############################################ ## Workshop UCIBIO ## ## Module 2 ## ## 13 September 2022 ## ## ## ## Transcriptome assembly ## ## ## ## Pedro M. Costa (pmcosta@fct.unl.pt) ## ############################################ #In this session we will assemble the transcriptome of the regenerating planarian #Schmidtea mediterrenea. #This less conventional model organism has far fewer genomic resources than, e.g. mammals and zebrafish. #It is widely employed in regeneration studies. #Here, this species will be used as surrogate for "stranger" organisms for which there is no genome #or transcriptome mapping available. ##Login as usual ##Create work directories. mkdir ~/Module2 cd module2 mkdir Data mkdir Output mkdir Bash ##Loading necessay software: module load trinity/2.14.0 module load salmon/1.9.0 module load bowtie2-2.3.0 module load transdecoder/5.5.0 module load blast/2.13.0 ##Transfer files 'SmedIllumina_R1.fastq.gz' and 'uniprot_sprot.fasta.gz' to your directory 'Input' otherwise call them directly from their locatin when needed. ## Call, is needed: cd /data/tutorial/modulo2 ##Step 1. ##Transcriptome assembly. 'Left' sequences only. ##Create a bash file called 'Assembly.sh' and place it in '/Bash' ##Write the following instruction in it: #!/bin/bash #SBATCH --partition=short #SBATCH --tasks-per-node=1 #SBATCH --nodes=1 Trinity --seqType fq --single /data/tutorial/modulo2/Input/SmedIllumina_R1.fastq.gz --SS_lib_type F --max_memory 100G --CPU 8 --full_cleanup --output /data/tutorial/modulo2/Output/Trinity/TrinitySmed ##alternatively, we could assemble the transcriptome using paired-end sequencing using: Trinity --seqType fq --left /data/tutorial/modulo2/Input/SmedIllumina_R1.fastq.gz --right /data/tutorial/modulo2/Input/SmedIllumina_R2.fastq.gz --SS_lib_type FR --max_memory 100G --CPU 8 --full_cleanup --output /data/tutorial/modulo2/Input/Trinity/TrinitySmed2R ##Run the bash file: sbatch Asembly.sh ##you can check the progress of your job with: squeue -u [username] ##Check also the '.out' files the the 'Bash' directory, especially if errors occur. ##Now let's run some very very basic QA. You will note that our data is not very impressive to begin with. However, small raw data file was mandatory for classes... /data/seatox/Software/Trinityrnaseq-v2.6.6/util/TrinityStats.pl /data/tutorial/modulo2/Output/Trinity/TrinitySmed.Trinity.fasta ##Finding ORFs (a two-step process). ##Create another bash file #!/bin/bash #SBATCH --partition=short #SBATCH --tasks-per-node=1 #SBATCH --nodes=1 TransDecoder.LongOrfs -t /data/tutorial/modulo2/Output/TrinitySmed.Trinity.fasta --output_dir /data/tutorial/modulo2/Output/Transdecoder TransDecoder.Predict -t /data/tutorial/modulo2/Output/TrinitySmed.Trinity.fasta --output_dir /data/tutorial/modulo2/Output/Transdecoder ##annotation by homology matching against the UniProt database. ##For the purpose, we will be constrasting our predicted ORFs against a FASTA file containing UniProt. ##First, we must create a database from the file 'uniprot_sprot.fasta' downloaded from https://www.uniprot.org/. #!/bin/bash #SBATCH --partition=short #SBATCH --tasks-per-node=1 #SBATCH --nodes=1 makeblastdb -in /data/tutorial/modulo2/Input/uniprot_sprot.fasta -out /data/tutorial/modulo2/Output/Blast/uniprot_sprot -dbtype prot ##The second step is ORF homology-matching against the database. #!/bin/bash #SBATCH --partition=short #SBATCH --tasks-per-node=1 #SBATCH --nodes=1 blastp -query TrinitySmed.Trinity.fasta.transdecoder.pep -db /data/tutorial/modulo2/Output/Blast/uniprot_sprot -max_target_seqs 1 -outfmt '10 delim=; qseqid sacc evalue pident qcovs stitle' -evalue 1e-5 -num_threads 10 > /data/tutorial/modulo2/Output/BlastSmed.csv ##There are many parameters that can be set. The command line above is pretty much the 'standard', and is designed to deliver the results as a ';'-separated CSV file (as defined by outfm '10'). ##outfm values to be included in the results file ': ## ## qseqid - Query Seq-id ## sacc - Subject accession ## evalue - Expect value ## pident - Percentage of identical matches ## qcovs - Query coverage Per subject ## stitle - Full description of standard ## delim=; - Column separator (defaults to the costumary American-style ',') ## ##Note: 'Subject' refers to the matching result. ##Aautonomous exercises! ## ##Let's contrast the findings against Uniprot, but only the human proteome subset, in 'UniprotHomoSapiensProteomeReviewed.fasta', located in the 'Input' directory. ##As before, it is a two-step procedure. First, wecreate a database, then we contrast or predicted ORFs against it. ##Name the output file something like: 'BlastSmedHuman.csv' ##Save both .csv files to your compute and compare the findings! ##END