############################################
##            Workshop UCIBIO             ##
##                Module 2                ##
##           13 September 2022            ##
##                                        ##
##        Transcriptome assembly          ##
##                                        ##
##  Pedro M. Costa  (pmcosta@fct.unl.pt)  ##
############################################

#In this session we will assemble the transcriptome of the regenerating planarian
#Schmidtea mediterrenea.
#This less conventional model organism has far fewer genomic resources than, e.g. mammals and zebrafish.
#It is widely employed in regeneration studies.
#Here, this species will be used as surrogate for "stranger" organisms for which there is no genome
#or transcriptome mapping available. 

##Login as usual

##Create work directories.

mkdir ~/Module2
cd module2
mkdir Data
mkdir Output
mkdir Bash

##Loading necessay software:
module load trinity/2.14.0
module load salmon/1.9.0 
module load bowtie2-2.3.0
module load transdecoder/5.5.0
module load blast/2.13.0

##Transfer files 'SmedIllumina_R1.fastq.gz' and 'uniprot_sprot.fasta.gz' to your directory 'Input' otherwise call them directly from their locatin when needed.
## Call, is needed:
cd /data/tutorial/modulo2

##Step 1. ##Transcriptome assembly. 'Left' sequences only.

##Create a bash file called 'Assembly.sh' and place it in '/Bash'
##Write the following instruction in it:

#!/bin/bash
#SBATCH --partition=short
#SBATCH --tasks-per-node=1
#SBATCH --nodes=1
Trinity --seqType fq --single /data/tutorial/modulo2/Input/SmedIllumina_R1.fastq.gz --SS_lib_type F --max_memory 100G --CPU 8 --full_cleanup --output /data/tutorial/modulo2/Output/Trinity/TrinitySmed

##alternatively, we could assemble the transcriptome using paired-end sequencing using:
Trinity --seqType fq --left /data/tutorial/modulo2/Input/SmedIllumina_R1.fastq.gz --right /data/tutorial/modulo2/Input/SmedIllumina_R2.fastq.gz --SS_lib_type FR --max_memory 100G --CPU 8 --full_cleanup --output /data/tutorial/modulo2/Input/Trinity/TrinitySmed2R

##Run the bash file:
sbatch Asembly.sh

##you can check the progress of your job with:
squeue -u [username]

##Check also the '.out' files the the 'Bash' directory, especially if errors occur.

##Now let's run some very very basic QA. You will note that our data is not very impressive to begin with. However, small raw data file was mandatory for classes...

/data/seatox/Software/Trinityrnaseq-v2.6.6/util/TrinityStats.pl /data/tutorial/modulo2/Output/Trinity/TrinitySmed.Trinity.fasta

##Finding ORFs (a two-step process).
##Create another bash file

#!/bin/bash
#SBATCH --partition=short
#SBATCH --tasks-per-node=1
#SBATCH --nodes=1
TransDecoder.LongOrfs -t /data/tutorial/modulo2/Output/TrinitySmed.Trinity.fasta --output_dir /data/tutorial/modulo2/Output/Transdecoder
TransDecoder.Predict -t /data/tutorial/modulo2/Output/TrinitySmed.Trinity.fasta --output_dir /data/tutorial/modulo2/Output/Transdecoder

##annotation by homology matching against the UniProt database. 
##For the purpose, we will be constrasting our predicted ORFs against a FASTA file containing UniProt.

##First, we must create a database from the file 'uniprot_sprot.fasta' downloaded from https://www.uniprot.org/.
#!/bin/bash
#SBATCH --partition=short
#SBATCH --tasks-per-node=1
#SBATCH --nodes=1
makeblastdb -in /data/tutorial/modulo2/Input/uniprot_sprot.fasta -out /data/tutorial/modulo2/Output/Blast/uniprot_sprot -dbtype prot

##The second step is ORF homology-matching against the database.
#!/bin/bash
#SBATCH --partition=short
#SBATCH --tasks-per-node=1
#SBATCH --nodes=1
blastp -query TrinitySmed.Trinity.fasta.transdecoder.pep -db /data/tutorial/modulo2/Output/Blast/uniprot_sprot -max_target_seqs 1 -outfmt '10 delim=; qseqid sacc evalue pident qcovs stitle' -evalue 1e-5 -num_threads 10 > /data/tutorial/modulo2/Output/BlastSmed.csv

##There are many parameters that can be set. The command line above is pretty much the 'standard', and is designed to deliver the results as a ';'-separated CSV file (as defined by outfm '10').
##outfm values to be included in the results file ':
##
## qseqid	- Query Seq-id
## sacc		- Subject accession
## evalue	- Expect value
## pident	- Percentage of identical matches
## qcovs	- Query coverage Per subject
## stitle	- Full description of standard
## delim=;	- Column separator (defaults to the costumary American-style ',')
##
##Note: 'Subject' refers to the matching result.

##Aautonomous exercises!
##
##Let's contrast the findings against Uniprot, but only the human proteome subset, in 'UniprotHomoSapiensProteomeReviewed.fasta', located in the 'Input' directory.
##As before, it is a two-step procedure. First, wecreate a database, then we contrast or predicted ORFs against it.
##Name the output file something like: 'BlastSmedHuman.csv'

##Save both .csv files to your compute and compare the findings!

##END