Computational Biosciences using HPC systems

Module 3: Phylogenomics

Tutorial

1. Introduction

Software:

File formats:

Datasets

The dataset that we will be using is a multiple sequence alignment in fasta format that is a subset of the original Turtle data set used to assess the phylogenetic position of Turtle relative to Crocodiles and Birds (Chiari et al., 2012 MBC Biology 10; https://bmcbiol.biomedcentral.com/articles/10.1186/1741-7007-10-65). This dataset contains 29 genes which are a subset of the original 248 genes.

All the datasets here: /data/tutorial/modulo3
Copy these files to your home directory.

cp -r /data/tutorial/module3/data .

2. IQTREE

Where to get help:

http://www.iqtree.org
http://www.iqtree.org/doc/

2.1. Run IQTREE to estimate the best evolutionary model and number of cores for the dataset

Create a new folder called "iqtree" to perform all analyses with this tool. Copy the submission script "run_iqtree.sh" from '/data/tutorial/module3/bin' to your home directory; open it and change the path to your own folder.

The submission script should look like this:

#!/bin/bash
#SBATCH --partition=short  
#SBATCH --nodes=1   
#SBATCH --tasks-per-node=4  
#SBATCH --cpus-per-task=4    

# Path Variables
FOLDER='/data/yeastgenomics/pbrito/Teaching/Workshop_module3/iqtree'
FILE='turtle.fa'
OUTGROUP='protopterus'

# Run code
module load gcc83/iqtree2/2.1.3
srun iqtree2 -s $FOLDER/$FILE -st DNA  -o $OUTGROUP -m TEST -nt AUTO -ntmax 4 -pre model_threads

You can use the iqtree manual of the help menu to understand the previous commands and see what other commands are available.

module load gcc83/iqtree2/2.1.3
iqtree -h

launch the analysis using -> sbatch run_iqtree.sh
check if your analysis is in the queue -> squeue

Parameters used in this analysis:
-s -> Input alignment in PHYLIP/FASTA/NEXUS/CLUSTAL/MSF format
-st -> BIN, DNA, AA, NT2AA, CODON, MORPH (default: auto-detect)
-o -> Outgroup taxon name for writing .treefile
-pre -> Prefix for all output files (default: aln/partition)
-nt -> Number of cores/threads or AUTO for automatic detection
-ntmax -> Max number of threads by -nt AUTO (default: #CPU cores)

Once your analysis is ready analyse the following output file -> turtle.fa.log

Note the following definition:
Speedup: The speedup is defined as the ratio of the serial runtime of the best sequential algorithm for solving a problem to the time taken by the parallel algorithm to solve the same problem on p processors.

Efficiency: is defined as the ratio of speedup to the number of processors. Efficiency measures the fraction of time for which a processor is usefully utilised.

RESULTS Best model: GTR+F+I+G4 Best number of nodes: 3 (sometimes 1...)

2.2. Run a likelihood analysis with the selected model and assess branch support with ultrafast bootstrap with 1000 replicates. Use the chosen number of nodes

Note: to run this analysis edit the previous submission script and replace the last line of code with this one:

srun iqtree2 -s $FOLDER/$FILE -st DNA -o $OUTGROUP -m GTR+F+I+G4  -nt 3 -bb 1000 -bnni -pre MLBoot

Note: while using the ultrafast bootstrap analysis we use the option -bnni to reduce the risk of overestimating branch support with UFBoot due to severe model violations. With this option UFBoot will further optimize each bootstrap tree using a hill-climbing nearest neighbour interchange (NNI) search based directly on the corresponding bootstrap alignment.

output files: Analyze the log file: MLBoot.log
MLBoot.contree: contains the consensus tree with assigned branch supports where branch lengths are optimized on the original alignment.

Three ways to see the tree that you just inferred using Figtree
  1. Call the program on the server:
    module load FigTree/1.4.4
    figtree
    Note that this will only work if at login you specified "-X" (ssh -X -i ~/.ssh/FCT_rsa username@cirrus.ncg.ingrid.pt)
  2. Download Figtree to your computer (https://github.com/rambaut/figtree/)
  3. Visualize the tree using the web server of ETE Toolkit (http://etetoolkit.org/treeview/)

2.3. Run a maximum-likelihood analysis with a partition model, but use IQTREE ModelFinder to choose for the right partitioning scheme.

ModelFinder implements a greedy strategy (Lanfear et al., 2012) that starts with the full partition model and subsequently merges two genes until the model fit does not increase any further To run this analysis use the following lines of code (note that we removed the parameter that indicates the outgroup):

PARTITION='turtle.nex'
srun iqtree2 -s $FOLDER/$FILE -st DNA -m MFP+MERGE  -nt 3 -p $FOLDER/$PARTITION -bb 1000 -pre partition-merge

3. RAxML - Randomized Axelerated Maximum Likelihood)

3.1. Load the program:
module load gcc83/MrBayes/3.2.7a

You have access to 3 flavours of RAxML: raxmlHPC-HYBRID-SSE3 raxmlHPC-PTHREADS-SSE3 raxmlHPC-MPI-SSE3

Where to get help:

3.1. Performed a simple maximum-likelihood analysis using RAxML

Create a new folder to perform all analyses with this tool and move to that folder. Copy the turtle.fa file to this new folder that you just created.

mkdir raxml
cd raxml
cp ../iqtree/turtle.fa .
ls

Set up the submission script using the following lines of code - you can copy the script from /data/tutorial/modulo3/

#!/bin/bash
#SBATCH --partition=short   
#SBATCH --nodes=1           
#SBATCH --tasks-per-node=1
#SBATCH --cpus-per-task=4

# Path Variables
FOLDER='/data/yeastgenomics/pbrito/Teaching/Workshop_module3/raxml'
FILE='turtle.fa'

# Run code
module load gcc83/raxml/8.2.12
#Run simple max-lik search
srun raxmlHPC-PTHREADS-SSE3 -m GTRGAMMA -p 12345 -s $FOLDER/$FILE -# 10 -n ML -T 4

Q: What does each parameter in the code line mean? (use help menu)
Q: How many evolutionary models are available on this tool? (use help menu)
After the analysis is completed:

How many threads should we use?
It is important to know that the parallel efficiency of the PThreads version of RAxML depends on the alignment length or on the number of distinct patterns in the alignment. Normally, you would expect a parallel program to become faster as you increase the number of cores/processors you are using. This is however not generally true. Rule of thumb: use one core/thread per 500 DNA site patterns, i.e., if you have less, then it's probably better to just use the sequential version.

3.2. Perform a bootstrap analysis to get branch support for the previous best-scoring ML tree

This is done with two consecutive commands.

# First
srun raxmlHPC-PTHREADS-SSE3 -m GTRGAMMA -p 12345 -b 12345 -# 100 -s $FOLDER/$FILE -n boot -T 4

The bootstrap replicate trees are saved on file RAxML_bootstrap.boot that can be used to draw support values for all branches. Run now the following line of code:

# Second
srun raxmlHPC-PTHREADS-SSE3 -m GTRGAMMA -p 12345 -f b -t RAxML_bestTree.ML -z RAxML_bootstrap.boot -n bootTree -T 4

Tree with bipartitions written to file: RAxML_bipartitions.bootTree
Tree with bipartitions as branch labels written to file: RAxML_bipartitionsBranchLabels.bootTree
Execution information file written to RAxML_info.bootTree

In case you want to indicate the outgroup this is how it is done:
1 outgroup taxon:
raxmlHPC-SSE3 -p 12345 -o taxon1 -m GTRGAMMA -s dna.phy -n T30
2 outgroups taxa:
raxmlHPC-SSE3 -p 12345 -o taxon1,taxon2 -m GTRGAMMA -s dna.phy -n T31

3.3. Run multiple non-parametric bootstrap with the parallel version of RAxML

There are some differences in the submission script, please note:

#!/bin/bash
#SBATCH --partition=short   
#SBATCH --tasks-per-node=4  
#SBATCH --nodes=4           
#SBATCH --cpus-per-task=1 --> will not use this, by default is 1; if you put 4 we will then have 4*4 cores in each node.

mpirun raxmlHPC-MPI-SSE3  ­-­m GTRGAMMA -s $FOLDER/$FILE -p 12345 -b 12345 -­# 100  -­n boot_parallel

The MPI version is for executing really large runs (i.e. 100 or 1,000 bootstraps) on a Linux cluster. You can also perform multiple inferences on larger datasets in parallel to find a best- known ML tree for your dataset.
Warning: Reduced functionality of MPI version! The current MPI version only works properly if you specify the ­-# or -­N option in the command line, since it has been designed to do multiple inferences or rapid/standard bootstrap searches in parallel! For all remaining options, the usage of this type of coarse-grained parallelism does not make much sense!

Read the log file RAxML_info.boot_parallel The bootstrap trees that you just created are here: RAxML_bootstrap.boot_parallel Run the second step of the bootstrap analyses as before but with the new trees. NOTE that now you do not need to use the mpi version and parallelize the analysis.

# Second
srun raxmlHPC-PTHREADS-SSE3 -m GTRGAMMA -p 12345 -f b -t RAxML_bestTree.ML -z RAxML_bootstrap.boot_parallel -n bootTree_parallel -T 4

4. MrBayes - Bayesian inference of phylogeny

MrBayes is a program for Bayesian inference and model choice across a wide range of phylogenetic and evolutionary models. MrBayes uses Markov chain Monte Carlo (MCMC) methods to estimate the posterior distribution of model parameters. http://nbisweden.github.io/MrBayes/

Where to get help:

3.1 First get familiar with the main parameters

Create a new folder to perform all analyses with this tool and move to that folder. Copy the turtle_mb.nex file to this new folder that you just created. MrBayes only accepts files with the Nexus format.

mkdir mrbayes
cd mrbayes
cp /data/tutorial/modulo3/data/turtle_mb.nex .
ls

There are four steps to do a typical Bayesian phylogenetic analysis using MrBayes:

  1. Read the input data file (execute input_file)
  2. Set the evolutionary model, the priors, and the search strategy (lset and prset and mcmcp)
  3. Run the analysis (mcmc)
  4. Summarise the samples (sump and sumt)

First get familiar with the main parameters

# load and enter the program in a interactive way
module load gcc83/MrBayes/3.2.7a
mb
# read the input file
execute turtle_mb.nex
# learn about the different parameters
help lset (or showmodel) # sets the evolutionary model parameters
help prset   # sets the priors
help mcmcp   # sets the parameters of the Markov chain Monte Carlo (MCMC) analysis
help mcmc
# if you want to change a parameter value you can do the following:
lset nst=6 rates=gamma   # This sets the evolutionary model to GTR+G
mcmcp ngen=10000 Samplefreq=100  # This changes the search strategy to 10000 generations that are sampled every 100.
# to quit the program:    
quit (q)

The Ngen in mcmcp sets the number of generations for which the analysis will be run. It is useful to run a small number of generations first to make sure that the analysis is correctly set up and to get an idea of how long it will take to complete a longer analysis.
The Samplefreq indicates how often the chain is sampled. By default, the chain is sampled every 100th generation, and this works well for most analyses. When the chain is sampled, the current values of the model parameters are printed to .p file and the topology and branch lengths are printed to a .t file.

By default, MrBayes uses Metropolis coupling to improve the MCMC sampling of the target distribution. The Swapfreq, Nchains, and Temp settings together control the Metropolis coupling behavior. When Nchains is set to 1, no heating is used. When Nchains is set to a value n larger than 1, then n - 1 heated chains are used. By default, Nchains is set to 4, meaning that MrBayes will use 3 heated chains and one "cold" chain. Typically, heating is essential for problems with more than about 50 taxa.

3.1 Run a simple and too short bayesian analysis

Set up the submission script using the following lines of code - you can copy the script from /data/tutorial/modulo3/run_mrbayes.sh

#!/bin/bash
#SBATCH --partition=short   
#SBATCH --nodes=1           
#SBATCH --tasks-per-node=4

# Path Variables
FOLDER='/data/yeastgenomics/pbrito/Teaching/Workshop_module3/mrbayes'
FILE='turtle_mb.nex'
# Load module
module load gcc83/MrBayes/3.2.7a
# Run code
srun mb MrBayes_block.nex > log.short.txt

To run MrBayes in batch mode you need to create a file with all parameter values in nexus format.
We called thios file MrBayes_block.nex
You can copy the file from /data/tutorial/modulo3/run_mrbayes.sh

begin mrbayes;
   set autoclose=yes nowarnings=yes;
   execute ./turtle_mb.nex;
   lset nst=6 rates=gamma;
   mcmcp ngen=10000 Samplefreq=100;
   mcmc;
end;

You are now ready to submit your analysis to the queue:

qsub run_mrbayes.sh
squeue

After the analysis is done enter MrBayes in the interactive mode and execute your file and summarise the sampled results (both parameters and trees)

mb
execute turtle_mb.nex
# Summarize your samples both parameters and tree using a burnin and analyse the results on the output.
sump burnin=50
sumt burnin=50
# Analyse the output results
# Did it run enough time? How was the mixing? Do you need a larger burnin?

You can use the program Tracer to analyse the probability files (.p). Alternatively, you can into Excel and plot the log likelihoods to check how the log likelihoods improve with time.

module load Tracer/1.7.2  
tracer
# open files with extection ".p"

SUMMARIZNG SAMPLES OF TREES AND BRANCH LENGTHS
sumt and sump commands each have separate burn-in settings so it is necessary to give the burn-in here again.

The sumt command will output a tree with clade credibility (posterior probability) values and a phylogram. This command will create three additional files:

* __.parts__ file, which contains the list of taxon bipartitions, their posterior probability, and the branch lengths associated with them.  Open this file with a text editor.     
* __.con__ file that includes two consensus trees. The first one has both the posterior probability of clades (as interior node labels) and the branch lengths in its description. A graphical representation of this tree can be generated with Figtree as before. The second tree only contains the branch lengths and it can be imported into a wide range of phylogenetics programs.     
* __.trprobs__  file, which contains the trees that were found during the MCMC search, sorted by posterior probability. Open this file with a text editor.    

3.2 Run a simple but long bayesian analysis

Edit the scripts you used in the previous analysis and run a bayesian analysis with a more complex search strategy. Summarise the results and compare the final tree and the parameter values.

NOTE: you might want to save the output files in a different folder so to not overwrite them in subsequent analysis.

mkdir RUN1
mv files RUN1/

Final note
Perhaps the most difficult aspect of a Bayesian MCMC analysis is to decide how long to run it. An initial guess may be obtained by observing the log likelihood of the cold chain. In the beginning of the run, the log likelihood of the cold chain typically increases rapidly. This phase of the run is referred to as the burn-in and the samples from this phase are discarded. Once the likelihood of the cold chain stops to increase and starts to randomly fluctuate within a more or less stable range, the run may have reached stationarity. At stationarity, we typically also observe that the cold and heated chains are exchanging states frequently, resulting in the cold chain moving around freely among the columns.

This is the end!