Standard Operating Procedures
In our lab we typically use Funannotate for annotation though MAKER is also another pipline that we support and have used in the past.
Using RNA-seq data Funannotate can improve the gene prediction parameters for SNAP and Augustus.
If you are utilizing RNASeq for training and improvement of gene models you can take advantage of the extra speed up that running PASA with a mysql server (instead of the default, SQLite). To do that on HPCC this requires seting up a mysql server using singularity package.
You need to create a file $HOME/pasa.CONFIG.template
this will be customized for your user account. Copy it from the system installed PASA folder.
A current version on the system is located in /opt/linux/centos/7.x/x86_64/pkgs/PASA/2.3.3/pasa_conf/pasa.CONFIG.template
Doing rsync /opt/linux/centos/7.x/x86_64/pkgs/PASA/2.4.1/pasa_conf/pasa.CONFIG.template ~/
This can also be done automatically with the latest version of PASA on the system
module load PASA/2.4.1
FOLDER=$(dirname `which pasa`)
rsync -v $FOLDER/../pasa_conf/pasa.CONFIG.template
You will need to edit this file which has this at the top. The MYSLQSERVER part will get updated by the mysql setup step later so leave it alone.
You will need to fill in the content for MYSQL_RW_USER
# server actively running MySQL
# Pass socket connections through Perl DBI syntax e.g. MYSQLSERVER=mysql_socket=/tmp/mysql.sock
# read-write username and password
On the UCR HPCC here are directions on how to setup your own mysql instance in your account using singularity. If you were running funannotate on your own linux/mac setup you would just do a native mysql/mariadb install and have the server running on your local machine.
The HPCC instructions include the steps to initialize a database followed by you will start a job that will be running which has the mysql instance. This db server will need to be started before you start annotating and be shutdown when you are finished. I often give it a long life like 2 weeks but it can be stopped at any point too.
Make sure your directory has the following:
All other folders will be created by running the steps in the pipeline folder.
The Funannotate steps look like:
Paste the following MySQL code into your pipeline directory. Submit this to sbatch then commence through the rest of the Funannotate steps. There is no need to change anything in this script.
This script will run for 5 days, make sure you finish all the annotation steps in this time.
This program builds a repeated element database out of your genome in order to be masked properly.
#!/usr/bin/bash -l
#SBATCH -p batch --time 2-0:00:00 --ntasks 8 --nodes 1 --mem 120G --out logs/repeatmodeler_attempt.%a.log
if [ $SLURM_CPUS_ON_NODE ]; then
mkdir -p repeat_library
if [ ! $N ]; then
if [ ! $N ]; then
echo "need to provide a number by --array or cmdline"
MAX=$(wc -l $SAMPFILE | awk '{print $1}')
if [ $N -gt $(expr $MAX) ]; then
echo "$N is too big, only $MAXSMALL lines in $SAMPFILE"
tail -n +2 $SAMPFILE | sed -n ${N}p | while read SPECIES STRAIN PHYLUM LOCUS
name=$(echo -n ${SPECIES}_${STRAIN} | perl -p -e 's/\s+/_/g')
echo "$name"
module unload perl
module unload python
module unload miniconda2
module unload anaconda3
module load RepeatModeler
module load ncbi-blast/2.13.0+
export AUGUSTUS_CONFIG_PATH=$(realpath lib/augustus/3.3/config)
#makeblastdb -in $INDIR/$name.sorted.fasta -dbtype nucl -out repeat_library/$name
BuildDatabase -name repeat_library/$name $INDIR/$name.sorted.fasta
RepeatModeler -database repeat_library/$name -pa $CPU
In order to run this step properly you must submit like:
sbatch --array=1 pipeline/
If you have more than one genome, make sure your samples.csv file has this information inside, and the genome can be found in the genomes folder and the way to run this looks like:
sbatch --array=1-5 pipeline/
You would be submitting this script just like above:
sbatch --array=1-5 pipeline/
If this was run successfully, your genomes/ folder will have your original genome along with a masked.genome found here. Make sure your genome ends with GENUS_SPECIES.sorted.fasta in order to run this properly.
If you have RNA-seq data on your genome, you can run this step. Otherwise you should skip to the next step (03_predict) in the Funannotate pipeline.
This portion of the pipeline requires a couple more items to be run properly. Make sure you have the lib/RNASeq folder here. If you have RNA-Seq data available for your genome, or even RNASeq data of a close species of your organism you would like to use to better predict your genome, you should add them here. Your folder should look like. lib/RNASeq/YOUR_GENOME_SPECIES_NAME/ Inside this folder should have two files Forward.fq.gz and Reverse.fq.gz. These two files you can find through SRA and download into this folder and rename to these two names. Once complete you should see a annotate/YOUR_GENOME_SPECIES/training folder.
Now that you have finally set up and masked and trained your genome, running this step is one of the main results from this pipeline. You can either use the RNASeq data to generate a better predicted genome, or you can skip the prior step and move on to this step to run the prediction denovo.
Check your annotate/YOUR_GENOME_SPECIES/predict_results to see if the program has run to completion. You will see a series of files including gbk, gff3 and protein files. But you are not done just yet!!
Check your annotate/YOUR_GENOME_SPECIES/update_results to see if the program has run to completion. You will see a series of files including gbk, gff3 and protein files. But you are not done just yet!!
At the end of these two scripts you should have folders.
Make sure you look through these folders and see if annotate/YOUR_GENOME_SPECIES/antismash_local/ has gbk files, there should be one for each scaffold. While the annotate/YOUR_GENOME_SPECIES/annotate_misc folder should have an iprscan.xml file (and it should NOT be empty).
Once the script has run successfully you should find annotate/YOUR_GENOME_SPECIES/annotate_results and within you will find gbk, gff3, protein and a couple of other file types once complete.
Congratulations! You have run the Funannotate pipeline!