SOP_data

Standard Operating Procedures


Project maintained by stajichlab Hosted on GitHub Pages — Theme by mattgraham

Running Trinity

Trinity is a de novo assembler of RNA-seq reads which can handle pretty large datasets written by Brian Haas and collaborators. It also has a setting called Genome Guided Trinity which works by first aligning reads to the genome and binning the genome into segements and assembly reads only in a region - this helps cut down on paralogous genes being co-assembled and also can be more accurate and a little faster since the problem is now subdivided into smaller portions for the assembler.

Nearly all the info you need for running Trinity is on the website supported by the developed so it will not be repeated here, but instead will emphasize some points of clarification and

Getting data ready for assembly

The tutorial developed by Running Trinity

FASTQ from sequencing center

These data should work just fine out of the gate. Remembering there are some which will be paired end and some single end.

Trinity expect data to

FASTQ from NCBI / SRA

Fix read names …

Knowing if your data are strand-specific RNAseq

One critical aspect is knowing if the data are strand specific and the organization of the reads as RF or FR. Guessing this can be done but it requires a reference genome. However the Trinity documentation discuss this so it may require you to run a whole assembly iteration with Trinity and then return and re-map reads against this transcript assembly.

Several tools will help guess this, though generally this is only going to be guessable if you also have a genome to align the reads to.

Running Trinity Genome-Guided (GG) mode

Read about Genome Guided mode which improves accuracy of assembled transcripts by first aligning reads to a genome assembly and then building clusters of aligned reads. To run this you will need to have first aligned the reads to the genome

The results of this run is a Trinity-GG.fasta file instead of Trinity.fasta

Running Trinity on HPCC

For large read set (eg 200M reads or more) you will want to assign a lot of memory. The intel queue max memory that can be requested is 500 Gb and the highmem queue can request up to 1Tb (1000Gb).

Inferring proteins from Trinity assembly

The tool Transdecoder (also written by Brian Haas) can be used to infer open reading frames (ORFs) from transcript assembly.

module load transdecoder
TransDecoder.LongOrfs -t Trinity.fasta