The meaning of RNA-seq data

The meaning of RNA-seq data

We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

many papers I read mentioned "RNA-seq data". While searching for the meaning of this word, I could not find any layman's definition.

As far as I understand, RNA-seq data is the complete RNA information in a cell (or all of the cells) of a species that is extracted at a certain time.

Am I correct with this definition? If yes, does this mean that by extracting all the RNA information ones knows all the RNA sequences in that species (like AUGGUCAUCAG… )? Or does it mean I have the RNA, but not the sequence?

RNA-seq data usually provides a snapshot in time of the transcriptome of that which is being sequenced. Single-cell sequencing is possible, but less common than RNA-seq on a sample (containing many cells).

You are correct that RNA-seq provides one with knowledge of RNA sequences, like AUGGUCAUCAG and so on. However, one will not necessarily have information about all possible RNA sequences from a particular species. Performing RNA-seq on the same cell type of a species at two different time points, or two different cell types at the same time point, may result in different profiles for the RNA sequences you get back. It depends on which parts of the genome are being expressed by the cells in a sample at the time of RNA extraction.

The data one gets back from RNA sequencing depends on the technology and company used. Typically, RNA-seq involves using a reverse transcription step so that the sequence data you get back is actually reported as the cDNA version of your original mRNA transcripts. Usually one gets a (very large) file of sequencing 'reads'. One can either then 'assemble' these directly to form a snapshot of the transcriptome or one can map them to a known reference genome. One of the main benefits of RNA-seq is that we learn not just what the sequences of the RNA molecules are, but also their relative abundances within the sample. Knowledge of this can be particularly useful in testing or developing a large range of biological hypotheses.

RNA sequencing occurs when you perform an RNA extraction and then sequence them, which usually involves fragmenting them.

You do in fact know all of the sequences for all of the RNA molecules that are being sequenced. I can't see an interpretation that does not result in you having the sequences of the molecules, what with it being called "sequencing" and all.

Lesson 9: RNA Seq Data

High throughput sequencing technologies have greatly expanded the tools available for DNA and RNA exploration. This chapter is focused on assessing differential gene expression by RNA-seq. However, very similar statistical tools are available for other differential studies using sequencing.

Sequencing starts with an RNA sample from a tissue. This may be preprocessed to enrich for certain types of RNA, such as RNA with poly-A tails. Usually it is then converted into cDNA. Typically there are some quality checks to ensure that there is sufficient RNA in the sample and that the RNA is not degraded. The cDNA is then fragmented into pieces. The length of the fragments is under loose control, in the sense that a mean length can be targeted. In some studies there is further selection to obtain only fragments in a narrow size window. In others, most sufficiently long fragments can be sequenced. A sequenced fragment is called a "read".

The basic sequencing unit is a "lane" which essentially holds one sequencing sample. A set of lanes which are processed together is often called a "plate". A single RNA sample may be split across multiple lanes to increase the amount of sequencing done. This is uncommon in current RNA-seq studies, because each lane can now sequence 100's of millions of RNA fragments, which is more than sufficient for RNA-seq, but it may be done in studies that need very high read counts.

Nucleotide fragments can be labeled by synthesizing short sequences at the end of each fragment. These sequences are called bar codes. The bar code is sequenced along with the rest of the fragment. By barcoding samples, we can mix different samples in the sequencing lane and then determine which reads belong to which samples using the bar code. Barcoding and mixing samples before loading the sequencing lane is called multiplexing. It is frequently used to reduce sequencing costs when the sequencer produces more reads per lane then required by the study. For example, for RNA-seq studies, the recommendation is to obtain about 25 million reads per sample. If the sequencer can produce 200 million reads per lane, 8 samples can be multiplexed on the same lane.

Sequencing Technologies

The most current technologies are "single molecule" sequencers which can sequence very long strands of DNA as is usual with new technologies these are currently expensive, inaccurate, and not high enough throughput for quantitative studies. Based on the rapid improvement in their precursors, it is likely that in a few years these short-comings will be overcome. Currently, these technologies are mostly used to improve our knowledge of DNA and RNA sequence.

For quantitative studies, such as gene expression, much shorter fragments are sequenced - typically around 250 bases. Current high throughput technologies typically sequence between 50 and 250 bases per fragment (called the read length) with longer reads being proportionately more expensive. Paired end sequencing, in which each fragment was sequenced from both ends, was popular for a while to achieve greater read length. Now that the read length can be quite comparable to the fragment length, paired end sequencing leads to sequencing the center of the read twice, and is not cost-effective.

Matching the reads to features

Once the RNA fragments have been sequenced, they need to be identified by matching to the features of interest. This is called mapping. For many organisms, there is already a reference genome (all the DNA), transcriptome (all the RNA transcripts) or other type of reference. If there is no reference, or if you need a reference more specific to the strain you are working with, the reads can be used to create a reference. Longer and more accurate reads assist in good mapping, and are particularly important when building a reference. Neither building a reference nor mapping are among the topics of this course. Typically specialized mapping software is used, and then the mapped reads are matched to genomic regions corresponding to the features of interest.

When I am working with a lab that has no expertise in mapping, I generally ask the sequencing facility to do the mapping. Among other advantages, this means that I do not need to work with the raw data, which are large. A typical RNA-seq read file is over 10 Gb. Typically to obtain the raw data, my collaborators send a hard drive to the sequencing facility - courier service is faster than up and down-loading from the computing Cloud!

After mapping to the reference, the reads are converted to counts per feature. Typically a small proportion of the reads do not match anything and another small proportion match 2 or more locations on the reference (which may be to similarities in the reference or read errors). An expert bioinformatician can often get better results than someone following "off-the-shelf" script by writing scripts that deal in sensible ways with these unmapped reads. In any case, the final result is an expression matrix, usually with features in the rows and samples (or lanes is the samples are split across lanes) in the columns. These are the data that will go into the differential expression analysis. You should also keep the information about how many reads did not map and how many mapped to multiple locations - this information is required to interpret your statistical results.

One note of caution! Often sequencing facilities and labs convert the mapped to other units such as counts per million reads (CPM) or counts per kilo bases per gene lane per million (RPKM). These types of data are not suitable for the types of analysis that will be doing here. The error structure of the data depends on on the number of reads. It doesn't matter whether one gene is very, very small and at the same expression level it produces 1/10 of the reads that a huge gene might, all that matters is how many reads you've actually counted. This is because, as we will see in the next section, the variance of a count is related to its mean. If you convert to reads per anything you will not be able to recover the variance.

Another important piece of information that is lost when converting to reads per anything is the total reads for the sample. For example, in a cancer study we thought that we had a subpopulation of cells in which only a subset of the genes expressed. It turned out that some of the samples produced only a few thousand reads, while others produced 10s of millions. Many of the genes that expressed lowly in the samples were not detected in the samples with few reads. This was not evident when we received the data as RPKM, but was very obvious once we saw the total reads for each sample. The variability in the total reads was due to technical difficulties, not due to the cancer tissues.

From Lanes to Samples

Our units of analysis are features and RNA samples. In many studies, sequencing lanes and samples are not the same. Mapping identifies the features. We also need to summarize by sample

In some studies, the RNA samples are split across several lanes. It turns out that the error structure is preserved if we simply sum up the reads from each sample to obtain the total reads for each feature in the sample.

In some studies, the RNA samples are barcoded and multiplexed so that several samples are sequenced together. As the reads are mapped to the reference, the bar codes need to be read so that they can also be assigned to samples.

When we finish the mapping and sample assignment process, we should have a data matrix of counts. Typically each row of the matrix is a feature and each column is a sample. The data has the form of a count (n_) the number of reads mapping to feature i in sample j.

Library Size

The RNA that was sequenced is called the RNA library. With longer read lengths and more accurate sequencing, these days in most organisms, most of the reads are mapped.

Library size could mean one of two things: the total number of reads that were sequenced in the run or the total number of mapped reads. We will use the total number of mapped reads as the library size in our analyses. Normalization of RNA-seq data proceeds by computing an "effective" library size, which is computed from the actual library size and the distribution of the counts.


“Counts” usually refers to the number of reads that align to a particular feature. I’ll refer to counts by the random variable . These numbers are heavily dependent on two things: (1) the amount of fragments you sequenced (this is related to relative abundances) and (2) the length of the feature, or more appropriately, the effective length. Effective length refers to the number of possible start sites a feature could have generated a fragment of that particular length. In practice, the effective length is usually computed as:

where is the mean of the fragment length distribution which was learned from the aligned read. If the abundance estimation method you’re using incorporates sequence bias modeling (such as eXpress or Cufflinks), the bias is often incorporated into the effective length by making the feature shorter or longer depending on the effect of the bias.

Since counts are NOT scaled by the length of the feature, all units in this category are not comparable within a sample without adjusting for the feature length. This means you can’t sum the counts over a set of features to get the expression of that set (e.g. you can’t sum isoform counts to get gene counts).

Counts are often used by differential expression methods since they are naturally represented by a counting model, such as a negative binomial (NB2).

Effective counts

When eXpress came out, they began reporting “effective counts.” This is basically the same thing as standard counts, with the difference being that they are adjusted for the amount of bias in the experiment. To compute effective counts:

The intuition here is that if the effective length is much shorter than the actual length, then in an experiment with no bias you would expect to see more counts. Thus, the effective counts are scaling the observed counts up.

Counts per million

Counts per million (CPM) mapped reads are counts scaled by the number of fragments you sequenced () times one million. This unit is related to the FPKM without length normalization and a factor of :

I’m not sure where this unit first appeared, but I’ve seen it used with edgeR and talked about briefly in the limma voom paper.

A Beginner's Guide to Analysis of RNA Sequencing Data

Since the first publications coining the term RNA-seq (RNA sequencing) appeared in 2008, the number of publications containing RNA-seq data has grown exponentially, hitting an all-time high of 2,808 publications in 2016 (PubMed). With this wealth of RNA-seq data being generated, it is a challenge to extract maximal meaning from these datasets, and without the appropriate skills and background, there is risk of misinterpretation of these data. However, a general understanding of the principles underlying each step of RNA-seq data analysis allows investigators without a background in programming and bioinformatics to critically analyze their own datasets as well as published data. Our goals in the present review are to break down the steps of a typical RNA-seq analysis and to highlight the pitfalls and checkpoints along the way that are vital for bench scientists and biomedical researchers performing experiments that use RNA-seq.

Keywords: RNA sequencing bioinformatics data analysis transcriptomics.


Assessing inter- and intragroup variability.…

Assessing inter- and intragroup variability. ( A ) Principal component (PC) analysis plot…

Determining a low count threshold.…

Determining a low count threshold. ( A ) The number of genes at…

The effect of group size and intragroup variance on ability to identify differentially…

Distribution of ANOVA P values…

Distribution of ANOVA P values for ( A ) all ( n =…

Effect of group size and intragroup variance on ability to identify gene clusters.…

Individual gene analysis. RPKM expression…

Individual gene analysis. RPKM expression values for the Cdk2 , Il1b , and…


RNA-Seq data

Four samples of cell-cycle syncronized of RNA-Seq data were generated from serum-starved human fibroblasts (NHDF) ( 31). Briefly, cells were starved for 48 h and then harvested at 0 h and following serum refeeding at 12, 18 and 24 h, as the cells underwent synchronized cell division. RNA-Seq analysis was performed on an Illumina HiSeq2500 (Illumina, San Diego, California, USA) with 100 bp paired-end sequencing according to the manufacturer's recommendations and performed by Edinburgh Genomics (Edinburgh, UK) using the TruSeq™ RNA Sample Prep Kit (Illumina). Poly-(A) RNA was isolated and fragmented to produce of an average 180 bp fragments. Fragmented RNA was reverse transcribed and a single stranded DNA template was used to generate double strand cDNA which was blunt ended using T4 DNA polymerase prior to the addition of an adenosine base to assist ligation of the sequencing adapters. Flow cell preparation was carried out according to Illumina protocols the libraries were denatured and diluted to a concentration of 15 pM for loading into the flow cells. RNA-Seq data were processed using the Kraken pipeline, a set of tools for quality control and analysis of high-throughput sequence data ( 32). Expression levels were reported as fragments per kilobase of transcript per million (FPKM).

To analyse a broader range of samples, RNA-Seq data from a human tissue atlas ( 33) representing 27 different tissues were downloaded from ArrayExpress database (E-MTAB-1733). Primary visualization of the data was performed using IGV to visualize the reads mapped on to the reference genome in certain loci or genes across samples. Long-read sequencing data of human heart, liver and lung samples released by Pacific Biosciences (PacBio) ( 34) were also utilized for making comparison to transcript assembly generated from the short read data.

Preparation of files for transcript visualization

The pipeline described below is based around a set of linked bash and Python scripts that perform the following tasks. Initial QC and read mapping to the reference genome (GRCh38) were performed using BowTie v1.1.0 ( 35). Sequence mapping data (BAM) were converted to a text file suitable for graph visualization in the free and open-source tool Graphia Professional (Kajeka Ltd, Edinburgh, UK). Firstly, BAM files were sorted according to mapped chromosomal location using sort from SAMtools ( 36). The R package GenomicRanges ( 37) was used to create annotation information, from a GTF file containing node annotation. This GTF file holds annotation information about gene structure (Ensembl version GRCh38). The output from this step was a tab-delimited file containing read mappings on Ensembl transcript and exon features. Exon junction spanning reads are assigned to the exon in which the majority of their sequence resides. This information can be overlaid on to graphs using the class sets function of Graphia, such that upon selection of an Ensembl transcript ID, nodes representing reads that map to this transcript model will be coloured according to the exon number.

The next step was to define the similarity between reads mapping to a gene of interest from the BAM and GTF files. A FASTA file containing all sequences mapping to a particular gene was extracted and the supporting information used for the visualization of transcript isoforms in the context of the resultant graph. For read-to-read comparison MegaBLAST ( 38) was used to generate a similarity matrix with edge weights derived from the alignment bit score. Parameterization of this step, i.e. defining the threshold for % sequence similarity (p) and length (l) over which two sequences must be similar in order for an edge to be drawn between them is of particular importance. Ideally, a graph should contain the maximum number of reads (nodes), connected by a minimum number of edges and where possible give rise to a single graph component, i.e. a single group of connected nodes that together represent the mRNA species of interest. For high coverage transcripts, more stringent parameters may be desirable.

Exploration of graph structure using simulated gene with multiple splicing events

Artificial transcript models representing two splice variants of the same 2706 bp gene were assembled from 10 exons of the gene TTN, selected exons being between 261 and 282 bp in length. When combined together, the two simulated transcripts incorporated an alternative start site (E1a, E1b), mutually exclusive exons (E3a, E3b), a skipped exon (E5) and an alternative 5′ donor site (20 bp shorter E7). Using ART (version MountRainier) ( 39) two levels of sequencing depth/transcript abundance were simulated, so as to provide either 250 or 1000, 125 bp reads per transcript model. For each level of transcript abundance, the simulated reads for the two transcripts were merged into a single FASTQ file and aligned to the reference genome (GRCh37) with HISAT (hierarchical indexing for spliced alignment of transcripts) ( 40). RNA assembly graphs were generated from the resulting BAM files using a percentage similar threshold (p = 98) and three settings for the threshold for length coverage (l = 20, 40, 80). The resultant graphs were visualized in Graphia Professional (Kajeka Ltd) (Figure 2).

Graph layout

The size and unusual topology of DNA/RNA sequence graphs necessitates the use of a highly optimized graph layout approach. Following experimentation (see Supplementary Figure S1 ), the Fast Multipole Multilevel Method (FMMM) ( 41) was shown to be well suited to the layout of these types of graphs. The FMMM algorithm was reimplemented in Java from the Open Graph Drawing Framework (OGDF) ( 42) and incorporated into the Graphia Professional code base ( 29), adding uniquely the ability to perform FMMM graph layout in 3D space. In general, the higher the FMMM quality setting, the more linear a graph becomes, but at the cost of computational runtime.

Collapsing of redundant reads

In the case of highly expressed genes, there can be a significant degree of redundancy in read coverage, i.e. reads of exactly the same sequence may be present in the data many times. Redundant reads add nothing to the interpretation of transcript structure and make the read-to-read comparison step unnecessarily time-consuming and the resultant graph sometimes difficult or impossible to visualize due to its size. Using Tally from the Kraken package ( 32), multiple identical reads were mapped to a single identifier that incorporates the number of occurrences of that specific sequence. When the read unification mode is employed, a single node is used to represent multiple identical reads, where the diameter of a node is proportional to the original number of reads it represents.

Analysis of the graph structure

Initially, we chose to examine a set of 550 genes whose expression was up-regulated as fibroblasts entered into S-M phases of the cell cycle (18–24 h after being refed serum). A graph derived from the 24 h data was plotted for each gene using MegaBLAST parameters p = 98, l = 31. Where the topology of a given gene graph was relatively simple, an explanation of its structure required only the overlay of individual transcript exon information in order to identify splice variant(s) represented. In other cases more detailed analyses were required. Other graphs were generated from the human tissue atlas data available at ArrayExpress (E-MTAB-1733) ( 33). In the tissue samples, reads may originate from multiple cell types expressing different isoforms of the same gene. The 100 bp paired-end reads for each tissue were individually mapped to the human genome (Ensembl GRCh38.82) with STAR v2.3.0 ( 43). The output from the mapping process (BAM files) was used to generate RNA assembly graphs using our pipeline. Publicly available data for TPM1 were used to compare the network-based RNA-Seq approach with Pacific Biosciences (PacBio) long-read results obtained through their website ( 34). The TPM1 gene models from both data were compared for heart, brain and liver.

Validation of splice variants using RT-PCR

To validate the existence of splice variants predicted by graph analyses, reverse transcription polymerase chain reaction (RT-PCR) of candidate splice variants was performed. Total RNA from human fibroblasts used for the RNA-Seq experiment was reverse transcribed in order to generate single stranded cDNA. Primers were designed using the Primer3 software ( 44) to amplify the region for validation of the splice variant. For LRR1, a pair of primers was designed to amplify three splice variants as suggested from the graph visualization, while for PCM1 two pairs of primers were designed across two different splice variant locations. For LRR1: Forward primer 5′-TGTTGAGCCTCTGTCAGCAG-3′ and reverse 5′-GTGTGGGCAACAGAATGCAG-3′ for PCM1 (primer set 1) forward primer 5′-TCTGCTAATGTTGAGCGCCT-3′ and reverse 5′-TGCAGAGCTAGAAGTGCAGC-3′ and PCM1: (primer set 2) Forward 5′-ACGGAAGAAGACGCCAGTTT-3′ and reverse 5′-AGCTGCAGCTCATGGAAGAG-3′. PCR was carried out for 35 cycles (92°C, 30 s 60°C, 90 s 72°C, 60 s). The amplicons were run on a 2% agarose gel in the presence of SYBR-Safe DNA gel stain (Thermo Fisher, Waltham, MA, USA) and gels visualized by UV illumination.

Access to the pipeline and web-based NGS Graph generator

Documentation and full source code for the NGS Graph Generator package can be downloaded from: The user needs to supply BAM and GTF files to run the pipeline. In addition, we have developed a web interface designed for demonstration purposes rather than the analysis of a user's own data, that allows the pipeline to be run on a number of predefined datasets. This web-interface is called NGS graph generator and can be accessed at Using this resource, a user can select a BAM file from RNA-Seq time-course samples of human fibroblasts or data from the human tissue atlas. Users can adjust parameters used by MegaBLAST to compute read similarity and there is an option to discard identical reads. Processing time required is dependent on the number of reads mapping to a gene of interest. The user must provide their email address and will be informed once the job finished. The resultant graph layout file will automatically open Graphia Professional (if installed). Protocols for graph generation and visualization are provided in Supplementary File S1 and a video of graph visualizations is provided in Supplementary File S2 .


Preparation to teach this Lesson begins with the instructor acquiring a Galaxy user account (as described in Supporting File S1: RNA-seq Student Tutorial I) and downloading the appropriate data files from NCBI SRA into the user History. At that point, the instructor can work through the complete tutorial at her/his own pace, identifying any potential changes to the Galaxy website and/or instructions that may need to be clarified. The instructor should also carefully review the annotated PowerPoint presentations (Supporting File S4: RNA-seq Annotated Instructor PowerPoints) and potentially do background reading on high throughput sequencing (19,20), RNA-seq (7,18), and/or Galaxy (15). The Lesson is designed for a computer laboratory over three lab sessions (

8 hr total). Students should be introduced to high throughput sequencing and the concept of RNA-seq (e.g., in the lecture section of the course) before starting the tutorial.

The first lab session has the longest instructor presentation, since both the key experimental background and the Galaxy platform are reviewed. Before the lab, students are assigned to read Afgan et al. (2016), which introduces Galaxy, and Shanks et al. (2016), which introduces the Arabidopsis/nematode experimental system (14,15). There is typically enough time to do a brief (

30 minute) round table discussion of the Afgan et al. (2016) article before students begin to work on the tutorial (Supporting File S1: RNA-seq Student Tutorial I). We randomly assign half of the students to work with an "infected replicate 1" file (RNA isolated from A. thaliana roots infected with H. schachtii NCBI SRR2221834), and the other half work with the "control replicate 1" file (RNA from uninfected A. thaliana roots NCBI SRR2221833). Students then individually follow the tutorial, which provides detailed and illustrated instructions on registering for a Galaxy user account and uploading the appropriate sequence file from NCBI SRA and the A. thaliana genome annotation file from Ensembl Plants into their Galaxy History. Thus, by the time students reach the end of Tutorial I, they will have established a Galaxy user account and acquired all of the files that they will need to "hit the ground running" during the next lab session. Throughout the first tutorial, students will encounter eight questions, which review basic experimental design concepts and ask them to explore their data. We encourage students to answer the questions as they work through the tutorial, and hand in their written answers at the end of the lab period. Alternatively, instructors could ask students to answer some or all of the questions orally or via an instant response polling system as they work through the exercise, allowing "just in time" feedback to clear up potential areas of confusion.

During the second laboratory session, students perform read quality control (using FastQC), read trimming (using Trimmomatic), and read mapping (using HISAT2) (22-24) (Supporting File S2: RNA-seq Student Tutorial II). After performing these steps "manually," they create a computational workflow using Galaxy's "Create Workflow" function. This allows the students to automatically download and analyze two additional RNA-seq data files ("infected replicates" 2+3 or "control replicates" 2+3) without additional hands-on time. Before the students begin the tutorial, the instructor presents the concepts behind each of these computational steps using the provided PowerPoint slides (Supporting File S4: Annotated Instructor PowerPoints). Again, a series of embedded questions keeps the students conceptually "on track" to assure that the tutorial does not simply become a mindless "point and click" exercise. When the second laboratory is complete, students will have generated Counts Tables from three "control" or three "nematode infected" samples. Again, all files are stored on each student's individual Galaxy History (250 gigabyte capacity), so no on-site storage is needed.

It is during the final laboratory session that students begin to see the biologically relevant "payoff" of the RNA-seq experiment: A list of all genes that are significantly up- or down-regulated in A. thaliana roots in response to nematode infection (Supporting File S3: RNA-seq Student Tutorial III). First, students must share Counts Table data with a "partner" (using Galaxy's Share History function) so that they have three "control" replicates and three "nematode infected" replicates in their History (at Wooster, each three-person team processed all six raw data files on their own). Differentially expressed genes are then identified using DESeq2 (25). The differentially expressed gene list from DESeq2 is exported to Excel and sorted. Finally, over-represented functional categories are identified amongst the up- and down-regulated genes using the online Panther Gene List Analysis tool (26). Again, the conceptual basis and operation of all of these tools are introduced in the instructor "pre-lab" presentation. Because students can now begin to glean biological meaning from their data, the questions embedded in the third tutorial are more frequent and detailed, and students typically need to spend time outside of the laboratory session to thoroughly answer them.

A timeline of the Lesson is provided in Table 1, below. In addition, all associated materials are provided as Supporting Files. Specifically, student tutorials are provided as Supporting Files S1-S3 (Supporting File S1: RNA-seq Student Tutorial I Supporting File S2: RNA-seq Student Tutorial II Supporting File S3: RNA-seq Student Tutorial III). The three-part Instructor PowerPoint and a grading key for the tutorial questions are also provided (Supporting File S4: Annotated Instructor PowerPoints Supporting File S5: Instructor grading key). Finally, a document containing additional instructor background and instructions on identifying alternative RNA-seq data sets on NCBI SRA is included (Supporting File S6: Additional Instructor Background).

Table 1. RNAseq - Teaching Timeline


We conclude that the PLDA algorithm with power transformation and voomNSC classifiers may be the sparse methods of choice, if one aims to obtain accurate models for RNA-Seq classification. SVM and RF algorithms are the overall winners in nonsparse classifiers. When sparsity is the measure of interest, voomNSC classifiers should be the preferred methods. Along with its accurate and sparse performance, the voomNSC method is fast and applicable to even very large RNA-Seq datasets. Besides the prediction purpose, the voomNSC classifier can be used to identify the potential diagnostic biomarkers for a condition of interest. In this way, a small subset of genes, which is relevant to distinguishing the different classes, can be detected. These genes can then be investigated for further, such as discovering additional genes which have interactions with these genes. We leave extending this model with considering the known biomarkers as a follow-up research study.

Toxicogenomics – A Drug Development Perspective NGS Technologies – Sequencing-based Approaches for Transcriptomics Study

The arrival of deep sequencing applications for transcriptome analyses, RNA-Seq , may circumvent the above-mentioned disadvantages of microarray platforms. In contrast to microarray, transcriptome sequencing studies have evolved from determining the sequence of individual cDNA clones to more comprehensive attempts to construct cDNA sequencing libraries representing portions of the species transcriptome [69–72] . The use of sequencing technologies to study the transcriptome is termed RNA-Seq [73,74] . RNA-Seq uses recently developed deep sequencing technologies. In general, a population of RNA is converted to a library of cDNA fragments by use of adaptors attached to one or both ends. Each molecule, with or without amplification, is then sequenced in a high-throughput manner to obtain short sequences from one or both ends. In principle, any high-throughput sequencing technology can be used for RNA-Seq. This methodology has tremendously reduced the sequencing cost and experimental complexity, as well as improved transcript coverage, rendering sequencing-based transcriptome analysis more readily available and useful to individual laboratories. RNA-Seq technologies have demonstrated some distinct advantages over hybridization-based approaches such as microarrays that likely will enable them to dominate in the near future.

Currently, there are four major commercially available NGS technologies: Roche/454, Illumina HiSeq 2000, Applied Biosystems SOLiD, and Helicos HeliScope. Illumina’s NGS platforms have a strong presence. Their sequencing-by-synthesis approach [75–78] utilizes fluorescently labeled reversible-terminator nucleotides on clonally amplified DNA templates immobilized to an acrylamide coating on the surface of a glass flow cell. The Illumina Genome Analyzer and the more recent HiSeq 2000 have been widely used for high-throughput massively parallel sequencing. In 2011, Illumina also released a lower throughput fast-turnaround instrument, the MiSeq, aimed at smaller laboratories and the clinical diagnostics market.

Although RNA-Seq is unlikely to completely supplant hybridization-based techniques in the near future, it offers a number of improvements over these technologies, for example:

unlike hybridization-based approaches, RNA-Seq does not depend on prior knowledge of the transcriptome, and is thus capable of new discovery and could reveal the precise boundaries of transcripts to single base precision [79]

the technique can also yield information about exon junctions, allowing the study of complex transcription units [80]

RNA-Seq has inherently low background and high sensitivity, and the upper detection limits are not constrained, together allowing the study of the transcription across a much wider range than for microarrays [56,81] .

A discussion of the considerable differences between available RNA-Seq technologies is beyond the scope of this chapter. However these technologies share many common features. First, the RNA sample is either mRNA enriched or ribosomal RNA depleted. The choice depends on the intent of the experiment. A gene expression profiling experiment would enrich the mRNA and ignore the other RNA species, while an experiment focused on transcriptome characterization would deplete the ribosomal RNA leaving the mRNA, ncRNA, miRNA, and siRNA. Next, the RNA is fragmented and size selected. The size of RNA fragments required depends on the specific technology. Third, the fragments are reverse-transcribed into cDNA and are clonally amplified and tagged so that they can be attached to beads. The bead-bound fragments are then placed in a fluidics chamber, placed in the sequencer, and sequenced. The chemistry of sequencing varies between the platforms. However, each chemical change in the fluidics chamber (pH in the case of Ion Torrent, fluorescence for the other technologies) corresponds to a specific base and the sequence is recorded. The technologies described above all rely on the amplification of fragments via polymerase chain reaction (PCR), which will introduce bias and change the relative proportions of the RNA species present. Other technologies, referred to as ‘single-molecule sequencing’ or ‘third-generation sequencing’, avoid this amplification step and its attendant bias. However, these technologies have not yet been widely adopted by the scientific community.

Taking all of these advantages into account, RNA-Seq represents a paradigm shift in transcriptomics studies, with concomitant benefits for toxicogenomics. This technology has already been extensively applied to biological research, resulting in significant and remarkable insights into the molecular biology of cells [82–84] . The pharmaceutical industry has already embraced sequence-based technologies, and it is likely that these technologies will have their impact throughout the drug discovery process [85–87] .

A Simple Guideline to Assess the Characteristics of RNA-Seq Data

Next-generation sequencing (NGS) techniques have been used to generate various molecular maps including genomes, epigenomes, and transcriptomes. Transcriptomes from a given cell population can be profiled via RNA-seq. However, there is no simple way to assess the characteristics of RNA-seq data systematically. In this study, we provide a simple method that can intuitively evaluate RNA-seq data using two different principal component analysis (PCA) plots. The gene expression PCA plot provides insights into the association between samples, while the transcript integrity number (TIN) score plot provides a quality map of given RNA-seq data. With this approach, we found that RNA-seq datasets deposited in public repositories often contain a few low-quality RNA-seq data that can lead to misinterpretations. The effect of sampling errors for differentially expressed gene (DEG) analysis was evaluated with ten RNA-seq data from invasive ductal carcinoma tissues and three RNA-seq data from adjacent normal tissues taken from a Korean breast cancer patient. The evaluation demonstrated that sampling errors, which select samples that do not represent a given population, can lead to different interpretations when conducting the DEG analysis. Therefore, the proposed approach can be used to avoid sampling errors prior to RNA-seq data analysis.

RNA-seq Tutorial (with Reference Genome)

This tutorial will serve as a guideline for how to go about analyzing RNA sequencing data when a reference genome is available. We will be going through quality control of the reads, alignment of the reads to the reference genome, conversion of the files to raw counts, analysis of the counts with DeSeq2, and finally annotation of the reads using Biomart. Most of this will be done on the BBC server unless otherwise stated.

The packages we’ll be using can be found here: Page by Dister Deoss

The data we will be using are comparative transcriptomes of soybeans grown at either ambient or elevated O3 levels. Each condition was done in triplicate, giving us a total of six samples we will be working with. The paper that these samples come from (which also serves as a great background reading on RNA-seq) can be found here:

The samples we will be using are described by the following accession numbers SRR391535, SRR391536, SRR391537, SRR391538, SRR391539, and SRR391541. They can be found in results 13 through 18 of the following NCBI search:

The script for downloading these .SRA files and converting them to fastq can be found in

/common/RNASeq_Workshop/Soybean/Quality_Control as the file . The fastq files themselves are also already saved to this same directory.

Quality Control on the Reads Using Sickle:

Step one is to perform quality control on the reads using Sickle. We are using unpaired reads, as indicated by the “se” flag in the script below. The -f flag designates the input file, -o is the output file, -q is our minimum quality score and -l is the minimum read length. The trimmed output files are what we will be using for the next steps of our analysis.

The script for running quality control on all six of our samples can be found in

/common/RNASeq_Workshop/Soybean/Quality_Control as the file . The output trimmed fastq files are also stored in this directory.

Alignment of Trimmed Reads Using STAR:

For this next step, you will first need to download the reference genome and annotation file for Glycine max (soybean). The files I used can be found at the following link:

You will need to create a user name and password for this database before you download the files. Once you’ve done that, you can download the assembly file Gmax_275_v2 and the annotation file Gmax_275_Wm82.a2.v1.gene_exons. Having the correct files is important for annotating the genes with Biomart later on.

Now that you have the genome and annotation files, you will create a genome index using the following script:

You will likely have to alter this script slightly to reflect the directory that you are working in and the specific names you gave your files, but the general idea is there. Indexing the genome allows for more efficient mapping of the reads to the genome.

The assembly file, annotation file, as well as all of the files created from indexing the genome can be found in

Now that you have your genome indexed, you can begin mapping your trimmed reads with the following script:

The –genomeDir flag refers to the directory in which your indexed genome is located. The output we get from this are .BAM files binary files that will be converted to raw counts in our next step.

The script for mapping all six of our trimmed reads to .bam files can be found in

/common/RNASeq_Workshop/Soybean/STAR_HTSEQ_mapping as the file . The .bam output files are also stored in this directory.

Convert BAM Files to Raw Counts with HTSeq:

Finally, we will use HTSeq to transform these mapped reads into counts that we can analyze with R. “-s” indicates we do not have strand specific counts. “-r” indicates the order that the reads were generated, for us it was by alignment position. “-t” indicates the feature from the annotation file we will be using, which in our case will be exons. “-i” indicates what attribute we will be using from the annotation file, here it is the PAC transcript ID. We identify that we are pulling in a .bam file (“-f bam”) and proceed to identify, and say where it will go.

The script for converting all six .bam files to .count files is located in

/common/RNASeq_Workshop/Soybean/STAR_HTSEQ_mapping as the file The .count output files are saved in

Analysis of Counts with DESeq2:

For the remaining steps I find it easier to to work from a desktop rather than the server. So you can download the .count files you just created from the server onto your computer. You will also need to download R to run DESeq2, and I’d also recommend installing RStudio, which provides a graphical interface that makes working with R scripts much easier. They can be found here:

The R DESeq2 library also must be installed. To install this package, start the R console and enter:

The R code below is long and slightly complicated, but I will highlight major points. This script was adapted from here and here, and much credit goes to those authors. Some important notes:

  • The most important information comes out as “-replaceoutliers-results.csv” there we can see adjusted and normal p-values, as well as log2foldchange for all of the genes.
  • par(mar) manipulation is used to make the most appealing figures, but these values are not the same for every display or system or figure. Much documentation is available online on how to manipulate and best use par() and ggplot2 graphing parameters: DESeq2 Manual

The .csv output file that you get from this R code should look something like this:

Below are some examples of the types of plots you can generate from RNAseq data using DESeq2:

Merging Data and Using Biomart:

To continue with analysis, we can use the .csv files we generated from the DeSEQ2 analysis and find gene ontology. This next script contains the actual biomaRt calls, and uses the .csv files to search through the Phytozome database. If you are trying to search through other datsets, simply replace the “useMart()” command with the dataset of your choice. Again, the biomaRt call is relatively simple, and this script is customizable in which values you want to use and retrieve.

After fetching data from the Phytozome database based on the PAC transcript IDs of the genes in our samples, a .txt file is generated that should look something like this:

Finally, we want to merge the deseq2 and biomart output.

We get a merged .csv file with our original output from DESeq2 and the Biomart data:

Visualizing Differential Expression with IGV:

To visualize how genes are differently expressed between treatments, we can use the Broad Institute’s Interactive Genomics Viewer (IGV), which can be downloaded from here: IGV

We will be using the .bam files we created previously, as well as the reference genome file in order to view the genes in IGV. IGV requires that .bam files be indexed before being loaded into IGV. There is a script file located in

/common/RNASeq_Workshop/Soybean/STAR_HTSEQ_mapping/bam_files called that will accomplish this. The .bam files themselves as well as all of their corresponding index files (.bai) are located here as well. The reference genome file is located at

You will need to download the .bam files, the .bai files, and the reference genome to your computer. Once you have IGV up and running, you can load the reference genome file by going to Genomes -> Load Genome From File… in the top menu. Now you can load each of your six .bam files onto IGV by going to File -> Load from File… in the top menu. Be sure that your .bam files are saved in the same folder as their corresponding index (.bai) files. Once you have everything loaded onto IGV, you should be able to zoom in and out and scroll around on the reference genome to see differentially expressed regions between our six samples.

The differentially expressed gene shown is located on chromosome 10, starts at position 11,454,208, and codes for a transferrin receptor and related proteins containing the protease-associated (PA) domain. This information can be found on line 142 of our merged csv file. You can search this file for information on other differentially expressed genes that can be visualized in IGV!

Watch the video: How to analyze RNA-Seq data? Find differentially expressed genes in your research. (January 2023).