Bulk RNAseq Methodology and Analysis Guide

Emory Integrated Computational Core
Required Files for Bulk RNA-seq Analysis
Deliverables
Quality Control and Alignment of FASTQ Files
Bulk RNA-seq Differential Gene Expression Analysis
- The Basics of DESeq2 Analysis
Differential Gene Expression Analysis Outputs
Example Methods Sections for Publication
- Sequencing, Alignment, and Quantification
- Differential Expression
Conclusion

Emory Integrated Computational Core

The Emory Integrated Computational Core (EICC), one of the Emory Integrated Core Facilities (EICF), is supported by the Emory University School of Medicine and the Georgia Clinical & Translational Science Alliance. Our mission is to provide cutting-edge computational support to Emory investigators with large “Omics” datasets.

Required Files for Bulk RNA-seq Analysis

Raw Data Files (FASTQ)
A spreadsheet, also known as the metadata sheet, which contains all the necessary information (FASTQ sample IDs, laboratory sample IDs, and group assignments) about experiment samples. This spreadsheet can also contain various covariate information (e.g., sex, race, age) relevant to each sample. This spreadsheet must match the FASTQ sample IDs to the subjects and associated phenotype of your study. The image below is an example of a properly formatted metadata spreadsheet.
Target organism (Human, Mouse, Rabbit, etc.). If you are using an uncommon reference organism, please provide us with a link to the reference genome and *.gtf annotation file so that we can perform the correct alignment and gene counts assignment.

Return to Table of Contents

Deliverables

All analysis results will be uploaded to a shared OneDrive folder which the client(s) will have access to immediately following the presentation of analysis outcomes. This shared OneDrive folder will contain the following.

A PowerPoint presentation will be created including all agreed upon analysis results with relevant interpretations tailored to the outcomes of your experiment.
Spreadsheets that contain raw counts, normalized counts, complete differential gene expression results, pathway analysis results (when possible), and any other analysis results previously agreed to.
Image files including, but not limited to, volcano plots, PCA plots, and heatmaps.

Return to Table of Contents

Quality Control and Alignment of FASTQ Files

Sample sequences are quality-checked using FastQC for completeness, depth and read quality. Data are trimmed using Trimmomatic to remove adapter contamination. Sequences are aligned to the appropriate reference genome using STAR aligner.
Gene quantification is done using HTSeq-count.
If you are specifically looking for long noncoding RNAs (lncRNAs), we use annotations from GENCODE lncRNAs in alignment (STAR) and gene quantification (HTSeq-count) steps [1].

Return to Table of Contents

Bulk RNA-seq Differential Gene Expression Analysis

In differential gene expression analysis, the basic underlying task is the analysis of count data from RNA-seq experiments for the detection of differentially expressed genes between two sample groups. Count data is provided as a table containing the number of sequence fragments that have been uniquely assigned to each gene. For more information about the methodology and algorithms used to generate the count table, please see the Methods section of the EICC website.

For general bulk RNA-seq analyses, we prefer the use of DESeq2 for differential gene expression results. The original DESeq2 publication, by Anders and Huber, introduces DESeq2 and outlines its principle concepts. For a detailed guide on the installation and implementation of DESeq2, please refer to the DESeq2 Vignette provided by the authors of the package.

The Basics of DESeq2 Analysis

DESeq2 requires count data obtained from RNA-seq or another high-throughput sequencing process. The count matrix is a matrix of integer values where typically each row i is a unique gene and each column j is the number of uniquely assigned reads. DESeq2 internally corrects for the appropriate library size, and it is not recommended that the original count matrix be transformed or normalized prior to being given as input.

Pre-filtering Low Count Genes

Pre-filtering the count matrix for differential expression analysis is typically performed to reduce the memory size requirements of the analysis and to increase the speed of the transformation and testing results. Here at the EICC, our computational power is such that we do not need to perform pre-filtering to increase analysis speed. Strict filtering is performed within the standard DESeq2 pipeline to increase power and is automatically applied on the normalized counts prior to differential expression analysis.

Normalization within DESeq2

DESeq2 assumes that the count matrix follows a negative binomial distribution (also known as the gamma-Poisson distribution). A mean is computed proportionally to the concentration of cDNA fragments from the genes in a given sample and then scaled by a normalization factor. For most applications, the same normalization factor can be used for all the genes in a given sample which accounts for the differences in sequencing depth between samples [1]. Gene specific normalization can be used to account for sources of technical bias, but these methods are not implemented by default.

P-values and Adjusted p-values (q-values)

To test the hypothesis that a gene is differentially expressed between one or more groups with statistical significance, DESeq2 implements the Wald Test by default. In the case of differential gene expression, the Wald Test uses the precision of the log fold change value (LFC) as a weight to compute a test statistic based on the distance between the unrestricted LFC estimate and its hypothetical value under the null hypothesis. The null hypothesis for this statistical test is that there is no difference between the two groups being tested (e.g., case and control). Each gene is tested individually and a single p-value is produced as a result.

For every statistical test, there is an acceptable false positive rate in the determination of a significant difference (usually \(\alpha =\) 0.05). Due to the extremely high dimension of RNA-seq data and therefore large number of tests, a multiple comparison adjustment is recommended when determining statistical significance. By default, DESeq2 implements the Benjamini-Hochberg False Discovery Rate (FDR) multiple testing correction. The FDR is the expected ratio of the number of false positives over the total number of significant differences found across all tests being corrected for. The FDR correction retains high power at the expense of a slight increased allowance of false positives, and is therefore the most commonly used multiple hypothesis corrections for exploratory studies.

Family-wise Error Rate (FWER) Corrections

The class of Family-wise Error Rate (FWER) test corrections are more conservative than the FDR corrections typically applied in exploratory analysis results. If the goal of the analysis is to directly confirm the difference in expression between one or more groups for a specific gene or set of genes, the application of a FWER in publication is preferred. The FWER corrections typically have reduced power and a more strict control over false positive rates. We do not provide this multiple test correction as a default due to its highly conservative nature, but FWER can be applied to your analysis results upon request.

Effect Size Estimation

We perform effect size estimation using the R package apeglm. This package provides empirical Bayes shrinkage estimators which can be used to compute s-values (Stephens, 2017). In short, these s-values provide a confidence level in the direction (or sign) of the log base 2 fold-change value. It is recommended that the threshold be much lower for significance when using these outcomes (i.e., 0.005 instead of 0.05 for significance). We provide these values by default, but leave it up to the researchers to decide to include them in publication.

We also use the apeglm package to provide lfcShrink log₂ fold-change values. The empirical Bayes shrinkage estimators (Zhu, Ibrahim, Love) use heavy-tailed prior distributions to remove noise and to help prevent extremely large differences that may appear due to technical artifacts. Often, when one sample group has an over-abundance of zeros this can lead to highly inflated fold changes that often give very high significant differences between samples. We use lfcShrink to help determine which log₂ fold-change values are more in line with what we would expect biologically, and not what may often be due to technical artifacts from sample processing in the lab.

Return to Table of Contents

Differential Gene Expression Analysis Outputs

One Drive Folder

For each analysis performed, we generate a unique sub-folder inside the “Statistical Analysis” folder of the shared OneDrive. Inside this analysis folder, you will find all the generated results described below. If, for any reason, you are unable to access the files located inside these folders please contact EICC.

Full Analysis Results

We provide the most complete picture of the analysis results as a spreadsheet associated with each individual analysis. We include this spreadsheet as a .csv file for ease-of-use across a variety of operating systems and software environments such a R and Python. A snapshot of this table is shown below.

symbol: Each row of the analysis output corresponds to a unique gene.
baseMean: The mean expression value of this gene from the raw count table.
log2FoldChange: The log₂ fold-change value of this gene, relative to the reference group used in this analysis.
LFC_shrink: Results from the effect size estimation and shrinkage of the original log fold change values.
lfcSE: The standard error of the log₂ fold-change value.
lfcSE_Shrink: The standard error of the log₂ fold-change value.
Stat: The statistic used in the calculation of the p-value. By default, this is the Wald Test.
pvalue: The p-value obtained from the statistical test.
padj: The adjusted p-value after the application of a multiple testing correction. By default, this is the FDR correction.
svalue: The result (essentially a p-value) of a statistical test on the confidence in the sign, or direction, of the log₂ fold-change value.

Principal Component Analysis

In the visualization of your analysis results, we use Principal Component Analysis to reduce the high-dimensionality of the count matrix while preserving as much of the original variation of the data as possible. This allows the plotting of each sample in two dimensions and a clearer representation of the separation between the samples and sample groups. This two-dimensional plot is of the first principal component (the x or horizontal axis) and the second principal component (the y or vertical axis).

Principle components are the eigenvectors of the original data’s covariance matrix. We compute these principle components through the singular value decomposition of the count matrix.

When testing for the differential expression between groups, DESeq2 uses normalization on the original count matrix as described previously. For the visualization of the differences between the two groups, it is often preferable to work with different transformations of the count matrix. Prior to the application of PCA, we perform a variance stabilizing transformation to the original count data (Tibshirani 1988, Anders and Huber 2010). After the application of a variance stabilizing transformation, the PCA results are computed using the R function prcomp.

In the PCA plot above, sample groups are differentiated by color. The group and sample separation are visualized using the first principle component (PC1), which accounts for the highest amount of variation, and the second principal component (PC2), which accounts for the second highest amount of the original data variation. Each point represents a single sample, and the sample label is printed next to each plot point. In the above example, we can see clear separation between the two sample groups along PC1, which is accounting for a high proportion of the overall data variance. The Control group is clustering closer together than the Case (TGF) group. This is an indication of higher between-sample variance in the TGF group.

For our primary PCA plot, we include the 1,000 genes with the highest variance. In general, these genes are contributing the most to the differences between samples and between groups. This number can be changed to any number of genes up to the total number of genes in the count matrix. We also include a number of secondary PCA plots, which include magnitude vectors associated with genes for different numbers of genes. These PCA plots are helpful to visualize the impact of certain genes on the separation of samples and sample groups, but are used as often in publication.

As an output in the PCA plot folder, we also include the spreadsheet of principle components associated with each sample. The information in this spreadsheet is used to generate the PCA plot.

Return to Table of Contents

Volcano Plot

A Volcano Plot is an efficient way to visualize changes in expression values and differentially expressed genes in the two groups being compared in the analysis. We use volcano plots to quickly identify the most meaningful differences between the two sample groups. Volcano plots use the log₂ fold-change and the corrected or adjusted p-value (q-value) to plot each gene in two dimensions. The reference group (typically the control group) is important to consider when referring to the log fold-changes. The log₂ fold-change values are relative to the reference group. These values reflect the change in gene expression of the non-reference group over the reference group. If the reference group is flipped, all the log fold-change values simply change to the opposite sign (e.g., from negative to positive.)

In the above volcano plot, the two groups being compared in this analysis are shown in the main title, and the subtitle displays the reference group. In the legend located at the top of the plot, different colors distinguish between the non-significant genes (grey), genes with a significant q-value but a small log fold-change value (blue), genes that were not significantly different but with a larger log fold-change value (green), and genes that were significantly different and had a high log fold-change value (red). In our volcano plots, we attempt to label as many of the red points as possible, but as the number of highly significant genes increases this becomes difficult.

The log fold-change values are given as log base 2 (binary logarithm) values to improve visualization. A gene with a log₂ value of 1 is twice as expressed in the non-reference group when compared to the reference group.

The horizontal dashed line in the above plot represents a q-value of 0.05. The p-values have been converted to negative log₁₀ (common logarithm) values. This is simply to improve the visualization of the plot. As a results, all genes above the dashed line were significantly different between the two groups being compared. It is important to note that the designation of a q-value of 0.05 as statistically significant and a log₂ fold-change or greater than 1 or less than -1 are commonly chosen values, but can be changed according to the information researchers wish to display in their own volcano plots relative to their own experiments.

Return to Table of Contents

Heatmap

Another common visualization of differential gene expression analysis is the Heatmap. The heatmap is closely related to the volcano plot and provides a visual reference to the differences in gene expression between the two groups being compared. By default, we include the top 40 genes sorted by the lowest adjusted p-values. This is in order to maintain readability and provide a closer visual inspection of the 40 more differentially expressed genes. As outlined previously in PCA analysis and visualization, we apply a variance stabilizing transformation to the original count matrix (Tibshirani 1988, Anders and Huber 2010) prior to generating the heatmap.

We generate heatmaps using the R function pheatmap. The colors of the heatmap are relative to the overall expression values of the entire count matrix after the application of a variance stabilizing transformation. A strong red color indicates a high increase in a gene’s expression level, whereas a strong or dark blue color indicates a high decrease in a gene’s expression level.

To improve the initial visualization, we do not perform clustering on the subjects. Instead, subjects are grouped by the analysis categories that are being examined. The function pheatmap can perform clustering on the rows (genes) of the count matrix. This method of clustering requires examination of a hierarchical dendrogram and the determination of a proper cutoff which yields a desired number of clusters. This is available upon request, and will require some collaboration between the statistical analyst and the researchers to determine the desired or appropriate number of clusters.

Return to Table of Contents

Gene Ontology and Pathway Enrichment Analysis

We perform gene set enrichment analysis (GSEA) and pathway enrichment analysis, also known as over representation analysis (ORA), using the R package clusterProfiler. We limit the size of the pathways examined to between 10 and 500 genes, to limit the number of extremely large or extremely small pathways that are often significant but lack biological interest.

Over Representation Analysis

Over representation analysis is a widely known and used method of pathway analysis to determine if a list of statistically significant genes from a differential expression analysis between two groups are over-represented in a gene ontology pathway. The p-value to determine if this pathway is over-represented or enriched is calculated using the hypergeometric distribution. This corresponds to a one-sided version of the Fisher’s exact test, where the number of differentially expressed genes inside a pathway is significantly larger than what we would expect to see by pure random chance. The Fisher’s Exact test compares the actual number of differentially expressed genes in a pathway with the number that would be expected by random chance. A significant p-value associated with this test demonstrates a significant deviation from the null hypothesis that the differentially expressed genes would appear in the GO pathway purely by random chance. FDR correction is applied to all resulting p-values of pathway enrichment testing. To improve the results of pathway enrichment tests, we remove all non-expressed genes from the background list of genes (the entire set of genes), following the results of [1],[2]. This background set of genes can be further trimmed to include genes that would only be reasonably expressed given the certain tissue being examined.

ID: Gene Ontology pathway identifier.
Description: Common description of pathway.
GeneRatio: The number of genes located in the pathway over the total number of genes submitted for analysis.
BgRatio: The total size of the pathway over the total number of genes annotated for this type of pathway analysis.
pvalue: The resulting p-value from the application of Fisher’s Exact test.
qvalue: Multiple testing correction FDR value.
geneID: The list of statistically significant genes located inside the pathway from the submitted genes for pathway analysis.

The above plot is called a dotplot and demonstrates the top 20 over-represented pathways from the pathway enrichment analysis. The color of the dots represent the significance level of the pathway enrichment, and the pathways are ordered by the number of differentially expressed genes inside the pathway. GeneRatio, the x-axis, shows the percentage of enriched genes in the pathway.

The Upset plot allows us to visualize the overlap between the top 20 most enrichment pathways. The Upset plot is more useful than the Venn Diagram when we are looking at more than 3 sets. Each set is represented by the dots across the bottom, and the number of overlapping genes between the top sets are represented by the bars of the bar graph across the top. The dots and links between the dots show us commonality between the top pathways examined.

Gene Set Enrichment Analysis

In gene set enrichment analysis, all genes from the comparison are examined and ranked based on the size of log fold change values are considered. This allows the pathway analysis tool to determine if all the genes in a given set are changing in a coordinated way regardless of their significance level. The calculation of an enrichment score is main component of GSEA. This score represents to degree that a set of genes is over-represented at the top of the bottom of a gene list that has been ranked by log fold change. The significance level of the enrichment score (p-value) is determined using a permutation test. The p-values are then adjusted for multiple hypothesis testing.

ID: Gene Ontology pathway identifier.
Description: Common description of pathway.
SetSize: The number of genes allocated to a given set.
enrichmentScore: The enrichment score of the ranked genes in the set.
NES: Normalized enrichment score.
pvalue: Permutation test p-value.
qvalue: Multiple testing correction FDR value.
leading_edge: Percentage of genes contributing to the enrichment score.
core_enrichment: Genes inside examined pathway.

The GSEA dotplot allows us to quickly displayed the most significant pathways from the GSEA, and demonstrates whether the pathways are activated or suppressed.

The Ridgeplot demonstrates the distribution of genes in the top pathways based on their log fold change values and their enrichment scores. This plot can demonstrate how the genes are behaving over the length of the pathway.

Return to Table of Contents

Example Methods Sections for Publication

Given below are examples that we hope will help you in writing your manuscripts using the previously outlined methodology.

Sequencing, Alignment, and Quantification

Sequences are quality-checked using FastQC for completeness, depth and read quality. Sequences are aligned to the HG38 reference genome using STAR aligner [1]. Gene quantification is done using HTSeq-count [2]. Data are trimmed using Trimmomatic to remove adapter contamination [3].

Dobin, A. et al. STAR: Ultrafast universal RNA-seq aligner. Bioinformatics (2013). doi:10.1093/bioinformatics/bts635
Anders, S., Pyl, P. T. & Huber, W. HTSeq-A Python framework to work with highthroughput sequencing data. Bioinformatics (2015). doi:10.1093/bioinformatics/btu638
Anthony M. Bolger, Marc Lohse, Bjoern Usadel, Trimmomatic: a flexible trimmer for Illumina sequence data, Bioinformatics, Volume 30, Issue 15, August 2014, Pages 2114–2120, https://doi.org/10.1093/bioinformatics/btu170

Differential Expression

DESeq2 is used determine differentially expressed genes between at least two experimental groups [3–5]. DESeq2 assumes that the gene expression count table follows a negative binomial distribution and implements the Wald Test for differential gene expression analysis. Genes with low counts are filtered by mean normalized counts in DESeq2. Raw p-values are transformed using the Benjamini-Hochberg correction to account for multiple hypothesis testing. Genes considered significantly differentially expressed are those with adjusted p-values, also called False Discovery Rates (or FDR), less than a pre-determined threshold (typically 0.05). Graphs are generated using package-specific recommended data transformations.

Robinson, M., McCarthy, D. & Smyth, G. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26, (2010).
Love, M. I., Anders, S. & Huber, W. Differential analysis of count data - the DESeq2 package. Genome Biology (2014). doi:110.1186/s13059-014-0550-8

Return to Table of Contents

Conclusion

If you have any questions about your analysis results, or would like to us to tailor our approach to your specific analysis goals, please feel free to contact us.

Note: The contents of this document are subject to change. We strive to provide the most up-to-date analysis results to our clients, and amend or alter our analysis outputs and methodology to maintain current best practices.

Bulk RNAseq Methodology and Analysis Guide

Henry Claussen, PhD, Emory Integrated Computational Core (EICC)

Last compiled 10 May, 2023

Emory Integrated Computational Core

Required Files for Bulk RNA-seq Analysis

Deliverables

Quality Control and Alignment of FASTQ Files

Bulk RNA-seq Differential Gene Expression Analysis

The Basics of DESeq2 Analysis

Pre-filtering Low Count Genes

Normalization within DESeq2

P-values and Adjusted p-values (q-values)

Family-wise Error Rate (FWER) Corrections

Effect Size Estimation

Differential Gene Expression Analysis Outputs

One Drive Folder

Full Analysis Results

Principal Component Analysis

Volcano Plot

Heatmap

Gene Ontology and Pathway Enrichment Analysis

Over Representation Analysis

Gene Set Enrichment Analysis

Example Methods Sections for Publication

Sequencing, Alignment, and Quantification

Differential Expression

Conclusion