The Emory Integrated Computational Core (EICC), one of the Emory Integrated Core Facilities (EICF), is supported by the Emory University School of Medicine and the Georgia Clinical & Translational Science Alliance. Our mission is to provide cutting-edge computational support to Emory investigators with large “Omics” datasets.
All analysis results will be uploaded to a shared OneDrive folder which the client(s) will have access to immediately following the presentation of analysis outcomes. This shared OneDrive folder will contain the following.
In differential gene expression analysis, the basic underlying task is the analysis of count data from RNA-seq experiments for the detection of differentially expressed genes between two sample groups. Count data is provided as a table containing the number of sequence fragments that have been uniquely assigned to each gene. For more information about the methodology and algorithms used to generate the count table, please see the Methods section of the EICC website.
For general bulk RNA-seq analyses, we prefer the use of DESeq2 for differential gene expression results. The original DESeq2 publication, by Anders and Huber, introduces DESeq2 and outlines its principle concepts. For a detailed guide on the installation and implementation of DESeq2, please refer to the DESeq2 Vignette provided by the authors of the package.
DESeq2 requires count data obtained from RNA-seq or another high-throughput sequencing process. The count matrix is a matrix of integer values where typically each row i is a unique gene and each column j is the number of uniquely assigned reads. DESeq2 internally corrects for the appropriate library size, and it is not recommended that the original count matrix be transformed or normalized prior to being given as input.
Pre-filtering the count matrix for differential expression analysis is typically performed to reduce the memory size requirements of the analysis and to increase the speed of the transformation and testing results. Here at the EICC, our computational power is such that we do not need to perform pre-filtering to increase analysis speed. Strict filtering is performed within the standard DESeq2 pipeline to increase power and is automatically applied on the normalized counts prior to differential expression analysis.
DESeq2 assumes that the count matrix follows a negative binomial distribution (also known as the gamma-Poisson distribution). A mean is computed proportionally to the concentration of cDNA fragments from the genes in a given sample and then scaled by a normalization factor. For most applications, the same normalization factor can be used for all the genes in a given sample which accounts for the differences in sequencing depth between samples [1]. Gene specific normalization can be used to account for sources of technical bias, but these methods are not implemented by default.
To test the hypothesis that a gene is differentially expressed between one or more groups with statistical significance, DESeq2 implements the Wald Test by default. In the case of differential gene expression, the Wald Test uses the precision of the log fold change value (LFC) as a weight to compute a test statistic based on the distance between the unrestricted LFC estimate and its hypothetical value under the null hypothesis. The null hypothesis for this statistical test is that there is no difference between the two groups being tested (e.g., case and control). Each gene is tested individually and a single p-value is produced as a result.
For every statistical test, there is an acceptable false positive rate in the determination of a significant difference (usually \(\alpha =\) 0.05). Due to the extremely high dimension of RNA-seq data and therefore large number of tests, a multiple comparison adjustment is recommended when determining statistical significance. By default, DESeq2 implements the Benjamini-Hochberg False Discovery Rate (FDR) multiple testing correction. The FDR is the expected ratio of the number of false positives over the total number of significant differences found across all tests being corrected for. The FDR correction retains high power at the expense of a slight increased allowance of false positives, and is therefore the most commonly used multiple hypothesis corrections for exploratory studies.
The class of Family-wise Error Rate (FWER) test corrections are more conservative than the FDR corrections typically applied in exploratory analysis results. If the goal of the analysis is to directly confirm the difference in expression between one or more groups for a specific gene or set of genes, the application of a FWER in publication is preferred. The FWER corrections typically have reduced power and a more strict control over false positive rates. We do not provide this multiple test correction as a default due to its highly conservative nature, but FWER can be applied to your analysis results upon request.
We perform effect size estimation using the R package apeglm. This package provides empirical Bayes shrinkage estimators which can be used to compute s-values (Stephens, 2017). In short, these s-values provide a confidence level in the direction (or sign) of the log base 2 fold-change value. It is recommended that the threshold be much lower for significance when using these outcomes (i.e., 0.005 instead of 0.05 for significance). We provide these values by default, but leave it up to the researchers to decide to include them in publication.
We also use the apeglm package to provide lfcShrink log2 fold-change values. The empirical Bayes shrinkage estimators (Zhu, Ibrahim, Love) use heavy-tailed prior distributions to remove noise and to help prevent extremely large differences that may appear due to technical artifacts. Often, when one sample group has an over-abundance of zeros this can lead to highly inflated fold changes that often give very high significant differences between samples. We use lfcShrink to help determine which log2 fold-change values are more in line with what we would expect biologically, and not what may often be due to technical artifacts from sample processing in the lab.
For each analysis performed, we generate a unique sub-folder inside the “Statistical Analysis” folder of the shared OneDrive. Inside this analysis folder, you will find all the generated results described below. If, for any reason, you are unable to access the files located inside these folders please contact EICC.
We provide the most complete picture of the analysis results as a spreadsheet associated with each individual analysis. We include this spreadsheet as a .csv file for ease-of-use across a variety of operating systems and software environments such a R and Python. A snapshot of this table is shown below.
In the visualization of your analysis results, we use Principal Component Analysis to reduce the high-dimensionality of the count matrix while preserving as much of the original variation of the data as possible. This allows the plotting of each sample in two dimensions and a clearer representation of the separation between the samples and sample groups. This two-dimensional plot is of the first principal component (the x or horizontal axis) and the second principal component (the y or vertical axis).
Principle components are the eigenvectors of the original data’s covariance matrix. We compute these principle components through the singular value decomposition of the count matrix.
When testing for the differential expression between groups, DESeq2 uses normalization on the original count matrix as described previously. For the visualization of the differences between the two groups, it is often preferable to work with different transformations of the count matrix. Prior to the application of PCA, we perform a variance stabilizing transformation to the original count data (Tibshirani 1988, Anders and Huber 2010). After the application of a variance stabilizing transformation, the PCA results are computed using the R function prcomp.
In the PCA plot above, sample groups are differentiated by color. The group and sample separation are visualized using the first principle component (PC1), which accounts for the highest amount of variation, and the second principal component (PC2), which accounts for the second highest amount of the original data variation. Each point represents a single sample, and the sample label is printed next to each plot point. In the above example, we can see clear separation between the two sample groups along PC1, which is accounting for a high proportion of the overall data variance. The Control group is clustering closer together than the Case (TGF) group. This is an indication of higher between-sample variance in the TGF group.
For our primary PCA plot, we include the 1,000 genes with the highest variance. In general, these genes are contributing the most to the differences between samples and between groups. This number can be changed to any number of genes up to the total number of genes in the count matrix. We also include a number of secondary PCA plots, which include magnitude vectors associated with genes for different numbers of genes. These PCA plots are helpful to visualize the impact of certain genes on the separation of samples and sample groups, but are used as often in publication.
As an output in the PCA plot folder, we also include the spreadsheet of principle components associated with each sample. The information in this spreadsheet is used to generate the PCA plot.
A Volcano Plot is an efficient way to visualize changes in expression values and differentially expressed genes in the two groups being compared in the analysis. We use volcano plots to quickly identify the most meaningful differences between the two sample groups. Volcano plots use the log2 fold-change and the corrected or adjusted p-value (q-value) to plot each gene in two dimensions. The reference group (typically the control group) is important to consider when referring to the log fold-changes. The log2 fold-change values are relative to the reference group. These values reflect the change in gene expression of the non-reference group over the reference group. If the reference group is flipped, all the log fold-change values simply change to the opposite sign (e.g., from negative to positive.)
In the above volcano plot, the two groups being compared in this analysis are shown in the main title, and the subtitle displays the reference group. In the legend located at the top of the plot, different colors distinguish between the non-significant genes (grey), genes with a significant q-value but a small log fold-change value (blue), genes that were not significantly different but with a larger log fold-change value (green), and genes that were significantly different and had a high log fold-change value (red). In our volcano plots, we attempt to label as many of the red points as possible, but as the number of highly significant genes increases this becomes difficult.
The log fold-change values are given as log base 2 (binary logarithm) values to improve visualization. A gene with a log2 value of 1 is twice as expressed in the non-reference group when compared to the reference group.
The horizontal dashed line in the above plot represents a q-value of 0.05. The p-values have been converted to negative log10 (common logarithm) values. This is simply to improve the visualization of the plot. As a results, all genes above the dashed line were significantly different between the two groups being compared. It is important to note that the designation of a q-value of 0.05 as statistically significant and a log2 fold-change or greater than 1 or less than -1 are commonly chosen values, but can be changed according to the information researchers wish to display in their own volcano plots relative to their own experiments.
Another common visualization of differential gene expression analysis is the Heatmap. The heatmap is closely related to the volcano plot and provides a visual reference to the differences in gene expression between the two groups being compared. By default, we include the top 40 genes sorted by the lowest adjusted p-values. This is in order to maintain readability and provide a closer visual inspection of the 40 more differentially expressed genes. As outlined previously in PCA analysis and visualization, we apply a variance stabilizing transformation to the original count matrix (Tibshirani 1988, Anders and Huber 2010) prior to generating the heatmap.
We generate heatmaps using the R function pheatmap. The colors of the heatmap are relative to the overall expression values of the entire count matrix after the application of a variance stabilizing transformation. A strong red color indicates a high increase in a gene’s expression level, whereas a strong or dark blue color indicates a high decrease in a gene’s expression level.
To improve the initial visualization, we do not perform clustering on the subjects. Instead, subjects are grouped by the analysis categories that are being examined. The function pheatmap can perform clustering on the rows (genes) of the count matrix. This method of clustering requires examination of a hierarchical dendrogram and the determination of a proper cutoff which yields a desired number of clusters. This is available upon request, and will require some collaboration between the statistical analyst and the researchers to determine the desired or appropriate number of clusters.
We perform gene set enrichment analysis (GSEA) and pathway enrichment analysis, also known as over representation analysis (ORA), using the R package clusterProfiler. We limit the size of the pathways examined to between 10 and 500 genes, to limit the number of extremely large or extremely small pathways that are often significant but lack biological interest.
Over representation analysis is a widely known and used method of pathway analysis to determine if a list of statistically significant genes from a differential expression analysis between two groups are over-represented in a gene ontology pathway. The p-value to determine if this pathway is over-represented or enriched is calculated using the hypergeometric distribution. This corresponds to a one-sided version of the Fisher’s exact test, where the number of differentially expressed genes inside a pathway is significantly larger than what we would expect to see by pure random chance. The Fisher’s Exact test compares the actual number of differentially expressed genes in a pathway with the number that would be expected by random chance. A significant p-value associated with this test demonstrates a significant deviation from the null hypothesis that the differentially expressed genes would appear in the GO pathway purely by random chance. FDR correction is applied to all resulting p-values of pathway enrichment testing. To improve the results of pathway enrichment tests, we remove all non-expressed genes from the background list of genes (the entire set of genes), following the results of [1],[2]. This background set of genes can be further trimmed to include genes that would only be reasonably expressed given the certain tissue being examined.
The above plot is called a dotplot and demonstrates the top 20 over-represented pathways from the pathway enrichment analysis. The color of the dots represent the significance level of the pathway enrichment, and the pathways are ordered by the number of differentially expressed genes inside the pathway. GeneRatio, the x-axis, shows the percentage of enriched genes in the pathway.
The Upset plot allows us to visualize the overlap between the top 20 most enrichment pathways. The Upset plot is more useful than the Venn Diagram when we are looking at more than 3 sets. Each set is represented by the dots across the bottom, and the number of overlapping genes between the top sets are represented by the bars of the bar graph across the top. The dots and links between the dots show us commonality between the top pathways examined.
In gene set enrichment analysis, all genes from the comparison are examined and ranked based on the size of log fold change values are considered. This allows the pathway analysis tool to determine if all the genes in a given set are changing in a coordinated way regardless of their significance level. The calculation of an enrichment score is main component of GSEA. This score represents to degree that a set of genes is over-represented at the top of the bottom of a gene list that has been ranked by log fold change. The significance level of the enrichment score (p-value) is determined using a permutation test. The p-values are then adjusted for multiple hypothesis testing.
The GSEA dotplot allows us to quickly displayed the most significant pathways from the GSEA, and demonstrates whether the pathways are activated or suppressed.
The Ridgeplot demonstrates the distribution of genes in the top pathways based on their log fold change values and their enrichment scores. This plot can demonstrate how the genes are behaving over the length of the pathway.
Given below are examples that we hope will help you in writing your manuscripts using the previously outlined methodology.
Sequences are quality-checked using FastQC for completeness, depth and read quality. Sequences are aligned to the HG38 reference genome using STAR aligner [1]. Gene quantification is done using HTSeq-count [2]. Data are trimmed using Trimmomatic to remove adapter contamination [3].
DESeq2 is used determine differentially expressed genes between at least two experimental groups [3–5]. DESeq2 assumes that the gene expression count table follows a negative binomial distribution and implements the Wald Test for differential gene expression analysis. Genes with low counts are filtered by mean normalized counts in DESeq2. Raw p-values are transformed using the Benjamini-Hochberg correction to account for multiple hypothesis testing. Genes considered significantly differentially expressed are those with adjusted p-values, also called False Discovery Rates (or FDR), less than a pre-determined threshold (typically 0.05). Graphs are generated using package-specific recommended data transformations.
If you have any questions about your analysis results, or would like to us to tailor our approach to your specific analysis goals, please feel free to contact us.
Note: The contents of this document are subject to change. We strive to provide the most up-to-date analysis results to our clients, and amend or alter our analysis outputs and methodology to maintain current best practices.