16S rRNA Methodology and Analysis Guide

Emory Integrated Computational Core
16S Statistical Analysis
- Phyloseq
- Linear Decomposition Model (LDM)
References

Emory Integrated Computational Core

The Emory Integrated Computational Core (EICC), one of the Emory Integrated Core Facilities (EICF), is supported by the Emory University School of Medicine and the Georgia Clinical & Translational Science Alliance. Our mission is to provide cutting-edge computational support to Emory investigators with large “Omics” datasets.

Why use 16S?

An adult human has an estimated 40-100 trillion bacteria within their gut or digestive tract. This is considerably more than the amount of human cells within the entire human body. Given that humans have 23-25k genes in their genome, and bacteria have demonstrated host-microbiome interactions for functional purposes ranging from milk digestion, peanut digestion or allergy, cancer growth, gum disease, Alzheimer’s or mental health disorders, and beyond, we begin to see just how much our internal flora matter for our overall health. One could argue that there is more bacteria in “you” than “you” yourself, so it is important to explore these bacteria and their role in our health.

The overall goal of the microbiome experiment is to sequence everything we can from collected samples from a body site or within a model organism under certain conditions. This can be done in many different ways, but the most popular of which being 16S rRNA gene-driven studies and shotgun metagenomics. It can be imagined like a barcoding experiment where you search for “hits” in the taxonomy, and try to identify patterns in the commensal flora versus that of an experimental or test state (disease, exposure to a drug or chemical, et cetera).

What is 16S?

What is the 16S rRNA gene? 16S is a part of a sub unit of the “30s” group that are present in most prokaryotes, with 5S, 16S, and 23S being the bacterial “big three”. Although 16S can be very conserved across domains, there are several regions in which change or mutation can occur and such change is commonly observed. These are called hyper-variable regions, and typically go from v1-v9. Most sequencing kits focus on the v3/v4 region due to the ease of primer targeting and amplification. It’s much cheaper to perform a 16S experiment than it is to do either shotgun metagenomics or whole genome sequencing, so this experiment came about as a way to quickly identify what we find in given clinical samples.

The 16s gene has been used as a genotyping method for identifying unknown bacteria as far back as the 1970s (Woese et al 1990), and can still be quickly used for sequencing an individual petri dish of bacteria or just one isolated cultured bacterium today to get a quick identification. One then can BLAST the 16s gene to confirm down to a certain level what the sample is.

We amplify and target the v3/v4 regions of your sample to get a snapshot of what taxa we can identify and then put the resulting .fastq.gz files through QIIME2 as below and later downstream statistics. Although 16s is powerful, we are typically only studying about 460bp of a single gene so you can imagine that the scope is limited in some ways. One can find a great

deal of information about many taxa without individually targeting and sequencing for primers specific to a given species or genus. 16s is a quick experiment that is considerably cheaper than shotgun metagenomics. However, it is limited in statistical power and scope due to the small size of the gene being sequenced and since the 16s gene can only differentiate to certain taxonomic levels for given taxa. For example, some genera share identical 16s genes so we are then limited in power to only the genus level, not the species level, of resolution.

Requirements for 16S Analysis

A: Raw data in FastQ format
B: Forward and reverse primers used to amplify the 16s rRNA gene region of your choice
C: A metadata spreadsheet to identify which FastQ filenames correspond to your original sample ids as well as the group or phenotypes of interest being used for statistical comparisons. This metadata spreadsheet can also include additional information that you may want to control for in your statistical analysis such as age, race, sex, or other metrics recorded during the generation of sample data. The metadata file must include sample-id column that corresponds to your FastQ filenames and at least one categorical data column (for example, SubjectGroup – which samples are cases, which samples are your controls). Example below:

Taxonomic Assignment

Taxonomic composition analysis will be performed to know what kinds of organisms are present in each sample. For all features, we assign taxonomy information by comparison against the GreenGenes (v13_8, 97 and 99% clustered OTUs), GreenGenes2 (late 2022), Silva, or Human Oral Microbiome Database (HOMD) databases based on a naive Bayesian classifier with default parameters.

Controls and Mock Microbial Communities

We strongly recommend including traditional negative, no template control (NTC) negative, positive, and Zymo mock microbial community controls in your project (4). This allows us to determine the efficacy of PCR, DNA extraction, sequencing, and library preparation of your samples in the wet lab stages. Without controls, we cannot calibrate your experimental analysis parameters efficiently.

The Basics of 16S

Ribosomal RNA (rRNA) and 16S sequencing are amplicon sequencing methods that are commonly used to identify bacteria or fungi that are present in multiple samples. The 16S rRNA gene is a prokaryotic gene that is approximately 1,500 bp long and has nine variable regions which contain about 50 functional domains. The process of 16S rRNA gene detection has three main steps. First, you must acquire the DNA, second you must locate and isolate the 16S rRNA gene, and finally you must analyse the 16S rRNA gene sequence.

16S Data Processing

Demultiplexed raw amplicon (16S) sequences in FastQ file format will be processed using the open-source software package QIIME2 (Quantitative Insights Into Microbial Ecology) version 2022.11. Denoising and dereplication of your data, including chimera removal and trimming of reads based on quality scores, will be performed using the Divisive Amplicon Denoising Algorithm 2 (DADA2) module. After data cleaning, a feature table containing counts of each unique sequence variant found in the data will be constructed using DADA2. A feature is essentially any unit of observation, e.g., an operational taxonomic unit (OTU), an amplicon sequence variant (ASV), a gene or a metabolite. OTUs are identified via a clustering method called VSEARCH [3], and ASVs are identified via DADA2. ASVs differentiate sequences even if they vary by only one base pair, giving us distinct units that otherwise would be lost with any form of OTU clustering. As such, most researchers use ASVs in recent experiments to increase resolution.

16S Statistical Analysis

Phyloseq

To conduct most of our exploratory 16S analysis, we implement the R package phyloseq. This package is used to incorporate the pieces from the bioinformatics pipeline (Taxonomy, Counts Table, Phylogenic Tree Alignment) into a single R object which can then be used to create exploratory graphics such as richness and diversity estimates, heat maps, and exploratory bar plots.

Diversity Metrics

We will calculate alpha diversity via the Shannon diversity index. The Shannon index allows us to display alpha diversity of two sample groups. Alpha diversity metrics summarize the structure of an ecological community taking into account the number of taxonomic groups (richness) and the distribution of abundances within the sample groups.

We calculate beta diversity using canonical correspondence analysis (CCA) ordination Bray-Curtis dissimilarities by default, but we can also produce plots for Jaccard dissimilaties and unweighted UniFrac distances upon request. The Bray-Curtis dissimilarity is a statistic that quantifies the differences between two different groups based on the counts obtained in 16S data processing.

Relative Abundance and Bar Plots

Prior to performing an ASV by ASV statistical analysis between the two groups of comparison, we calculate the relative abundance of each sample. The relative abundance is the percent each ASV contributes to the total number of ASV reads assigned to a particular sample, where the total number of ASVS for the sample now sums to the value of 1.

We then combine these ASVs at the Species, Genus, and Phylum levels to produce bar plots. For the different taxanomic levels, we apply a filter with a detection rate of 1/100 and a prevalence of 30/100. ASVs that do not meet these requirements are classified as “Other” to help improve the readability of these bar plots by only displaying the largest Genus or Species categories. We also produce these bar plots separated by the two or more groups being compared. The following bar plot demonstrates the relative abundance of the top Phylum separated by sample group.

Linear Decomposition Model (LDM)

To perform global significance testing (testing for an overall difference between the two groups), and the ASV by ASV testing we use the LDM package that implements the Linear Decomposition Model. The LDM model package controls for multiple testing and also can incorporate multiple covariates that can be either continuous or discrete. It also implements permutation-based p-values that can control for clustered data. We apply to LDM model to the relative abudance transformed data.

In our preliminary output, we perform the global difference test using LDM, which outputs a global p-value which tests the overall difference between the two sample groups that are being compared. We then implement ASV by ASV testing which incorporates the multiple testing correction (FDR).

P-values and Adjusted p-values (q-values)

For every statistical test, there is an acceptable false positive rate in the determination of a significant difference (usually \(\alpha =\) 0.05). Due to the extremely high dimension of 16S data and therefore large number of tests, a multiple comparison adjustment is recommended when determining statistical significance. By default, LDM implements the Benjamini-Hochberg False Discovery Rate (FDR) multiple testing correction. The FDR is the expected ratio of the number of false positives over the total number of significant differences found across all tests being corrected for. The FDR correction retains high power at the expense of a slight increased allowance of false positives, and is therefore the most commonly used multiple hypothesis corrections for exploratory studies.

Linear Decomposition Model (LDM) Output

The ASV by ASV test provides an output which lists the entire classification (Kingdom to Species) of the ASV as well as the p-value and the FDR adjusted q-value for each ASV tested. An example of this output can be seen below.

Due to the limitations of certain classification datasets, it is often not possible to categorize an ASV all the way down to the Species level, resulting in a NA value.

References

[1] https://qiime2.org/
[2] https://greengenes.secondgenome.com/
[3] https://www.homd.org/
[4] https://pubmed.ncbi.nlm.nih.gov/29427429/
[5] https://joey711.github.io/phyloseq/
[6] https://github.com/yijuanhu/LDM