Required Files

Raw Data Files: FASTQ or BCL
Sample sheet or metadata sheet, preferably in CSV format, containing:
- Sample_IDs and/or FASTQ_IDs to match samples with FASTQ files
- if raw data in BCL format, include sample index that was used in library construction, e.g., SI-TT-D9 or index sequence
- reference organism
- optionally, any important experimental information such as treatment, age, sex, etc.

SampleID	Index	RefOrg	Treatment
DF8_M4AV_A1_SN_5GEX_L_175315_S1	SI-TT-D9	Homo_sapiens	Control
DF3_M4PYR_A1_SN_5GEX_L_175317_S3	SI-TT-D9	Mus_musculus	DrugA

If cell clusters should be assigned to cell types, also include a spreadsheet with cell type specific genes, also known as cell type markers

Cell Type	Cell Type Specific Markers
Microglia	IBA1, CD11B, CX3CR1, TREM2
Astrocytes	GFAP, S100B, ALDH1L1, AQP4
Oligodendrocytes	OLIG2, MBP, MOG, CNP

Deliverables

The analysis results will be uploaded and shared via a custom OneDrive folder and will include:

A PowerPoint presentation with an overview of the workflow and analysis results.
Plots:
- QC: Violin and Scatter plots of features (genes), UMIs (transcripts), and mitochondrial genes counts
- Clustering Evaluation: Clustree, Elbow, and Jackstraw plots
- Dimensional Reduction: UMAP and tSNE plots
- Any other plots agreed upon a preliminary project discussion
Tables: Differential gene expression results
Data:
- Processed Seurat object in the R importable RDS format
- CLOUPE files generated by CellRanger for 10x Genomics Loupe Browser
- HTML web summary files created by CellRanger

Overview of the Workflow

The initial raw data processing is done with data specific tool like CellRanger, in case of Chromium single cell data produced with 10x Genomics sequencing technology

The secondary analysis, of the feature-barcode matrices generated by CellRanger (or analogous tool), is performed using the Seurat package

Seurat workflow

QC and Filtering: Feature-barcode matrices are converted to Seurat objects, and features present in < 3 cells are removed to reduce noise. An additional cells filtering step is required to remove dead and/or low quality cells before clustering and subsequent analysis. The important metrics to consider are:

UMI and Feature counts per cell: Barcodes (representing cells) with unusually low UMI and/or feature counts may represent droplets containing ambient RNAs but not actual cells. Conversely, barcodes with unusually high feature and/or UMI counts may indicate cell multiplets, which occur when two or more cells are accidentally captured together in a single droplet during the cell isolation and library preparation steps.
Ratio of mitochondrial counts: High fraction of mtDNA transcripts could indicate an unhealthy cell state and/or cell damage, which caused release of cytoplasmic RNAs from the cell, while mitochondrial RNAs remained intact.

Normalization and Variance Stabilization: Our approach involves utilizing the SCTransform method for normalization and variance stabilization of molecular count data. SCTransform was introduced as an improved alternative to the log transform normalization approach utilized in the analysis of unwanted variation sources.

In essence, the SCTransform method constructs a generalized linear model (GLM) for each gene, utilizing UMI counts as the response variable and sequencing depth as the explanatory variable. To account for genes with similar expression levels, information is pooled together to regulate parameter estimates. Consequently, the obtained residuals represent effectively normalized data values that are no longer influenced by sequencing depth, eliminating the correlation between the two. By employing this approach, SCTransform provides a robust means to enhance the quality and comparability of gene expression data.
If the experimental dataset consists of multiple samples, the SCTransform normalization is applied to each sample separately before the subsequent integration step.
Alternatively, the standard Seurat normalization and scaling workflow, NormalizeData() -> FindVariableFeatures() -> ScaleData(), could be applied to your data set upon your request

Data Integration Across Samples/Conditions: Seurat performs a series of steps to harmonize and integrate the individual datasets. It aligns the samples based on shared biological variation, identifies sources of unwanted variability, and enables the identification of biological signals that are consistent across samples. This approach involves the identification and correction of batch effects, normalization of expression values, and integration of the datasets into a unified analysis.

The data integration method in Seurat is important for overcoming batch effects, improving statistical power, enhancing cell type identification, and enabling meaningful cross-sample comparisons. It enables to perform robust and comprehensive analyses of scRNA-seq data from diverse sources, leading to more accurate and interpretable results.

Selecting meaningful principal components (PCs): In order to address the substantial technical noise present in individual features within scRNA-seq data, Seurat utilizes cell clustering based on Principal Component Analysis (PCA) scores. Each PC can be seen as a ‘metafeature’ that amalgamates information from a related set of features. Consequently, the leading principal components provide a reliable compression of the dataset.

JackStraw analysis: JackStraw is a statistical method used to assess the significance of each principal component (PC) in capturing the underlying biological variation. It evaluates whether the observed structure in the data is statistically significant or can be attributed to noise. PCs with low p-values are considered important and capture meaningful biological variation. The JackStraw analysis can only be used with the standard Seurat normalization and scaling workflow.
ElbowPlot: ElbowPlot is used to visualize the cumulative explained variance by each PC. It helps determine the optimal number of PCs to retain for downstream analyses. The plot shows the percentage of variance explained on the y-axis and the number of PCs on the x-axis. The “elbow” point on the plot represents the optimal number of PCs to retain, as it captures a substantial portion of the variance while avoiding overfitting or including noise.

Clustering: The clustering algorithm used by Seurat is based on shared nearest neighbor (SNN) modularity optimization, which helps identify groups of cells that share similar characteristics. One of the important considerations at this stage is the choice of an optimal clustering resolution, which affects the number and size of the resulting cell clusters. Higher resolutions tend to result in smaller, more fine-grained clusters, while lower resolutions tend to merge similar cell populations into larger clusters.

We determine the optimal clustering resolution by considering the study design information you provide, such as the expected number of unique cell types, and evaluating the clustering tree generated by the clustree package, which is constructed based on clustering results obtained at various resolutions spanning from 0 to 1.4

Dimensionality Reduction: Dimensionality reduction refers to the process of reducing the high-dimensional gene expression data to a lower-dimensional space while retaining the most important features or patterns. This reduction allows for visualization and analysis of the data in a more manageable and interpretable form.

Seurat offers several dimensional reduction techniques including Principal Component Analysis (PCA), t-distributed Stochastic Neighbor Embedding (t-SNE), and Uniform Manifold Approximation and Projection (UMAP)
It is important to note that any dimensionality reduction technique involves some level of approximation.Researchers should interpret UMAP and tSNE plots with the understanding that while the distances are informative, they may not perfectly reflect the true distances in the original high-dimensional space.
t-SNE better preserves local structure and can reveal finer details of relationships between cells that are located close to each other in high-dimensional space.
UMAP is designed to preserve both local and global structure in the data, providing a more faithful representation of the relationships between cells. It aims to create a low-dimensional representation where similar cells are located close to each other while maintaining the overall structure of the data.

Assigning cell-types to clusters: Marker genes are are used to annotate the clusters with specific cell types. This step involves comparing the marker gene expression patterns of each cluster to known gene expression signatures of different cell types.

Since we are not experts on all cell types and all systems, it is highly recommended that you provide us with cell type-specific markers.
Differences in normalized expression of marker genes among clusters can be visualized with Violin plot and Dot plot

Differential Expression (DE) Analysis: Seurat supports multiple DE tests:
- “wilcox” : Wilcoxon rank sum test
- “bimod” : Likelihood-ratio test for single cell feature expression
- “roc” : Standard AUC classifier
- “t” : Student’s t-test
- “poisson” : Likelihood ratio test assuming an underlying negative binomial distribution. Use only for UMI-based datasets
- “negbinom” : Likelihood ratio test assuming an underlying negative binomial distribution. Use only for UMI-based datasets
- “LR” : Uses a logistic regression framework to determine differentially expressed genes. Constructs a logistic regression model predicting group membership based on each feature individually and compares this to a null model with a likelihood ratio test.
- “MAST” : GLM-framework that treates cellular detection rate as a covariate
- “DESeq2” : DE based on a model using the negative binomial distribution

By default, we utilize the Wilcoxon rank-sum test for differential expression analysis due to its non-parametric nature. Unlike parametric tests, the Wilcoxon rank-sum test does not make strong assumptions about the distribution of the data, making it suitable for situations where data distributional assumptions are not met.

Gene	p_val	avg_log2FC	pct.1	pct.2	p_val_adj
ISG15	1.505693e-19	3.5088882	0.998	0.229	2.003475e-9
IFIT3	4.128835e-14	2.5684404	0.961	0.052	5.493827e-8
IFI6	2.476e-13	2.4602058	0.965	0.076	3.299190e-10
ISG20	9.2634e-15	2.5618077	1.000	0.666	1.851e-3

A Volcano plot is a commonly used visualization tool for displaying the results of differential expression (DE) analysis in genomics and transcriptomics studies. It provides a graphical representation of the statistical significance (p-values) and fold changes of genes between different experimental conditions or groups. The main features of a Volcano plot are the x-axis, which represents the log2 fold change (effect size) of gene expression between two conditions, and the y-axis, which represents the negative log10-transformed q-values (false discovery rate adjusted p-values) associated with the differential expression of each gene.

Contact

If you have any questions or need assistance with your data analysis, please feel free to contact EICC at eicc@emory.edu or reach out to me directly at sergei.bombin@emory.edu

Bioinformatics Methods for scRNA-seq Analysis

Sergei Bombin, PhD

Compiled: December 28, 2023

Required Files

Deliverables

Overview of the Workflow

Seurat workflow

Contact