Required Files
Raw Data Files: FASTQ or BCL
Sample sheet or metadata sheet, preferably in CSV format,
containing:
- Sample_IDs and/or FASTQ_IDs to match samples with FASTQ files
- if raw data in BCL format, include sample index that was used in
library construction, e.g., SI-TT-D9 or index sequence
- reference organism
- optionally, any important experimental information such as
treatment, age, sex, etc.
DF8_M4AV_A1_SN_5GEX_L_175315_S1 |
SI-TT-D9 |
Homo_sapiens |
Control |
|
DF3_M4PYR_A1_SN_5GEX_L_175317_S3 |
SI-TT-D9 |
Mus_musculus |
DrugA |
|
- If cell clusters should be assigned to cell types, also include a
spreadsheet with cell type specific genes, also known as cell type
markers
Microglia |
IBA1, CD11B, CX3CR1, TREM2 |
Astrocytes |
GFAP, S100B, ALDH1L1, AQP4 |
Oligodendrocytes |
OLIG2, MBP, MOG, CNP |
Deliverables
The analysis results will be uploaded and shared via a custom
OneDrive folder and will include:
A PowerPoint presentation with an overview of the workflow and
analysis results.
Plots:
- QC: Violin and Scatter plots of features (genes), UMIs
(transcripts), and mitochondrial genes counts
- Clustering Evaluation: Clustree, Elbow, and Jackstraw plots
- Dimensional Reduction: UMAP and tSNE plots
- Any other plots agreed upon a preliminary project discussion
Tables: Differential gene expression results
Data:
- Processed Seurat object in the R importable RDS format
- CLOUPE files generated by CellRanger for 10x Genomics Loupe
Browser
- HTML web summary files created by CellRanger
Overview of the Workflow
The initial raw data processing is done with data specific tool like
CellRanger,
in case of Chromium single cell data produced with 10x Genomics
sequencing technology
The secondary analysis, of the feature-barcode matrices generated by
CellRanger (or analogous tool), is performed using the Seurat package
Seurat workflow
- QC and Filtering: Feature-barcode matrices are
converted to Seurat objects, and features present in < 3 cells are
removed to reduce noise. An additional cells filtering step is required
to remove dead and/or low quality cells before clustering and subsequent
analysis. The important metrics to consider are:
- UMI and Feature counts per cell: Barcodes (representing cells) with
unusually low UMI and/or feature counts may represent droplets
containing ambient RNAs but not actual cells. Conversely, barcodes with
unusually high feature and/or UMI counts may indicate cell multiplets,
which occur when two or more cells are accidentally captured together in
a single droplet during the cell isolation and library preparation
steps.
- Ratio of mitochondrial counts: High fraction of mtDNA transcripts
could indicate an unhealthy cell state and/or cell damage, which caused
release of cytoplasmic RNAs from the cell, while mitochondrial RNAs
remained intact.
- Normalization and Variance Stabilization: Our
approach involves utilizing the SCTransform
method for normalization and variance stabilization of molecular count
data. SCTransform was introduced as an improved alternative to the log
transform normalization approach utilized in the analysis of unwanted
variation sources.
- In essence, the SCTransform method constructs a generalized linear
model (GLM) for each gene, utilizing UMI counts as the response variable
and sequencing depth as the explanatory variable. To account for genes
with similar expression levels, information is pooled together to
regulate parameter estimates. Consequently, the obtained residuals
represent effectively normalized data values that are no longer
influenced by sequencing depth, eliminating the correlation between the
two. By employing this approach, SCTransform provides a robust means to
enhance the quality and comparability of gene expression data.
- If the experimental dataset consists of multiple samples, the
SCTransform normalization is applied to each sample separately before
the subsequent integration step.
- Alternatively, the standard Seurat normalization and scaling
workflow, NormalizeData() -> FindVariableFeatures() ->
ScaleData(), could be applied to your data set upon your request
- Data Integration Across Samples/Conditions: Seurat
performs a series of steps to harmonize and integrate the individual
datasets. It aligns the samples based on shared biological variation,
identifies sources of unwanted variability, and enables the
identification of biological signals that are consistent across samples.
This approach involves the identification and correction of batch
effects, normalization of expression values, and integration of the
datasets into a unified analysis.
- The data integration method in Seurat is important for overcoming
batch effects, improving statistical power, enhancing cell type
identification, and enabling meaningful cross-sample comparisons. It
enables to perform robust and comprehensive analyses of scRNA-seq data
from diverse sources, leading to more accurate and interpretable
results.
- Selecting meaningful principal components (PCs): In
order to address the substantial technical noise present in individual
features within scRNA-seq data, Seurat utilizes cell clustering based on
Principal Component Analysis (PCA) scores. Each PC can be seen as a
‘metafeature’ that amalgamates information from a related set of
features. Consequently, the leading principal components provide a
reliable compression of the dataset.
JackStraw analysis: JackStraw is a statistical method used to
assess the significance of each principal component (PC) in capturing
the underlying biological variation. It evaluates whether the observed
structure in the data is statistically significant or can be attributed
to noise. PCs with low p-values are considered important and capture
meaningful biological variation. The JackStraw analysis can only be used
with the standard Seurat normalization and scaling workflow.

ElbowPlot: ElbowPlot is used to visualize the cumulative
explained variance by each PC. It helps determine the optimal number of
PCs to retain for downstream analyses. The plot shows the percentage of
variance explained on the y-axis and the number of PCs on the x-axis.
The “elbow” point on the plot represents the optimal number of PCs to
retain, as it captures a substantial portion of the variance while
avoiding overfitting or including noise.

- Clustering: The clustering algorithm used by Seurat
is based on shared nearest neighbor (SNN) modularity optimization, which
helps identify groups of cells that share similar characteristics. One
of the important considerations at this stage is the choice of an
optimal clustering resolution, which affects the number and size of the
resulting cell clusters. Higher resolutions tend to result in smaller,
more fine-grained clusters, while lower resolutions tend to merge
similar cell populations into larger clusters.
We determine the optimal clustering resolution by considering the
study design information you provide, such as the expected number of
unique cell types, and evaluating the clustering tree generated by the
clustree
package, which is constructed based on clustering results obtained at
various resolutions spanning from 0 to 1.4

- Dimensionality Reduction: Dimensionality reduction
refers to the process of reducing the high-dimensional gene expression
data to a lower-dimensional space while retaining the most important
features or patterns. This reduction allows for visualization and
analysis of the data in a more manageable and interpretable form.
Seurat offers several dimensional reduction techniques including
Principal Component Analysis (PCA), t-distributed Stochastic Neighbor
Embedding (t-SNE), and Uniform Manifold Approximation and Projection
(UMAP)
It is important to note that any dimensionality reduction
technique involves some level of approximation.Researchers should
interpret UMAP and tSNE plots with the understanding that while the
distances are informative, they may not perfectly reflect the true
distances in the original high-dimensional space.
t-SNE better preserves local structure and can reveal finer
details of relationships between cells that are located close to each
other in high-dimensional space.

UMAP is designed to preserve both local and global structure in
the data, providing a more faithful representation of the relationships
between cells. It aims to create a low-dimensional representation where
similar cells are located close to each other while maintaining the
overall structure of the data.

- Assigning cell-types to clusters: Marker genes are
are used to annotate the clusters with specific cell types. This step
involves comparing the marker gene expression patterns of each cluster
to known gene expression signatures of different cell types.
- Since we are not experts on all cell types and all systems, it is
highly recommended that you provide us with cell type-specific
markers.
- Differences in normalized expression of marker genes among clusters
can be visualized with Violin plot and Dot plot

Differential Expression (DE) Analysis: Seurat
supports multiple DE tests:
- “wilcox” : Wilcoxon rank sum test
- “bimod” : Likelihood-ratio test for single cell feature
expression
- “roc” : Standard AUC classifier
- “t” : Student’s t-test
- “poisson” : Likelihood ratio test assuming an underlying negative
binomial distribution. Use only for UMI-based datasets
- “negbinom” : Likelihood ratio test assuming an underlying negative
binomial distribution. Use only for UMI-based datasets
- “LR” : Uses a logistic regression framework to determine
differentially expressed genes. Constructs a logistic regression model
predicting group membership based on each feature individually and
compares this to a null model with a likelihood ratio test.
- “MAST” : GLM-framework that treates cellular detection rate as a
covariate
- “DESeq2” : DE based on a model using the negative binomial
distribution
- By default, we utilize the Wilcoxon rank-sum test for differential
expression analysis due to its non-parametric nature. Unlike parametric
tests, the Wilcoxon rank-sum test does not make strong assumptions about
the distribution of the data, making it suitable for situations where
data distributional assumptions are not met.
ISG15 |
1.505693e-19 |
3.5088882 |
0.998 |
0.229 |
2.003475e-9 |
IFIT3 |
4.128835e-14 |
2.5684404 |
0.961 |
0.052 |
5.493827e-8 |
IFI6 |
2.476e-13 |
2.4602058 |
0.965 |
0.076 |
3.299190e-10 |
ISG20 |
9.2634e-15 |
2.5618077 |
1.000 |
0.666 |
1.851e-3 |
A Volcano plot is a commonly used visualization tool for
displaying the results of differential expression (DE) analysis in
genomics and transcriptomics studies. It provides a graphical
representation of the statistical significance (p-values) and fold
changes of genes between different experimental conditions or groups.
The main features of a Volcano plot are the x-axis, which represents the
log2 fold change (effect size) of gene expression between two
conditions, and the y-axis, which represents the negative
log10-transformed q-values (false discovery rate adjusted p-values)
associated with the differential expression of each gene.
