Methods


Whole genome and whole exome sequence analysis

Quality checked sequencing data will be aligned to the human reference genome (hg38) using PEMapper. Variant calls will be generated using PECaller. Variants will be annotated using bystro.io. A VCF file containing the cleaned variants will be generated. Read more.

Somatic mutations from cancer (tumor/normal) data:

Raw sequence data reads (WES/WGS) of the tumor or matching-normal samples will be aligned to the human reference genome (ex. hg38) using the Burrows–Wheeler Aligner (BWA). After alignment, and deduplication of reads with Picard software, we sort and index the binary version of Sequence Alignment Map (SAM) file. We then use Genome Analysis Toolkit (GATK) v4.0 to perform indel realignment and base quality score recalibration. Somatic variant calling will be performed on matched tumor-normal pairs using MuTect2. Read more.

Cytogenomic/SNP Genotyping data analysis:

The widespread use of microarrays allows gene expression profiling, genotyping, mutation detection, and gene discovery throughout the genome. This document aims to provide a workflow for analysis of Infinium CytoSNP-850K v1.2 array data to identify genetic and structural variations. Read more.

Prokaryotic Whole Genome Sequencing – Assembly and Functional Annotation of Illumina Reads

Raw Illumina reads from a whole genome sequencing project will be run through an analysis pipeline that includes quality control, read quality and adapter trimming, reference based or de novo assembly, gene prediction, and genome functional annotation. We can assemble a genome from pooled Illumina read libraries or single-cell reads. Read more.

ATAC-seq (bulk) data analysis:

ATAC-seq (Assay for Transposase Accessible Chromatin with high-throughput Sequencing) is a next-generation sequencing approach for the analysis of open chromatin regions to assess genome-wise chromatin accessibility. ATAC-seq achieves this by simultaneously fragmenting and tagging genomic DNA with sequencing adapters using the hyperactive Tn5 transposase enzyme. This document aims to provide a workflow for the analysis of ATAC-seq data to identify differential chromatin accessibility. Read more.

RNA-Seq (Cancer) data analysis:

Quality filtered sequencing data will be aligned to the reference genome (ex. hg38) using STAR (Spliced Transcripts Alignment to a Reference). Gene quantification will be done using HTSeq-count. Fusion transcripts are characteristic of cancer tumors. STAR-Fusion uses chimeric-reads collected during STAR-alignment for fusion RNA prediction. In order to reduce the number of false-positive fusion genes, fusion events with fusion fragments per million total reads < 0.1 and putative fusions between homologous genes will be discarded. Read more.

RNA-Seq (Bulk) data analysis:

Quality checked sequencing data will be aligned to the human reference genome (hg38) using STAR (Spliced Transcripts Alignment to a Reference). Gene quantification will be done using HTSeq-count. To characterize expressed genes, a pre-ranked permutation based gene set enrichment analysis (GSEA) will be performed. Read more.

RNA-seq (Single Cell) data analysis:

The initial raw data processing is done with data specific tool like CellRanger, in case of Chromium single cell data produced with 10x Genomics sequencing technology. The secondary analysis, of the feature-barcode matrices generated by CellRanger (or analogous tool), is performed using the Seurat package. Read more.

MicroRNAseq data analysis:

Data is trimmed using Trimmomatic and CutAdapt, and QCed with FastQC and MultiQC. MiRDeep2 and MiRBase are used to align reads. Counts for mature known miRNAs and predicted miRNAs are generated on a per-sample basis.  Read more.

Proteomics data analysis - Label Free Quantification

Database searches will be performed using the Andromeda search engine with the UniProt-SwissProt human canonical database as a reference and a contaminants database of common laboratory contaminants. Protein group LFQ (label free quantification) intensities will be log2-transformed to reduce the effect of outliers. To overcome the obstacle of missing LFQ values, missing values will be imputed before fitting the models. Two-tailed, Student’s t test calculations will be used in statistical tests. Read more.

Amplicon (16S rRNA) data analysis:

Demultiplexed raw amplicon (16S) sequences in FastQ file format will be processed using the open-source software package QIIME2 (Quantitative Insights Into Microbial Ecology) version 2022.11. Denoising and dereplication of your data, including chimera removal and trimming of reads based on quality scores, will be performed using the Divisive Amplicon Denoising Algorithm 2 (DADA2) module. After data cleaning, a feature table containing counts of each unique sequence variant found in the data will be constructed using DADA2. A feature is essentially any unit of observation, e.g., an operational taxonomic unit (OTU), an amplicon sequence variant (ASV), a gene or a metabolite. OTUs are identified via a clustering method called VSEARCH [3], and ASVs are identified via DADA2. ASVs differentiate sequences even if they vary by only one base pair, giving us distinct units that otherwise would be lost with any form of OTU clustering. As such, most researchers use ASVs in recent experiments to increase resolution. Read more.

Shotgun metagenomic data analysis:

To perform taxonomic (phyla, genera or species level) profiling of shotgun metagenome sequencing reads, the MetaPhlAn2 pipeline will be used on a high performance cluster-computing environment or as Amazon custom AMI. HUMAnN2 (HMP Unified Metabolic Analysis Network) utilizes the MetaCyc database as well as the UniRef gene family catalog to characterize the microbial pathways present in samples. Read more.

This service encompasses a pipeline of downstream analyses following generation of results from genomic, epigenomic, transcriptomic, proteomic, or metagenomic services. Leverage our expertise in study design and statistical methodology to provide quality hypothesis development and testing, sample size calculations, data analysis and visualizations, and actionable interpretations. Read more.

Research question and Hypothesis development

Power and sample size calculations

Clinical Data Integration and Data Visualizations

Machine Learning Analysis

A one-time, hands-on training workshop on the topic of your choice on statistical and technological methodologies as applied to your area of research. Read more about Methods Worshops.