일 | 월 | 화 | 수 | 목 | 금 | 토 |
---|---|---|---|---|---|---|
1 | 2 | 3 | 4 | |||
5 | 6 | 7 | 8 | 9 | 10 | 11 |
12 | 13 | 14 | 15 | 16 | 17 | 18 |
19 | 20 | 21 | 22 | 23 | 24 | 25 |
26 | 27 | 28 | 29 | 30 | 31 |
- MACS2
- EdgeR
- single cell analysis
- pandas
- github
- DataFrame
- javascript
- HTML
- CUTandRUN
- CUT&RUN
- 싱글셀 분석
- Git
- drug development
- ChIPseq
- drug muggers
- scRNAseq
- js
- cellranger
- python matplotlib
- single cell rnaseq
- julia
- scRNAseq analysis
- single cell
- 비타민 C
- Batch effect
- CSS
- matplotlib
- ngs
- PYTHON
- Bioinformatics
- Today
- Total
바이오 대표
[scRNAseq 논문] 싱글셀 RNA 시퀀싱 데이터 분석 흐름 “Current best practices in single-cell RNA-seq analysis: a tutorial” 본문
[scRNAseq 논문] 싱글셀 RNA 시퀀싱 데이터 분석 흐름 “Current best practices in single-cell RNA-seq analysis: a tutorial”
바이오 대표 2023. 2. 20. 08:58“Current best practices in single-cell RNA-seq analysis: a tutorial” 논문 요약
요약
scRNAseq analysis tools increase → lack of standardization (+ dependency of language)
The paper introduces the typical scRNAseq analysis steps as current best-practice recommendations.
- count matrix
- pre-processing
- QC, normalization, data correction, feature selection, dimensionality reduction
- cell - and gene-level downstream analysis
Pre-processing and visualization
Raw data ——————————————→ count matrix
- wet lab (tissue to count)
- single-cell dissociation (tissue digesting)
- single cell isolation
- double, multiplet, and empty droplets need to be filtered later
- library construction
- each droplet contains chemicals to
- break down the cell membrane
- capture mRNA
- RT to cDNA
- amplification (to increase its probability of being measured)
- each droplet contains chemicals to
- libraries (cellular barcoded + UMIs) pooled (multiplexed) for sequencing
- pipeline (FASTQ → QC, demultiplexing, alignment)
- count data
Raw data processing pipelines:
- Cell Ranger (or others)
- read QC, assigning reads to the cellular barcodes, mRNA molecules of origin (demultiplexing), genome alignment, quantification
- Demultiplexing def.
- Demultiplexing in sequencing: sorting reads into different FASTQ files for different libraries pooled into a single sequencing runs
- Separating multiple samples pooled into a single library
- Tools https://www.10xgenomics.com/resources/analysis-guides/bioinformatics-tools-for-sample-demultiplexing
Result Count Matrix:
#of barcodes(cells) x #of transcripts
Quality Control
= doublets, low-quality cells (dying cells) 제거
QC covariates (main):
- the #of counts per barcode (count depth)
- the #of genes per barcode - count가 하나 이상인 gene 개수
- the fraction of counts from mt genes per barcode
고려사항:
- 해당 covariates를 각각 생각하면 안되고 consider jointly (biological 의미가 있을 수 있다).
- Start w/ permissive QC
- distrbution QC 가 샘플마다 다르면, 각 샘플마다 다른 threshold 취하기
** For Heterogeneous cell population may exhibit multiple QC covariate peaks
Tools:
- Scrublet, doubletFiderm doubletDecon
- Demuxlet https://si.biostat.washington.edu/sites/default/files/modules/2019_SISG_6_7_JP_Multiplexing
Normalization
[1] GOAL: normalization between cells (count depth)
NORMALIZATION METHODS: (global scaling, linear normalization)
Assumption: each cell has the equal number of mRNA molecules/count depth
- High-counted filtering CPM (Wein- reb et al , 2018)
- count’s 5% 이상을 차지하는 genes 들은 size factor 계산할때 포함 안함
- allow counts variability in few highly expressed genes
- Scran (Lun et al, 2016a) 현존 탑!!
- limits variability to fewer than 50% of genes being differencially expressed between cells,
- :) in a small-scale comparision
- batch correction 뛰어남
** strong batch effect 를 보이는 데이터에는 Non-linear normalization methods 효과 좋음 (Cole et al, 2019)
- CPM (counts per million)
- count depth scaling - using scale factor proportional to the count depth per cell
- ~downsampling (increase technical dropout)
Full length data - TPM normalization
3’end - scran
[2] gene normalization: scaling gene counts to z scores (0 mean, unit variance)
Assumption: all genes are weighted equally
**Seurat 에서는 사용, 다른데서는 반대
⇒ normalization 후에는 보통 log(x+1) transform 을 해준다
- distance = log fold change
- mitigates mean-variance relationship
- reduce the skewness of the data
주의! test group 간의 size factor distribution이 크게 차이가 나면 거짓부렁 결과보여줄수 있음
Data correction and integration
GOAL: data correction for bath effect, dropout, or cell cycle
** Biological 의미가 있을수도 있으니 고려해야함
[1] Regressing out biological effects
- for particular biological signals of interest
- remove cell cycle effects → can improve trajectory analysis
- remove mt gene expression
- HOW?
- linear regression against cell cycle scores (Scanpy , Seurat)
[2] Regressing out tech effects
- count depth
- 빡세게 가려면 downsampling or non-linear normalization (if large variation of count depths per cell)
[3] Batch effects
- correcting between samples or cells
- HOW? (by linear approach)
[4] Data Integration
- integrate from multiple experiments
- HOW? (by non-linear approach)
- MNN, Harmony
[5] Expression recovery (not recommended)
- drop out, imputation
Feature selection, dimensionality reduction, and visualization
Human scRNAseq can contain up to 25,000 genes — QC —> ~ 15,000
GOAL: 엄청난 sparse/dimension 데이터를 유의미하게 함축
Feature Selection
- highly variable genes (HVGs) 를 이용하여 downstream analysis 하는 것인데 high # of HVGs 추전
- HOW?
- binned by gene’s mean expression → for each bin, genes with the highest variance-to-mean ratio are selected
- from count data (Seurat) //question what is Seurat default gene number?
- from log-transformed data (Cell Ranger)
- ** 데이터가 z-score normalization 이 되있으면 적용 불가
- HOW?
Dimensionality reduction
- Summarization
- PCA (principle component analysis)
- linear approach, reduce dimension by maximizing the captured residual variance in each further dimension
- As pre-step for non-linear dimensionality
- distance 유의미
- Diffusion maps
- non-linear approach, alternative to PCA for trajectory inference summarization
- Continuous process 데이터에 굿 (differentiation is of interest)
- PCA (principle component analysis)
- Visualization
- non-linear
- t-SNE: local similarity 에 초점
- UMAP: 현존 짱!! *2D 이상으로도 데이터 summarize 된다 (발전 가능성!)
- SRPING: graph-based tools
- PAGA: for coarse-grained visualization (for large numbers of cells)
Downstream analysis
The paper introduces the typical scRNAseq analysis steps as current best-practice recommendations.
- count matrix
- pre-processing
- QC, normalization, data correction, feature selection, dimensionality reduction
- cell - and gene-level downstream analysis