바이오 대표

[scRNAseq 논문] 싱글셀 RNA 시퀀싱 데이터 분석 흐름 “Current best practices in single-cell RNA-seq analysis: a tutorial” 본문

논문

[scRNAseq 논문] 싱글셀 RNA 시퀀싱 데이터 분석 흐름 “Current best practices in single-cell RNA-seq analysis: a tutorial”

바이오 대표 2023. 2. 20. 08:58

“Current best practices in single-cell RNA-seq analysis: a tutorial” 논문 요약 

요약

scRNAseq analysis tools increase → lack of standardization (+ dependency of language)

The paper introduces the typical scRNAseq analysis steps as current best-practice recommendations.

  • count matrix
  • pre-processing
    • QC, normalization, data correction, feature selection, dimensionality reduction
  • cell - and gene-level downstream analysis

 

Pre-processing and visualization

Raw data ——————————————→ count matrix

  • wet lab (tissue to count)
    1. single-cell dissociation (tissue digesting)
    2. single cell isolation
      1. double, multiplet, and empty droplets need to be filtered later
    3. library construction
      1. each droplet contains chemicals to
        1. break down the cell membrane
        2. capture mRNA
        3. RT to cDNA
        4. amplification (to increase its probability of being measured)
    4. libraries (cellular barcoded + UMIs) pooled (multiplexed) for sequencing
    5. pipeline (FASTQ → QC, demultiplexing, alignment)
    6. count data

Raw data processing pipelines:

Result Count Matrix:

#of barcodes(cells) x #of transcripts

 

Quality Control

= doublets, low-quality cells (dying cells) 제거

QC covariates (main):

  1. the #of counts per barcode (count depth)
  2. the #of genes per barcode - count가 하나 이상인 gene 개수
  3. the fraction of counts from mt genes per barcode

고려사항:

  • 해당 covariates를 각각 생각하면 안되고 consider jointly (biological 의미가 있을 수 있다).
  • Start w/ permissive QC
  • distrbution QC 가 샘플마다 다르면, 각 샘플마다 다른 threshold 취하기

** For Heterogeneous cell population may exhibit multiple QC covariate peaks

Tools:

 

Normalization

[1] GOAL: normalization between cells (count depth)

NORMALIZATION METHODS: (global scaling, linear normalization)

Assumption: each cell has the equal number of mRNA molecules/count depth

  • High-counted filtering CPM (Wein- reb et al , 2018)
    • count’s 5% 이상을 차지하는 genes 들은 size factor 계산할때 포함 안함
    • allow counts variability in few highly expressed genes
  • Scran (Lun et al, 2016a) 현존 탑!!
    • limits variability to fewer than 50% of genes being differencially expressed between cells,
    • :) in a small-scale comparision
    • batch correction 뛰어남

** strong batch effect 를 보이는 데이터에는 Non-linear normalization methods 효과 좋음 (Cole et al, 2019)

  • CPM (counts per million)
    • count depth scaling - using scale factor proportional to the count depth per cell
    • ~downsampling (increase technical dropout)

Full length data - TPM normalization

3’end - scran

 

[2] gene normalization: scaling gene counts to z scores (0 mean, unit variance)

Assumption: all genes are weighted equally

**Seurat 에서는 사용, 다른데서는 반대

⇒ normalization 후에는 보통 log(x+1) transform 을 해준다

  1. distance = log fold change
  2. mitigates mean-variance relationship
  3. reduce the skewness of the data

주의! test group 간의 size factor distribution이 크게 차이가 나면 거짓부렁 결과보여줄수 있음

 

Data correction and integration

GOAL: data correction for bath effect, dropout, or cell cycle

** Biological 의미가 있을수도 있으니 고려해야함

[1] Regressing out biological effects

  • for particular biological signals of interest
    • remove cell cycle effects → can improve trajectory analysis
    • remove mt gene expression
  • HOW?
    • linear regression against cell cycle scores (Scanpy , Seurat)

[2] Regressing out tech effects

  • count depth
  • 빡세게 가려면 downsampling or non-linear normalization (if large variation of count depths per cell)

[3] Batch effects

[4] Data Integration

  • integrate from multiple experiments
  • HOW? (by non-linear approach)
    • MNN, Harmony

[5] Expression recovery (not recommended)

  • drop out, imputation

 

Feature selection, dimensionality reduction, and visualization

Human scRNAseq can contain up to 25,000 genes — QC —> ~ 15,000

GOAL: 엄청난 sparse/dimension 데이터를 유의미하게 함축

Feature Selection

  • highly variable genes (HVGs) 를 이용하여 downstream analysis 하는 것인데 high # of HVGs 추전
    • HOW?
      • binned by gene’s mean expression → for each bin, genes with the highest variance-to-mean ratio are selected
      • from count data (Seurat) //question what is Seurat default gene number?
      • from log-transformed data (Cell Ranger)
    • ** 데이터가 z-score normalization 이 되있으면 적용 불가

Dimensionality reduction

  1. Summarization
    • PCA (principle component analysis)
      • linear approach, reduce dimension by maximizing the captured residual variance in each further dimension
      • As pre-step for non-linear dimensionality
      • distance 유의미
    • Diffusion maps
      • non-linear approach, alternative to PCA for trajectory inference summarization
      • Continuous process 데이터에 굿 (differentiation is of interest)
  2. Visualization
    1. non-linear
    • t-SNE: local similarity 에 초점
    • UMAP: 현존 짱!! *2D 이상으로도 데이터 summarize 된다 (발전 가능성!)
    • SRPING: graph-based tools
    • PAGA: for coarse-grained visualization (for large numbers of cells)

 

Downstream analysis

The paper introduces the typical scRNAseq analysis steps as current best-practice recommendations.

  • count matrix
  • pre-processing
    • QC, normalization, data correction, feature selection, dimensionality reduction
  • cell - and gene-level downstream analysis