[ RNA-seq DE Analysis ] EdgeR/limma (COVID-19 vs Health)

Notice

Recent Posts

Recent Comments

Link

Link to blog "한 사람의 일상"

« 2025/04 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30

Tags more

Archives

Today

Total

관리 메뉴

바이오 대표

[ RNA-seq DE Analysis ] EdgeR/limma (COVID-19 vs Health) 본문

My works

[ RNA-seq DE Analysis ] EdgeR/limma (COVID-19 vs Health)

바이오 대표 2022. 1. 26. 18:58

EdgeR 은 RNA counts data로부터 DE (Differential Expression) Analysis 를 진행한다.

EdgeR 이 제공하는Functions:

DGEList - 데이터를 효율적으로 사용할 수 있도록, List-based data 태로 변환 ([1] counts [2] samples)
filterByExpr
cacNormFactors
estimateDisp

### Covid- 19 Patients vs Healthy Control DE Analysis ###

해당 글에서 다룰 DE (differntial Expression) Analysis 는 4가지의 실험중 RNA-Seq data 를 R bioconductor edgeR 을 이용한 실험이다. 17 명의 Covid-19 patient 와 17명의 건강한 control 에서 Peripheral Blood Mononuclear Cells (PBMCs - Lymphocytes) 샘플을 추출해 single Illumina HiSeq2000 flow cell 으로 sequencing 을 하였다. Patients 들은 convalescnet, moderate, severe, and intensive 카테고리로 나눌 수 있다. (아래예시 코드에는 COVID-19 vs HEALTHY 만 이용하였다)

2022.01.19 - [My work] - [ Multi-omic Analysis ] Immune Response Against COVID-19: an -omics approach

[ Multi-omic Analysis ] Immune Response Against COVID-19: an -omics approach

여러 -omic 데이터를 통합하여 (more data, potential to more power) 우리는 complex biological big data 를 분석할 수 있다. ( Genomics, Transcriptomics, Proteomics, Epigenomics ) * Epigenome: DNA나 히..

joyful-ugentstudent-note.tistory.com

+ 더 나아가 Gene set Analysis 도 하였다.

1. Set up

1.1 Load packages

library(edgeR)
library(pheatmap)
library(limma)
library(org.Hs.eg.db)

* 만약 당신이 맥북 M1 의 소유자라면, 최선 R 에는 해당 library가 없을 것이다. 그렇다면 BiocManager 을 install 하길 권장한다.

예시) > BiocManager::install("limma ")

* 하지만 그러하더라도 edgeR 이 install 안될 수 있다. 이는 역시나 버전 문제. 엄청난 구글링과 시도끝에 찾은 방법

2022.01.25 - [Extra] - [ M1 chip ] R Bioconductor (R Studio) 맥북 Mac

[ M1 chip ] R Bioconductor (R Studio) 맥북 Mac

당신이 M1 chip (ARM silicon) 의 소유자라면 추후 몇년간은 Rosetta2 emulator 사용이 가장 편할 것이다. Bioconductor는 아직 Apple M1 chip을 기본적으로 지원하지 못한다. (edgeR이 install 되지 않아 찾..

joyful-ugentstudent-note.tistory.com

1.2 Load Dataset

데이터는 NCBI GEO (gene expression omnibus) GSE152418 에서 extracted 하였다.

# read in text file 
rawCount <- read.table("../Dataset/RNAseq1/GSE152418_p20047_Study1_RawCounts.txt", header = T)
head(rawCount)

Rows = ENSEMBL_ID (gene ID) , Columns = sample ID

1.3 Make parameters for design

# Covid-19 vs Healthy
cvsh <- factor(c(rep("Covid-19", 17), rep("Healthy", 17)))
cvsh <- relevel(cvsh,"Healthy")

2. DE Analysis (EdgeR Quasi)

2.1 Data into DGEList

y <- DGEList(rawCount)

이때 y 는 y$counts , y$samples 두개로 구성된다.

2.2 Normalization of library sizes

calcNormFacotrs( ): scaling factors 이용 (TMM - weighted mean of log2 ratio between test and ref. samples 이용)

* library size = sequencing depth = sums of the counts

keep <- rowSums(cpm(y)>cutoff) >= 3
y <- y[keep, ,keep.lib.sizes=FALSE]

y <- calcNormFactors(y)

2.3 MDS plot

MDS (Multidimensional scaling) 을 이용해서 samples 들이 대략적인 유사도를 알 수 있다. 가까우면 비슷, 멀면 안비슷.

plotMDS(y, labels=paste(cvsh2), col=as.double(cvsh2), main ="MDS Plot", cex = 0.8)  # as.double 은 Type 을 숫자형으로 바꿔 주는 것

크게 COVID-19 patients (color) Healthy 그룹이 나뉘는것을 확인할 수 있다.

2.4 Design

estimateDisp( ): negative binomial likelihood 를 최대화해서 common, trended and tagwise dispersions를 구한다.

# Covid-19 vs Healthy
design <- model.matrix(~cvsh)
y <- estimateDisp(y,design)   
plotBCV(y, main="Covid-19 vs Healthy")

BCV (biological coeffieicnt of variation)

* Common dispersion 계산법은 모든 유전자를 common으로 계산하는 방법이고, Tagwise 은 각각의 gene을 고려한 gene-specific dispersion 이다.

2.5 Fitting the model

edgeR 은 아래와 같이 진행된다.

Negative Binomial Model 에 fit
Dispersion estimates
qCML(quantile-adjusted conditional maximum likelihood) 방법을 이용한 exact test 를 통한 DE test

* glmFit 과, 더 나아가 glmQLFit 를 이용하면 exact test 보다 더 자유도가 높아, 더 많은 DE genes 를 찾을 수 있다.

glmQLFit : tagwise NB dispersion을 이용하여 glm에 fit
glmQLFTest : Quasi-likelihood dispersion을 estimate

* GLM 은 outcome/response 에 대해서 distribution을 추축하고 이 distribution에 의거하여 likelihood 를 maximize 하는 것이다. 만약 데이터에 over dispersion 이 있다고 추측되면 QL (Quasi-likelihood)를 적용시켜줄 수 있다. 이는 통계 모델에서 예측되는 variability 보다 더 큰 over-dispersion 을 허용하는 방법이다.

# Covid-19 vs Healthy
fit <- glmQLFit(y,design)
qlf <- glmQLFTest(fit,coef=2) 
res <- topTags(qlf,n=nrow(qlf$table)) 

# filtering at FDR < 0.05
res_edgeR <- res$table[res$table[,ncol(res$table)]<0.05,]

* edgeR 말고 DESeq2 library 도 DE anlaysis 할때 종종 사용되는데 edgeR 이 variance estimate에서 lowly expressed genes 보다 highly expression genes 에 대해 더 관대한다.

2.6 Summary

decideTestDGE( )

qlf$comparison <- "COVID-19 vs Healthy"
summary(dt <-decideTestsDGE(qlf))

2.7 Visualization

# MA Plot

# MA plot
with(res$table,plot(logCPM,logFC,pch=16,cex=0.2, main= 'COVID-19 vs Healthy')) 
with(res$table,points(logCPM[FDR<0.05],logFC[FDR<0.05],pch=16,col="red",cex=0.3)) 
abline(0,0)

# Histogram
hist(res$table$PValue, main="COV19 vs Healthy", xlab = "p-values")

# Heatmap
pheatmap(cpm(y,log=TRUE)[rownames(res$table)[1:30],],labels_col = paste(cvsh), main="COV19 vs Healthy")

3. Gene Set analysis

DE analysis 에서 더 나아가 Gene Set analysis 도 진행하였다. ENSEMBLid, ENTREZid 를 이용해서 ors.Hs.eg DB에서 더 자세한 정보를 얻을 수 있었고 해당 프로젝트는 Immune Genes 에 중점을 두었다.

* GO:0006955 - which indicates the 'immune response' biological process ontology

# Gonna (Gene Ontology)
gsa <- goana(Entrez_gene_IDs,species="Hs")
gsa <- gsa[sort(gsa$P.DE,index.return=T)$ix,]
gsa$P.DE.adj <- p.adjust(gsa$P.DE,n=nrow(gsa),method="BH")

# KEGG (pathway)
kegg <- kegga(Entrez_gene_IDs,species="Hs")
kegg <- kegg[sort(kegg$P.DE,index.return=T)$ix,]
kegg$P.DE.adj <- p.adjust(kegg$P.DE,n=nrow(kegg),method="BH")

참고 https://blog.naver.com/jinp7/222041237039

저작자표시

'My works' 카테고리의 다른 글

[ CITE-seq DE Analysis ] Differential expression analysis for the protein component of CITE-seq data (CiteFuse) (0)	2022.01.27
[ MHC genotyping Tools Benchmarking ] Benchmarking of NGS-based MHC-II genotyping algorithms (0)	2022.01.22
[ Multi-omic Analysis ] Immune Response Against COVID-19: an -omics approach (0)	2022.01.19

'My works' Related Articles

바이오 대표

[ RNA-seq DE Analysis ] EdgeR/limma (COVID-19 vs Health) 본문

[ RNA-seq DE Analysis ] EdgeR/limma (COVID-19 vs Health)

### Covid- 19 Patients vs Healthy Control DE Analysis ###

1. Set up

1.1 Load packages

1.2 Load Dataset

1.3 Make parameters for design

2. DE Analysis (EdgeR Quasi)

2.1 Data into DGEList

2.2 Normalization of library sizes

2.3 MDS plot

2.4 Design

2.5 Fitting the model

2.6 Summary

2.7 Visualization

3. Gene Set analysis

'My works' 카테고리의 다른 글

티스토리툴바