바이오 대표

[NGS scATACseq] scATACseq, Cicero를 이용해서 cis-regulatory gene network 분석하기 본문

Bioinformatics/Tools

[NGS scATACseq] scATACseq, Cicero를 이용해서 cis-regulatory gene network 분석하기

바이오 대표 2022. 12. 28. 04:01

Cicero

Cicero single-cell chromatin accessibility data 를 이용 → co-accessibility examinecis-regulatory network를 구성하고 분석하기 위한 툴이다. 나는 scATACseq을 분석하기 위해 해당 툴을 이용했다.

How? chromatin accessibility data → physically proximity 한 regions of genome 구하기→ 추정되는 enhancer-promoter pair 찾기 → cis-regulatory network 구성

 

Step 0. Create an input CDS

Cicero는 CDS (cell_data_set) class object 를 이용한다.

10x scATACseq 데이터를 사용했다면 filtered_peak_bc_matrix 폴더에 있는 데이터를 input CDS형태로 변형시켜줄 수 있다.

# read in matrix data using the Matrix package
indata <- Matrix::readMM("filtered_peak_bc_matrix/matrix.mtx")
# binarize the matrix
indata@x[indata@x > 0] <- 1

# format cell info
cellinfo <- read.table("filtered_peak_bc_matrix/barcodes.tsv")
row.names(cellinfo) <- cellinfo$V1
names(cellinfo) <- "cells"

# format peak info
peakinfo <- read.table("filtered_peak_bc_matrix/peaks.bed")
names(peakinfo) <- c("chr", "bp1", "bp2")
peakinfo$site_name <- paste(peakinfo$chr, peakinfo$bp1, peakinfo$bp2, sep="_")
row.names(peakinfo) <- peakinfo$site_name

row.names(indata) <- row.names(peakinfo)
colnames(indata) <- row.names(cellinfo)

# make CDS
input_cds <-  suppressWarnings(new_cell_data_set(indata,
cell_metadata = cellinfo,
gene_metadata = peakinfo))

input_cds <- monocle3::detect_genes(input_cds)

#Ensure there are no peaks included with zero reads
input_cds <- input_cds[Matrix::rowSums(exprs(input_cds)) != 0,]

 

Step 1. Create a Cicero CDS

single cell accessibility data (scATACseq) 은 sparse (0이 엄청 많은) 데이터 이기 때문에, 정확한 co-accessibility scores 를 얻기 위해서, 비슷한 cells들을 합쳐서 dense count data 를 만들어야한다. Cicero 는 이를 KNN을 이용해서 overlapping sets of cells 를 만들어 사용한다. ← based on UMAP 이지만 어떤 dimensionality reduction 방법을 사용해도 상관 없다.

make_cicero_cds(): input CDS object + reduced dimension coordinate map → create aggregated CDS object

# Create a Cicero CDS 
set.seed(2017)
input_cds <- detect_genes(input_cds)
input_cds <- estimate_size_factors(input_cds)
input_cds <- preprocess_cds(input_cds, method = "LSI")
input_cds <- reduce_dimension(input_cds, reduction_method = 'UMAP',
                              preprocess_method = "LSI")
# Monacle로 UMAP 그리기
plot_cells(input_cds)

# UMAP coordinates 얻고, cicero CDS 만들기
umap_coords <- reducedDims(input_cds)$UMAP
cicero_cds <- make_cicero_cds(input_cds, reduced_coordinates = umap_coords)
## input_cds 예시

# class: cell_data_set 
# dim: 93518 18157 
# metadata(2): cds_version citations
# assays(1): counts
# rownames(93518): chr1_9788_10684 chr1_180636_181315 ... KI270713.1_32971_33663 KI270713.1_34301_35077
# rowData names(5): chr bp1 bp2 site_name num_cells_expressed
# colnames(18157): AAACGAAAGACCGCAA-1 AAACGAAAGAGGCAGG-3 ... TTTGTGTTCTAAACGC-4 TTTGTGTTCTATATCC-1
# colData names(3): cells Size_Factor num_genes_expressed
# reducedDimNames(2): LSI UMAP
# mainExpName: NULL
# altExpNames(0):
  • reduced_coordinates: reduced dimension coordinate map (umap_coords을 확인해보자)
    • data.frame or matrix 형태

  • Monocle 3 과 연동되어 있어서, Monocle’s plotting function도 사용 가능하다.

 

Step 2. Run Cicero

run_cicero to estimate co-accessibility sites

data("human.hg19.genome")
# use only a small part of the genome for speed
sample_genome <- subset(human.hg19.genome, V1 == "chr2")
sample_genome$V2[1] <- 10000000

## Usually use the whole human.hg19.genome ##
## Usually run with sample_num = 100 ##
conns <- run_cicero(cicero_cds, sample_genome, sample_num = 2)
head(conns)

** 다른 reference 데이터를 이용하고 싶다면 https://cole-trapnell-lab.github.io/cicero-release/docs_m3/#frequently-asked-questions 에서 “Frequently Asked Question” 에 해결방법이 나와있다.

만약 GRCh38 데이터를 사용해야한다면 http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/ 에서 [hg38.chrom.sizes](<http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.chrom.sizes>) 을 다운받아 이용하거나 wget [<http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.chrom.sizes>](<http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.chrom.sizes>) 로 직접 서버에 다운받을 수 있따.

 

결과

 

run_cicero 를 데이터에 맞게 조율하고 싶다면 다음과 같이 내포되어있는 functions 들 및 해당 parameters들을 따로 실행시킬 수 있다.

  • parameters (default)
    • window (500000): window used for calculating each individual model , 비교할 sites들의 가장 먼 길이라고 보면 된다.
    • sample_num (100): distance parameter을 계산할 때 사용되는 # of sample regions
    • distance_constraint (250000): meaningful cis-regulatory contacts 를 구할때의 distance
    • s (0.75): for human recommendation 0.75 by default in Cicero, which corresponds to the “tension globule” polymer model of DNA (Sanborn et al., 2015)
  • functions 들
    • estimate_distance_parameter. This function calculates the distance penalty parameter based on small random windows of the genome.
    • generate_cicero_models. This function uses the distance parameter determined above and uses graphical LASSO to calculate the co-accessibility scores of overlapping windows of the genome using a distance-based penalty.
    • assemble_connections. This function takes as input the output of generate_cicero_models and reconciles the overlapping models to create the final list of co-accessibility scores

 

Step 3. Gene Annotation

원하는 GTF files 은 https://uswest.ensembl.org/info/data/ftp/index.html 에서 원하는 species의 gene sets column 에서 GTF 파일을 다운받을 수 있다.

 

** for human GRCh38:

wget [<https://ftp.ensembl.org/pub/release-108/gtf/homo_sapiens/Homo_sapiens.GRCh38.108.gtf.gz>](<https://ftp.ensembl.org/pub/release-108/gtf/homo_sapiens/Homo_sapiens.GRCh38.108.gtf.gz>)

or download.file("<ftp://ftp.ensembl.org/pub/release-108/gtf/homo_sapiens/Homo_sapiens.GRCh38.108.gtf.gz>", temp)

# Download the GTF associated with this data (mm9) from ensembl and load it
# using rtracklayer

# download and unzip
temp <- tempfile()
download.file("<ftp://ftp.ensembl.org/pub/release-65/gtf/mus_musculus/Mus_musculus.NCBIM37.65.gtf.gz>", temp)
# download.file("<ftp://ftp.ensembl.org/pub/release-108/gtf/homo_sapiens/Homo_sapiens.GRCh38.108.gtf.gz>", temp) for human GRCh38
gene_anno <- rtracklayer::readGFF(temp)
unlink(temp)

# rename some columns to match requirements
gene_anno$chromosome <- paste0("chr", gene_anno$seqid)
gene_anno$gene <- gene_anno$gene_id
gene_anno$transcript <- gene_anno$transcript_id
gene_anno$symbol <- gene_anno$gene_name

 

Step 4. Visualization Cicero Connection

plot_connections(conns, "chr2", 9773451, 9848598,
                 gene_model = gene_anno,
                 coaccess_cutoff = .25,
                 connection_width = .5,
                 collapseTranscripts = "longest" )