일 | 월 | 화 | 수 | 목 | 금 | 토 |
---|---|---|---|---|---|---|
1 | 2 | 3 | 4 | |||
5 | 6 | 7 | 8 | 9 | 10 | 11 |
12 | 13 | 14 | 15 | 16 | 17 | 18 |
19 | 20 | 21 | 22 | 23 | 24 | 25 |
26 | 27 | 28 | 29 | 30 | 31 |
- CSS
- drug development
- github
- single cell rnaseq
- julia
- single cell
- pandas
- js
- DataFrame
- MACS2
- Git
- scRNAseq analysis
- ChIPseq
- ngs
- Bioinformatics
- python matplotlib
- 비타민 C
- EdgeR
- drug muggers
- scRNAseq
- Batch effect
- matplotlib
- PYTHON
- single cell analysis
- cellranger
- 싱글셀 분석
- HTML
- CUT&RUN
- javascript
- CUTandRUN
- Today
- Total
바이오 대표
[NGS scATACseq] scATACseq, Cicero를 이용해서 cis-regulatory gene network 분석하기 본문
[NGS scATACseq] scATACseq, Cicero를 이용해서 cis-regulatory gene network 분석하기
바이오 대표 2022. 12. 28. 04:01Cicero
Cicero single-cell chromatin accessibility data 를 이용 → co-accessibility examine → cis-regulatory network를 구성하고 분석하기 위한 툴이다. 나는 scATACseq을 분석하기 위해 해당 툴을 이용했다.
How? chromatin accessibility data → physically proximity 한 regions of genome 구하기→ 추정되는 enhancer-promoter pair 찾기 → cis-regulatory network 구성
Step 0. Create an input CDS
Cicero는 CDS (cell_data_set) class object 를 이용한다.
10x scATACseq 데이터를 사용했다면 filtered_peak_bc_matrix 폴더에 있는 데이터를 input CDS형태로 변형시켜줄 수 있다.
# read in matrix data using the Matrix package
indata <- Matrix::readMM("filtered_peak_bc_matrix/matrix.mtx")
# binarize the matrix
indata@x[indata@x > 0] <- 1
# format cell info
cellinfo <- read.table("filtered_peak_bc_matrix/barcodes.tsv")
row.names(cellinfo) <- cellinfo$V1
names(cellinfo) <- "cells"
# format peak info
peakinfo <- read.table("filtered_peak_bc_matrix/peaks.bed")
names(peakinfo) <- c("chr", "bp1", "bp2")
peakinfo$site_name <- paste(peakinfo$chr, peakinfo$bp1, peakinfo$bp2, sep="_")
row.names(peakinfo) <- peakinfo$site_name
row.names(indata) <- row.names(peakinfo)
colnames(indata) <- row.names(cellinfo)
# make CDS
input_cds <- suppressWarnings(new_cell_data_set(indata,
cell_metadata = cellinfo,
gene_metadata = peakinfo))
input_cds <- monocle3::detect_genes(input_cds)
#Ensure there are no peaks included with zero reads
input_cds <- input_cds[Matrix::rowSums(exprs(input_cds)) != 0,]
Step 1. Create a Cicero CDS
single cell accessibility data (scATACseq) 은 sparse (0이 엄청 많은) 데이터 이기 때문에, 정확한 co-accessibility scores 를 얻기 위해서, 비슷한 cells들을 합쳐서 dense count data 를 만들어야한다. Cicero 는 이를 KNN을 이용해서 overlapping sets of cells 를 만들어 사용한다. ← based on UMAP 이지만 어떤 dimensionality reduction 방법을 사용해도 상관 없다.
make_cicero_cds(): input CDS object + reduced dimension coordinate map → create aggregated CDS object
# Create a Cicero CDS
set.seed(2017)
input_cds <- detect_genes(input_cds)
input_cds <- estimate_size_factors(input_cds)
input_cds <- preprocess_cds(input_cds, method = "LSI")
input_cds <- reduce_dimension(input_cds, reduction_method = 'UMAP',
preprocess_method = "LSI")
# Monacle로 UMAP 그리기
plot_cells(input_cds)
# UMAP coordinates 얻고, cicero CDS 만들기
umap_coords <- reducedDims(input_cds)$UMAP
cicero_cds <- make_cicero_cds(input_cds, reduced_coordinates = umap_coords)
## input_cds 예시
# class: cell_data_set
# dim: 93518 18157
# metadata(2): cds_version citations
# assays(1): counts
# rownames(93518): chr1_9788_10684 chr1_180636_181315 ... KI270713.1_32971_33663 KI270713.1_34301_35077
# rowData names(5): chr bp1 bp2 site_name num_cells_expressed
# colnames(18157): AAACGAAAGACCGCAA-1 AAACGAAAGAGGCAGG-3 ... TTTGTGTTCTAAACGC-4 TTTGTGTTCTATATCC-1
# colData names(3): cells Size_Factor num_genes_expressed
# reducedDimNames(2): LSI UMAP
# mainExpName: NULL
# altExpNames(0):
- reduced_coordinates: reduced dimension coordinate map (umap_coords을 확인해보자)
- data.frame or matrix 형태
- Monocle 3 과 연동되어 있어서, Monocle’s plotting function도 사용 가능하다.
Step 2. Run Cicero
run_cicero to estimate co-accessibility sites
data("human.hg19.genome")
# use only a small part of the genome for speed
sample_genome <- subset(human.hg19.genome, V1 == "chr2")
sample_genome$V2[1] <- 10000000
## Usually use the whole human.hg19.genome ##
## Usually run with sample_num = 100 ##
conns <- run_cicero(cicero_cds, sample_genome, sample_num = 2)
head(conns)
** 다른 reference 데이터를 이용하고 싶다면 https://cole-trapnell-lab.github.io/cicero-release/docs_m3/#frequently-asked-questions 에서 “Frequently Asked Question” 에 해결방법이 나와있다.
만약 GRCh38 데이터를 사용해야한다면 http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/ 에서 [hg38.chrom.sizes](<http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.chrom.sizes>) 을 다운받아 이용하거나 wget [<http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.chrom.sizes>](<http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.chrom.sizes>) 로 직접 서버에 다운받을 수 있따.
결과
run_cicero 를 데이터에 맞게 조율하고 싶다면 다음과 같이 내포되어있는 functions 들 및 해당 parameters들을 따로 실행시킬 수 있다.
- parameters (default)
- window (500000): window used for calculating each individual model , 비교할 sites들의 가장 먼 길이라고 보면 된다.
- sample_num (100): distance parameter을 계산할 때 사용되는 # of sample regions
- distance_constraint (250000): meaningful cis-regulatory contacts 를 구할때의 distance
- s (0.75): for human recommendation 0.75 by default in Cicero, which corresponds to the “tension globule” polymer model of DNA (Sanborn et al., 2015)
- functions 들
- estimate_distance_parameter. This function calculates the distance penalty parameter based on small random windows of the genome.
- generate_cicero_models. This function uses the distance parameter determined above and uses graphical LASSO to calculate the co-accessibility scores of overlapping windows of the genome using a distance-based penalty.
- assemble_connections. This function takes as input the output of generate_cicero_models and reconciles the overlapping models to create the final list of co-accessibility scores
Step 3. Gene Annotation
원하는 GTF files 은 https://uswest.ensembl.org/info/data/ftp/index.html 에서 원하는 species의 gene sets column 에서 GTF 파일을 다운받을 수 있다.
** for human GRCh38:
wget [<https://ftp.ensembl.org/pub/release-108/gtf/homo_sapiens/Homo_sapiens.GRCh38.108.gtf.gz>](<https://ftp.ensembl.org/pub/release-108/gtf/homo_sapiens/Homo_sapiens.GRCh38.108.gtf.gz>)
or download.file("<ftp://ftp.ensembl.org/pub/release-108/gtf/homo_sapiens/Homo_sapiens.GRCh38.108.gtf.gz>", temp)
# Download the GTF associated with this data (mm9) from ensembl and load it
# using rtracklayer
# download and unzip
temp <- tempfile()
download.file("<ftp://ftp.ensembl.org/pub/release-65/gtf/mus_musculus/Mus_musculus.NCBIM37.65.gtf.gz>", temp)
# download.file("<ftp://ftp.ensembl.org/pub/release-108/gtf/homo_sapiens/Homo_sapiens.GRCh38.108.gtf.gz>", temp) for human GRCh38
gene_anno <- rtracklayer::readGFF(temp)
unlink(temp)
# rename some columns to match requirements
gene_anno$chromosome <- paste0("chr", gene_anno$seqid)
gene_anno$gene <- gene_anno$gene_id
gene_anno$transcript <- gene_anno$transcript_id
gene_anno$symbol <- gene_anno$gene_name
Step 4. Visualization Cicero Connection
plot_connections(conns, "chr2", 9773451, 9848598,
gene_model = gene_anno,
coaccess_cutoff = .25,
connection_width = .5,
collapseTranscripts = "longest" )
'Bioinformatics > Tools' 카테고리의 다른 글
[ Cut & Tag / Cut & Run ] Cut & Tag 투토리얼 (0) | 2023.03.18 |
---|---|
[ NSG QC / trimming ] TrimGalore (0) | 2023.02.14 |
[ 싱글셀 분석 ] cellranger-atac aggr (0) | 2023.02.03 |
[NGS scRNAseq] cellranger count 의 output 파일, summary.html 해석 (1) | 2022.12.19 |
[NGS scRNAseq] Chromium 10x Illumina의 기본이해(workflow)와 Cell ranger count (0) | 2022.12.18 |