일 | 월 | 화 | 수 | 목 | 금 | 토 |
---|---|---|---|---|---|---|
1 | ||||||
2 | 3 | 4 | 5 | 6 | 7 | 8 |
9 | 10 | 11 | 12 | 13 | 14 | 15 |
16 | 17 | 18 | 19 | 20 | 21 | 22 |
23 | 24 | 25 | 26 | 27 | 28 |
- Bioinformatics
- drug development
- ChIPseq
- PYTHON
- github
- matplotlib
- single cell
- CSS
- DataFrame
- javascript
- CUT&RUN
- CUTandRUN
- cellranger
- EdgeR
- js
- Git
- pandas
- python matplotlib
- single cell rnaseq
- MACS2
- HTML
- 비타민 C
- drug muggers
- scRNAseq
- single cell analysis
- ngs
- julia
- 싱글셀 분석
- scRNAseq analysis
- Batch effect
- Today
- Total
바이오 대표
[ Cut & Tag / Cut & Run ] Cut & Tag 투토리얼 본문
해당 글을 https://yezhengstat.github.io/CUTTag_tutorial/#Cite_this_tutorial 을 참고하였습니다.
CUT & TAG Data Processing and Analysis Tutorial
1. Introduction
Eukaryotic nucleus DNA에서 일어나는 모든 dynamic process 는 chromatin landscape/지형 을 따른다고 보면된다. Chromatin landscape 는 nucleosomes, their modification, TF and chromatin-associated complexed 등으로 이루어져있다. 이렇게 다는 chromatin features 을 이용하면 다른 cell type간, development stages 에서의 activating / silencing sites 나 chromatin domain 을 표시 할 수 있다.
Cut & Tag 방법
- Permeabilized cells or nuclei (antibiody 가 통과할 수 있도록 한다.)
- Incubated with antibody → bind to specific target.
- pA-Tn5 + adapter 가 tethered (docking) to antibody-bound sites.
- 마그네슘 아이언을 이용해서, transposome 을 activate 하고 transposome 은 adapter은 주변DNA에 integrate.
- extract DNA, Amplify (PCR).
장점: Antibody-tethered Tn5-based methods achieve high sensitivity owing to stringent washing of samples after pA-Tn5 tethering and the high efficiency of adaptor integration. (Chipseq 의 signal-to-noise improve 가능)
→ 90개 sample 까지 가능
Tutorial
Human Lymphoma K562 cell cline 에서의 histone modification profiling 을 진행한다.
** Any chromatin protein (TF, RNA polymerase II, epitope-tagged proteins) 관련 분석 다 가능하다.
# 필요한 packages
library(dplyr)
library(stringr)
library(ggplot2)
library(viridis) # https://m.blog.naver.com/regenesis90/222201170325
library(GenomicRanges)
library(chromVAR) ## For FRiP analysis and differential analysis
library(DESeq2) ## For differential analysis section
library(ggpubr) ## For customizing figures
library(corrplot) ## For correlation plot
2. Data Pre-processing
QC
- FastQC
- “Per base sequence content” - CUT&TAG 실험을 진행할때, 초반 10bp 는 톱니같은 는 discordant sequence를 포함할 수 있다. 이는 alignment 나 peakcalling 할때 문제가 되지 않는다. (해당 tutorial Bowtie2 parameter 을 이용하면 정확한 mapping 정보를 제공하니, trimming 하는 것을 추천하지 않는다. )
(Merge technical replicates/lanes (if needed))
3. Alignment
하나의 HiSeq 2500 flowcell 에 90 pooled samples 까지 single-index 25x25 PE Illumina sequencing (25 bp for each end?)이 가능하다. 각 샘플마다 unique PCR primer barcode 가 존재한다. 각 Library 를 ~5million paired-end-reads 를 갖을 수 있도록 맞추면 high quality profiling 을 갖을 수 있다.
Bowtie2 alignment
https://bowtie-bio.sourceforge.net/bowtie2/manual.shtml#the-bowtie2-aligner
##== linux command ==##
cores=8
ref="/path/to/bowtie2Index/hg38"
## Build the bowtie2 reference genome index if needed:
## bowtie2-build path/to/hg38/fasta/hg38.fa /path/to/bowtie2Index/hg38©
bowtie2 --end-to-end --very-sensitive --no-mixed --no-discordant --phred33 -I 10 -X 700 -p ${cores} -x ${ref} -1 ${projPath}/fastq/${histName}_R1.fastq.gz -2 ${projPath}/fastq/${histName}_R2.fastq.gz -S ${projPath}/alignment/sam/${histName}_bowtie2.sam &> ${projPath}/alignment/sam/bowtie2_summary/${histName}_bowtie2.txt
-end-to-end --very-sensitive --no-mixed --no-discordant --phred33 -I 10 -X 700 : 10-700bp 길이의 inserts mapping.
만약 25x25 PE sequencing 을 하면 trimming 을 안해도 되지만, longer sequencing 이라면, Cutadapt 혹은 -local --very-sensitive --no-mixed --no-discordant --phred33 -I 10 -X 700 를 이용해서, 3’ end of reads 에 남아있는 adapter을 제거해줄 필요가 있다.
(-local 을 사용하면 Bowtie2 가 maximize alignment score을 위해서 끝 부분을 좀 자를 수도 있다.)
Alignment to spike-in genome for spike-in calibration
E.coli DNA is carried along with bacterially-produced pA-Tn5 protein은 E.coli DNA 를 포함하고 있고 during the reaction, random tagmentation한다. … 하지만 결과물이 좋지않아 아직까지는 많이들 안쓴다고한다.
Alignment summary bowite2 manual
2984630 reads; of these: # sequencing depth (total number of paired reads)
2984630 (100.00%) were paired; of these:
125110 (4.19%) aligned concordantly 0 times
2360430 (79.09%) aligned concordantly exactly 1 time
499090 (16.72%) aligned concordantly >1 times # 2360430 + 499090 = successfully mapped reads
95.81% overall alignment rate
- 80%, 1 million mapped fragments gives robust profiles for a histone modification in the human genome
Remove duplicates
pA-Tn5 가 adapter을 DNA 에 삽입 할때, 정확히 어디에 넣을지는 accessibility of surrounding DNA에 의 영향을 받는다. 때문에 정확히 같은 position에 삽입되어 똑같은 fragments 가 생기고 ‘duplicates’라고 한다. (이는 PCR 에서 생긴 duplicates와는 다른 것이다.) 다만 high-quality CUT&RUN dataset에는 duplication rate 가 낮고, apprent ‘duplicate’ fragments가 진짜 fragments이기때문에 잘 제거하지 않는다. 하지만 데이터가 엄청 작거나, PCR 로 의심되는 duplication은 제거가 필요하다.
# example markedDup.bam
$samtools view ./A1_H3K27_IRI_markedDup.bam | head -1
K00124:539:H7NVGBBXX:7:1110:25083:12314 99 scaffold_1 13 11 1S94M1S = 13 96 AACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCGTTCCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCT AAFFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ MD:Z:17T0T75 PG:Z:MarkDuplicates XG:i:0 NM:i:2 XM:i:2 XN:i:0 XO:i:0 AS:i:172 XS:i:174 YS:i:173 YT:Z:CP
**Unique fragment numbe = MappedFragNum_hg38 * (1-DuplicationRate/100)
Assess mapped fragment size distribution
CUt & RUN tag 실험을 할 때 fragments size 는 nucleosome-sized fragments (~180bp 이거나 x2,x3) 이거나, shorter fragments (factor bound sites) 이다.
**aligned sam file 의 9th column 에서 fragment length 를 확인 할 수 있다.
## Extract the 9th column from the alignment sam file which is the fragment length
samtools view -F 0x04 $projPath/alignment/sam/${histName}_bowtie2.sam | awk -F'\\t' 'function abs(x){return ((x < 0.0) ? -x : x)} {print abs($9)}' | sort | uniq -c | awk -v OFS="\\t" '{print $2, $1/2}' >$projPath/alignment/sam/fragmentLen/${histName}_fragmentLen.txt
4. Alignment results Filtering and file format conversion
Mapping Quality
samtools view -q minQualityScore 을 이용해서 minQualityScore 보다 아래이면 제거도 가능하다. 이때, QualityScore 은 MAPQ(x) ****= -10 * log_10 (p) 이다. (p: probability that x is mapped wrongly)
File format conversion
## Filter and keep the mapped read pairs
**samtools view -bS** -F 0x04 $projPath/alignment/sam/${histName}_bowtie2.sam >$projPath/alignment/bam/${histName}_bowtie2.mapped.bam
# -F 0x04: exclude 0x04 flag = reads that are unmapped for the output
## Convert into bed file format
**bedtools bamtobed** -i $projPath/alignment/bam/${histName}_bowtie2.mapped.bam -bedpe > $projPath/alignment/bed/${histName}_bowtie2.bed
# example bed line
# scaffold_16 71719668 71719805 scaffold_16 71719676 71719825 K00124:539:H7NVGBBXX:7:1101:1225:64966 44 + -
## Keep the read pairs on the same chromosome and fragment length < 1000bp.
awk '$1==$4 && $6-$2 < 1000 {print $0}' $projPath/alignment/bed/${histName}_bowtie2.bed > $projPath/alignment/bed/${histName}_bowtie2.clean.bed
# awk: for text processing and pattern matching
# $1==$4: 1st field == 4th field 인지 확인 (same chromosome)
# $6 position = second pair's start position
# &&(and)
# If both conditions are true, the awk script outputs the entire record ($0), which consists of all the fields in the current line of the input file.
## Only extract the fragment related columns
cut -f 1,2,6 $projPath/alignment/bed/${histName}_bowtie2.clean.bed | sort -k1,1 -k2,2n -k3,3n > $projPath/alignment/bed/${histName}_bowtie2.fragments.bed
# first field (-k1,1) in ascending order (-k2,2n) and then based on the second field (-k2,2n) and third field (-k3,3n) in numeric order
Access replicate reproducibility
Reproducibility between replicates를 공부하기 위해서, genome 을 500bp bins 으로 나누고, Pearson correlation of the log2 transfored read counts 를 각 replicate 마다 계산하여 비교해볼수 있다.
## We use the midpoint of each fragment to infer which 500bp bins this fragment belongs to.
binLen=500
awk -v w=$binLen '{print $1, int(($2 + $3)/(2*w))*w + w/2}' $projPath/alignment/bed/${histName}_bowtie2.fragments.bed | sort -k1,1V -k2,2n | uniq -c | awk -v OFS="\\t" '{print $2, $3, $1}' | sort -k1,1V -k2,2n >$projPath/alignment/bed/${histName}_bowtie2.fragmentsCount.bin$binLen.bed
6. Peak calling
SEACR
Sparse Enrichment Analysis for Cut & Run (SEART) 툴을 이용해서 chromatin profiling data 에서 call peaks 과, enriched regions을 알아낼 수 있다. SEACR 을 이용하면 2가지 peak calling 에 사용할 수 있다. (1) narrow peaks - for factor binding sites (2) broad domains - for histone modifications calling
- Input: bedGraph files (from paired-end sequencing). IgG bedGraph file
- Do: IgG control 데이터와 비교하여 high signal intensity peaks를 peak 한다.
SEAR parameter:
- "non" if you normalized fragments with E.coli read count (Spike-in)
- "norm" for normalization
seacr="SEACR/SEACR_1.3.sh"
histControl=$2
mkdir -p $projPath/peakCalling/SEACR
bash $seacr $projPath/alignment/bedgraph/${histName}_bowtie2.fragments.normalized.bedgraph \\
$projPath/alignment/bedgraph/${histControl}_bowtie2.fragments.normalized.bedgraph \\
non stringent $projPath/peakCalling/SEACR/${histName}_seacr_control.peaks
bash $seacr $projPath/alignment/bedgraph/${histName}_bowtie2.fragments.normalized.bedgraph 0.01 non stringent $projPath/peakCalling/SEACR/${histName}_seacr_top0.01.peaks
Number of peaks called
각 replicate 마다 peak 개수를 정리하여 확인하면 데이터를 이해하는데 좋다. → Reproducibility across biological replicates (rep1 , rep2 overlapping peaks) 를 통해서 데이터가 얼마나 confident 한지 확인해 볼수 있다.
Fragment proportion in Peaks regions (FRiPs)
FRiPs 는 signal-to-noise 를 measure 할 수 있는 지수이다. 최소 20은 넘겨야 한다.
Visualization of peak number, peak width, peak reproducibility, and FRiPs
'Bioinformatics > Tools' 카테고리의 다른 글
[SEACR 논문] “Peak calling by Sparse Enrichment Analysis for CUT&RUN chromatin profiling” (0) | 2023.04.06 |
---|---|
[ 싱글셀 분석 ] 10x Cell ranger 정복하기 1 (1) | 2023.03.27 |
[ NSG QC / trimming ] TrimGalore (0) | 2023.02.14 |
[ 싱글셀 분석 ] cellranger-atac aggr (0) | 2023.02.03 |
[NGS scATACseq] scATACseq, Cicero를 이용해서 cis-regulatory gene network 분석하기 (0) | 2022.12.28 |