바이오 대표

[ Cut & Tag / Cut & Run ] Cut & Tag 투토리얼 본문

Bioinformatics/Tools

[ Cut & Tag / Cut & Run ] Cut & Tag 투토리얼

바이오 대표 2023. 3. 18. 05:05

해당 글을 https://yezhengstat.github.io/CUTTag_tutorial/#Cite_this_tutorial 을 참고하였습니다. 

 

CUT & TAG Data Processing and Analysis Tutorial

1. Introduction

Eukaryotic nucleus DNA에서 일어나는 모든 dynamic process 는 chromatin landscape/지형 을 따른다고 보면된다. Chromatin landscape 는 nucleosomes, their modification, TF and chromatin-associated complexed 등으로 이루어져있다. 이렇게 다는 chromatin features 을 이용하면 다른 cell type간, development stages 에서의 activating / silencing sites 나 chromatin domain 을 표시 할 수 있다.

Chromatin domain

 

Cut & Tag 방법

  1. Permeabilized cells or nuclei (antibiody 가 통과할 수 있도록 한다.)
  2. Incubated with antibody → bind to specific target.
  3. pA-Tn5 + adapter 가 tethered (docking) to antibody-bound sites.
  4. 마그네슘 아이언을 이용해서, transposome 을 activate 하고 transposome 은 adapter은 주변DNA에 integrate.
  5. extract DNA, Amplify (PCR).

장점: Antibody-tethered Tn5-based methods achieve high sensitivity owing to stringent washing of samples after pA-Tn5 tethering and the high efficiency of adaptor integration. (Chipseq 의 signal-to-noise improve 가능)

→ 90개 sample 까지 가능

 

 

 

 

 

 

Tutorial

Human Lymphoma K562 cell cline 에서의 histone modification profiling 을 진행한다.

** Any chromatin protein (TF, RNA polymerase II, epitope-tagged proteins) 관련 분석 다 가능하다.

# 필요한 packages 
library(dplyr)
library(stringr)
library(ggplot2)
library(viridis) # https://m.blog.naver.com/regenesis90/222201170325
library(GenomicRanges)
library(chromVAR) ## For FRiP analysis and differential analysis
library(DESeq2) ## For differential analysis section
library(ggpubr) ## For customizing figures
library(corrplot) ## For correlation plot

 

 

 

2. Data Pre-processing

QC

  • FastQC
  • “Per base sequence content” - CUT&TAG 실험을 진행할때, 초반 10bp 는 톱니같은 는 discordant sequence를 포함할 수 있다. 이는 alignment 나 peakcalling 할때 문제가 되지 않는다. (해당 tutorial Bowtie2 parameter 을 이용하면 정확한 mapping 정보를 제공하니, trimming 하는 것을 추천하지 않는다. )

(Merge technical replicates/lanes (if needed))

 

3. Alignment

하나의 HiSeq 2500 flowcell 에 90 pooled samples 까지 single-index 25x25 PE Illumina sequencing (25 bp for each end?)이 가능하다. 각 샘플마다 unique PCR primer barcode 가 존재한다. 각 Library 를 ~5million paired-end-reads 를 갖을 수 있도록 맞추면 high quality profiling 을 갖을 수 있다.

 

Bowtie2 alignment

 

 

https://bowtie-bio.sourceforge.net/bowtie2/manual.shtml#the-bowtie2-aligner

##== linux command ==##
cores=8
ref="/path/to/bowtie2Index/hg38"

## Build the bowtie2 reference genome index if needed:
## bowtie2-build path/to/hg38/fasta/hg38.fa /path/to/bowtie2Index/hg38©

bowtie2 --end-to-end --very-sensitive --no-mixed --no-discordant --phred33 -I 10 -X 700 -p ${cores} -x ${ref} -1 ${projPath}/fastq/${histName}_R1.fastq.gz -2 ${projPath}/fastq/${histName}_R2.fastq.gz -S ${projPath}/alignment/sam/${histName}_bowtie2.sam &> ${projPath}/alignment/sam/bowtie2_summary/${histName}_bowtie2.txt

-end-to-end --very-sensitive --no-mixed --no-discordant --phred33 -I 10 -X 700 : 10-700bp 길이의 inserts mapping.

만약 25x25 PE sequencing 을 하면 trimming 을 안해도 되지만, longer sequencing 이라면, Cutadapt 혹은 -local --very-sensitive --no-mixed --no-discordant --phred33 -I 10 -X 700 를 이용해서, 3’ end of reads 에 남아있는 adapter을 제거해줄 필요가 있다.

(-local 을 사용하면 Bowtie2 가 maximize alignment score을 위해서 끝 부분을 좀 자를 수도 있다.)

 

Alignment to spike-in genome for spike-in calibration

E.coli DNA is carried along with bacterially-produced pA-Tn5 protein은 E.coli DNA 를 포함하고 있고 during the reaction, random tagmentation한다. … 하지만 결과물이 좋지않아 아직까지는 많이들 안쓴다고한다.

 

 

Alignment summary bowite2 manual

2984630 reads; of these: # sequencing depth (total number of paired reads)
  2984630 (100.00%) were paired; of these:
    125110 (4.19%) aligned concordantly 0 times
    2360430 (79.09%) aligned concordantly exactly 1 time
    499090 (16.72%) aligned concordantly >1 times # 2360430 + 499090 = successfully mapped reads
95.81% overall alignment rate
  • 80%, 1 million mapped fragments gives robust profiles for a histone modification in the human genome

 

 

 

Remove duplicates

pA-Tn5 가 adapter을 DNA 에 삽입 할때, 정확히 어디에 넣을지는 accessibility of surrounding DNA에 의 영향을 받는다. 때문에 정확히 같은 position에 삽입되어 똑같은 fragments 가 생기고 ‘duplicates’라고 한다. (이는 PCR 에서 생긴 duplicates와는 다른 것이다.) 다만 high-quality CUT&RUN dataset에는 duplication rate 가 낮고, apprent ‘duplicate’ fragments가 진짜 fragments이기때문에 잘 제거하지 않는다. 하지만 데이터가 엄청 작거나, PCR 로 의심되는 duplication은 제거가 필요하다.

# example markedDup.bam
$samtools view ./A1_H3K27_IRI_markedDup.bam | head -1
K00124:539:H7NVGBBXX:7:1110:25083:12314 99      scaffold_1      13      11      1S94M1S =       13      96      AACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCGTTCCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCT        AAFFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ MD:Z:17T0T75    PG:Z:MarkDuplicates     XG:i:0  NM:i:2  XM:i:2  XN:i:0  XO:i:0  AS:i:172        XS:i:174        YS:i:173        YT:Z:CP

**Unique fragment numbe = MappedFragNum_hg38 * (1-DuplicationRate/100)

 

Assess mapped fragment size distribution

CUt & RUN tag 실험을 할 때 fragments size 는 nucleosome-sized fragments (~180bp 이거나 x2,x3) 이거나, shorter fragments (factor bound sites) 이다.

**aligned sam file 의 9th column 에서 fragment length 를 확인 할 수 있다.

## Extract the 9th column from the alignment sam file which is the fragment length
samtools view -F 0x04 $projPath/alignment/sam/${histName}_bowtie2.sam | awk -F'\\t' 'function abs(x){return ((x < 0.0) ? -x : x)} {print abs($9)}' | sort | uniq -c | awk -v OFS="\\t" '{print $2, $1/2}' >$projPath/alignment/sam/fragmentLen/${histName}_fragmentLen.txt

 

 

4. Alignment results Filtering and file format conversion

Mapping Quality

samtools view -q minQualityScore 을 이용해서 minQualityScore 보다 아래이면 제거도 가능하다. 이때, QualityScore 은 MAPQ(x) ****= -10 * log_10 (p) 이다. (p: probability that x is mapped wrongly)

 

File format conversion

## Filter and keep the mapped read pairs
**samtools view -bS** -F 0x04 $projPath/alignment/sam/${histName}_bowtie2.sam >$projPath/alignment/bam/${histName}_bowtie2.mapped.bam
	# -F 0x04: exclude 0x04 flag = reads that are unmapped for the output 

## Convert into bed file format
**bedtools bamtobed** -i $projPath/alignment/bam/${histName}_bowtie2.mapped.bam -bedpe > $projPath/alignment/bed/${histName}_bowtie2.bed
	# example bed line 
	# scaffold_16     71719668        71719805        scaffold_16     71719676        71719825        K00124:539:H7NVGBBXX:7:1101:1225:64966  44      +       -

## Keep the read pairs on the same chromosome and fragment length < 1000bp.
awk '$1==$4 && $6-$2 < 1000 {print $0}' $projPath/alignment/bed/${histName}_bowtie2.bed > $projPath/alignment/bed/${histName}_bowtie2.clean.bed
	# awk: for text processing and pattern matching
		# $1==$4: 1st field == 4th field 인지 확인 (same chromosome)
		# $6 position = second pair's start position 
		# &&(and)
	# If both conditions are true, the awk script outputs the entire record ($0), which consists of all the fields in the current line of the input file.

## Only extract the fragment related columns
cut -f 1,2,6 $projPath/alignment/bed/${histName}_bowtie2.clean.bed | sort -k1,1 -k2,2n -k3,3n  > $projPath/alignment/bed/${histName}_bowtie2.fragments.bed
	# first field (-k1,1) in ascending order (-k2,2n) and then based on the second field (-k2,2n) and third field (-k3,3n) in numeric order

 

Access replicate reproducibility

Reproducibility between replicates를 공부하기 위해서, genome 을 500bp bins 으로 나누고, Pearson correlation of the log2 transfored read counts 를 각 replicate 마다 계산하여 비교해볼수 있다.

## We use the midpoint of each fragment to infer which 500bp bins this fragment belongs to.
binLen=500
awk -v w=$binLen '{print $1, int(($2 + $3)/(2*w))*w + w/2}' $projPath/alignment/bed/${histName}_bowtie2.fragments.bed | sort -k1,1V -k2,2n | uniq -c | awk -v OFS="\\t" '{print $2, $3, $1}' |  sort -k1,1V -k2,2n  >$projPath/alignment/bed/${histName}_bowtie2.fragmentsCount.bin$binLen.bed

 

 

6. Peak calling

SEACR

Sparse Enrichment Analysis for Cut & Run (SEART) 툴을 이용해서 chromatin profiling data 에서 call peaks 과, enriched regions을 알아낼 수 있다. SEACR 을 이용하면 2가지 peak calling 에 사용할 수 있다. (1) narrow peaks - for factor binding sites (2) broad domains - for histone modifications calling

- Input: bedGraph files (from paired-end sequencing). IgG bedGraph file

- Do: IgG control 데이터와 비교하여 high signal intensity peaks를 peak 한다.

 

SEAR parameter: 

- "non" if you normalized fragments with E.coli read count (Spike-in)

- "norm" for normalization

seacr="SEACR/SEACR_1.3.sh"
histControl=$2
mkdir -p $projPath/peakCalling/SEACR

bash $seacr $projPath/alignment/bedgraph/${histName}_bowtie2.fragments.normalized.bedgraph \\
     $projPath/alignment/bedgraph/${histControl}_bowtie2.fragments.normalized.bedgraph \\
     non stringent $projPath/peakCalling/SEACR/${histName}_seacr_control.peaks

bash $seacr $projPath/alignment/bedgraph/${histName}_bowtie2.fragments.normalized.bedgraph 0.01 non stringent $projPath/peakCalling/SEACR/${histName}_seacr_top0.01.peaks

 

Number of peaks called

각 replicate 마다 peak 개수를 정리하여 확인하면 데이터를 이해하는데 좋다. → Reproducibility across biological replicates (rep1 , rep2 overlapping peaks) 를 통해서 데이터가 얼마나 confident 한지 확인해 볼수 있다.

 

Fragment proportion in Peaks regions (FRiPs)

FRiPs 는 signal-to-noise 를 measure 할 수 있는 지수이다. 최소 20은 넘겨야 한다.

 

Visualization of peak number, peak width, peak reproducibility, and FRiPs