Overview

For our ChIP-Seq analysis, we created a set of reference genomic regions from the Gencode (Harrow J 2012) and Ensembl (Andrew Yates 2016) databases.

The following reference genome assemblies were used:

The following software was used:

BEDOPS version 2.4.16 (Shane Neph 2012)
GenomicTools version 3.0.0 (Tsirigos A 2012)
BEDtools version 2.22.0 (Quinlan and Hall 2010)
HOMER version 4.6 (Heinz S 2010)

Find GENCODE TSS Regions

First, we need to download and install BEDOPS.

# downlaod the BEDOPS software
cd ~/software/bin
wget "https://github.com/bedops/bedops/releases/download/v2.4.16/bedops_linux_x86_64-v2.4.16.tar.bz2"
tar jxvf bedops_linux_x86_64-vx.y.z.tar.bz2
PATH=$PATH:~/software/bin

Next we will load the other required software packages and set some environment variables for reference. We will also need a text file with the sizes of each chromosome in our reference genome (hg19). We will use a regions size of 10000 base pairs from the TSS.

# load some programs to be used; this updates the PATH with the corresponding entries
module load homer/v4.6
module unload gcc
module load bedtools/2.22.0

# size of the TSS regions to use
region_size="10000"

# make the directory for the GENCODE data
data_dir="./data/Gencode"
mkdir -p "$data_dir"
cd "$data_dir"

# get the chrom sizes file and copy it here
chrom_file="/ifs/home/kellys04/projects/Bioinformatics/data/hg19_chrom.sizes.txt"
cp "$chrom_file" . 

# file prefix
gen_file="gencode.v19.annotation"

The GENCODE reference data set will be downloaded.

# Download source data ; http://www.gencodegenes.org/releases/19.html
[ ! -f ${gen_file}.gtf.gz ] && wget "ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_19/gencode.v19.annotation.gtf.gz"

The GENCODE data is in GTF format; it will first be converted to BED format with BEDOPS. Then, GenomicTools will be used to obtain locations of the TSS regions. HOMER will be used to remove overlapping peaks and collapse the peak set.

# convert the GTF to BED witht he BEDOPS converter
[ ! -f ${gen_file}.bed ] && zcat ${gen_file}.gtf.gz | convert2bed --input=gtf - > ${gen_file}.bed

# use gtools to get the TSS regions
# VERSION: genomic-tools 3.0.0
[ ! -f ${gen_file}.tss.bed ] && cat ${gen_file}.bed | genomic_regions connect | genomic_regions pos -op 5p > ${gen_file}.tss.bed

# collapse the peaks with HOMER, preserve strand specific peaks
[ ! -f ${gen_file}.tss_collapse.txt ] && mergePeaks ${gen_file}.tss.bed -strand > ${gen_file}.tss_collapse.txt
[ ! -f ${gen_file}.tss_collapse.bed ] && pos2bed.pl ${gen_file}.tss_collapse.txt > ${gen_file}.tss_collapse.bed && rm -f ${gen_file}.tss_collapse.txt

# create 10,000kbp TSS regions with bedtools
if [ ! -f ${gen_file}.tss_collapse_${region_size}.bed ]; then
  cat ${gen_file}.tss_collapse.bed | bedtools slop -g "$(basename $chrom_file)" -b "$region_size" > ${gen_file}.tss_collapse_${region_size}.bed
fi

# collapse the new peak overlaps created by extending the regions
[ ! -f Gencode_TSS.txt ] && mergePeaks ${gen_file}.tss_collapse_${region_size}.bed > Gencode_TSS.txt
[ ! -f Gencode_TSS.bed ] && pos2bed.pl Gencode_TSS.txt > Gencode_TSS.bed && rm -f Gencode_TSS.txt

Check the number of peaks

gencode_bed="data/Gencode/Gencode_TSS.bed"
wc -l $gencode_bed

## 29978 data/Gencode/Gencode_TSS.bed

Find Ensembl TSS Regions

We can now repeat the same procedure with the Ensembl data set. We are starting a new session, so environment parameters need to be set again.

# load some programs to be used; this updates the PATH with the corresponding entries
module load homer/v4.6
module unload gcc
module load bedtools/2.22.0

# size of the TSS regions to use
region_size="10000"

data_dir="./data/Ensembl"
mkdir -p "$data_dir"
cd "$data_dir"

# get the chrom sizes file and copy it here
chrom_file="/ifs/home/kellys04/projects/Bioinformatics/data/hg19_chrom.sizes.txt"
cp "$chrom_file" .

Next we will download the Ensembl data set. Since it is in GTF format, it will need to be converted to BED format. Then we will use the same procedure as before to find the TSS regions and collapse duplicate peaks.

# file prefix
ens_file="genes.ensembl.GRCh37.82"
# download source data
if [ ! -f ${ens_file}.gtf.gz ]; then
  wget "ftp://ftp.ensembl.org/pub/grch37/release-84/gtf/homo_sapiens/Homo_sapiens.GRCh37.82.gtf.gz" -O ${ens_file}.gtf.gz
fi

# GTF processing
# Need to filter out the GL, MT entries, need to add 'chr' to the start of each 1st entry
if [ ! -f ${ens_file}_noGLMT.gtf ]; then
  zcat ${ens_file}.gtf.gz | grep -Ev "^#|^GL|^M" > ${ens_file}_noGLMT.gtf
  # cat ${ens_file}_noGLMT.gtf | awk '{ OFS="\t"; $1 = "chr"$1; print }' > ${ens_file}_noGLMT_chr.gtf # this don't work?? 
  cat ${ens_file}_noGLMT.gtf | sed 's/^/chr/' > ${ens_file}_noGLMT_chr.gtf
  
fi

# convert to BED 
if [ ! -f ${ens_file}_noGLMT_chr.bed ]; then
  # cat ${ens_file}_noGLMT_chr.gtf | convert2bed --input=gtf - > ${ens_file}_noGLMT_chr.bed
  gtf2bed < ${ens_file}_noGLMT_chr.gtf > ${ens_file}_noGLMT_chr.bed
fi

# get the TSS regions
if [ ! -f ${ens_file}_noGLMT_chr.tss.bed ]; then
  cat ${ens_file}_noGLMT_chr.bed | genomic_regions connect | genomic_regions pos -op 5p > ${ens_file}_noGLMT_chr.tss.bed
fi

# collapse the regions
 if [ ! -f ${ens_file}_noGLMT_chr.tss_collapse.bed ]; then
  mergePeaks ${ens_file}_noGLMT_chr.tss.bed -strand > ${ens_file}_noGLMT_chr.tss_collapse.txt
  pos2bed.pl ${ens_file}_noGLMT_chr.tss_collapse.txt > ${ens_file}_noGLMT_chr.tss_collapse.bed && rm -f ${ens_file}_noGLMT_chr.tss_collapse.txt
fi

# create 10,000kbp TSS regions
if [ ! -f ${ens_file}_noGLMT_chr.tss_collapse_${region_size}.bed ]; then
  cat ${ens_file}_noGLMT_chr.tss_collapse.bed | bedtools slop -g "$(basename $chrom_file)" -b "$region_size" > ${ens_file}_noGLMT_chr.tss_collapse_${region_size}.bed
fi

# collapse the peaks again 
[ ! -f Ensembl_TSS.txt ] && mergePeaks ${ens_file}_noGLMT_chr.tss_collapse_${region_size}.bed > Ensembl_TSS.txt
[ ! -f Ensembl_TSS.bed ] && pos2bed.pl Ensembl_TSS.txt > Ensembl_TSS.bed && rm -f Ensembl_TSS.txt

Check the number of peaks

ensembl_bed="data/Ensembl/Ensembl_TSS.bed"
wc -l $ensembl_bed

## 29950 data/Ensembl/Ensembl_TSS.bed

Overlap GENCODE and Ensembl regions

We can now overlap the Ensembl and GENCODE TSS region data sets to find the number of regions they have in common. Ideally, there should be a very large overlap.

# load some programs to be used; this updates the PATH with the corresponding entries
module load homer/v4.6

# path to the regions
ens_file="Ensembl/Ensembl_TSS.bed"
gen_file="Gencode/Gencode_TSS.bed"

# set up a directory for the overlapping
merge_dir="merged"
mkdir -p "$merge_dir"
cd "$merge_dir"

# create symlinks to the files
ln -s "../$ens_file" "Ensembl.bed"
ln -s "../$gen_file" "Gencode.bed"

# overlap them with HOMER
mergePeaks *.bed -prefix mergepeaks -venn venn.txt

Now we can load the venn.txt file produced by HOMER into R and create a Venn diagram to visualize the overlap between the peak sets.

# ~~~~~ LOAD PACKAGES ~~~~~~~ #
suppressPackageStartupMessages(library('VennDiagram'))
suppressPackageStartupMessages(library('gridExtra'))
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ # 

# custom function for creating the Venn diagram for two comparisons
pair_peak_overlap_Venn <- function(SampleID, venn_table_file){
    
    # ~~~~~ PARSE THE VENN TABLE ~~~~~~~ #
    # read in the venn text
    venn_table_df<-read.table(venn_table_file,header = TRUE,sep = "\t",stringsAsFactors = FALSE,check.names = FALSE)
    # venn_table_df
    
    # get the venn categories
    venn_categories<-colnames(venn_table_df)[!colnames(venn_table_df) %in% c("Total","Name")] 
    # cat("Venn categories are:\n"); venn_categories
    
    # venn_categories
    num_categories<-length(venn_categories)
    # cat("Num categories are:\n"); num_categories
    
    # make a summary table
    venn_summary<-venn_table_df[!colnames(venn_table_df) %in% venn_categories]
    # cat("Venn summary table is categories are:\n"); venn_summary
    # venn_summary
    
    # write summary table
    # write.table(venn_summary,file = "venn_summary.tsv",quote = FALSE,row.names = FALSE)
    # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ # 
    
    # ~~~~~ SET UP THE PLOT ~~~~~~~ #
    # get the areas for the venn; add up all the overlaps that contain the given category 
    
    if (num_categories == 2) {
        # PAIRWISE VENN
        # cat("CREATING PAIR-WISE VENN DIAGRAM\n")
        # area1
        area_n1<-sum(venn_summary[grep(pattern = paste0("(?=.*",venn_categories[1],")"),
                                       x = venn_summary$Name,perl = TRUE),][["Total"]])
        
        # area2
        area_n2<-sum(venn_summary[grep(pattern = paste0("(?=.*",venn_categories[2],")"),
                                       x = venn_summary$Name,perl = TRUE),][["Total"]])
        
        # n12
        area_n12<-sum(venn_summary[grep(pattern = paste0("(?=.*",venn_categories[1],")","(?=.*",venn_categories[2],")"),
                                        x = venn_summary$Name,perl = TRUE),][["Total"]])
        
        venn <-draw.pairwise.venn(area1=area_n1,
                                  area2=area_n2,
                                  cross.area=area_n12,
                                  category=gsub(pattern = ".bed",replacement = "",x = venn_categories),
                                  fill=c('red','blue'),
                                  alpha=0.3,
                                  # cat.dist = 0.1,
                                  cex=2,
                                  cat.cex = 2,
                                  margin = 0.1,
                                  ind = FALSE)
        
        # pdf(plot_filepath,width = 8,height = 8)
        # grid.arrange(gTree(children=venn), top=paste0(SampleID," Peak Overlap")) #, bottom="subtitle")
        grid.arrange(gTree(children=venn), top=textGrob(paste0(SampleID," Peak Overlap"), gp=gpar(cex=2), vjust=3))
        # top=textGrob("Total Data and Image", gp=gpar(cex=3), just="top") #  vjust=0.7
        # dev.off()
        
        
    } 
}

# the file produced by HOMER mergePeaks
venn_file <- "data/merged/venn.txt"

# create the plot
pair_peak_overlap_Venn("", venn_file)

For reference, this is our hg19 chromosome sizes file:

chrom_file="/ifs/home/kellys04/projects/Bioinformatics/data/hg19_chrom.sizes.txt"
cat $chrom_file

chrM    16571
chr1    249250621
chr2    243199373
chr3    198022430
chr4    191154276
chr5    180915260
chr6    171115067
chr7    159138663
chr8    146364022
chr9    141213431
chr10   135534747
chr11   135006516
chr12   133851895
chr13   115169878
chr14   107349540
chr15   102531392
chr16   90354753
chr17   81195210
chr18   78077248
chr19   59128983
chr20   63025520
chr21   48129895
chr22   51304566
chrX    155270560
chrY    59373566

Check the number of base pairs covered by our regions

gen_bed_file <- "/ifs/home/kellys04/projects/ChIpSeq_2016-03-10/project_notes/gencode_tss_regions/data/Gencode/Gencode_TSS.bed"

gen_df <-read.delim(gen_bed_file, sep = '\t', header = FALSE, comment.char = '#')

head(gen_df)

##     V1     V2     V3            V4 V5 V6
## 1 chr1   1868  80006 Merged-chr1-1  1  +
## 2 chr1  80049 183862 Merged-chr1-2  1  -
## 3 chr1 218653 277253 Merged-chr1-3  1  -
## 4 chr1 307719 352392 Merged-chr1-4  1  +
## 5 chr1 357639 389467 Merged-chr1-5  1  +
## 6 chr1 429364 480971 Merged-chr1-6  1  +

gen_df['Difference'] <- gen_df['V3'] - gen_df['V2']
total_bp <- sum(gen_df[['Difference']])
print(total_bp)

## [1] 1544625519

Session Information

system('uname -srv',intern=T)

## [1] "Linux 2.6.32-642.4.2.el6.x86_64 #1 SMP Tue Aug 23 19:58:13 UTC 2016"

sessionInfo()

## R version 3.3.1 (2016-06-21)
## Platform: x86_64-redhat-linux-gnu (64-bit)
## Running under: CentOS release 6.8 (Final)
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] grid      stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
## [1] gridExtra_2.2.1     VennDiagram_1.6.17  futile.logger_1.4.3
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.6          digest_0.6.10        gtable_0.2.0        
##  [4] futile.options_1.0.0 formatR_1.4          magrittr_1.5        
##  [7] evaluate_0.9         stringi_1.1.1        rmarkdown_1.0       
## [10] lambda.r_1.1.9       tools_3.3.1          stringr_1.0.0       
## [13] yaml_2.1.13          htmltools_0.3.5      knitr_1.13

References

Andrew Yates, et al. 2016. “Ensembl 2016.” Nucleic Acids Res 44. doi:10.1093/nar/gkv1157.

Harrow J, Frankish et al. 2012. “GENCODE: The Reference Human Genome Annotation for the ENCODE Project.” Genome Research 22 (9): 1760–74. doi:10.1101/gr.135350.111.

Heinz S, Spann N, Benner C. 2010. “Simple Combinations of Lineage-Determining Transcription Factors Prime Cis-Regulatory Elements Required for Macrophage and B Cell Identities.” Mol Cell 38 (4): 576–89. doi:10.1016/j.molcel.2010.05.004.

Quinlan, Aaron R., and Ira M. Hall. 2010. “BEDTools: A Flexible Suite of Utilities for Comparing Genomic Features.” Bioinformatics 26 (6): 841–42. doi:10.1093/bioinformatics/btq033.

Shane Neph, Alex P. Reynolds, M. Scott Kuehn. 2012. “BEDOPS: High-Performance Genomic Feature Operations.” Bioinformatics 28 (14): 1919–20. doi:10.1093/bioinformatics/bts277.

Tsirigos A, Bilal E, Haiminen N. 2012. “GenomicTools: A Computational Platform for Developing High-Throughput Analytics in Genomics.” Bioinformatics. doi:10.1093/bioinformatics/btr646.

GENCODE & Ensembl TSS Region Processing