This example will walk through the analysis of a ChIP-Seq experiment conducted on human samples before and after treatment. Samples were immunoprecipitated for five histone marks (only the results of H3K27AC are shown here). We will compare the differential peak data from DiffBind with gene expression data from a microarray analysis of the same samples. The goal of this analysis is to determine if any patterns can be found between genes that were differentially bound and the changes in expression of those genes. To accomplish this, we will first subset the DiffBind data set for peaks within 3000 base pairs of the gene start site. The table of differential binding values per sample will be merged with the gene expression values. Gene expression values will then be grouped into genes that were upregulate (UP
) and genes that were down regulated (DOWN
). In order to visualize the data, split beanplots will be created for each sample; one side of the split will contain the DiffBind values for the UP
genes, while the other side contains the DiffBind values for the DOWN
genes. Each sample will also be tested for significance of difference between the UP
and DOWN
groupings, and significance will be marked on the plot.
First, required packages will be loaded.
# ~~~~~ LOAD PACKAGES ~~~~~~~~ #
# install.packages(c("devtools"))
# source("http://www.bioconductor.org/biocLite.R")
# biocLite(c("Biobase","preprocessCore"))
library("data.table")
library("beanplot")
library("ggplot2")
library("reshape2")
##
## Attaching package: 'reshape2'
## The following objects are masked from 'package:data.table':
##
## dcast, melt
library("preprocessCore")
Due to the complexity of this analysis pipeline and the need for repeated analysis with different parameters, custom functions will be created.
Our pipeline will need to match up information on samples between two data sets. Sample ID’s are saved in the column headings of both data sets, but include multiple entries for each sample set, and not all samples are present in both sets. The get_replacements
function will use a supplied regular expression to return a character vector of unique matching sample ID’s in the supplied column names.
# ~~~~~ CUSTOM FUNCTIONS ~~~~~~~~ #
get_replacements <- function(source_data, uniq=TRUE, pattern = "",
remove_entry = NULL, ...){
# return vector of first matches to regex
matches <- gsub(x = source_data,
pattern = pattern,
replacement = "\\1")
# return only unique entries
if(uniq) matches <- unique(matches)
# remove an entry from the output
if ( ! is.null(remove_entry)){
if(any(grepl(pattern = remove_entry,x = matches))){
matches <- matches[grep(pattern = remove_entry, x = matches,
value = FALSE, invert = TRUE)]
}
}
return(matches)
}
The multi_grep
function will be used to search for multiple patterns in a character vector by passing a vector of patterns to be matched.
multi_grep <- function(source_data, patterns){
# find multiple patterns in a char vector
# make a regex to search with
if (length(patterns) > 1){
search_pattern <- paste(patterns,
collapse = "|")
} else {
search_pattern <- patterns
}
matches <- grep(x = source_data,
pattern = search_pattern,
value = TRUE)
return(matches)
}
The sampleID_intersects
function will be used to find the set of Sample ID’s that are present in both the DiffBind and gene expression data sets.
sampleID_intersects <- function(IDs1, pattern1 = "",
IDs2, pattern2 = "",
removeID2 = NULL ){
# get the intersect between sets of ID's after gsub pattern replacements
ID1_match <- get_replacements(source_data = IDs1, pattern = pattern1)
ID2_intersect_ID1 <- multi_grep(patterns = ID1_match,
source_data = IDs2)
ID2_match <- get_replacements(source_data = ID2_intersect_ID1,
pattern = pattern2, remove_entry = removeID2)
return(ID2_match)
}
The quantile_normalize_df
function will be uses to transform a data frame by quantile normalizing its values.
quantile_normalize_df <- function(df){
# quantile normalization of a dataframe
df <- setnames(x = as.data.frame(normalize.quantiles(as.matrix(df))),
colnames(df))
return(df)
}
The diff_subset
function will subset the DiffBind data frame for differential peak score entries per Sample that are greater than or equal to 1.
diff_subset <- function(df, patientID_matches){
# subset for DiffPeaks >= 1 per sample
for(i in seq_along(patientID_matches)){
tmp_ID <- patientID_matches[i]
# print(tmp_ID %in% colnames(diffbind_df_intersect_min))
# get the diffBind columns with the patient's ID
tmp_colnames <- grep(pattern = tmp_ID,x = colnames(df),value = TRUE)
# make sure there are 2 colnames, otherwise
# skip to the next iteration of the loop!
if (length(tmp_colnames) !=2) next
# get the 2 colname
tmp_colnames_2 <- grep(pattern = paste0("^",tmp_ID,"[[:punct:]]2.*$"),
x = tmp_colnames,perl = TRUE,value = TRUE)
# get the 1 colname
tmp_colnames_1 <- grep(pattern = paste0("^",tmp_ID,"[[:punct:]]1.*$"),
x = tmp_colnames,perl = TRUE,value = TRUE)
# diffbind_df <- subset(diffbind_df, tmp_colnames_R >= 1)
# diffbind_df <- subset(diffbind_df, tmp_colnames_D >= 1)
df <- df[which(df[[tmp_colnames_2]] >= 1 | df[[tmp_colnames_1]] >= 1),]
# diffbind_df <- diffbind_df[which(diffbind_df[[tmp_colnames_D]] >= 1),]
}
return(df)
}
The quantile_cutoff
function will apply a quantile cut off to a numeric vector. This ‘quantile flooring’ will raise all values below the 25th quantile up to the 25th quantile. This measure was implemented to account for potential skewing caused by extremely low values.
quantile_cutoff <- function(vec, cutoff = 0.25){
# replace lowest values with the x'th quantile value
vec_quantile <- quantile(vec, probs = cutoff)
vec <- sapply(vec, function(x) max(x, vec_quantile))
return(vec)
}
The calc_fold_change
function calculates the fold change value for two numeric vectors.
calc_fold_change <- function(sample1, sample2, calc_log = TRUE, log_base = 2){
# sample1, sample2: numeric vectors
# fc: fold change output vector
if(calc_log){
fc <- log((sample2/sample1), base = log_base)
} else {
fc <- (sample2/sample1)
}
return(fc)
}
The log_ratio_df
function will create a data frame that contains only the fold change values for the DiffBind samples
log_ratio_df <- function(diffbind_df, diff_gene_colname = "external_gene_name",
patientID_matches, calc_cutoff = FALSE, melt_df = FALSE,
add_status = FALSE, ...){
# create a df to hold the DiffBind log ratio entries
# create an empty data frame to fill with log ratio values
diff_log_ratio <- setNames(data.frame(matrix(nrow = nrow(diffbind_df),
ncol = length(patientID_matches))),
patientID_matches)
# add the genes from DiffBind
diff_log_ratio[["gene"]] <- diffbind_df[[diff_gene_colname]]
# enter the log ratio values into the df per patient
# # iterate over the samples; get only diff min cols per patient
for(i in seq_along(patientID_matches)){
tmp_ID <- patientID_matches[i]
# get the diffBind columns with the patient's ID
tmp_colnames <- grep(pattern = tmp_ID,x = colnames(diffbind_df),
value = TRUE)
# make sure there are 2 colnames,
# otherwise skip to the next iteration of the loop!
if (length(tmp_colnames) !=2) next
# get the 2 colname
tmp_colnames_2 <- grep(pattern = paste0("^",tmp_ID,"[[:punct:]]2.*$"),
x = tmp_colnames,perl = TRUE,value = TRUE)
# get the 1 colname
tmp_colnames_1 <- grep(pattern = paste0("^",tmp_ID,"[[:punct:]]1.*$"),
x = tmp_colnames,perl = TRUE,value = TRUE)
if(calc_cutoff){
# apply quantil cutoffs to the 2 and 1 values
tmp_2 <- quantile_cutoff(diffbind_df[[tmp_colnames_2]])
tmp_1 <- quantile_cutoff(diffbind_df[[tmp_colnames_1]])
} else {
tmp_2 <- diffbind_df[[tmp_colnames_2]]
tmp_1 <- diffbind_df[[tmp_colnames_1]]
}
# get the fold change
diff_log_ratio[,tmp_ID] <- calc_fold_change(sample1 = tmp_1,
sample2 = tmp_2,
calc_log = TRUE,
log_base = 2)
}
# melt it into long format
if (melt_df) diff_log_ratio <- reshape2::melt(diff_log_ratio,id.vars="gene",
variable.name="sample",
value.name="diff_peak_log_ratio")
# add a column to hold the 'status'
# e.g. Up regulated vs. Down regulated in the gene_expr table
if (add_status) diff_log_ratio[["gene_expression_status"]] <- NA
return(diff_log_ratio)
}
The mark_test_type
function will retrieve the ‘test’ value for a given histone mark based on the supplied dataframe of marks and tests.
mark_test_type <- function(hist_mark, mark_key){
# get the alternative value for the ttest/utest from the mark key
test_alt <- as.character(subset(mark_key, marks == hist_mark)[['test']])
return(test_alt)
}
The plot_signif_pvalue
function will be used to add a ’*’ character to the plots to indicate if each sample has passed significance testing, using either a one-sided T-test or U-test.
plot_signif_pvalue <- function(df, y_colname, x_colname, x_group, p_value_cutoff=0.05, test_method = "ttest", ...){
# plot points on a barplot / beanplot marking significant categories
y_max <- max(df[[y_colname]],na.rm = TRUE)
for(i in seq_along(unique(as.character(df[[x_group]])))){
tmp_ID <- unique(as.character(df[[x_group]]))[i]
tmp_df <- subset(df, get(x_group) == tmp_ID)
# get the one-sided test type value
test_alt <- mark_test_type(...)
if(test_method == 'ttest') tmp_pvalue <- t.test(get(y_colname)~get(x_colname),
data = tmp_df,
alternative = test_alt,
na.action = na.omit,
paired = FALSE)["p.value"]
# Wilcoxon rank-sum test applied to independent samples
if(test_method == 'utest'){
tmp_pvalue <- wilcox.test(get(y_colname)~get(x_colname),
data = tmp_df,
alternative = test_alt,
na.action = na.omit,
paired = FALSE)["p.value"]
print(wilcox.test(get(y_colname)~get(x_colname),
data = tmp_df,
paired = FALSE))
}
if(tmp_pvalue < p_value_cutoff){
print(points(x = i,
y = y_max - 1,
pch ='*',
col ='red',
cex=2))
print(text(x = i,
y = y_max,
labels = paste0('p = ',
format(tmp_pvalue,
digits = 2)),
cex=0.7))
}
}
}
The diff_beanplot
function will create the DiffBind beanplots.
diff_beanplot <- function(df, x1_colname, x2_colname, y_colname, main_text_1 = "",
y_lab = "", main_text_2 = "", save_output = FALSE,
file = "./plot.pdf", strip_chart = FALSE, ... ){
# make the DiffBind split beanplots
# only make a plot of there are at least 50 genes
if(nrow(unique(subset(df, get(x1_colname) == "UP")["gene"]))>=50 &&
nrow(unique(subset(df, get(x1_colname) == "DOWN")["gene"]))>=50){
if(save_output) pdf(file = file,height = 8,width = 12)
beanplot(get(y_colname)~get(x1_colname)*get(x2_colname),
data=df,
what=c(0,1,1,0),
# # EXTRA OPTIONS HERE: # what=c(0,1,1,1),
# maxstripline = 0, # ll = 0.04,varwidth = TRUE,
border = NA,
bw="nrd0", # gave errors with default and "nrd"
overallline = 'mean', #median',
col=list('grey','purple'),
ylab = y_lab,
main=paste0("DiffBind results for ",
main_text_1,
" ",
main_text_2),
cex.main=0.9,
side = "both"
)
legend('bottomright',
fill=c('grey','purple'),
legend= c('Down Genes', 'Up Genes'))
# add a horizontal line at 0
abline(h=0,col="darkgrey",lty="dotted")
# add p value and *'s
plot_signif_pvalue(df = df,y_colname = y_colname,
x_colname = x1_colname,
x_group = x2_colname, ...)
if (strip_chart){
# creates a stripchart overlay
# # don't use this !!
stripchart(y_colname~x_colname*sample,data = df,
vertical=TRUE,
add=TRUE,
cex=0.1,
method="jitter",
jitter=0.05)
}
if(save_output) dev.off()
}
}
The gene_expression_raw_pipeline
function will contain the steps required for processing the microarray expression data.
gene_expression_raw_pipeline <- function(gene_expr_df, patientID_matches,
gene_names_overlap, quant_norm = FALSE){
# keep only LR columns; log ratio
gene_expr_df <- gene_expr_df[multi_grep(source_data = colnames(gene_expr_df),
patterns = "^.*LR$")]
# quantil normalize the gene expression dataset
if (quant_norm) gene_expr_df <- quantile_normalize_df(gene_expr_df)
# fix colnames
colnames(gene_expr_df) <- get_replacements(source_data = colnames(gene_expr_df),
uniq=TRUE,
pattern = "^([[:alpha:]]*)([[:punct:]]*.*)$")
# keep matched columns
gene_expr_df <- gene_expr_df[patientID_matches]
# add a column with the gene names for melting
gene_expr_df[["gene"]] <- rownames(gene_expr_df)
# subset for intersected genes only
gene_expr_df <- droplevels(gene_expr_df[rownames(gene_expr_df) %in% gene_names_overlap,])
# melt it into long format
gene_expr_df <- reshape2::melt(gene_expr_df,
id.vars="gene",
variable.name="sample",
value.name="gene_expression_log_ratio")
return(gene_expr_df)
}
The diffbind_raw_pipeline
function will hold the steps required to process the DiffBind data set.
diffbind_raw_pipeline <- function(diffbind_df, patientID_matches,
gene_names_overlap, quant_norm = FALSE){
# quantile normalize entires DiffBind dataset
if (quant_norm) {
norm_cols <- multi_grep(source_data = colnames(diffbind_df),
patterns = patientID_matches)
diffbind_df[norm_cols] <- quantile_normalize_df(diffbind_df[norm_cols])
}
# subset for closest genes
diffbind_df <- subset(diffbind_df, abs(distancetoFeature)<=3000)
# subset for intersected genes only
diffbind_df <- droplevels(diffbind_df[diffbind_df[["external_gene_name"]] %in% gene_names_overlap,])
# remove extraneous columns
diffbind_cols_to_remove <-c("seqnames","start","end","width","strand",
"Conc","Conc_D","Conc_R","Fold",
"p.value","FDR","feature","gene_biotype",
"start_position","end_position",
"insideFeature","shortestDistance",
"fromOverlappingOrNearest", "distancetoFeature")
diffbind_df <- diffbind_df[,! colnames(diffbind_df) %in% diffbind_cols_to_remove]
return(diffbind_df)
}
The combined_full_pipeline
function contains the steps needed for running the entire analysis pipeline.
combined_full_pipeline <- function(diffbind_df, gene_expr_df, patientID_matches,
gene_names_overlap, diff_quant_norm = FALSE,
gene_quant_norm = FALSE, diff_lr_cutoff = FALSE,
make_plot = TRUE, data_file = "./data.Rdata",
plot_file = "./plot.pdf", histone_mark,
params_branch, mark_key, ...){
# process the DiffBind data
diffbind_df <- diffbind_raw_pipeline(diffbind_df = diffbind_df,
patientID_matches = patientID_matches,
gene_names_overlap = gene_names_overlap,
quant_norm = diff_quant_norm)
# process the gene expression data
gene_expr_df <- gene_expression_raw_pipeline(gene_expr_df = gene_expr_df,
patientID_matches = patientID_matches,
gene_names_overlap = gene_names_overlap,
quant_norm = gene_quant_norm)
# create the DiffBind log ratio data frame
diff_log_ratio <- log_ratio_df(diffbind_df = diffbind_df,
diff_gene_colname = "external_gene_name",
patientID_matches = patientID_matches,
calc_cutoff = diff_lr_cutoff,
melt_df = TRUE, add_status = TRUE)
# merge the tables
# # merge together the two long format diff_log_ratio's and gene_expr log ratio's
diff_gene_expr_merged <- base::merge(gene_expr_df,
diff_log_ratio,by=c("gene","sample"))
# set the gene expression status to UP or DOWN based on gene expression value
# # 1.5x up/down expression
diff_gene_expr_merged[["gene_expression_status"]] <- ifelse(diff_gene_expr_merged[["gene_expression_log_ratio"]]>=0.58, "UP", ifelse(diff_gene_expr_merged[["gene_expression_log_ratio"]]<=-0.58, "DOWN", no = NA))
# ~~~~~~ SUBSET DIFF FOLD CHANGE ~~~~~~~~~ #
# subset for log ratio >1.5x change; log2 0.58
# diff_gene_expr_merged <- subset(diff_gene_expr_merged, abs(diff_peak_log_ratio) >= 0.58)
# diff_gene_expr_merged <- droplevels(diff_gene_expr_merged)
# ~~~~~~~~~~~~~~~ #
if (make_plot){
# make the DiffBind beanplots
save.image(file=data_file, compress = TRUE)
diff_beanplot(df = diff_gene_expr_merged,
x1_colname = 'gene_expression_status',
x2_colname = 'sample',
y_colname = 'diff_peak_log_ratio',
main_text_1 = histone_mark,
main_text_2 = params_branch,
y_lab = "DiffBind ratio = log2 ( Diff peak 2 / Diff peak 1 )",
test_method = 'utest',
hist_mark = histone_mark,
mark_key = mark_key,
# save_output = TRUE, # use this in script, but not in this document
file = plot_file)
}
}
Input files and output locations will be specified. A key will be created that relates each histone mark studied in our experiment with its role, and its one-sided significance test.
# location for the output
main_outdir <- "."
setwd(main_outdir)
# path to the TSV file containing the microarray gene expression data
gene_expr_file <- "input/gene_expression2.tsv"
# path to the file containing the differential binding DiffBind sheet
diffbind_file <- "input/diffbind_H3K27AC.csv"
# get the unique ID to use for the sample
histone_mark <- "H3K27AC"
# get the analysis params branch for the samples
params_branch <- "diffbind.by_chip.status-peaks.by_sample.macs_broad"
# histone marks included in the experiment
mark_key <- data.frame(marks = c('H3K9ME3', 'H3K27ME3', 'H3K9AC',
'H3K27AC', 'H3K4ME3'),
type = c('repressive', 'repressive','activating',
'activating','activating'),
test = c('greater', 'greater', 'less', 'less', 'less'))
mark_key
## marks type test
## 1 H3K9ME3 repressive greater
## 2 H3K27ME3 repressive greater
## 3 H3K9AC activating less
## 4 H3K27AC activating less
## 5 H3K4ME3 activating less
The DiffBind and microarray gene expression data sets will be loaded.
# ~~~~~~ READ IN FILE ~~~~~~~~ #
# diffbind data
diffbind_df <- read.csv(diffbind_file)
# gene expression data
gene_expr_df <- read.table(gene_expr_file,sep = '\t',header = TRUE,
row.names = 1,quote = "")
# save current state # use this in the script
# save.image(file=paste0(main_outdir,"/raw_data.Rdata"),compress = TRUE)
In our DiffBind data set, each entry corresponds to a peak that has been determined to be differentially bound between sample sets. For our experiment, we used paired samples before and after treatment. For example, here AADC.1.H3K27AC
corresponds to sample AADC before treatment, and AADC.2.H3K27AC
corresponds to the same sample group after treatment. Both samples have undergone chromatin immunoprecipitation for histone mark H3K27AC. In our data set, FDR
corresponds to the false discovery rate, which is a form of adjusted p value.
# check some attributes of the data sets
head(diffbind_df, n = 3)
## seqnames start end width strand Conc Conc_1 Conc_2
## 1 chr1 10009 10438 430 * 1.932386 1.850992 2.009433
## 2 chr1 710223 710919 697 * 2.619188 2.368653 2.832585
## 3 chr1 712387 715259 2873 * 6.365982 6.165604 6.541893
## Fold p.value FDR AADC.1.H3K27AC BBDC.1.H3K27AC
## 1 -0.1584405 0.4227292 1.0000000 7.656727 0.540563
## 2 -0.4639316 0.2293055 0.5103474 5.955232 4.324504
## 3 -0.3762889 0.2762324 0.5419556 115.701655 7.027319
## CCDC.1.H3K27AC EEDC.1.H3K27AC GGDC.1.H3K27AC HHDC.1.H3K27AC
## 1 2.348033 0.5304813 6.523903 0.756952
## 2 4.696066 12.2010689 3.261951 0.756952
## 3 69.658311 127.3155011 122.649376 0.756952
## JJDC.1.H3K27AC KKDC.1.H3K27AC NNDC.1.H3K27AC OODC.1.H3K27AC
## 1 0.76252 0.3777487 21.948941 0.6244618
## 2 0.76252 7.9327230 8.047945 4.3712326
## 3 96.84004 27.1979074 147.789537 74.3109549
## QQDC.1.H3K27AC RRDC.1.H3K27AC AADC.2.H3K27AC BBDC.2.H3K27AC
## 1 0.5870339 0.6324206 5 11.521257
## 2 5.8703390 3.7945233 7 6.670202
## 3 48.1367794 24.0319811 139 89.138150
## CCDC.2.H3K27AC EEDC.2.H3K27AC GGDC.2.H3K27AC HHDC.2.H3K27AC
## 1 9.751324 4.849967 0.6946048 0.7920232
## 2 5.250713 9.699934 2.7784194 0.7920232
## 3 138.018741 147.923993 50.7061531 53.8575756
## JJDC.2.H3K27AC KKDC.2.H3K27AC NNDC.2.H3K27AC OODC.2.H3K27AC
## 1 0.4375714 0.5349915 4.898485 0.495923
## 2 6.5635704 15.5147522 12.736061 9.422537
## 3 105.0171265 89.8785643 140.423237 54.055609
## QQDC.2.H3K27AC RRDC.2.H3K27AC feature external_gene_name
## 1 0.5391373 8.799582 ENSG00000223972 DDX11L1
## 2 3.7739608 5.279749 ENSG00000228327 RP11-206L10.2
## 3 2.1565490 107.941535 ENSG00000237491 RP11-206L10.9
## gene_biotype start_position end_position insideFeature distancetoFeature
## 1 pseudogene 11869 14412 upstream -1645
## 2 lincRNA 700237 714006 inside 3435
## 3 lincRNA 714150 745440 overlapStart -327
## shortestDistance fromOverlappingOrNearest
## 1 1431 shortestDistance
## 2 3087 shortestDistance
## 3 1109 shortestDistance
dim(diffbind_df)
## [1] 126206 44
We can also create some basic plots to visualize some of the data. We can see here that the raw values have a very large and skewed distribution.
# distribution of raw Fold changes for a sample
with(diffbind_df, hist(AADC.2.H3K27AC/AADC.1.H3K27AC))
Using the log2(Fold change) helps to normalize the distribution.
# distribution of log2 Fold changes for a sample
with(diffbind_df, hist(log2(AADC.2.H3K27AC/AADC.1.H3K27AC)))
Our gene expression data set was obtained from microarray analysis of all samples included in the study. For example, here AADC.Exp.1
and AADC.Exp.2
correspond to the samples before and after treatment, while AADC.Exp.LR
is the log2 ratio of the two values. We will preview some aspects of the data set.
head(gene_expr_df, n = 3)
## AADC.Exp.1 AADC.Exp.2 AADC.Exp.LR BBDC.Exp.1 BBDC.Exp.2 BBDC.Exp.LR
## STAT1 6131.25 5941.836 -0.04527245 882.3387 5646.252 2.67788905
## GAPDH 51219.36 49317.550 -0.05458810 51503.7100 55067.740 0.09653104
## ACTB 46530.07 50957.180 0.13112208 27041.2200 43020.270 0.66985634
## CCDC.Exp.1 CCDC.Exp.2 CCDC.Exp.LR DDDC.Exp.1 DDDC.Exp.2 DDDC.Exp.LR
## STAT1 1222.487 705.6041 -0.7928883 2700.842 2402.704 -0.16875032
## GAPDH 30246.960 58412.8800 0.9494964 53715.410 54167.860 0.01210106
## ACTB 25757.660 18002.7900 -0.5167810 35830.510 57658.160 0.68633622
## EEDC.Exp.1 EEDC.Exp.2 EEDC.Exp.LR FFDC.Exp.1 FFDC.Exp.2 FFDC.Exp.LR
## STAT1 1307.528 1969.007 0.5906264 1076.489 1192.044 0.14710391
## GAPDH 45408.600 38131.560 -0.2519800 34473.110 33843.480 -0.02659354
## ACTB 42287.460 37245.500 -0.1831638 23570.470 25168.620 0.09464559
## GGDC.Exp.1 GGDC.Exp.2 GGDC.Exp.LR HHDC.Exp.1 HHDC.Exp.2 HHDC.Exp.LR
## STAT1 1783.034 1554.07 -0.1982827 1327.161 1064.01 -0.3188317
## GAPDH 46598.990 49978.21 0.1010005 79254.210 93127.11 0.2327136
## ACTB 26419.930 15990.41 -0.7244197 20488.180 37259.12 0.8628018
## IIDC.Exp.1 IIDC.Exp.2 IIDC.Exp.LR JJDC.Exp.1 JJDC.Exp.2 JJDC.Exp.LR
## STAT1 1067.311 3152.203 1.562380 2125.583 2636.649 0.3108469
## GAPDH 68302.050 25023.780 -1.448629 47029.500 57089.950 0.2796708
## ACTB 71205.810 35488.070 -1.004661 39072.230 64979.450 0.7338399
## KKDC.Exp.1 KKDC.Exp.2 KKDC.Exp.LR LLDC.Exp.1 LLDC.Exp.2 LLDC.Exp.LR
## STAT1 3710.152 2696.019 -0.4606476 1201.073 662.8243 -0.8576254
## GAPDH 42782.580 53452.320 0.3212291 62244.380 76772.7300 0.3026504
## ACTB 49310.700 53184.350 0.1091010 39070.760 34014.8800 -0.1999233
## MMDC.Exp.1 MMDC.Exp.2 MMDC.Exp.LR NNDC.Exp.1 NNDC.Exp.2 NNDC.Exp.LR
## STAT1 782.8055 1865.457 1.2528033 2423.722 1932.754 -0.3265662
## GAPDH 38957.5300 59453.970 0.6098709 68994.010 49123.550 -0.4900563
## ACTB 20465.2900 66247.930 1.6946963 69432.540 47629.870 -0.5437453
## OODC.Exp.1 OODC.Exp.2 OODC.Exp.LR PPDC.Exp.1 PPDC.Exp.2 PPDC.Exp.LR
## STAT1 46.22742 37.10801 -0.31701843 2126.65 1761.695 -0.2716184
## GAPDH 6177.22959 5903.68645 -0.06534385 15028.38 57574.320 1.9377360
## ACTB 6059.12679 5795.10568 -0.06427492 13552.89 53110.290 1.9703909
## QQDC.Exp.1 QQDC.Exp.2 QQDC.Exp.LR RRDC.Exp.1 RRDC.Exp.2 RRDC.Exp.LR
## STAT1 2186.735 1070.688 -1.0302403 1411.666 1947.939 0.4645497
## GAPDH 29822.860 89003.020 1.5774357 60804.270 55741.410 -0.1254231
## ACTB 37794.860 64049.440 0.7609959 56846.640 51633.440 -0.1387694
dim(gene_expr_df)
## [1] 17426 54
Next we will check the distribution of gene expression values for one sample’s log ratio change in gene expressions.
# distribution of log ratio values for a sample
with(gene_expr_df, hist(AADC.Exp.LR))
We can preview the effects of quantile normalization with this data set. We will normalize all of the Log Ratio columns in the dataset, and re-create the plot for the same sample’s data.
# preview quantile normalization effect
with(quantile_normalize_df(gene_expr_df[multi_grep(source_data = colnames(gene_expr_df),
patterns = "^.*LR$")]),
hist(AADC.Exp.LR))
During DiffBind analysis, some samples failed and were excluded from the output. On the other hand, all samples were included in the microarray gene expression analysis. Because of this, we need to find the samples that both data sets have in common. To accomplish this, we will use our sampleID_intersects
function with a regular expression that returns the intersect of the two lists of samples. Due to failure on other quality control measures, we decided to exclude sample sample HHDC
from the analysis; this will be accomplished with this step as well.
First, we will check on the column headers
# the column headings from the two group
colnames(diffbind_df)
## [1] "seqnames" "start"
## [3] "end" "width"
## [5] "strand" "Conc"
## [7] "Conc_1" "Conc_2"
## [9] "Fold" "p.value"
## [11] "FDR" "AADC.1.H3K27AC"
## [13] "BBDC.1.H3K27AC" "CCDC.1.H3K27AC"
## [15] "EEDC.1.H3K27AC" "GGDC.1.H3K27AC"
## [17] "HHDC.1.H3K27AC" "JJDC.1.H3K27AC"
## [19] "KKDC.1.H3K27AC" "NNDC.1.H3K27AC"
## [21] "OODC.1.H3K27AC" "QQDC.1.H3K27AC"
## [23] "RRDC.1.H3K27AC" "AADC.2.H3K27AC"
## [25] "BBDC.2.H3K27AC" "CCDC.2.H3K27AC"
## [27] "EEDC.2.H3K27AC" "GGDC.2.H3K27AC"
## [29] "HHDC.2.H3K27AC" "JJDC.2.H3K27AC"
## [31] "KKDC.2.H3K27AC" "NNDC.2.H3K27AC"
## [33] "OODC.2.H3K27AC" "QQDC.2.H3K27AC"
## [35] "RRDC.2.H3K27AC" "feature"
## [37] "external_gene_name" "gene_biotype"
## [39] "start_position" "end_position"
## [41] "insideFeature" "distancetoFeature"
## [43] "shortestDistance" "fromOverlappingOrNearest"
colnames(gene_expr_df)
## [1] "AADC.Exp.1" "AADC.Exp.2" "AADC.Exp.LR" "BBDC.Exp.1" "BBDC.Exp.2"
## [6] "BBDC.Exp.LR" "CCDC.Exp.1" "CCDC.Exp.2" "CCDC.Exp.LR" "DDDC.Exp.1"
## [11] "DDDC.Exp.2" "DDDC.Exp.LR" "EEDC.Exp.1" "EEDC.Exp.2" "EEDC.Exp.LR"
## [16] "FFDC.Exp.1" "FFDC.Exp.2" "FFDC.Exp.LR" "GGDC.Exp.1" "GGDC.Exp.2"
## [21] "GGDC.Exp.LR" "HHDC.Exp.1" "HHDC.Exp.2" "HHDC.Exp.LR" "IIDC.Exp.1"
## [26] "IIDC.Exp.2" "IIDC.Exp.LR" "JJDC.Exp.1" "JJDC.Exp.2" "JJDC.Exp.LR"
## [31] "KKDC.Exp.1" "KKDC.Exp.2" "KKDC.Exp.LR" "LLDC.Exp.1" "LLDC.Exp.2"
## [36] "LLDC.Exp.LR" "MMDC.Exp.1" "MMDC.Exp.2" "MMDC.Exp.LR" "NNDC.Exp.1"
## [41] "NNDC.Exp.2" "NNDC.Exp.LR" "OODC.Exp.1" "OODC.Exp.2" "OODC.Exp.LR"
## [46] "PPDC.Exp.1" "PPDC.Exp.2" "PPDC.Exp.LR" "QQDC.Exp.1" "QQDC.Exp.2"
## [51] "QQDC.Exp.LR" "RRDC.Exp.1" "RRDC.Exp.2" "RRDC.Exp.LR"
# get the patient ID intersects
patientID_matches <- sampleID_intersects(IDs1 = colnames(gene_expr_df),
pattern1 = "^([[:alpha:]]*)[[:punct:]]*.*$",
IDs2 = colnames(diffbind_df),
pattern2 = "^([[:alpha:]]*)[[:punct:]]*.*$",
removeID2 = "HHDC")
patientID_matches
## [1] "AADC" "BBDC" "CCDC" "EEDC" "GGDC" "JJDC" "KKDC" "NNDC" "OODC" "QQDC"
## [11] "RRDC"
Similarly, we need to find the list of genes that are shared between the DiffBind and gene expression data sets.
# get the genes in common between the diffbind and gene expression data
gene_names_overlap <- intersect(rownames(gene_expr_df),
unique(diffbind_df[["external_gene_name"]]))
head(gene_names_overlap)
## [1] "STAT1" "GAPDH" "ACTB" "DDR1" "RFC2" "HSPA6"
Due to the wide range of values in our data sets, we need to apply normalization measures to make the data points within and between samples more comparable to each other. In order to evaluate the effects of different normalization methods, we will run the entire pipeline with different sets of options in regards to quantile flooring and quantile normalization of the DiffBind and gene expression data sets. Beanplots showing results will be shown as output. Full-sized plots can be viewed by right-clicking on the images and selecting ‘View Image’ in your web browser.
# without normalization
combined_full_pipeline(diffbind_df = diffbind_df,
gene_expr_df = gene_expr_df,
patientID_matches = patientID_matches,
gene_names_overlap = gene_names_overlap,
diff_quant_norm = FALSE,
gene_quant_norm = FALSE,
diff_lr_cutoff = FALSE,
make_plot = TRUE,
# data_file = paste0(main_outdir,"/data.Rdata"),
# plot_file = paste0(main_outdir,"/beanplot_no_normalization.pdf"),
histone_mark = histone_mark,
params_branch = params_branch,
mark_key = mark_key)
##
## Wilcoxon rank sum test with continuity correction
##
## data: get(y_colname) by get(x_colname)
## W = 1963600, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
##
## NULL
## NULL
##
## Wilcoxon rank sum test with continuity correction
##
## data: get(y_colname) by get(x_colname)
## W = 2035200, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
##
## NULL
## NULL
##
## Wilcoxon rank sum test with continuity correction
##
## data: get(y_colname) by get(x_colname)
## W = 2240600, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
##
## NULL
## NULL
##
## Wilcoxon rank sum test with continuity correction
##
## data: get(y_colname) by get(x_colname)
## W = 2058800, p-value = 0.0004066
## alternative hypothesis: true location shift is not equal to 0
##
## NULL
## NULL
##
## Wilcoxon rank sum test with continuity correction
##
## data: get(y_colname) by get(x_colname)
## W = 2127200, p-value = 0.02365
## alternative hypothesis: true location shift is not equal to 0
##
## NULL
## NULL
##
## Wilcoxon rank sum test with continuity correction
##
## data: get(y_colname) by get(x_colname)
## W = 1789400, p-value = 0.7282
## alternative hypothesis: true location shift is not equal to 0
##
##
## Wilcoxon rank sum test with continuity correction
##
## data: get(y_colname) by get(x_colname)
## W = 1047200, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
##
## NULL
## NULL
##
## Wilcoxon rank sum test with continuity correction
##
## data: get(y_colname) by get(x_colname)
## W = 2316400, p-value = 1.174e-09
## alternative hypothesis: true location shift is not equal to 0
##
## NULL
## NULL
##
## Wilcoxon rank sum test with continuity correction
##
## data: get(y_colname) by get(x_colname)
## W = 784310, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
##
## NULL
## NULL
##
## Wilcoxon rank sum test with continuity correction
##
## data: get(y_colname) by get(x_colname)
## W = 2938900, p-value = 7.248e-09
## alternative hypothesis: true location shift is not equal to 0
##
## NULL
## NULL
##
## Wilcoxon rank sum test with continuity correction
##
## data: get(y_colname) by get(x_colname)
## W = 1961800, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
##
## NULL
## NULL
# DiffBind matrix quantile normalization
combined_full_pipeline(diffbind_df = diffbind_df,
gene_expr_df = gene_expr_df,
patientID_matches = patientID_matches,
gene_names_overlap = gene_names_overlap,
diff_quant_norm = TRUE,
gene_quant_norm = FALSE,
diff_lr_cutoff = FALSE,
make_plot = TRUE,
# data_file = paste0(main_outdir,"/data.Rdata"),
# plot_file = paste0(main_outdir,"/beanplot_DiffQnorm.pdf"),
histone_mark = histone_mark,
params_branch = params_branch,
mark_key = mark_key)
##
## Wilcoxon rank sum test with continuity correction
##
## data: get(y_colname) by get(x_colname)
## W = 1928400, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
##
## NULL
## NULL
##
## Wilcoxon rank sum test with continuity correction
##
## data: get(y_colname) by get(x_colname)
## W = 1915900, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
##
## NULL
## NULL
##
## Wilcoxon rank sum test with continuity correction
##
## data: get(y_colname) by get(x_colname)
## W = 2151200, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
##
## NULL
## NULL
##
## Wilcoxon rank sum test with continuity correction
##
## data: get(y_colname) by get(x_colname)
## W = 2083400, p-value = 0.003625
## alternative hypothesis: true location shift is not equal to 0
##
## NULL
## NULL
##
## Wilcoxon rank sum test with continuity correction
##
## data: get(y_colname) by get(x_colname)
## W = 2217800, p-value = 0.9783
## alternative hypothesis: true location shift is not equal to 0
##
##
## Wilcoxon rank sum test with continuity correction
##
## data: get(y_colname) by get(x_colname)
## W = 1780700, p-value = 0.5515
## alternative hypothesis: true location shift is not equal to 0
##
##
## Wilcoxon rank sum test with continuity correction
##
## data: get(y_colname) by get(x_colname)
## W = 1009700, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
##
## NULL
## NULL
##
## Wilcoxon rank sum test with continuity correction
##
## data: get(y_colname) by get(x_colname)
## W = 2050800, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
##
## NULL
## NULL
##
## Wilcoxon rank sum test with continuity correction
##
## data: get(y_colname) by get(x_colname)
## W = 822720, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
##
## NULL
## NULL
##
## Wilcoxon rank sum test with continuity correction
##
## data: get(y_colname) by get(x_colname)
## W = 3182300, p-value = 0.2363
## alternative hypothesis: true location shift is not equal to 0
##
##
## Wilcoxon rank sum test with continuity correction
##
## data: get(y_colname) by get(x_colname)
## W = 2023600, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
##
## NULL
## NULL
# DiffBind matrix quantile normalization + per sample quantile cutoffs
combined_full_pipeline(diffbind_df = diffbind_df,
gene_expr_df = gene_expr_df,
patientID_matches = patientID_matches,
gene_names_overlap = gene_names_overlap,
diff_quant_norm = TRUE,
gene_quant_norm = FALSE,
diff_lr_cutoff = TRUE,
make_plot = TRUE,
# data_file = paste0(main_outdir,"/data.Rdata"),
# plot_file = paste0(main_outdir,"/beanplot_DiffQnorm_Qcutoff.pdf"),
histone_mark = histone_mark,
params_branch = params_branch,
mark_key = mark_key)
##
## Wilcoxon rank sum test with continuity correction
##
## data: get(y_colname) by get(x_colname)
## W = 1834400, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
##
## NULL
## NULL
##
## Wilcoxon rank sum test with continuity correction
##
## data: get(y_colname) by get(x_colname)
## W = 1836200, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
##
## NULL
## NULL
##
## Wilcoxon rank sum test with continuity correction
##
## data: get(y_colname) by get(x_colname)
## W = 2138900, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
##
## NULL
## NULL
##
## Wilcoxon rank sum test with continuity correction
##
## data: get(y_colname) by get(x_colname)
## W = 1714700, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
##
## NULL
## NULL
##
## Wilcoxon rank sum test with continuity correction
##
## data: get(y_colname) by get(x_colname)
## W = 2387800, p-value = 2.608e-05
## alternative hypothesis: true location shift is not equal to 0
##
##
## Wilcoxon rank sum test with continuity correction
##
## data: get(y_colname) by get(x_colname)
## W = 1754100, p-value = 0.1682
## alternative hypothesis: true location shift is not equal to 0
##
##
## Wilcoxon rank sum test with continuity correction
##
## data: get(y_colname) by get(x_colname)
## W = 1029100, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
##
## NULL
## NULL
##
## Wilcoxon rank sum test with continuity correction
##
## data: get(y_colname) by get(x_colname)
## W = 2289300, p-value = 1.592e-11
## alternative hypothesis: true location shift is not equal to 0
##
## NULL
## NULL
##
## Wilcoxon rank sum test with continuity correction
##
## data: get(y_colname) by get(x_colname)
## W = 846150, p-value = 1.143e-15
## alternative hypothesis: true location shift is not equal to 0
##
## NULL
## NULL
##
## Wilcoxon rank sum test with continuity correction
##
## data: get(y_colname) by get(x_colname)
## W = 3684500, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
##
##
## Wilcoxon rank sum test with continuity correction
##
## data: get(y_colname) by get(x_colname)
## W = 2364300, p-value = 6.027e-06
## alternative hypothesis: true location shift is not equal to 0
##
## NULL
## NULL
# per sample quantile cutoffs
combined_full_pipeline(diffbind_df = diffbind_df,
gene_expr_df = gene_expr_df,
patientID_matches = patientID_matches,
gene_names_overlap = gene_names_overlap,
diff_quant_norm = FALSE,
gene_quant_norm = FALSE,
diff_lr_cutoff = TRUE,
make_plot = TRUE,
# data_file = paste0(main_outdir,"/data.Rdata"),
# plot_file = paste0(main_outdir,"/beanplot_Qcutoff.pdf"),
histone_mark = histone_mark,
params_branch = params_branch,
mark_key = mark_key)
##
## Wilcoxon rank sum test with continuity correction
##
## data: get(y_colname) by get(x_colname)
## W = 1845100, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
##
## NULL
## NULL
##
## Wilcoxon rank sum test with continuity correction
##
## data: get(y_colname) by get(x_colname)
## W = 1908000, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
##
## NULL
## NULL
##
## Wilcoxon rank sum test with continuity correction
##
## data: get(y_colname) by get(x_colname)
## W = 2261500, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
##
## NULL
## NULL
##
## Wilcoxon rank sum test with continuity correction
##
## data: get(y_colname) by get(x_colname)
## W = 1621400, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
##
## NULL
## NULL
##
## Wilcoxon rank sum test with continuity correction
##
## data: get(y_colname) by get(x_colname)
## W = 2311200, p-value = 0.02157
## alternative hypothesis: true location shift is not equal to 0
##
##
## Wilcoxon rank sum test with continuity correction
##
## data: get(y_colname) by get(x_colname)
## W = 1750900, p-value = 0.1417
## alternative hypothesis: true location shift is not equal to 0
##
##
## Wilcoxon rank sum test with continuity correction
##
## data: get(y_colname) by get(x_colname)
## W = 991650, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
##
## NULL
## NULL
##
## Wilcoxon rank sum test with continuity correction
##
## data: get(y_colname) by get(x_colname)
## W = 2442300, p-value = 0.001106
## alternative hypothesis: true location shift is not equal to 0
##
## NULL
## NULL
##
## Wilcoxon rank sum test with continuity correction
##
## data: get(y_colname) by get(x_colname)
## W = 811830, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
##
## NULL
## NULL
##
## Wilcoxon rank sum test with continuity correction
##
## data: get(y_colname) by get(x_colname)
## W = 3579900, p-value = 1.875e-10
## alternative hypothesis: true location shift is not equal to 0
##
##
## Wilcoxon rank sum test with continuity correction
##
## data: get(y_colname) by get(x_colname)
## W = 2083900, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
##
## NULL
## NULL
After reviewing the results, it was determined that DiffBind Raw Data Matrix Quantile Normalization
gave the best results. These parameters normalized some of the data variance that was seen without causing too much distortion of the results. It can be seen from the beanplots that many of the samples had higher differential binding for H3K27AC in genes that were upregulated after treatment.
system('uname -srv',intern=T)
## [1] "Darwin 15.6.0 Darwin Kernel Version 15.6.0: Mon Aug 29 20:21:34 PDT 2016; root:xnu-3248.60.11~1/RELEASE_X86_64"
sessionInfo()
## R version 3.3.1 (2016-06-21)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X 10.11.6 (El Capitan)
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] preprocessCore_1.34.0 reshape2_1.4.1 ggplot2_2.1.0
## [4] beanplot_1.2 data.table_1.9.6
##
## loaded via a namespace (and not attached):
## [1] Rcpp_0.12.7 digest_0.6.10 plyr_1.8.4 chron_2.3-47
## [5] grid_3.3.1 gtable_0.2.0 formatR_1.4 magrittr_1.5
## [9] scales_0.4.0 evaluate_0.9 stringi_1.1.1 rmarkdown_1.0
## [13] tools_3.3.1 stringr_1.1.0 munsell_0.4.3 yaml_2.1.13
## [17] colorspace_1.2-6 htmltools_0.3.5 knitr_1.14
# save.image(compress = TRUE, ) # use this in the script, not in this document