1 Introduction

This example will walk through the analysis of a ChIP-Seq experiment conducted on human samples before and after treatment. Samples were immunoprecipitated for five histone marks (only the results of H3K27AC are shown here). We will compare the differential peak data from DiffBind with gene expression data from a microarray analysis of the same samples. The goal of this analysis is to determine if any patterns can be found between genes that were differentially bound and the changes in expression of those genes. To accomplish this, we will first subset the DiffBind data set for peaks within 3000 base pairs of the gene start site. The table of differential binding values per sample will be merged with the gene expression values. Gene expression values will then be grouped into genes that were upregulate (UP) and genes that were down regulated (DOWN). In order to visualize the data, split beanplots will be created for each sample; one side of the split will contain the DiffBind values for the UP genes, while the other side contains the DiffBind values for the DOWN genes. Each sample will also be tested for significance of difference between the UP and DOWN groupings, and significance will be marked on the plot.

2 Packages

First, required packages will be loaded.

# ~~~~~ LOAD PACKAGES ~~~~~~~~ #

# install.packages(c("devtools"))
# source("http://www.bioconductor.org/biocLite.R")
# biocLite(c("Biobase","preprocessCore"))
library("data.table")
library("beanplot")
library("ggplot2")
library("reshape2")
## 
## Attaching package: 'reshape2'
## The following objects are masked from 'package:data.table':
## 
##     dcast, melt
library("preprocessCore")

3 Functions

Due to the complexity of this analysis pipeline and the need for repeated analysis with different parameters, custom functions will be created.

Our pipeline will need to match up information on samples between two data sets. Sample ID’s are saved in the column headings of both data sets, but include multiple entries for each sample set, and not all samples are present in both sets. The get_replacements function will use a supplied regular expression to return a character vector of unique matching sample ID’s in the supplied column names.

# ~~~~~ CUSTOM FUNCTIONS ~~~~~~~~ #
get_replacements <- function(source_data, uniq=TRUE, pattern = "", 
                             remove_entry = NULL, ...){
    # return vector of first matches to regex
    matches <-  gsub(x = source_data, 
                     pattern = pattern,
                     replacement = "\\1")
    
    # return only unique entries
    if(uniq) matches <- unique(matches)
    
    # remove an entry from the output
    if ( ! is.null(remove_entry)){
        if(any(grepl(pattern = remove_entry,x = matches))){
            matches <- matches[grep(pattern = remove_entry, x = matches, 
                                    value = FALSE, invert = TRUE)]  
        } 
    }
    return(matches)
}

The multi_grep function will be used to search for multiple patterns in a character vector by passing a vector of patterns to be matched.

multi_grep <- function(source_data, patterns){
    # find multiple patterns in a char vector
    # make a regex to search with
    if (length(patterns) > 1){
        search_pattern <- paste(patterns,
                                collapse = "|") 
    } else {
        search_pattern <- patterns
    }
    
    matches <- grep(x = source_data,
                    pattern = search_pattern,
                    value = TRUE)
    return(matches)
}

The sampleID_intersects function will be used to find the set of Sample ID’s that are present in both the DiffBind and gene expression data sets.

sampleID_intersects <- function(IDs1, pattern1 = "", 
                                IDs2, pattern2 = "", 
                                removeID2 = NULL ){
    # get the intersect between sets of ID's after gsub pattern replacements
    ID1_match <- get_replacements(source_data = IDs1, pattern = pattern1)
    
    ID2_intersect_ID1 <-  multi_grep(patterns = ID1_match, 
                                     source_data = IDs2)
    ID2_match <- get_replacements(source_data = ID2_intersect_ID1, 
                                  pattern = pattern2, remove_entry = removeID2)
    return(ID2_match)
}

The quantile_normalize_df function will be uses to transform a data frame by quantile normalizing its values.

quantile_normalize_df <- function(df){
    # quantile normalization of a dataframe
    df <- setnames(x = as.data.frame(normalize.quantiles(as.matrix(df))), 
                   colnames(df))
    return(df)
}

The diff_subset function will subset the DiffBind data frame for differential peak score entries per Sample that are greater than or equal to 1.

diff_subset <- function(df, patientID_matches){
    # subset for DiffPeaks >= 1 per sample
    for(i in seq_along(patientID_matches)){
        tmp_ID <- patientID_matches[i]
        
        # print(tmp_ID %in% colnames(diffbind_df_intersect_min))
        
        # get the diffBind columns with the patient's ID
        tmp_colnames <- grep(pattern = tmp_ID,x = colnames(df),value = TRUE)
        
        # make sure there are 2 colnames, otherwise 
        # skip to the next iteration of the loop!
        if (length(tmp_colnames) !=2) next
        
        # get the 2 colname
        tmp_colnames_2 <- grep(pattern = paste0("^",tmp_ID,"[[:punct:]]2.*$"),
                               x = tmp_colnames,perl = TRUE,value = TRUE)
        
        # get the 1 colname
        tmp_colnames_1 <- grep(pattern = paste0("^",tmp_ID,"[[:punct:]]1.*$"),
                               x = tmp_colnames,perl = TRUE,value = TRUE)
        
        # diffbind_df <- subset(diffbind_df, tmp_colnames_R >= 1)
        # diffbind_df <- subset(diffbind_df, tmp_colnames_D >= 1)
        df <- df[which(df[[tmp_colnames_2]] >= 1 | df[[tmp_colnames_1]] >= 1),]
        # diffbind_df <- diffbind_df[which(diffbind_df[[tmp_colnames_D]] >= 1),]
    }
    
    return(df)
}

The quantile_cutoff function will apply a quantile cut off to a numeric vector. This ‘quantile flooring’ will raise all values below the 25th quantile up to the 25th quantile. This measure was implemented to account for potential skewing caused by extremely low values.

quantile_cutoff <- function(vec, cutoff = 0.25){
    # replace lowest values with the x'th quantile value
    vec_quantile <- quantile(vec, probs = cutoff)
    vec <- sapply(vec, function(x) max(x, vec_quantile))
    return(vec)
}

The calc_fold_change function calculates the fold change value for two numeric vectors.

calc_fold_change <- function(sample1, sample2, calc_log = TRUE, log_base = 2){
    # sample1, sample2: numeric vectors
    # fc: fold change output vector
    
    if(calc_log){
        fc <- log((sample2/sample1), base = log_base)  
    } else {
        fc <- (sample2/sample1)
    }
    
    return(fc)
}

The log_ratio_df function will create a data frame that contains only the fold change values for the DiffBind samples

log_ratio_df <- function(diffbind_df, diff_gene_colname = "external_gene_name", 
                         patientID_matches, calc_cutoff = FALSE, melt_df = FALSE, 
                         add_status = FALSE, ...){
    # create a df to hold the DiffBind log ratio entries
    # create an empty data frame to fill with log ratio values
    diff_log_ratio <- setNames(data.frame(matrix(nrow = nrow(diffbind_df), 
                                                 ncol = length(patientID_matches))), 
                               patientID_matches)
    # add the genes from DiffBind
    diff_log_ratio[["gene"]] <- diffbind_df[[diff_gene_colname]]
    
    # enter the log ratio values into the df per patient
    # # iterate over the samples; get only diff min cols per patient
    for(i in seq_along(patientID_matches)){
        tmp_ID <- patientID_matches[i]
        
        # get the diffBind columns with the patient's ID
        tmp_colnames <- grep(pattern = tmp_ID,x = colnames(diffbind_df),
                             value = TRUE)
        
        # make sure there are 2 colnames, 
        # otherwise skip to the next iteration of the loop!
        if (length(tmp_colnames) !=2) next
        
        # get the 2 colname
        tmp_colnames_2 <- grep(pattern = paste0("^",tmp_ID,"[[:punct:]]2.*$"),
                               x = tmp_colnames,perl = TRUE,value = TRUE)
        
        # get the 1 colname
        tmp_colnames_1 <- grep(pattern = paste0("^",tmp_ID,"[[:punct:]]1.*$"),
                               x = tmp_colnames,perl = TRUE,value = TRUE)
        
        if(calc_cutoff){
            # apply quantil cutoffs to the 2 and 1 values
            tmp_2 <- quantile_cutoff(diffbind_df[[tmp_colnames_2]])
            tmp_1 <- quantile_cutoff(diffbind_df[[tmp_colnames_1]])
        } else {
            tmp_2 <- diffbind_df[[tmp_colnames_2]]
            tmp_1 <- diffbind_df[[tmp_colnames_1]]
        }
        
        # get the fold change
        diff_log_ratio[,tmp_ID] <- calc_fold_change(sample1 = tmp_1, 
                                                    sample2 = tmp_2, 
                                                    calc_log = TRUE, 
                                                    log_base = 2)
        
    }
    
    
    # melt it into long format
    if (melt_df) diff_log_ratio <- reshape2::melt(diff_log_ratio,id.vars="gene",
                                                  variable.name="sample",
                                                  value.name="diff_peak_log_ratio")
    
    # add a column to hold the 'status' 
    # e.g. Up regulated vs. Down regulated in the gene_expr table
    if (add_status) diff_log_ratio[["gene_expression_status"]] <- NA
    
    return(diff_log_ratio)
}

The mark_test_type function will retrieve the ‘test’ value for a given histone mark based on the supplied dataframe of marks and tests.

mark_test_type <- function(hist_mark, mark_key){
    # get the alternative value for the ttest/utest from the mark key
    test_alt <- as.character(subset(mark_key, marks == hist_mark)[['test']])
    return(test_alt)
}

The plot_signif_pvalue function will be used to add a ’*’ character to the plots to indicate if each sample has passed significance testing, using either a one-sided T-test or U-test.

plot_signif_pvalue <- function(df, y_colname, x_colname, x_group, p_value_cutoff=0.05, test_method = "ttest", ...){
    # plot points on a barplot / beanplot marking significant categories
    y_max <- max(df[[y_colname]],na.rm = TRUE)
    
    for(i in seq_along(unique(as.character(df[[x_group]])))){
        tmp_ID <- unique(as.character(df[[x_group]]))[i]
        tmp_df <- subset(df, get(x_group) == tmp_ID)
        
        # get the one-sided test type value
        test_alt <- mark_test_type(...)
        if(test_method == 'ttest') tmp_pvalue <- t.test(get(y_colname)~get(x_colname),
                                                        data = tmp_df, 
                                                        alternative = test_alt,
                                                        na.action = na.omit,
                                                        paired = FALSE)["p.value"]
        
        # Wilcoxon rank-sum test applied to independent samples
        if(test_method == 'utest'){
            tmp_pvalue <- wilcox.test(get(y_colname)~get(x_colname),
                                      data = tmp_df,
                                      alternative = test_alt,
                                      na.action = na.omit, 
                                      paired = FALSE)["p.value"]
            
            print(wilcox.test(get(y_colname)~get(x_colname),
                              data = tmp_df,
                              paired = FALSE))
            
        } 
        if(tmp_pvalue < p_value_cutoff){
            print(points(x = i, 
                         y = y_max - 1,
                         pch ='*', 
                         col ='red',
                         cex=2))
            print(text(x = i, 
                       y = y_max,
                       labels = paste0('p = ',
                                       format(tmp_pvalue,
                                              digits = 2)),
                       cex=0.7))
        }
    }
}

The diff_beanplot function will create the DiffBind beanplots.

diff_beanplot <- function(df, x1_colname, x2_colname, y_colname, main_text_1 = "", 
                          y_lab = "", main_text_2 = "", save_output = FALSE, 
                          file = "./plot.pdf", strip_chart = FALSE, ... ){
    # make the DiffBind split beanplots
    
    # only make a plot of there are at least 50 genes
    if(nrow(unique(subset(df, get(x1_colname) == "UP")["gene"]))>=50 &&
       nrow(unique(subset(df, get(x1_colname) == "DOWN")["gene"]))>=50){
        
        if(save_output) pdf(file = file,height = 8,width = 12)
        
        beanplot(get(y_colname)~get(x1_colname)*get(x2_colname),
                 data=df,
                 what=c(0,1,1,0), 
                 # # EXTRA OPTIONS HERE: # what=c(0,1,1,1),
                 # maxstripline = 0, # ll = 0.04,varwidth = TRUE,
                 border = NA,
                 bw="nrd0", # gave errors with default and "nrd" 
                 overallline = 'mean', #median',
                 col=list('grey','purple'),
                 ylab = y_lab, 
                 main=paste0("DiffBind results for ",
                             main_text_1,
                             " ",
                             main_text_2), 
                 cex.main=0.9,
                 side = "both"
        )
        
        legend('bottomright', 
               fill=c('grey','purple'), 
               legend= c('Down Genes', 'Up Genes'))
        
        # add a horizontal line at 0
        abline(h=0,col="darkgrey",lty="dotted")
        
        # add p value and *'s
        plot_signif_pvalue(df = df,y_colname = y_colname, 
                           x_colname = x1_colname, 
                           x_group = x2_colname, ...) 
        
        if (strip_chart){
            # creates a stripchart overlay
            # # don't use this !!
            stripchart(y_colname~x_colname*sample,data = df,
                       vertical=TRUE,
                       add=TRUE,
                       cex=0.1,
                       method="jitter",
                       jitter=0.05)   
        }
        
        if(save_output) dev.off()
    }
}

The gene_expression_raw_pipeline function will contain the steps required for processing the microarray expression data.

gene_expression_raw_pipeline <- function(gene_expr_df, patientID_matches, 
                                         gene_names_overlap, quant_norm = FALSE){
    # keep only LR columns; log ratio
    gene_expr_df <- gene_expr_df[multi_grep(source_data = colnames(gene_expr_df), 
                                            patterns = "^.*LR$")]
    # quantil normalize the gene expression dataset
    if (quant_norm) gene_expr_df <- quantile_normalize_df(gene_expr_df)
    # fix colnames
    colnames(gene_expr_df) <- get_replacements(source_data = colnames(gene_expr_df), 
                                               uniq=TRUE, 
                                               pattern = "^([[:alpha:]]*)([[:punct:]]*.*)$")
    # keep matched columns
    gene_expr_df <- gene_expr_df[patientID_matches]
    # add a column with the gene names for melting
    gene_expr_df[["gene"]] <- rownames(gene_expr_df)
    
    # subset for intersected genes only
    gene_expr_df <- droplevels(gene_expr_df[rownames(gene_expr_df)  %in% gene_names_overlap,])
    
    # melt it into long format
    gene_expr_df <- reshape2::melt(gene_expr_df,
                                   id.vars="gene",
                                   variable.name="sample",
                                   value.name="gene_expression_log_ratio")
    
    return(gene_expr_df)
}

The diffbind_raw_pipeline function will hold the steps required to process the DiffBind data set.

diffbind_raw_pipeline <- function(diffbind_df, patientID_matches, 
                                  gene_names_overlap, quant_norm = FALSE){
    # quantile normalize entires DiffBind dataset
    if (quant_norm) {
        norm_cols <- multi_grep(source_data = colnames(diffbind_df), 
                                patterns = patientID_matches)
        diffbind_df[norm_cols] <- quantile_normalize_df(diffbind_df[norm_cols])
    }
    
    
    # subset for closest genes
    diffbind_df <- subset(diffbind_df, abs(distancetoFeature)<=3000)
    # subset for intersected genes only
    diffbind_df <- droplevels(diffbind_df[diffbind_df[["external_gene_name"]] %in% gene_names_overlap,])
    # remove extraneous columns
    diffbind_cols_to_remove <-c("seqnames","start","end","width","strand",
                                "Conc","Conc_D","Conc_R","Fold",
                                "p.value","FDR","feature","gene_biotype",
                                "start_position","end_position",
                                "insideFeature","shortestDistance",
                                "fromOverlappingOrNearest", "distancetoFeature")
    diffbind_df <- diffbind_df[,! colnames(diffbind_df) %in% diffbind_cols_to_remove]
    return(diffbind_df)
}

The combined_full_pipeline function contains the steps needed for running the entire analysis pipeline.

combined_full_pipeline <- function(diffbind_df, gene_expr_df, patientID_matches, 
                                   gene_names_overlap, diff_quant_norm = FALSE, 
                                   gene_quant_norm = FALSE, diff_lr_cutoff = FALSE, 
                                   make_plot = TRUE, data_file = "./data.Rdata", 
                                   plot_file = "./plot.pdf", histone_mark, 
                                   params_branch, mark_key, ...){
    
    # process the DiffBind data
    diffbind_df <- diffbind_raw_pipeline(diffbind_df = diffbind_df, 
                                         patientID_matches = patientID_matches, 
                                         gene_names_overlap = gene_names_overlap, 
                                         quant_norm = diff_quant_norm)
    
    # process the gene expression data
    gene_expr_df <- gene_expression_raw_pipeline(gene_expr_df = gene_expr_df, 
                                                 patientID_matches = patientID_matches,
                                                 gene_names_overlap = gene_names_overlap, 
                                                 quant_norm = gene_quant_norm)
    
    # create the DiffBind log ratio data frame
    diff_log_ratio <- log_ratio_df(diffbind_df = diffbind_df, 
                                   diff_gene_colname = "external_gene_name", 
                                   patientID_matches = patientID_matches, 
                                   calc_cutoff = diff_lr_cutoff,
                                   melt_df = TRUE, add_status = TRUE)
    
    # merge the tables
    # # merge together the two long format diff_log_ratio's and gene_expr log ratio's
    diff_gene_expr_merged <- base::merge(gene_expr_df,
                                         diff_log_ratio,by=c("gene","sample")) 
    
    # set the gene expression status to UP or DOWN based on gene expression value
    # # 1.5x up/down expression
    diff_gene_expr_merged[["gene_expression_status"]] <- ifelse(diff_gene_expr_merged[["gene_expression_log_ratio"]]>=0.58, "UP", ifelse(diff_gene_expr_merged[["gene_expression_log_ratio"]]<=-0.58, "DOWN", no = NA))
    
    
    # ~~~~~~ SUBSET DIFF FOLD CHANGE ~~~~~~~~~ #
    # subset for log ratio >1.5x change; log2 0.58
    # diff_gene_expr_merged <- subset(diff_gene_expr_merged, abs(diff_peak_log_ratio) >= 0.58)
    # diff_gene_expr_merged <- droplevels(diff_gene_expr_merged)
    # ~~~~~~~~~~~~~~~ #
    
    
    if (make_plot){
        
        # make the DiffBind beanplots
        save.image(file=data_file, compress = TRUE) 
        
        diff_beanplot(df = diff_gene_expr_merged,
                      x1_colname =  'gene_expression_status', 
                      x2_colname = 'sample',
                      y_colname = 'diff_peak_log_ratio', 
                      main_text_1 = histone_mark, 
                      main_text_2 = params_branch,
                      y_lab = "DiffBind ratio = log2 ( Diff peak 2 / Diff peak 1 )",
                      test_method = 'utest',
                      hist_mark = histone_mark, 
                      mark_key = mark_key,
                      # save_output = TRUE, # use this in script, but not in this document
                      file = plot_file) 
    }
}

4 Setup

Input files and output locations will be specified. A key will be created that relates each histone mark studied in our experiment with its role, and its one-sided significance test.

# location for the output
main_outdir <- "."
setwd(main_outdir)

# path to the TSV file containing the microarray gene expression data
gene_expr_file <- "input/gene_expression2.tsv"

# path to the file containing the differential binding DiffBind sheet
diffbind_file <- "input/diffbind_H3K27AC.csv"

# get the unique ID to use for the sample
histone_mark <- "H3K27AC"

# get the analysis params branch for the samples
params_branch <- "diffbind.by_chip.status-peaks.by_sample.macs_broad"

# histone marks included in the experiment
mark_key <- data.frame(marks = c('H3K9ME3', 'H3K27ME3', 'H3K9AC', 
                                 'H3K27AC', 'H3K4ME3'), 
                       type = c('repressive', 'repressive','activating',
                                'activating','activating'),
                       test = c('greater', 'greater', 'less', 'less', 'less'))

mark_key
##      marks       type    test
## 1  H3K9ME3 repressive greater
## 2 H3K27ME3 repressive greater
## 3   H3K9AC activating    less
## 4  H3K27AC activating    less
## 5  H3K4ME3 activating    less

5 Load Data

The DiffBind and microarray gene expression data sets will be loaded.

# ~~~~~~ READ IN FILE ~~~~~~~~ #
# diffbind data
diffbind_df <- read.csv(diffbind_file)
# gene expression data
gene_expr_df <- read.table(gene_expr_file,sep = '\t',header = TRUE,
                           row.names = 1,quote = "")

# save current state # use this in the script
# save.image(file=paste0(main_outdir,"/raw_data.Rdata"),compress = TRUE)

5.1 DiffBind data

In our DiffBind data set, each entry corresponds to a peak that has been determined to be differentially bound between sample sets. For our experiment, we used paired samples before and after treatment. For example, here AADC.1.H3K27AC corresponds to sample AADC before treatment, and AADC.2.H3K27AC corresponds to the same sample group after treatment. Both samples have undergone chromatin immunoprecipitation for histone mark H3K27AC. In our data set, FDR corresponds to the false discovery rate, which is a form of adjusted p value.

# check some attributes of the data sets
head(diffbind_df, n = 3)
##   seqnames  start    end width strand     Conc   Conc_1   Conc_2
## 1     chr1  10009  10438   430      * 1.932386 1.850992 2.009433
## 2     chr1 710223 710919   697      * 2.619188 2.368653 2.832585
## 3     chr1 712387 715259  2873      * 6.365982 6.165604 6.541893
##         Fold   p.value       FDR AADC.1.H3K27AC BBDC.1.H3K27AC
## 1 -0.1584405 0.4227292 1.0000000       7.656727       0.540563
## 2 -0.4639316 0.2293055 0.5103474       5.955232       4.324504
## 3 -0.3762889 0.2762324 0.5419556     115.701655       7.027319
##   CCDC.1.H3K27AC EEDC.1.H3K27AC GGDC.1.H3K27AC HHDC.1.H3K27AC
## 1       2.348033      0.5304813       6.523903       0.756952
## 2       4.696066     12.2010689       3.261951       0.756952
## 3      69.658311    127.3155011     122.649376       0.756952
##   JJDC.1.H3K27AC KKDC.1.H3K27AC NNDC.1.H3K27AC OODC.1.H3K27AC
## 1        0.76252      0.3777487      21.948941      0.6244618
## 2        0.76252      7.9327230       8.047945      4.3712326
## 3       96.84004     27.1979074     147.789537     74.3109549
##   QQDC.1.H3K27AC RRDC.1.H3K27AC AADC.2.H3K27AC BBDC.2.H3K27AC
## 1      0.5870339      0.6324206              5      11.521257
## 2      5.8703390      3.7945233              7       6.670202
## 3     48.1367794     24.0319811            139      89.138150
##   CCDC.2.H3K27AC EEDC.2.H3K27AC GGDC.2.H3K27AC HHDC.2.H3K27AC
## 1       9.751324       4.849967      0.6946048      0.7920232
## 2       5.250713       9.699934      2.7784194      0.7920232
## 3     138.018741     147.923993     50.7061531     53.8575756
##   JJDC.2.H3K27AC KKDC.2.H3K27AC NNDC.2.H3K27AC OODC.2.H3K27AC
## 1      0.4375714      0.5349915       4.898485       0.495923
## 2      6.5635704     15.5147522      12.736061       9.422537
## 3    105.0171265     89.8785643     140.423237      54.055609
##   QQDC.2.H3K27AC RRDC.2.H3K27AC         feature external_gene_name
## 1      0.5391373       8.799582 ENSG00000223972            DDX11L1
## 2      3.7739608       5.279749 ENSG00000228327      RP11-206L10.2
## 3      2.1565490     107.941535 ENSG00000237491      RP11-206L10.9
##   gene_biotype start_position end_position insideFeature distancetoFeature
## 1   pseudogene          11869        14412      upstream             -1645
## 2      lincRNA         700237       714006        inside              3435
## 3      lincRNA         714150       745440  overlapStart              -327
##   shortestDistance fromOverlappingOrNearest
## 1             1431         shortestDistance
## 2             3087         shortestDistance
## 3             1109         shortestDistance
dim(diffbind_df)
## [1] 126206     44

We can also create some basic plots to visualize some of the data. We can see here that the raw values have a very large and skewed distribution.

# distribution of raw Fold changes for a sample
with(diffbind_df, hist(AADC.2.H3K27AC/AADC.1.H3K27AC))

Using the log2(Fold change) helps to normalize the distribution.

# distribution of log2 Fold changes for a sample
with(diffbind_df, hist(log2(AADC.2.H3K27AC/AADC.1.H3K27AC)))

5.2 Gene Expression data

Our gene expression data set was obtained from microarray analysis of all samples included in the study. For example, here AADC.Exp.1 and AADC.Exp.2 correspond to the samples before and after treatment, while AADC.Exp.LR is the log2 ratio of the two values. We will preview some aspects of the data set.

head(gene_expr_df, n = 3)
##       AADC.Exp.1 AADC.Exp.2 AADC.Exp.LR BBDC.Exp.1 BBDC.Exp.2 BBDC.Exp.LR
## STAT1    6131.25   5941.836 -0.04527245   882.3387   5646.252  2.67788905
## GAPDH   51219.36  49317.550 -0.05458810 51503.7100  55067.740  0.09653104
## ACTB    46530.07  50957.180  0.13112208 27041.2200  43020.270  0.66985634
##       CCDC.Exp.1 CCDC.Exp.2 CCDC.Exp.LR DDDC.Exp.1 DDDC.Exp.2 DDDC.Exp.LR
## STAT1   1222.487   705.6041  -0.7928883   2700.842   2402.704 -0.16875032
## GAPDH  30246.960 58412.8800   0.9494964  53715.410  54167.860  0.01210106
## ACTB   25757.660 18002.7900  -0.5167810  35830.510  57658.160  0.68633622
##       EEDC.Exp.1 EEDC.Exp.2 EEDC.Exp.LR FFDC.Exp.1 FFDC.Exp.2 FFDC.Exp.LR
## STAT1   1307.528   1969.007   0.5906264   1076.489   1192.044  0.14710391
## GAPDH  45408.600  38131.560  -0.2519800  34473.110  33843.480 -0.02659354
## ACTB   42287.460  37245.500  -0.1831638  23570.470  25168.620  0.09464559
##       GGDC.Exp.1 GGDC.Exp.2 GGDC.Exp.LR HHDC.Exp.1 HHDC.Exp.2 HHDC.Exp.LR
## STAT1   1783.034    1554.07  -0.1982827   1327.161    1064.01  -0.3188317
## GAPDH  46598.990   49978.21   0.1010005  79254.210   93127.11   0.2327136
## ACTB   26419.930   15990.41  -0.7244197  20488.180   37259.12   0.8628018
##       IIDC.Exp.1 IIDC.Exp.2 IIDC.Exp.LR JJDC.Exp.1 JJDC.Exp.2 JJDC.Exp.LR
## STAT1   1067.311   3152.203    1.562380   2125.583   2636.649   0.3108469
## GAPDH  68302.050  25023.780   -1.448629  47029.500  57089.950   0.2796708
## ACTB   71205.810  35488.070   -1.004661  39072.230  64979.450   0.7338399
##       KKDC.Exp.1 KKDC.Exp.2 KKDC.Exp.LR LLDC.Exp.1 LLDC.Exp.2 LLDC.Exp.LR
## STAT1   3710.152   2696.019  -0.4606476   1201.073   662.8243  -0.8576254
## GAPDH  42782.580  53452.320   0.3212291  62244.380 76772.7300   0.3026504
## ACTB   49310.700  53184.350   0.1091010  39070.760 34014.8800  -0.1999233
##       MMDC.Exp.1 MMDC.Exp.2 MMDC.Exp.LR NNDC.Exp.1 NNDC.Exp.2 NNDC.Exp.LR
## STAT1   782.8055   1865.457   1.2528033   2423.722   1932.754  -0.3265662
## GAPDH 38957.5300  59453.970   0.6098709  68994.010  49123.550  -0.4900563
## ACTB  20465.2900  66247.930   1.6946963  69432.540  47629.870  -0.5437453
##       OODC.Exp.1 OODC.Exp.2 OODC.Exp.LR PPDC.Exp.1 PPDC.Exp.2 PPDC.Exp.LR
## STAT1   46.22742   37.10801 -0.31701843    2126.65   1761.695  -0.2716184
## GAPDH 6177.22959 5903.68645 -0.06534385   15028.38  57574.320   1.9377360
## ACTB  6059.12679 5795.10568 -0.06427492   13552.89  53110.290   1.9703909
##       QQDC.Exp.1 QQDC.Exp.2 QQDC.Exp.LR RRDC.Exp.1 RRDC.Exp.2 RRDC.Exp.LR
## STAT1   2186.735   1070.688  -1.0302403   1411.666   1947.939   0.4645497
## GAPDH  29822.860  89003.020   1.5774357  60804.270  55741.410  -0.1254231
## ACTB   37794.860  64049.440   0.7609959  56846.640  51633.440  -0.1387694
dim(gene_expr_df)
## [1] 17426    54

Next we will check the distribution of gene expression values for one sample’s log ratio change in gene expressions.

# distribution of log ratio values for a sample
with(gene_expr_df, hist(AADC.Exp.LR))

We can preview the effects of quantile normalization with this data set. We will normalize all of the Log Ratio columns in the dataset, and re-create the plot for the same sample’s data.

# preview quantile normalization effect
with(quantile_normalize_df(gene_expr_df[multi_grep(source_data = colnames(gene_expr_df), 
                                            patterns = "^.*LR$")]), 
     hist(AADC.Exp.LR))

6 Run Pipeline

During DiffBind analysis, some samples failed and were excluded from the output. On the other hand, all samples were included in the microarray gene expression analysis. Because of this, we need to find the samples that both data sets have in common. To accomplish this, we will use our sampleID_intersects function with a regular expression that returns the intersect of the two lists of samples. Due to failure on other quality control measures, we decided to exclude sample sample HHDC from the analysis; this will be accomplished with this step as well.

First, we will check on the column headers

# the column headings from the two group
colnames(diffbind_df)
##  [1] "seqnames"                 "start"                   
##  [3] "end"                      "width"                   
##  [5] "strand"                   "Conc"                    
##  [7] "Conc_1"                   "Conc_2"                  
##  [9] "Fold"                     "p.value"                 
## [11] "FDR"                      "AADC.1.H3K27AC"          
## [13] "BBDC.1.H3K27AC"           "CCDC.1.H3K27AC"          
## [15] "EEDC.1.H3K27AC"           "GGDC.1.H3K27AC"          
## [17] "HHDC.1.H3K27AC"           "JJDC.1.H3K27AC"          
## [19] "KKDC.1.H3K27AC"           "NNDC.1.H3K27AC"          
## [21] "OODC.1.H3K27AC"           "QQDC.1.H3K27AC"          
## [23] "RRDC.1.H3K27AC"           "AADC.2.H3K27AC"          
## [25] "BBDC.2.H3K27AC"           "CCDC.2.H3K27AC"          
## [27] "EEDC.2.H3K27AC"           "GGDC.2.H3K27AC"          
## [29] "HHDC.2.H3K27AC"           "JJDC.2.H3K27AC"          
## [31] "KKDC.2.H3K27AC"           "NNDC.2.H3K27AC"          
## [33] "OODC.2.H3K27AC"           "QQDC.2.H3K27AC"          
## [35] "RRDC.2.H3K27AC"           "feature"                 
## [37] "external_gene_name"       "gene_biotype"            
## [39] "start_position"           "end_position"            
## [41] "insideFeature"            "distancetoFeature"       
## [43] "shortestDistance"         "fromOverlappingOrNearest"
colnames(gene_expr_df)
##  [1] "AADC.Exp.1"  "AADC.Exp.2"  "AADC.Exp.LR" "BBDC.Exp.1"  "BBDC.Exp.2" 
##  [6] "BBDC.Exp.LR" "CCDC.Exp.1"  "CCDC.Exp.2"  "CCDC.Exp.LR" "DDDC.Exp.1" 
## [11] "DDDC.Exp.2"  "DDDC.Exp.LR" "EEDC.Exp.1"  "EEDC.Exp.2"  "EEDC.Exp.LR"
## [16] "FFDC.Exp.1"  "FFDC.Exp.2"  "FFDC.Exp.LR" "GGDC.Exp.1"  "GGDC.Exp.2" 
## [21] "GGDC.Exp.LR" "HHDC.Exp.1"  "HHDC.Exp.2"  "HHDC.Exp.LR" "IIDC.Exp.1" 
## [26] "IIDC.Exp.2"  "IIDC.Exp.LR" "JJDC.Exp.1"  "JJDC.Exp.2"  "JJDC.Exp.LR"
## [31] "KKDC.Exp.1"  "KKDC.Exp.2"  "KKDC.Exp.LR" "LLDC.Exp.1"  "LLDC.Exp.2" 
## [36] "LLDC.Exp.LR" "MMDC.Exp.1"  "MMDC.Exp.2"  "MMDC.Exp.LR" "NNDC.Exp.1" 
## [41] "NNDC.Exp.2"  "NNDC.Exp.LR" "OODC.Exp.1"  "OODC.Exp.2"  "OODC.Exp.LR"
## [46] "PPDC.Exp.1"  "PPDC.Exp.2"  "PPDC.Exp.LR" "QQDC.Exp.1"  "QQDC.Exp.2" 
## [51] "QQDC.Exp.LR" "RRDC.Exp.1"  "RRDC.Exp.2"  "RRDC.Exp.LR"
# get the patient ID intersects
patientID_matches <- sampleID_intersects(IDs1 = colnames(gene_expr_df), 
                                         pattern1 = "^([[:alpha:]]*)[[:punct:]]*.*$", 
                                         IDs2 = colnames(diffbind_df), 
                                         pattern2 = "^([[:alpha:]]*)[[:punct:]]*.*$", 
                                         removeID2 = "HHDC")

patientID_matches
##  [1] "AADC" "BBDC" "CCDC" "EEDC" "GGDC" "JJDC" "KKDC" "NNDC" "OODC" "QQDC"
## [11] "RRDC"

Similarly, we need to find the list of genes that are shared between the DiffBind and gene expression data sets.

# get the genes in common between the diffbind and gene expression data
gene_names_overlap <- intersect(rownames(gene_expr_df),
                                unique(diffbind_df[["external_gene_name"]]))

head(gene_names_overlap)
## [1] "STAT1" "GAPDH" "ACTB"  "DDR1"  "RFC2"  "HSPA6"

Due to the wide range of values in our data sets, we need to apply normalization measures to make the data points within and between samples more comparable to each other. In order to evaluate the effects of different normalization methods, we will run the entire pipeline with different sets of options in regards to quantile flooring and quantile normalization of the DiffBind and gene expression data sets. Beanplots showing results will be shown as output. Full-sized plots can be viewed by right-clicking on the images and selecting ‘View Image’ in your web browser.

6.1 Without Normalization

# without normalization
combined_full_pipeline(diffbind_df = diffbind_df, 
                       gene_expr_df = gene_expr_df, 
                       patientID_matches = patientID_matches, 
                       gene_names_overlap = gene_names_overlap,
                       diff_quant_norm = FALSE, 
                       gene_quant_norm = FALSE, 
                       diff_lr_cutoff = FALSE,
                       make_plot = TRUE, 
                       # data_file = paste0(main_outdir,"/data.Rdata"), 
                       # plot_file = paste0(main_outdir,"/beanplot_no_normalization.pdf"), 
                       histone_mark = histone_mark, 
                       params_branch = params_branch, 
                       mark_key = mark_key)

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  get(y_colname) by get(x_colname)
## W = 1963600, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
## 
## NULL
## NULL
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  get(y_colname) by get(x_colname)
## W = 2035200, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
## 
## NULL
## NULL
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  get(y_colname) by get(x_colname)
## W = 2240600, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
## 
## NULL
## NULL
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  get(y_colname) by get(x_colname)
## W = 2058800, p-value = 0.0004066
## alternative hypothesis: true location shift is not equal to 0
## 
## NULL
## NULL
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  get(y_colname) by get(x_colname)
## W = 2127200, p-value = 0.02365
## alternative hypothesis: true location shift is not equal to 0
## 
## NULL
## NULL
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  get(y_colname) by get(x_colname)
## W = 1789400, p-value = 0.7282
## alternative hypothesis: true location shift is not equal to 0
## 
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  get(y_colname) by get(x_colname)
## W = 1047200, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
## 
## NULL
## NULL
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  get(y_colname) by get(x_colname)
## W = 2316400, p-value = 1.174e-09
## alternative hypothesis: true location shift is not equal to 0
## 
## NULL
## NULL
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  get(y_colname) by get(x_colname)
## W = 784310, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
## 
## NULL
## NULL
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  get(y_colname) by get(x_colname)
## W = 2938900, p-value = 7.248e-09
## alternative hypothesis: true location shift is not equal to 0
## 
## NULL
## NULL
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  get(y_colname) by get(x_colname)
## W = 1961800, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
## 
## NULL
## NULL

6.2 DiffBind Raw Data Matrix Quantile Normalization

# DiffBind matrix quantile normalization
combined_full_pipeline(diffbind_df = diffbind_df, 
                       gene_expr_df = gene_expr_df, 
                       patientID_matches = patientID_matches, 
                       gene_names_overlap = gene_names_overlap,
                       diff_quant_norm = TRUE, 
                       gene_quant_norm = FALSE, 
                       diff_lr_cutoff = FALSE,
                       make_plot = TRUE, 
                       # data_file = paste0(main_outdir,"/data.Rdata"), 
                       # plot_file = paste0(main_outdir,"/beanplot_DiffQnorm.pdf"), 
                       histone_mark = histone_mark, 
                       params_branch = params_branch, 
                       mark_key = mark_key)

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  get(y_colname) by get(x_colname)
## W = 1928400, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
## 
## NULL
## NULL
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  get(y_colname) by get(x_colname)
## W = 1915900, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
## 
## NULL
## NULL
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  get(y_colname) by get(x_colname)
## W = 2151200, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
## 
## NULL
## NULL
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  get(y_colname) by get(x_colname)
## W = 2083400, p-value = 0.003625
## alternative hypothesis: true location shift is not equal to 0
## 
## NULL
## NULL
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  get(y_colname) by get(x_colname)
## W = 2217800, p-value = 0.9783
## alternative hypothesis: true location shift is not equal to 0
## 
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  get(y_colname) by get(x_colname)
## W = 1780700, p-value = 0.5515
## alternative hypothesis: true location shift is not equal to 0
## 
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  get(y_colname) by get(x_colname)
## W = 1009700, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
## 
## NULL
## NULL
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  get(y_colname) by get(x_colname)
## W = 2050800, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
## 
## NULL
## NULL
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  get(y_colname) by get(x_colname)
## W = 822720, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
## 
## NULL
## NULL
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  get(y_colname) by get(x_colname)
## W = 3182300, p-value = 0.2363
## alternative hypothesis: true location shift is not equal to 0
## 
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  get(y_colname) by get(x_colname)
## W = 2023600, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
## 
## NULL
## NULL

6.3 DiffBind Raw Value Quantile Normalization & Log Ratio Cutoffs

# DiffBind matrix quantile normalization + per sample quantile cutoffs
combined_full_pipeline(diffbind_df = diffbind_df, 
                       gene_expr_df = gene_expr_df, 
                       patientID_matches = patientID_matches, 
                       gene_names_overlap = gene_names_overlap,
                       diff_quant_norm = TRUE, 
                       gene_quant_norm = FALSE, 
                       diff_lr_cutoff = TRUE,
                       make_plot = TRUE, 
                       # data_file = paste0(main_outdir,"/data.Rdata"), 
                       # plot_file = paste0(main_outdir,"/beanplot_DiffQnorm_Qcutoff.pdf"), 
                       histone_mark = histone_mark, 
                       params_branch = params_branch, 
                       mark_key = mark_key)

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  get(y_colname) by get(x_colname)
## W = 1834400, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
## 
## NULL
## NULL
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  get(y_colname) by get(x_colname)
## W = 1836200, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
## 
## NULL
## NULL
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  get(y_colname) by get(x_colname)
## W = 2138900, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
## 
## NULL
## NULL
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  get(y_colname) by get(x_colname)
## W = 1714700, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
## 
## NULL
## NULL
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  get(y_colname) by get(x_colname)
## W = 2387800, p-value = 2.608e-05
## alternative hypothesis: true location shift is not equal to 0
## 
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  get(y_colname) by get(x_colname)
## W = 1754100, p-value = 0.1682
## alternative hypothesis: true location shift is not equal to 0
## 
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  get(y_colname) by get(x_colname)
## W = 1029100, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
## 
## NULL
## NULL
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  get(y_colname) by get(x_colname)
## W = 2289300, p-value = 1.592e-11
## alternative hypothesis: true location shift is not equal to 0
## 
## NULL
## NULL
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  get(y_colname) by get(x_colname)
## W = 846150, p-value = 1.143e-15
## alternative hypothesis: true location shift is not equal to 0
## 
## NULL
## NULL
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  get(y_colname) by get(x_colname)
## W = 3684500, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
## 
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  get(y_colname) by get(x_colname)
## W = 2364300, p-value = 6.027e-06
## alternative hypothesis: true location shift is not equal to 0
## 
## NULL
## NULL

6.4 DiffBind Log Ratio Quantile Cutoffs

# per sample quantile cutoffs
combined_full_pipeline(diffbind_df = diffbind_df, 
                       gene_expr_df = gene_expr_df, 
                       patientID_matches = patientID_matches, 
                       gene_names_overlap = gene_names_overlap,
                       diff_quant_norm = FALSE, 
                       gene_quant_norm = FALSE, 
                       diff_lr_cutoff = TRUE,
                       make_plot = TRUE, 
                       # data_file = paste0(main_outdir,"/data.Rdata"), 
                       # plot_file = paste0(main_outdir,"/beanplot_Qcutoff.pdf"), 
                       histone_mark = histone_mark, 
                       params_branch = params_branch, 
                       mark_key = mark_key)

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  get(y_colname) by get(x_colname)
## W = 1845100, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
## 
## NULL
## NULL
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  get(y_colname) by get(x_colname)
## W = 1908000, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
## 
## NULL
## NULL
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  get(y_colname) by get(x_colname)
## W = 2261500, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
## 
## NULL
## NULL
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  get(y_colname) by get(x_colname)
## W = 1621400, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
## 
## NULL
## NULL
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  get(y_colname) by get(x_colname)
## W = 2311200, p-value = 0.02157
## alternative hypothesis: true location shift is not equal to 0
## 
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  get(y_colname) by get(x_colname)
## W = 1750900, p-value = 0.1417
## alternative hypothesis: true location shift is not equal to 0
## 
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  get(y_colname) by get(x_colname)
## W = 991650, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
## 
## NULL
## NULL
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  get(y_colname) by get(x_colname)
## W = 2442300, p-value = 0.001106
## alternative hypothesis: true location shift is not equal to 0
## 
## NULL
## NULL
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  get(y_colname) by get(x_colname)
## W = 811830, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
## 
## NULL
## NULL
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  get(y_colname) by get(x_colname)
## W = 3579900, p-value = 1.875e-10
## alternative hypothesis: true location shift is not equal to 0
## 
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  get(y_colname) by get(x_colname)
## W = 2083900, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
## 
## NULL
## NULL

7 Conclusion

After reviewing the results, it was determined that DiffBind Raw Data Matrix Quantile Normalization gave the best results. These parameters normalized some of the data variance that was seen without causing too much distortion of the results. It can be seen from the beanplots that many of the samples had higher differential binding for H3K27AC in genes that were upregulated after treatment.

8 System Information

system('uname -srv',intern=T)
## [1] "Darwin 15.6.0 Darwin Kernel Version 15.6.0: Mon Aug 29 20:21:34 PDT 2016; root:xnu-3248.60.11~1/RELEASE_X86_64"
sessionInfo()
## R version 3.3.1 (2016-06-21)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X 10.11.6 (El Capitan)
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] preprocessCore_1.34.0 reshape2_1.4.1        ggplot2_2.1.0        
## [4] beanplot_1.2          data.table_1.9.6     
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.7      digest_0.6.10    plyr_1.8.4       chron_2.3-47    
##  [5] grid_3.3.1       gtable_0.2.0     formatR_1.4      magrittr_1.5    
##  [9] scales_0.4.0     evaluate_0.9     stringi_1.1.1    rmarkdown_1.0   
## [13] tools_3.3.1      stringr_1.1.0    munsell_0.4.3    yaml_2.1.13     
## [17] colorspace_1.2-6 htmltools_0.3.5  knitr_1.14
# save.image(compress = TRUE, ) # use this in the script, not in this document