This is a basic EWAS tutorial, introducing some of the concepts and tools for analysing methylation data.

The workshop uses public Illumina 450K data from several studies (GSE40279, GSE48472, GSE41114, GSE42700, GSE46573, GSE48472, GSE50586). It is currently designed to investigate the methylation differences between blood (whole blood) and buccal (cheek swabs) tissues. It is aimed at giving a very basic overview of the structure and type of data used in EWAS as well as a few of the available tools and methods to explore this data.

Note: Before starting this workshop please visit GitHub and make sure you have downloaded the repository and have set up all the required software for this workshop.


1. load required packages

We need the minfi package installed for this workshop.

If you need to install minfi follow the steps below:

source("https://bioconductor.org/biocLite.R")
biocLite("minfi")

Now load the package:

require(minfi)

2. load data

The data for this workshop has been provided for you. There are 2 main data files:

  1. annotation for the Illumina 450k methylation array
  2. methylation data (beta values) for 5 whole blood and 5 buccal samples

Load this into R:

meth.anno <- readRDS('data/Ilmn450K_anno.RDS')
betas <- readRDS('data/blood_buccal.RDS')

3. set up samples/data and basic QC explore

We’ll create an object called tissues which provides identification of our blood and buccal samples:

# define samples
tissues <- c(rep(0, 5), rep(1, 5))
tissues <- as.factor(tissues)
levels(tissues) <- c('Blood', 'Buccal')

Now we’ll explore the distribution of methylation values (beta values) by creating a density plot. A methylation distribution for a given individual follows a bimodal distribution, with peaks around 0 (unmethylated) and 1 (methylated).

Create a density plot for all samples, what do you see?

# create a density plot
densityPlot(betas, sampGroups = tissues)

Another way of exploring a data set is by cluster analysis.

Here we perform hierarchical clustering on the 10 samples using all available data:

# Hierarchical Clustering
d <- t(betas)
rownames(d) <- tissues
d <- dist(d, method = "minkowski", p = 2)
hc <- hclust(d)
plot(hc, main="Cluster on all probes", labels = tissues, xlab = '', sub="")

What do you conclude from the above?


4. run dmpFinder - logistic regression

To run our analysis exploring the difference in methylation between blood and buccal we’ll use the dmpFinder function from the minfi package.

Depending on the β€˜outcome’ (trait/phenotype being investigated) dmpFinder can be configured for either categorical (logistic regression) or continuous (linear regression). Here we’re looking at blood tissue vs buccal tissue, so a categorical trait - we’ll be using logistic regression modelling.

Run dmpFinder:

# minfi dmpfinder
results <- dmpFinder(betas, pheno = tissues, type = "categorical", qCutoff = 1)
results <- na.omit(results)

Above we create a results object which contains all CpG sites (methylation sites) ranked by p-value (significance), with the most significant sites first.

Look at the head of results (most significant):

head(results)
intercept f pval qval
cg26297005 0.9248713 2779.165 0 2.6e-06
cg16780847 0.9185569 2179.933 0 2.6e-06
cg08780218 0.8801236 2021.894 0 2.6e-06
cg07759042 0.9810960 2003.060 0 2.6e-06
cg05350268 0.1052552 1776.359 0 3.3e-06
cg26071526 0.0753421 1444.052 0 4.4e-06

We can also explore the tail (least significant):

tail(results)
intercept f pval qval
cg18765133 0.9163265 0 0.9999703 0.3358529
cg13671590 0.8678395 0 0.9999785 0.3358539
cg01260615 0.9012077 0 0.9999799 0.3358539
cg14241748 0.6394218 0 0.9999801 0.3358539
cg25647092 0.8777785 0 0.9999956 0.3358582
cg10942084 0.8558640 0 0.9999973 0.3358582

The Q-value is an adjustment for multiple correction bias. We’re going to extract all CpG sites which have a Q-value less than 0.00005.

First, how many sites will this be?

# explore some results
table(results$qval < 0.00005)
FALSE TRUE
440320 6316

Now extract these sites into an object called top.results:

top.results <- results[results$qval <= 0.00005,]

For fun we can explore how many sites pass a stringent Bonferroni correction:

Note: Bonferroni threshold is 0.05/[all observations], so we’ll use 0.05/450000 which is 1.1e-7.

# bonferroni correction
table(results$pval <= (0.05/450000))
FALSE TRUE
445694 942

5. plotting the results

It is important to be able to visualise your data/results, so here we’re going to use some of the graphing functions of R to do just that.

First lets try the plotCpg function from the minfi package:

## very basic plotting top CpG marker ('best'/most significantly associated)
plotCpg(betas, cpg = "cg05350268", pheno = tissues, type = "categorical", measure = "beta", 
    ylab = "methylation (beta)", ylim = c(0, 1))
# bottom CpG marker ('best'/most significantly associated)
plotCpg(betas, cpg = "cg07309136", pheno = tissues, type = "categorical", measure = "beta", 
    ylab = "methylation (beta)", ylim = c(0, 1))

We can also use base R plotting to create boxplots:

## boxplots lets look at a few boxplots
boxplot(betas["cg05350268", ] ~ tissues, col = c("darkred", "cadetblue"), ylim = c(0, 
    1), main = "cg05350268")
boxplot(betas["cg19653345", ] ~ tissues, col = c("darkred", "cadetblue"), ylim = c(0, 
    1), main = "cg19653345")

If we want to create more than one plot at a time we can do so:

## multiple boxplots loop through a few sites for the top markers (most
## significantly associated)
tissue.sites <- rownames(top.results)[c(1:9)]

par(mfrow = c(3, 3))
for (cpg in tissue.sites) {
    
    boxplot(betas[cpg, ] ~ tissues, col = c("darkred", "cadetblue"), ylab = "methylation (beta)", 
        xlab = "tissue", main = cpg, ylim = c(0, 1))
    
}

We can also create a pdf containing these plots:

## multiple boxplots can also write this out to a file (pdf)
pdf("figures/blood_v_buccal_topsites_plot.pdf", width = 7, height = 10)
tissue.sites <- rownames(top.results)[c(1:12)]

par(mfrow = c(4, 3))
for (cpg in tissue.sites) {
    
    boxplot(betas[cpg, ] ~ tissues, col = c("darkred", "cadetblue"), ylab = "methylation (beta)", 
        xlab = "tissue", main = cpg)
    
}
dev.off()

If you navigate to the figures/ directory you’ll find the pdf created above.

Hierarchical Clustering 2

We previously performed clustering on all CpG sites, now we’re going to cluster on those that pass our defined threshold.

First extract a smaller matrix of all significant CpG sites:

# subset the beta matrix by the top markers and cluster
sig.betas <- betas[rownames(betas) %in% rownames(top.results), ]

Now perform the clustering:

# Hierarchical Clustering
d <- t(sig.betas)
rownames(d) <- tissues
d <- dist(d, method = "minkowski", p = 2)
hc <- hclust(d)
plot(hc, main = "Cluster on selected top tissue discrimination CpG sites", labels = tissues, 
    xlab = "", sub = "")

What do you see?


6. prepare and run pathways analysis

Before we can perform a pathways analysis we need to create a list of genes which underlie the CpG sites extracted from our analysis.

First we need to understand what the array annotation looks like:

# explore annotation
head(meth.anno)  # see what the annotation looks like
IlmnID CHR MAPINFO UCSC_RefGene_Name UCSC_RefGene_Group CHR_hg38 start_38 end_hg38 strand variant SNP filter
11645 cg00000165 chr1 91194674 chr1 90729117 90729118 - NA NA 1
11646 cg00000363 chr1 230560793 chr1 230425047 230425048 + NA NA 1
11647 cg00000957 chr1 5937253 NPHP4 Body chr1 5877193 5877194 + G/A rs199992272 1
11648 cg00001349 chr1 166958439 MAEL TSS200 chr1 166989202 166989203 - NA NA 1
11649 cg00001364 chr1 214170376 PROX1 Body chr1 213997033 213997034 - NA NA 1
11650 cg00001446 chr1 43831041 ELOVL1 Body chr1 43365370 43365371 - NA NA 1

Now we get a subset of this annotation - this is sub-setting the annotation based on the sites that passed our analysis:

# subset annotation based on the top.results
anno.sub <- meth.anno[meth.anno$IlmnID %in% rownames(top.results), ]

Lets explore this a little. How many β€˜significant’ CpG sites are there per chromosome?

# how many different features can be explored?
table(anno.sub$CHR)  # number of significant sites per chromosome
chr1 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19 chr2 chr20 chr21 chr22 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chrX
750 262 419 288 114 205 217 373 544 51 367 413 148 56 182 297 197 271 401 367 249 139 6

How many unique genes are there?

length(unique(unlist(strsplit(anno.sub$UCSC_RefGene_Name, split = ";"))))  # number of unique genes
## [1] 3223

We’ll extract these to an object called genes:

# get the gene list from the provided 450K annotation object
genes <- unique(unlist(strsplit(anno.sub$UCSC_RefGene_Name, split = ";")))

Then we can write this to a text file:

# write list of genes out to a file
write.table(genes, "EWAS_blood_buccal_genelist.txt", row.names = F, col.names = F, 
    quote = F)

This is the final list of genes which can now be uploaded to your pathways analysis tool.

Pathways analysis

For this section we will be using a popular web-server tool, WEB-based GEne SeT AnaLysis Toolkit (http://webgestalt.org).

Here is an overview of parameters to set:

You’re now ready to click the submit button and explore your pathways analysis.

Explore the resultant output.

Have a look at the genes that make up some of the enriched pathways, for example does it make sense to see these pathways? Explore some of the functionality of genes such as CORO1C, EGFR, LAT, and MAEA, does this support your hypothesis about the role these pathways might be playing?

Have a β€˜play’ with selecting other parameters and seeing what else you can potentially discover about the genes involved in differentiating these two tissue types at a methlyation level.


Extra: Explore sex effects on the data

(Note: This section can be run if there is spare time at the end of the workshop.)

Sex can be an important factor and potentially confounder to many analyses, so here we will explore whether there are samples from both sexes present and if they impact on the final results.

Install/Require packages

We’re going to be using the brilliant FactoMineR package (http://factominer.free.fr/) during this little experiment, so if it’s not currently installed make sure to do so, then load the package:

# install.packages('FactoMineR')
require(FactoMineR)   # used for PCA and plotting

Define and extract sex markers (chrX and Y)

First define how many X and Y chr markers are present on the 450k array:

table(meth.anno$CHR == 'chrX' | meth.anno$CHR == 'chrY')
FALSE TRUE
473697 11644

These can be extracted:

sex.markers <- meth.anno[meth.anno$CHR == 'chrX' | meth.anno$CHR == 'chrY',]$IlmnID

Then you can figure out how many of these are in the data that you have loaded:

table(rownames(betas) %in% sex.markers)
FALSE TRUE
436292 10344
sex.betas <- betas[rownames(betas) %in% sex.markers,]

Look at the data, can you see any patterns?

head(sex.betas)
GSM1179534 GSM1179535 GSM1179536 GSM1179537 GSM990423 GSM1179540 GSM1179542 GSM1224032 GSM1224035 GSM1224037
cg00063477 0.5313657 0.5313294 0.9171985 0.9198911 0.4691203 0.3705699 0.4787351 0.9481618 0.4540514 0.9248766
cg00121626 0.5313752 0.5313375 0.4340551 0.4122939 0.4691767 0.2490480 0.4787418 0.4317573 0.0001380 0.4147529
cg00212031 0.5313847 0.5313474 0.0471097 0.0457456 0.4692831 0.1768156 0.4787480 0.0001249 0.3888040 0.0175207
cg00213748 0.5313897 0.5313551 0.7857719 0.7865042 0.4692893 0.4675472 0.4787521 0.8891483 0.0837567 0.8326294
cg00214611 0.5313956 0.5313607 0.0506264 0.0525789 0.4693121 0.1979152 0.4787573 0.0153420 0.4159638 0.0223633
cg00271873 0.5314066 0.5313697 0.8009416 0.8236546 0.4694095 0.1734459 0.4787680 0.7834561 0.1834110 0.7495785

How does it look when clustered?

# Hierarchical Clustering 
d <- t(sex.betas)
rownames(d) <- tissues
d <- dist(d, method = "minkowski", p = 2)
hc <- hclust(d)
plot(hc, main="Cluster on X and Y Chr CpG sites", labels = tissues, xlab = '', sub="")

This data is obviously mixed sex. We actually know that there are 4 males in this data set so we can create a sex variable to use in further exploratory analysis:

sex <- as.factor(c('female', 'female', 'male', 'male', 'female', 'female', 'female', 'male', 'female', 'male'))

We are now going to perform a principle components analysis (PCA) as a means to further explore this data:

# set up the data for PCA
pca.data <- t(sex.betas)
pca.data <- as.data.frame(pca.data)
pca.data$tissue <- as.factor(tissues)
pca.data$sex <- sex
# perform the PCA
res.pca <- PCA(pca.data, quali.sup = c(ncol(pca.data), ncol(pca.data)-1), graph = FALSE)

Create a PCA plot:

plotellipses(res.pca, keepvar = c("tissue", "sex"), axes = c(1, 2), cex = 0.65, 
    legend = list(bty = "y", x = "topright"))

What can you infer from the PCA plots? Does it seem like sex could be influencing the analysis?

We can actually run a PCA on ALL CpG sites, this will help us to decide whether sex is potentially impacting our analysis.

WARNING: the below will take several minutes (~5 depending on the machine) to complete.

# set up the data for PCA
pca.data <- t(betas)
pca.data <- as.data.frame(pca.data)
pca.data$tissue <- as.factor(tissues)
pca.data$sex <- sex
# perform the PCA
res.pca <- PCA(pca.data, quali.sup = c(ncol(pca.data), ncol(pca.data) - 1), 
    graph = FALSE)
plotellipses(res.pca, keepvar = c("tissue", "sex"), axes = c(1, 2), cex = 0.65, 
    legend = list(bty = "y", x = "topright"))

What can we infer about sex and it’s influence on our analysis from the above?


Powered by RMarkdown