Introduction to methlyation and EWAS workshop

This is a basic EWAS tutorial, introducing some of the concepts and tools for analysing methylation data.

The workshop uses public Illumina 450K data from several studies (GSE40279, GSE48472, GSE41114, GSE42700, GSE46573, GSE48472, GSE50586). It is currently designed to investigate the methylation differences between blood (whole blood) and buccal (cheek swabs) tissues. It is aimed at giving a very basic overview of the structure and type of data used in EWAS as well as a few of the available tools and methods to explore this data.

Note: Before starting this workshop please visit GitHub and make sure you have downloaded the repository and have set up all the required software for this workshop.

1. load required packages

We need the minfi package installed for this workshop.

If you need to install minfi follow the steps below:

source("https://bioconductor.org/biocLite.R")
biocLite("minfi")

Now load the package:

require(minfi)

2. load data

The data for this workshop has been provided for you. There are 2 main data files:

annotation for the Illumina 450k methylation array
methylation data (beta values) for 5 whole blood and 5 buccal samples

Load this into R:

meth.anno <- readRDS('data/Ilmn450K_anno.RDS')
betas <- readRDS('data/blood_buccal.RDS')

3. set up samples/data and basic QC explore

We’ll create an object called tissues which provides identification of our blood and buccal samples:

# define samples
tissues <- c(rep(0, 5), rep(1, 5))
tissues <- as.factor(tissues)
levels(tissues) <- c('Blood', 'Buccal')

Now we’ll explore the distribution of methylation values (beta values) by creating a density plot. A methylation distribution for a given individual follows a bimodal distribution, with peaks around 0 (unmethylated) and 1 (methylated).

Create a density plot for all samples, what do you see?

# create a density plot
densityPlot(betas, sampGroups = tissues)

Another way of exploring a data set is by cluster analysis.

Here we perform hierarchical clustering on the 10 samples using all available data:

# Hierarchical Clustering
d <- t(betas)
rownames(d) <- tissues
d <- dist(d, method = "minkowski", p = 2)
hc <- hclust(d)
plot(hc, main="Cluster on all probes", labels = tissues, xlab = '', sub="")

What do you conclude from the above?

4. run dmpFinder - logistic regression

To run our analysis exploring the difference in methylation between blood and buccal we’ll use the dmpFinder function from the minfi package.

Depending on the ‘outcome’ (trait/phenotype being investigated) dmpFinder can be configured for either categorical (logistic regression) or continuous (linear regression). Here we’re looking at blood tissue vs buccal tissue, so a categorical trait - we’ll be using logistic regression modelling.

Run dmpFinder:

# minfi dmpfinder
results <- dmpFinder(betas, pheno = tissues, type = "categorical", qCutoff = 1)
results <- na.omit(results)

Above we create a results object which contains all CpG sites (methylation sites) ranked by p-value (significance), with the most significant sites first.

Look at the head of results (most significant):

head(results)

	intercept	f	qval
cg26297005	0.9248713	2779.165	2.6e-06
cg16780847	0.9185569	2179.933	2.6e-06
cg08780218	0.8801236	2021.894	2.6e-06
cg07759042	0.9810960	2003.060	2.6e-06
cg05350268	0.1052552	1776.359	3.3e-06
cg26071526	0.0753421	1444.052	4.4e-06

We can also explore the tail (least significant):

tail(results)

	intercept	pval	qval
cg18765133	0.9163265	0.9999703	0.3358529
cg13671590	0.8678395	0.9999785	0.3358539
cg01260615	0.9012077	0.9999799	0.3358539
cg14241748	0.6394218	0.9999801	0.3358539
cg25647092	0.8777785	0.9999956	0.3358582
cg10942084	0.8558640	0.9999973	0.3358582

The Q-value is an adjustment for multiple correction bias. We’re going to extract all CpG sites which have a Q-value less than 0.00005.

First, how many sites will this be?

# explore some results
table(results$qval < 0.00005)

FALSE	TRUE
440320	6316

Now extract these sites into an object called top.results:

top.results <- results[results$qval <= 0.00005,]

For fun we can explore how many sites pass a stringent Bonferroni correction:

Note: Bonferroni threshold is 0.05/[all observations], so we’ll use 0.05/450000 which is 1.1e-7.

# bonferroni correction
table(results$pval <= (0.05/450000))

FALSE	TRUE
445694	942

5. plotting the results

It is important to be able to visualise your data/results, so here we’re going to use some of the graphing functions of R to do just that.

First lets try the plotCpg function from the minfi package:

## very basic plotting top CpG marker ('best'/most significantly associated)
plotCpg(betas, cpg = "cg05350268", pheno = tissues, type = "categorical", measure = "beta", 
    ylab = "methylation (beta)", ylim = c(0, 1))
# bottom CpG marker ('best'/most significantly associated)
plotCpg(betas, cpg = "cg07309136", pheno = tissues, type = "categorical", measure = "beta", 
    ylab = "methylation (beta)", ylim = c(0, 1))

We can also use base R plotting to create boxplots:

## boxplots lets look at a few boxplots
boxplot(betas["cg05350268", ] ~ tissues, col = c("darkred", "cadetblue"), ylim = c(0, 
    1), main = "cg05350268")
boxplot(betas["cg19653345", ] ~ tissues, col = c("darkred", "cadetblue"), ylim = c(0, 
    1), main = "cg19653345")

If we want to create more than one plot at a time we can do so:

## multiple boxplots loop through a few sites for the top markers (most
## significantly associated)
tissue.sites <- rownames(top.results)[c(1:9)]

par(mfrow = c(3, 3))
for (cpg in tissue.sites) {
    
    boxplot(betas[cpg, ] ~ tissues, col = c("darkred", "cadetblue"), ylab = "methylation (beta)", 
        xlab = "tissue", main = cpg, ylim = c(0, 1))
    
}

We can also create a pdf containing these plots:

## multiple boxplots can also write this out to a file (pdf)
pdf("figures/blood_v_buccal_topsites_plot.pdf", width = 7, height = 10)
tissue.sites <- rownames(top.results)[c(1:12)]

par(mfrow = c(4, 3))
for (cpg in tissue.sites) {
    
    boxplot(betas[cpg, ] ~ tissues, col = c("darkred", "cadetblue"), ylab = "methylation (beta)", 
        xlab = "tissue", main = cpg)
    
}
dev.off()

If you navigate to the figures/ directory you’ll find the pdf created above.

Hierarchical Clustering 2

We previously performed clustering on all CpG sites, now we’re going to cluster on those that pass our defined threshold.

First extract a smaller matrix of all significant CpG sites:

# subset the beta matrix by the top markers and cluster
sig.betas <- betas[rownames(betas) %in% rownames(top.results), ]

Now perform the clustering:

# Hierarchical Clustering
d <- t(sig.betas)
rownames(d) <- tissues
d <- dist(d, method = "minkowski", p = 2)
hc <- hclust(d)
plot(hc, main = "Cluster on selected top tissue discrimination CpG sites", labels = tissues, 
    xlab = "", sub = "")

What do you see?

6. prepare and run pathways analysis

Before we can perform a pathways analysis we need to create a list of genes which underlie the CpG sites extracted from our analysis.

First we need to understand what the array annotation looks like:

# explore annotation
head(meth.anno)  # see what the annotation looks like

	IlmnID	CHR	MAPINFO	UCSC_RefGene_Name	UCSC_RefGene_Group	CHR_hg38	start_38	end_hg38	strand	variant	SNP	filter
11645	cg00000165	chr1	91194674			chr1	90729117	90729118	-	NA	NA	1
11646	cg00000363	chr1	230560793			chr1	230425047	230425048	+	NA	NA	1
11647	cg00000957	chr1	5937253	NPHP4	Body	chr1	5877193	5877194	+	G/A	rs199992272	1
11648	cg00001349	chr1	166958439	MAEL	TSS200	chr1	166989202	166989203	-	NA	NA	1
11649	cg00001364	chr1	214170376	PROX1	Body	chr1	213997033	213997034	-	NA	NA	1
11650	cg00001446	chr1	43831041	ELOVL1	Body	chr1	43365370	43365371	-	NA	NA	1

Now we get a subset of this annotation - this is sub-setting the annotation based on the sites that passed our analysis:

# subset annotation based on the top.results
anno.sub <- meth.anno[meth.anno$IlmnID %in% rownames(top.results), ]

Lets explore this a little. How many ‘significant’ CpG sites are there per chromosome?

# how many different features can be explored?
table(anno.sub$CHR)  # number of significant sites per chromosome

chr1	chr10	chr11	chr12	chr13	chr14	chr15	chr16	chr17	chr18	chr19	chr2	chr20	chr21	chr22	chr3	chr4	chr5	chr6	chr7	chr8	chr9	chrX
750	262	419	288	114	205	217	373	544	51	367	413	148	56	182	297	197	271	401	367	249	139	6

How many unique genes are there?

length(unique(unlist(strsplit(anno.sub$UCSC_RefGene_Name, split = ";"))))  # number of unique genes

## [1] 3223

We’ll extract these to an object called genes:

# get the gene list from the provided 450K annotation object
genes <- unique(unlist(strsplit(anno.sub$UCSC_RefGene_Name, split = ";")))

Then we can write this to a text file:

# write list of genes out to a file
write.table(genes, "EWAS_blood_buccal_genelist.txt", row.names = F, col.names = F, 
    quote = F)

This is the final list of genes which can now be uploaded to your pathways analysis tool.

Pathways analysis

For this section we will be using a popular web-server tool, WEB-based GEne SeT AnaLysis Toolkit (http://webgestalt.org).

Here is an overview of parameters to set:

select hsapiens as organism;
select Overrepresentation Enrichment Analysis (ORA) as method;
select geneontology AND Biological_Process
- NOTE: after you’ve run through Biological_Process try again with Cellular_Component and Molecular_Function
for Gene ID select genesymbol
then use the browse button to select and upload the list of genes you created earlier
for reference set select genome
in the Advanced parameters section set minimum number of genes to 2 and leave the rest of these settings as default

You’re now ready to click the submit button and explore your pathways analysis.

Explore the resultant output.

Have a look at the genes that make up some of the enriched pathways, for example does it make sense to see these pathways? Explore some of the functionality of genes such as CORO1C, EGFR, LAT, and MAEA, does this support your hypothesis about the role these pathways might be playing?

Have a ‘play’ with selecting other parameters and seeing what else you can potentially discover about the genes involved in differentiating these two tissue types at a methlyation level.

Extra: Explore sex effects on the data

(Note: This section can be run if there is spare time at the end of the workshop.)

Sex can be an important factor and potentially confounder to many analyses, so here we will explore whether there are samples from both sexes present and if they impact on the final results.

Install/Require packages

We’re going to be using the brilliant FactoMineR package (http://factominer.free.fr/) during this little experiment, so if it’s not currently installed make sure to do so, then load the package:

# install.packages('FactoMineR')
require(FactoMineR)   # used for PCA and plotting

Define and extract sex markers (chrX and Y)

First define how many X and Y chr markers are present on the 450k array:

table(meth.anno$CHR == 'chrX' | meth.anno$CHR == 'chrY')

FALSE	TRUE
473697	11644

These can be extracted:

sex.markers <- meth.anno[meth.anno$CHR == 'chrX' | meth.anno$CHR == 'chrY',]$IlmnID

Then you can figure out how many of these are in the data that you have loaded:

table(rownames(betas) %in% sex.markers)

FALSE	TRUE
436292	10344

sex.betas <- betas[rownames(betas) %in% sex.markers,]

Look at the data, can you see any patterns?

head(sex.betas)

	GSM1179534	GSM1179535	GSM1179536	GSM1179537	GSM990423	GSM1179540	GSM1179542	GSM1224032	GSM1224035	GSM1224037
cg00063477	0.5313657	0.5313294	0.9171985	0.9198911	0.4691203	0.3705699	0.4787351	0.9481618	0.4540514	0.9248766
cg00121626	0.5313752	0.5313375	0.4340551	0.4122939	0.4691767	0.2490480	0.4787418	0.4317573	0.0001380	0.4147529
cg00212031	0.5313847	0.5313474	0.0471097	0.0457456	0.4692831	0.1768156	0.4787480	0.0001249	0.3888040	0.0175207
cg00213748	0.5313897	0.5313551	0.7857719	0.7865042	0.4692893	0.4675472	0.4787521	0.8891483	0.0837567	0.8326294
cg00214611	0.5313956	0.5313607	0.0506264	0.0525789	0.4693121	0.1979152	0.4787573	0.0153420	0.4159638	0.0223633
cg00271873	0.5314066	0.5313697	0.8009416	0.8236546	0.4694095	0.1734459	0.4787680	0.7834561	0.1834110	0.7495785

How does it look when clustered?

# Hierarchical Clustering 
d <- t(sex.betas)
rownames(d) <- tissues
d <- dist(d, method = "minkowski", p = 2)
hc <- hclust(d)
plot(hc, main="Cluster on X and Y Chr CpG sites", labels = tissues, xlab = '', sub="")

This data is obviously mixed sex. We actually know that there are 4 males in this data set so we can create a sex variable to use in further exploratory analysis:

sex <- as.factor(c('female', 'female', 'male', 'male', 'female', 'female', 'female', 'male', 'female', 'male'))

We are now going to perform a principle components analysis (PCA) as a means to further explore this data:

# set up the data for PCA
pca.data <- t(sex.betas)
pca.data <- as.data.frame(pca.data)
pca.data$tissue <- as.factor(tissues)
pca.data$sex <- sex
# perform the PCA
res.pca <- PCA(pca.data, quali.sup = c(ncol(pca.data), ncol(pca.data)-1), graph = FALSE)

Create a PCA plot:

plotellipses(res.pca, keepvar = c("tissue", "sex"), axes = c(1, 2), cex = 0.65, 
    legend = list(bty = "y", x = "topright"))

What can you infer from the PCA plots? Does it seem like sex could be influencing the analysis?

We can actually run a PCA on ALL CpG sites, this will help us to decide whether sex is potentially impacting our analysis.

WARNING: the below will take several minutes (~5 depending on the machine) to complete.

# set up the data for PCA
pca.data <- t(betas)
pca.data <- as.data.frame(pca.data)
pca.data$tissue <- as.factor(tissues)
pca.data$sex <- sex
# perform the PCA
res.pca <- PCA(pca.data, quali.sup = c(ncol(pca.data), ncol(pca.data) - 1), 
    graph = FALSE)

plotellipses(res.pca, keepvar = c("tissue", "sex"), axes = c(1, 2), cex = 0.65, 
    legend = list(bty = "y", x = "topright"))

What can we infer about sex and it’s influence on our analysis from the above?