This is a basic EWAS tutorial, introducing some of the concepts and tools for analysing methylation data.
The workshop uses public Illumina 450K data from several studies (GSE40279, GSE48472, GSE41114, GSE42700, GSE46573, GSE48472, GSE50586). It is currently designed to investigate the methylation differences between blood (whole blood) and buccal (cheek swabs) tissues. It is aimed at giving a very basic overview of the structure and type of data used in EWAS as well as a few of the available tools and methods to explore this data.
Note: Before starting this workshop please visit GitHub and make sure you have downloaded the repository and have set up all the required software for this workshop.
We need the minfi
package installed for this workshop.
If you need to install minfi
follow the steps below:
source("https://bioconductor.org/biocLite.R")
biocLite("minfi")
Now load the package:
require(minfi)
The data for this workshop has been provided for you. There are 2 main data files:
Load this into R:
meth.anno <- readRDS('data/Ilmn450K_anno.RDS')
betas <- readRDS('data/blood_buccal.RDS')
Weβll create an object called tissues
which provides identification of our blood and buccal samples:
# define samples
tissues <- c(rep(0, 5), rep(1, 5))
tissues <- as.factor(tissues)
levels(tissues) <- c('Blood', 'Buccal')
Now weβll explore the distribution of methylation values (beta values) by creating a density plot. A methylation distribution for a given individual follows a bimodal distribution, with peaks around 0 (unmethylated) and 1 (methylated).
Create a density plot for all samples, what do you see?
# create a density plot
densityPlot(betas, sampGroups = tissues)
Another way of exploring a data set is by cluster analysis.
Here we perform hierarchical clustering on the 10 samples using all available data:
# Hierarchical Clustering
d <- t(betas)
rownames(d) <- tissues
d <- dist(d, method = "minkowski", p = 2)
hc <- hclust(d)
plot(hc, main="Cluster on all probes", labels = tissues, xlab = '', sub="")
What do you conclude from the above?
To run our analysis exploring the difference in methylation between blood and buccal weβll use the dmpFinder
function from the minfi
package.
Depending on the βoutcomeβ (trait/phenotype being investigated) dmpFinder
can be configured for either categorical (logistic regression) or continuous (linear regression). Here weβre looking at blood tissue vs buccal tissue, so a categorical trait - weβll be using logistic regression modelling.
Run dmpFinder
:
# minfi dmpfinder
results <- dmpFinder(betas, pheno = tissues, type = "categorical", qCutoff = 1)
results <- na.omit(results)
Above we create a results
object which contains all CpG sites (methylation sites) ranked by p-value (significance), with the most significant sites first.
Look at the head of results
(most significant):
head(results)
intercept | f | pval | qval | |
---|---|---|---|---|
cg26297005 | 0.9248713 | 2779.165 | 0 | 2.6e-06 |
cg16780847 | 0.9185569 | 2179.933 | 0 | 2.6e-06 |
cg08780218 | 0.8801236 | 2021.894 | 0 | 2.6e-06 |
cg07759042 | 0.9810960 | 2003.060 | 0 | 2.6e-06 |
cg05350268 | 0.1052552 | 1776.359 | 0 | 3.3e-06 |
cg26071526 | 0.0753421 | 1444.052 | 0 | 4.4e-06 |
We can also explore the tail (least significant):
tail(results)
intercept | f | pval | qval | |
---|---|---|---|---|
cg18765133 | 0.9163265 | 0 | 0.9999703 | 0.3358529 |
cg13671590 | 0.8678395 | 0 | 0.9999785 | 0.3358539 |
cg01260615 | 0.9012077 | 0 | 0.9999799 | 0.3358539 |
cg14241748 | 0.6394218 | 0 | 0.9999801 | 0.3358539 |
cg25647092 | 0.8777785 | 0 | 0.9999956 | 0.3358582 |
cg10942084 | 0.8558640 | 0 | 0.9999973 | 0.3358582 |
The Q-value is an adjustment for multiple correction bias. Weβre going to extract all CpG sites which have a Q-value less than 0.00005.
First, how many sites will this be?
# explore some results
table(results$qval < 0.00005)
FALSE | TRUE |
---|---|
440320 | 6316 |
Now extract these sites into an object called top.results
:
top.results <- results[results$qval <= 0.00005,]
For fun we can explore how many sites pass a stringent Bonferroni correction:
Note: Bonferroni threshold is 0.05/[all observations]
, so weβll use 0.05/450000
which is 1.1e-7
.
# bonferroni correction
table(results$pval <= (0.05/450000))
FALSE | TRUE |
---|---|
445694 | 942 |
It is important to be able to visualise your data/results, so here weβre going to use some of the graphing functions of R
to do just that.
First lets try the plotCpg
function from the minfi
package:
## very basic plotting top CpG marker ('best'/most significantly associated)
plotCpg(betas, cpg = "cg05350268", pheno = tissues, type = "categorical", measure = "beta",
ylab = "methylation (beta)", ylim = c(0, 1))
# bottom CpG marker ('best'/most significantly associated)
plotCpg(betas, cpg = "cg07309136", pheno = tissues, type = "categorical", measure = "beta",
ylab = "methylation (beta)", ylim = c(0, 1))
We can also use base R
plotting to create boxplots:
## boxplots lets look at a few boxplots
boxplot(betas["cg05350268", ] ~ tissues, col = c("darkred", "cadetblue"), ylim = c(0,
1), main = "cg05350268")
boxplot(betas["cg19653345", ] ~ tissues, col = c("darkred", "cadetblue"), ylim = c(0,
1), main = "cg19653345")
If we want to create more than one plot at a time we can do so:
## multiple boxplots loop through a few sites for the top markers (most
## significantly associated)
tissue.sites <- rownames(top.results)[c(1:9)]
par(mfrow = c(3, 3))
for (cpg in tissue.sites) {
boxplot(betas[cpg, ] ~ tissues, col = c("darkred", "cadetblue"), ylab = "methylation (beta)",
xlab = "tissue", main = cpg, ylim = c(0, 1))
}
We can also create a pdf containing these plots:
## multiple boxplots can also write this out to a file (pdf)
pdf("figures/blood_v_buccal_topsites_plot.pdf", width = 7, height = 10)
tissue.sites <- rownames(top.results)[c(1:12)]
par(mfrow = c(4, 3))
for (cpg in tissue.sites) {
boxplot(betas[cpg, ] ~ tissues, col = c("darkred", "cadetblue"), ylab = "methylation (beta)",
xlab = "tissue", main = cpg)
}
dev.off()
If you navigate to the figures/
directory youβll find the pdf created above.
We previously performed clustering on all CpG sites, now weβre going to cluster on those that pass our defined threshold.
First extract a smaller matrix of all significant CpG sites:
# subset the beta matrix by the top markers and cluster
sig.betas <- betas[rownames(betas) %in% rownames(top.results), ]
Now perform the clustering:
# Hierarchical Clustering
d <- t(sig.betas)
rownames(d) <- tissues
d <- dist(d, method = "minkowski", p = 2)
hc <- hclust(d)
plot(hc, main = "Cluster on selected top tissue discrimination CpG sites", labels = tissues,
xlab = "", sub = "")
What do you see?
Before we can perform a pathways analysis we need to create a list of genes which underlie the CpG sites extracted from our analysis.
First we need to understand what the array annotation looks like:
# explore annotation
head(meth.anno) # see what the annotation looks like
IlmnID | CHR | MAPINFO | UCSC_RefGene_Name | UCSC_RefGene_Group | CHR_hg38 | start_38 | end_hg38 | strand | variant | SNP | filter | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
11645 | cg00000165 | chr1 | 91194674 | chr1 | 90729117 | 90729118 | - | NA | NA | 1 | ||
11646 | cg00000363 | chr1 | 230560793 | chr1 | 230425047 | 230425048 | + | NA | NA | 1 | ||
11647 | cg00000957 | chr1 | 5937253 | NPHP4 | Body | chr1 | 5877193 | 5877194 | + | G/A | rs199992272 | 1 |
11648 | cg00001349 | chr1 | 166958439 | MAEL | TSS200 | chr1 | 166989202 | 166989203 | - | NA | NA | 1 |
11649 | cg00001364 | chr1 | 214170376 | PROX1 | Body | chr1 | 213997033 | 213997034 | - | NA | NA | 1 |
11650 | cg00001446 | chr1 | 43831041 | ELOVL1 | Body | chr1 | 43365370 | 43365371 | - | NA | NA | 1 |
Now we get a subset of this annotation - this is sub-setting the annotation based on the sites that passed our analysis:
# subset annotation based on the top.results
anno.sub <- meth.anno[meth.anno$IlmnID %in% rownames(top.results), ]
Lets explore this a little. How many βsignificantβ CpG sites are there per chromosome?
# how many different features can be explored?
table(anno.sub$CHR) # number of significant sites per chromosome
chr1 | chr10 | chr11 | chr12 | chr13 | chr14 | chr15 | chr16 | chr17 | chr18 | chr19 | chr2 | chr20 | chr21 | chr22 | chr3 | chr4 | chr5 | chr6 | chr7 | chr8 | chr9 | chrX |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
750 | 262 | 419 | 288 | 114 | 205 | 217 | 373 | 544 | 51 | 367 | 413 | 148 | 56 | 182 | 297 | 197 | 271 | 401 | 367 | 249 | 139 | 6 |
How many unique genes are there?
length(unique(unlist(strsplit(anno.sub$UCSC_RefGene_Name, split = ";")))) # number of unique genes
## [1] 3223
Weβll extract these to an object called genes
:
# get the gene list from the provided 450K annotation object
genes <- unique(unlist(strsplit(anno.sub$UCSC_RefGene_Name, split = ";")))
Then we can write this to a text file:
# write list of genes out to a file
write.table(genes, "EWAS_blood_buccal_genelist.txt", row.names = F, col.names = F,
quote = F)
This is the final list of genes which can now be uploaded to your pathways analysis tool.
For this section we will be using a popular web-server tool, WEB-based GEne SeT AnaLysis Toolkit (http://webgestalt.org).
Here is an overview of parameters to set:
hsapiens
as organism;Overrepresentation Enrichment Analysis (ORA)
as method;geneontology
AND Biological_Process
Biological_Process
try again with Cellular_Component
and Molecular_Function
genesymbol
genome
2
and leave the rest of these settings as defaultYouβre now ready to click the submit
button and explore your pathways analysis.
Explore the resultant output.
Have a look at the genes that make up some of the enriched pathways, for example does it make sense to see these pathways? Explore some of the functionality of genes such as CORO1C, EGFR, LAT, and MAEA, does this support your hypothesis about the role these pathways might be playing?
Have a βplayβ with selecting other parameters and seeing what else you can potentially discover about the genes involved in differentiating these two tissue types at a methlyation level.
(Note: This section can be run if there is spare time at the end of the workshop.)
Sex can be an important factor and potentially confounder to many analyses, so here we will explore whether there are samples from both sexes present and if they impact on the final results.
Weβre going to be using the brilliant FactoMineR
package (http://factominer.free.fr/) during this little experiment, so if itβs not currently installed make sure to do so, then load the package:
# install.packages('FactoMineR')
require(FactoMineR) # used for PCA and plotting
First define how many X and Y chr markers are present on the 450k array:
table(meth.anno$CHR == 'chrX' | meth.anno$CHR == 'chrY')
FALSE | TRUE |
---|---|
473697 | 11644 |
These can be extracted:
sex.markers <- meth.anno[meth.anno$CHR == 'chrX' | meth.anno$CHR == 'chrY',]$IlmnID
Then you can figure out how many of these are in the data that you have loaded:
table(rownames(betas) %in% sex.markers)
FALSE | TRUE |
---|---|
436292 | 10344 |
sex.betas <- betas[rownames(betas) %in% sex.markers,]
Look at the data, can you see any patterns?
head(sex.betas)
GSM1179534 | GSM1179535 | GSM1179536 | GSM1179537 | GSM990423 | GSM1179540 | GSM1179542 | GSM1224032 | GSM1224035 | GSM1224037 | |
---|---|---|---|---|---|---|---|---|---|---|
cg00063477 | 0.5313657 | 0.5313294 | 0.9171985 | 0.9198911 | 0.4691203 | 0.3705699 | 0.4787351 | 0.9481618 | 0.4540514 | 0.9248766 |
cg00121626 | 0.5313752 | 0.5313375 | 0.4340551 | 0.4122939 | 0.4691767 | 0.2490480 | 0.4787418 | 0.4317573 | 0.0001380 | 0.4147529 |
cg00212031 | 0.5313847 | 0.5313474 | 0.0471097 | 0.0457456 | 0.4692831 | 0.1768156 | 0.4787480 | 0.0001249 | 0.3888040 | 0.0175207 |
cg00213748 | 0.5313897 | 0.5313551 | 0.7857719 | 0.7865042 | 0.4692893 | 0.4675472 | 0.4787521 | 0.8891483 | 0.0837567 | 0.8326294 |
cg00214611 | 0.5313956 | 0.5313607 | 0.0506264 | 0.0525789 | 0.4693121 | 0.1979152 | 0.4787573 | 0.0153420 | 0.4159638 | 0.0223633 |
cg00271873 | 0.5314066 | 0.5313697 | 0.8009416 | 0.8236546 | 0.4694095 | 0.1734459 | 0.4787680 | 0.7834561 | 0.1834110 | 0.7495785 |
How does it look when clustered?
# Hierarchical Clustering
d <- t(sex.betas)
rownames(d) <- tissues
d <- dist(d, method = "minkowski", p = 2)
hc <- hclust(d)
plot(hc, main="Cluster on X and Y Chr CpG sites", labels = tissues, xlab = '', sub="")
This data is obviously mixed sex. We actually know that there are 4 males in this data set so we can create a sex variable to use in further exploratory analysis:
sex <- as.factor(c('female', 'female', 'male', 'male', 'female', 'female', 'female', 'male', 'female', 'male'))
We are now going to perform a principle components analysis (PCA) as a means to further explore this data:
# set up the data for PCA
pca.data <- t(sex.betas)
pca.data <- as.data.frame(pca.data)
pca.data$tissue <- as.factor(tissues)
pca.data$sex <- sex
# perform the PCA
res.pca <- PCA(pca.data, quali.sup = c(ncol(pca.data), ncol(pca.data)-1), graph = FALSE)
Create a PCA plot:
plotellipses(res.pca, keepvar = c("tissue", "sex"), axes = c(1, 2), cex = 0.65,
legend = list(bty = "y", x = "topright"))
What can you infer from the PCA plots? Does it seem like sex could be influencing the analysis?
We can actually run a PCA on ALL CpG sites, this will help us to decide whether sex is potentially impacting our analysis.
WARNING: the below will take several minutes (~5 depending on the machine) to complete.
# set up the data for PCA
pca.data <- t(betas)
pca.data <- as.data.frame(pca.data)
pca.data$tissue <- as.factor(tissues)
pca.data$sex <- sex
# perform the PCA
res.pca <- PCA(pca.data, quali.sup = c(ncol(pca.data), ncol(pca.data) - 1),
graph = FALSE)
plotellipses(res.pca, keepvar = c("tissue", "sex"), axes = c(1, 2), cex = 0.65,
legend = list(bty = "y", x = "topright"))
What can we infer about sex and itβs influence on our analysis from the above?
Powered by RMarkdown