class: center, middle, inverse, title-slide # .huge[Data Science:] ## skills and capabilities ###
Dr Miles Benton
Senior Scientist, ESR
### (or soon to be, 2
nd
July 2018) --- class: middle # .center[Advocate for reproducible research...] ## .center[Where possible my presentations and code are available online] <br /> <p> .center[ <!-- ![](images/github_logo.png) --> <img src="images/github_logo.png" style="width: 520px; margin-right: 1%; margin-top: 1.5em;"/> <img src="images/sirselim_qrcode.png" style="width: 202px; margin-right: 1%; margin-top: 1.5em;"/> ] </p> <br /> .center[[sirselim.github.io/presentations](http://sirselim.github.io/presentations)] --- class: inverse ### Molecular Pathology, Human Genetics, <span style="color:lightblue">Computational Genomics</span> / Bioinformatics PhD .pull-left[.medium[ #### <span style="color:lightblue">Languages</span> - *Proficient* - R - GNU/UNIX - Bash - HTML / CSS - *Developing* - Python - Javascript - Julia #### <span style="color:lightblue">Writing/Reporting</span> - Markdown / RMarkdown - LaTeX #### <span style="color:lightblue">Versioning</span> - Git (GitHub / GitLab / Bitbucket) - SVN ]] .pull-right[.medium[ #### <span style="color:lightblue">Analysis</span> - Statistics - Machine learning (glmnet, elastic net, random-forest) - Pipeline development - WGS & Exome - Methylation (array and sequence) - Transcript, smallRNA #### <span style="color:lightblue">Visualisation</span> - Shiny / Shinydashboard - Markdown / RMarkdown - D3 - HTML - xaringan / reveal.js - Inkscape #### <span style="color:lightblue">Other</span> - Linux system admin - Cluster computing ]] --- # Overview **1. Me briefly** <br> -- **2. Select projects to highlight skill set<sup>*</sup>** - Obesity Methylation - bootNet - WGS pipeline - Methylation pipeline - Diagnostics Annotation and Variant Reporting software (DART) - Electronic notebook <sup>*</sup><i>aim to demonstrate integration of data, analysis and visualisation with these</i> <br> -- **4. Other tools** <br> -- **5. Collaborations** --- class: inverse middle .large[...it starts with...] # NORFOLK ISLAND --- class: middle center
--- class: inverse middle center <p> .center[<img src="images/NI_ped_mt.png" style="width: 100%; margin-right: 1%; margin-top: -0.5em; border: 3px solid white;"/>] </p> 40% of current population haplogroup <span style="color:lightblue">B4a1a[...]</span><br /> <p style="font-size: 14px">Benton MC <i>et al.,:</i> <a href="https://investigativegenetics.biomedcentral.com/articles/10.1186/s13323-015-0028-9" target="blank"><i>“Mutiny on the Bounty”: the genetic history of Norfolk Island reveals extreme gender-biased admixture.</i></a> Investigative Genetics 2015, 6:11.</p> --- class: inverse ### <span style="color:lightblue">My PhD took me from basic GWAS...</span> <span style="color:#3498DB">**Benton MC**</span>, Lea RA, <span style="color:#3498DB">**Macartney-Coxson D**</span>, Carless MA, Bellis C, Hanna M, Eccles D, Chambers GK, Curran JE, Blangero J and Griffiths LR. (2015) *Serum bilirubin concentration is modified by UGT1A1 Haplotypes and Influences Risk of Type-2 Diabetes in the Norfolk Island Genetic Isolate*. **BMC Genetics** 16(1) [[article]](http://www.biomedcentral.com/1471-2156/16/136) -- ### <span style="color:lightblue">through integrative 'omics...</span> <span style="color:#3498DB">**Benton MC**</span>, Lea RA, <span style="color:#3498DB">**Macartney-Coxson D**</span>, Carless MA, Göring HH, Bellis C, Hanna M, Eccles D, Chambers GK, Curran JE, Harper JL, Blangero J and Griffiths LR. (2013) *Mapping eQTLs in the Norfolk Island Genetic Isolate Identifies Candidate Genes for CVD-risk Traits*. **American Journal of Human Genetics** 93(6): 1087-99 [[article]](http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3853002/) -- ### <span style="color:lightblue">and the application of 'outside-of-the-box' statistical approaches...</span> <span style="color:#3498DB">**Benton MC**</span>, Lea RA, <span style="color:#3498DB">**Macartney-Coxson D**</span>, Carless MA, Göring HH, Bellis C, Hanna M, Eccles D, Chambers GK, Curran JE, Harper JL, Blangero J and Griffiths LR. (2015) *A phenomic scan of the Norfolk Island genetic isolate identifies a major pleiotropic effect locus associated with renal disorder markers*. **PLoS Genetics** 11(10) [[article]](http://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1005593) --- class: middle # Projects --- ## Project: Obesity Methylation Collaborators: Donia Macartney-Coxson Illumina methylation arrays - brand new = limited tools - use what's available and 'bolt on' the rest... Basic statistical approaches (t-test, regression, clustering) -- <br> Spawned some small GitHub projects: - illumina_filtering: https://github.com/sirselim/illumina450k_filtering + updated to work on 450k and 850k arrays - CpG_annotate: https://github.com/sirselim/cpg_annotate - EWAS_workshop: https://github.com/sirselim/methylation_EWAS_workshop <br> -- Was able to discuss this project with **Terry Speed**, who gave me a book to read... ...which lead to... --- class: inverse ## Project: bootNet (https://github.com/sirselim/bootNet) Collaborators: Nicole White, Ray Blick > bootNet is a wrapper for the fantastic [glmnet](https://cran.r-project.org/web/packages/glmnet/index.html) R package - it brings bootstrapping and parallel processing to the elastic-net framework. bootNet was designed to identify methylation sites predicting an outcome (now agnostic and integrative) -- - the outcome can either be: + quantitative (continuous, i.e. age, BMI, weight … ) + qualitative (categorical, i.e. gender, smoking, case/control … ) - integrated ability to perform bootstrapping for a given number of iterations + user defined sub-sampling at each iteration + additional sampling approaches (i.e. 'leave one out') -- - wrote a second function to allow parallelisation + this allows a user to select the number of processors (cores) to split the task across + running on multiple cores greatly speeds up analysis Full presentation: http://sirselim.github.io/bootNet/bootNet_presentation.html --- class: inverse ## Project: bootNet (https://github.com/sirselim/bootNet) Applied the principles to detect age related genes - [training set] Norfolk Island data - [validation] >2500 public methylomes ([blood](images/aging_meth.png)) .small[<span style="color:#3498DB">**Benton MC**</span>, Sutherland HG, <span style="color:#3498DB">**Macartney-Coxson D**</span>, Haupt LM, Lea RA, and Griffiths LR. (2017) *Methylome-wide association study of whole blood DNA in the Norfolk Island isolate identifies robust loci associated with age*. Aging 9(3) [[article]](http://www.aging-us.com/article/101187)] <br> -- Under final development, R package available via GitHub upon publication <br> -- Graph theorem: Mike Langston Contrasting methods... .small[<span style="color:#3498DB">**Macartney-Coxson D**</span>, <span style="color:#3498DB">**Benton MC**</span>, Blick R, Stubbs RS, Hagan RD, and <span style="color:#3498DB">**Langston MA**</span>. (2017) *Genome-wide DNA methylation analysis reveals loci that distinguish different types of adipose tissue in obese individuals*. Clinical Epigenetics 9(48) [[article]](https://clinicalepigeneticsjournal.biomedcentral.com/articles/10.1186/s13148-017-0344-4)] --- ## Project: Whole Genome Sequencing Collaborators: David Eccles .medium[ <span style="color:darkblue">**N=108**</span> core pedigree individuals sequenced Platform – **Illumina HiSeq-X10 (Garvan)** Bioinformatics: - BOWTIE2 –> SAMTOOLS -> VCF annotation –> dbSNP -> SNPsift{dbNSFP} -> VEP -> custom beds = fully annotated VCF Coverage >=30X ] -- ## Project: Functional Founder Effect Variants .medium[ <span style="color:darkblue">**Functional**</span> = Predicted damaging in *in silico* tests: - SIFT, POLYPHEN2, MUTATIONTASTER, PROVEAN, MUTATION ASSESSOR, CADD <span style="color:darkblue">**Founder effect**</span> = increased allele freq in NI compared to general population - (MAF<0.01% in 1000G >5% in NI) <span style="color:darkblue">**Variant**</span> = single nucleotide variant (SNV) ] --- class: inverse ## Project: Identification of allele-specific methylation profiles measuring genome-wide allele-specific methylation (ASM) - NGS bisulphite sequencing - SeqCap Epi CpGiant (Illumina HiSeq) collected data for <span style="color:lightblue"><b>108</b></span> NI individuals</li> <br /> -- fully customised QC and analysis pipeline:<span style="color:lightblue"><b>*</b></span> - (fastqc, trimgalore) >>> replaced now by [fastp](https://github.com/OpenGene/fastp) - bismark, sambamba, picard tools - methpipe (ASM estimation) - MethylDackel (originally PileOMeth), R and methylkit - Shiny webserver visualisation - all wrapped into a docker container parallel processing enabled for local and remote machines .center[ <span style="color:lightblue; font-size: 75%"><b>*<i>once wrangled into shape scripts/container will be accessible via GitHub/Docker Hub</i></b></span> ] --- <p> .center[<img src="images/circos_170906v4.png" style="width: 62%; margin-right: 1%; margin-top: -1.5em; border: 3px solid white;"/>] </p> --- <p> .center[<img src="images/circos_170906v4.png" style="width: 155%; margin-right: 1%; margin-top: -1.5em; border: 1px solid white;"/>] </p> --- ## Project: Diagnostics Annotation and Variant Reporting software <p> .center[<img src="images/Pipeline_flowchart_v2.png" style="width: 80%; margin-right: 1%; margin-top: 1.5em; border: 3px solid white;"/>] </p> --- class: inverse ## Project: Electronic Notebook A project I undertook to try and make life a little easier for my Honours student. - in Bioinformatics: Electronic Notebook **>>>** Physical Lab Notebook (IMO) - R package [`blogdown`](https://github.com/rstudio/blogdown) to the rescue -- .center[*With free software we are able to produce an extensible and reproducible electronic lab notebook.*] .center[.huge[RStudio -> Blogdown -> Hugo -> GitHub]] <br> -- ### Examples - https://example-lab-notebook.netlify.com/ - https://martha-labbook.netlify.com/ <br> -- .center[Everything you need to get set up is available from my GitHub repo: https://github.com/sirselim/electronic_lab_notebook] --- ## Other tools worth noting? **RedCap** (https://www.project-redcap.org/) + diabetes app questionnaire (Jeremy Krebs) + guinea pig data entry tool (Max Berry) <br> -- **MongoDB** (https://www.mongodb.com/) + reboot of a large public methylation array database (Sam Beardman) <br> -- **Docker** (https://www.docker.com/) + creating a container for DART (Sam Beardman) <br> -- **VSCode** (https://code.visualstudio.com) - Yep, I'm using a Microsoft product! --- class: middle inverse # Collaborations --- class: inverse # Collaborations / Networks .pull-left[.medium[ #### <span style="color:lightblue">UK</span> Peter Donnelly (*Director, Wellcome Trust Centre*) Jim Wilson (*Oxford University*) #### <span style="color:lightblue">USA</span> Greg Gibson (*Georgia Institute of Technology*) Mike Langston (*University of Tennessee*) John Blangero (*University of Texas Rio Grande Valley*) Melanie Carless (*Texas Biomedical Research Institute*) Saumya Das (*Harvard Medical School*) #### <span style="color:lightblue">Canada</span> Sam Beardman - Vancouver (*BestBuy, Boeing*) ]] .pull-right[.medium[ #### <span style="color:lightblue">Australia</span> Kerrie Mengersen (*QUT*) Nicole White (*QUT*) Marcel Dinger (*Garvan Institute*) Ray Blick (*University of New South Wales*) #### <span style="color:lightblue">Local</span> Mik Black (*University of Otago*) Max Berry (*University of Otago*) Jeremy Krebs (*University of Otago*) Kirsty Danielson (*Otago University Medical School*) David Eccles (*Gringene Bioinformatics*) Geoff Chambers (*VUW*) ]] <br /> .large[.center[<b>a collection of my important contacts</b>]] --- class: middle inverse <p> .huge[.center[<b>Thank you</b>]] </p> --- <iframe src="https://shiny.rstudio.com/gallery/word-cloud.html?showcase=0" width="100%" height="600px"></iframe>