From cell line to command line: my journey to bioinforma4cs Ming - - PowerPoint PPT Presentation

from cell line to command line my journey to bioinforma4cs
SMART_READER_LITE
LIVE PREVIEW

From cell line to command line: my journey to bioinforma4cs Ming - - PowerPoint PPT Presentation

From cell line to command line: my journey to bioinforma4cs Ming (Tommy) Tang Research scien4st Twi>er @tangming2005 MD Anderson Cancer Center, Houston, TX UFGI Gene4cs & Genomics program seminar Self-introduc4on: 2008 UF Gene4cs and


slide-1
SLIDE 1

From cell line to command line: my journey to bioinforma4cs

Ming (Tommy) Tang Research scien4st Twi>er @tangming2005 MD Anderson Cancer Center, Houston, TX UFGI Gene4cs & Genomics program seminar

slide-2
SLIDE 2

Self-introduc4on: 2008 UF Gene4cs and Genomics graduate student

slide-3
SLIDE 3

Hallmarks of cancer

Douglas Hanahan and Robert A. Weinberg. 2011.Cell EMT Epithelial-mesenchymal transi1on(EMT) Glycolysis VS mitochondria Tang M et.al 2011 PNAS kamarajugadda S et.al 2012 MCB Cai Q et.al 2012 oncogene Tang M et.al 2013. JBC vascular endothelial growth factor (VEGF)

slide-4
SLIDE 4

Challenges that I was facing

  • How do I open this 2G ChIPseq file?
  • Excel fails me.
  • How do I download the files from GEO and

process the raw data?

  • That’s how I started to teach myself Unix,R

and python.

slide-5
SLIDE 5

2015.03 joined MD Anderson With Dr.Roel Verhaak for a computa4onal Biology postodoc

slide-6
SLIDE 6

Teaching

University of Miami 2015 Sodware Carpentry workshop

slide-7
SLIDE 7

A book chapter published in April 2017

slide-8
SLIDE 8

More about me

slide-9
SLIDE 9

Star4ng a new job in October

Moving to Harvard FAS informa4cs as a bioinforma4cs scien4st

slide-10
SLIDE 10

Challenges and opportuni4es

slide-11
SLIDE 11

Data deluge

Sean Davis

slide-12
SLIDE 12

h>p://journals.plos.org/plosbiology/ar4cle?id=10.1371/journal.pbio.1002195

slide-13
SLIDE 13

Credit: Titus Brown

slide-14
SLIDE 14

Superman/Wonder woman

Credit: Torsten Seemann

slide-15
SLIDE 15

What is bioinforma4cs

h>ps://academic.oup.com/bib/advance-ar4cle/doi/10.1093/bib/bby063/5066445

slide-16
SLIDE 16

A typical day of my life as a bioinforma4cs scien4st

  • Googling (error message etc)
  • Conver4ng file formats.
  • Tidying the data.
  • Installing sodware.
  • Real analysis (plokng etc) 20%
slide-17
SLIDE 17

Google is what we do

slide-18
SLIDE 18

Ask for help

  • google
  • SeqAnswer
  • Biostars
  • Stack overflow
slide-19
SLIDE 19

bioiFORMATics

  • A real variant calling example:
  • Fastq
  • sam
  • bam
  • Vcf
  • Bed
slide-20
SLIDE 20

conda and biocoda

slide-21
SLIDE 21

Learn command line

  • Why command line?
  • More efficient/powerful
  • HPC, cloud compu4ng
slide-22
SLIDE 22

Terminal

h>p://rik.smith-unna.com/command_line_bootcamp/ Use a mac/ubuntu or windows10 has a built-in

slide-23
SLIDE 23

Learn some python

slide-24
SLIDE 24

Automa4on saves you 4me in the long run

Computers are good at repe44ve work

slide-25
SLIDE 25

Side effect of automa4on

  • The best documenta4on is automa4on
  • Write scripts for everything unless it is not
  • possible. (manual edi4ng, document!)

Credit to someone in the twi>er-verse J

slide-26
SLIDE 26
slide-27
SLIDE 27

DNA-seq Snakemake pipeline

slide-28
SLIDE 28

A real run

slide-29
SLIDE 29

Output files from the pipeline are

  • rganized by folders and uniformly named
slide-30
SLIDE 30

Learn some R

  • Rstudio (IDE)
  • Bioconductor
  • Tidyverse and ggplot2
slide-31
SLIDE 31

Do not re-invent the wheels

slide-32
SLIDE 32

Tidying data

R for data science by Hadley Wickham & Garre> Grolemund h>p://r4ds.had.co.nz/

slide-33
SLIDE 33

Data visualiza4on

slide-34
SLIDE 34

Wait, do you really need to learn C and C++?

h>p://ivory.idyll.org/blog/2015-bioinforma4cs-middle-class.html Titus Brown

slide-35
SLIDE 35

h>ps://www.ncbi.nlm.nih.gov/pmc/ar4cles/PMC3945096/

slide-36
SLIDE 36

Reproducibility crisis

h>p://biobungalow.weebly.com/bio-bungalow-blog/everybody-knows-the-scien4fic-method

slide-37
SLIDE 37

Titus Brown

slide-38
SLIDE 38

Method ma>ers

Credit: Nicolas Robine

slide-39
SLIDE 39

How to ensure reproducibility

  • Git version control
  • Jupyter/R Notebook, documenta4on
  • Containers (docker, singularity, biocontainers
  • h>ps://biocontainers.pro/)
slide-40
SLIDE 40
slide-41
SLIDE 41

Version control

  • Git
  • Github
  • Gitlab
slide-42
SLIDE 42

Notebooks

slide-43
SLIDE 43

docker

  • Why docker?
  • Imagine you are working on an analysis in R

and you send your code to a friend. Your friend runs exactly this code on exactly the same data set but gets a slightly different

  • result. This can have various reasons such as a

different opera4ng system, a different version

  • f an R package, etc. Docker is trying to solve

problems like that.

h>ps://cyverse-cybercarpentry-container-workshop-2018.readthedocs-hosted.com/en/latest/docke h>ps://ropenscilabs.github.io/r-docker-tutorial/01-what-and-why.html

slide-44
SLIDE 44

Other important untaught skills

  • Naming files
  • Project organiza4on
  • Data organiza4on, backup plans
slide-45
SLIDE 45

Naming files

  • Three principles for (file) names:
  • 1. Machine readable (do not put special

characters and space in the name)

  • 2. Human readable (Easy to figure out what the

heck something is, based on its name, add slug)

  • 3. Plays well with default ordering:
  • * Put something numeric first
  • * Use the ISO 8601 standard for dates (YYYY-MM-

DD)

  • * Led pad other numbers with zeros

Jenny Bryan

slide-46
SLIDE 46

Jenny Bryan

slide-47
SLIDE 47

Jenny Bryan: h>ps://rawgit.com/Reproducible-Science-Curriculum/rr-organiza4on1/master/organiza4on-01-slid

slide-48
SLIDE 48

TCGA barcode

slide-49
SLIDE 49

Organiza4on of each project down-stream analysis

slide-50
SLIDE 50

Rstudio R project

slide-51
SLIDE 51
slide-52
SLIDE 52
slide-53
SLIDE 53

Stay the current of bioinforma4cs

  • Bioinforma4cs evolves so fast!
  • E.g. sequencing technology: long-read (pacbio,

nanopore, single cell) all these require new tools to analyze the associated data.

  • I started bioinforma4cs ader reading:

h>p://www.geknggene4csdone.com/2012/05/how-to-stay-current-in.html

slide-54
SLIDE 54

Social medium network

  • Twi>er
  • Follow papers/tools, jobs, outreaching
  • Go to conferences/talk to other people
  • Blog posts
slide-55
SLIDE 55

h>ps://divingintogene4csandgenomics.rbind.io/

slide-56
SLIDE 56

Titus Brown

slide-57
SLIDE 57

Where to start

  • Never too late to start to learn!
  • ANGUS workshop
  • h>p://angus.readthedocs.io/en/2018/
  • I would love to come back and give workshops!
  • Sodware/Data Carpentries 2-day workshop

(Ethan White is heavily involved at UF)

  • Many other resources
  • h>ps://github.com/crazyho>ommy/gekng-

started-with-genomics-tools-and-resources

slide-58
SLIDE 58

Massive Online open Courses(MOOCs) and others

  • 1. coursera
  • 2. Edx
  • 3. Udacity
  • 4. Datacamp (not free, interac4ve lessons)
slide-59
SLIDE 59

Learn by doing

slide-60
SLIDE 60

What ques4ons do you have?

slide-61
SLIDE 61

Acknowledgements

  • Thanks Titus Brown, Torsten Seemann and

Sean Davis for lekng me borrow their slides.

  • Thanks Stephen Turner for wri4ng his blog

posts.

  • Thanks Roel Verhaak lab for giving me the
  • pportunity to learn computa4onal biology.
  • Thanks Samir Amin for teaching me so much!
slide-62
SLIDE 62

Use excel wisely

slide-63
SLIDE 63

Cau4on with excel

h>ps://genomebiology.biomedcentral.com/ar4cles/10.1186/s13059-016-1044-7

slide-64
SLIDE 64

Converted to dates

h>p://blogs.nature.com/naturejobs/2017/02/27/escape-gene-name-mangling-with-escape-excel/

slide-65
SLIDE 65
slide-66
SLIDE 66

h>ps://www.sciencedirect.com/science/ar4cle/pii/S0018506X18302599?via%3Dihub

slide-67
SLIDE 67
  • h>ps://www.youtube.com/watch?

v=s3JldKoA0zw&feature=youtu.be

slide-68
SLIDE 68

Regular expression