Using Disco and MapReduce to study mRNA complexity Dan Williams - - PowerPoint PPT Presentation

using disco and mapreduce to study mrna complexity
SMART_READER_LITE
LIVE PREVIEW

Using Disco and MapReduce to study mRNA complexity Dan Williams - - PowerPoint PPT Presentation

Using Disco and MapReduce to study mRNA complexity Dan Williams SciPy 2011 Lightning Talk 7/14/2011 | Life Technologies Proprietary & Confidential | 1 Disco MapReduce framework written in Python and Erlang useful for dealing with


slide-1
SLIDE 1

7/14/2011 | Life Technologies Proprietary & Confidential |

1

Using Disco and MapReduce to study mRNA complexity

Dan Williams

SciPy 2011 Lightning Talk

slide-2
SLIDE 2

7/14/2011 | Life Technologies Proprietary & Confidential |

2

Disco

  • MapReduce framework written in Python

and Erlang

− useful for dealing with massive data

  • Users specify map and reduce operations

as Python functions, then chain them together to get stuff done

slide-3
SLIDE 3

7/14/2011 | Life Technologies Proprietary & Confidential |

3

mRNA molecules contain three distinct regions:

AAATGACGACAACGGTGAGGGTTCTCGGGCGGGGCCTGGGACAGGCAGCTCCGGGGTCCGCGGTTTCACATCGGAAACAAAACAGCGG CTGGTCTGGAAGGAACCTGAGCTACGAGCCGCGGCGGCAGCGGGGCGGCGGGGAAGCGTATACCTAATCTGGGAGCCTGCAAGTGACA ACAGCCTTTGCGGTCCTTAGACAGCTTGGCCTGGAGGAGAACACATGAAAGAAAGAACCTCAAGAGGCTTTGTTTTCTGTGAAACAGT ATTTCTATACAGTTGCTCCAATGACAGAGTTACCTGCACCGTTGTCCTACTTCCAGAATGCACAGATGTCTGAGGACAACCACCTGAG CAATACTGTACGTAGCCAGAATGACAATAGAGAACGGCAGGAGCACAACGACAGACGGAGCCTTGGCCACCCTGAGCCATTATCTAAT GGACGACCCCAGGGTAACTCCCGGCAGGTGGTGGAGCAAGATGAGGAAGAAGATGAGGAGCTGACATTGAAATATGGCGCCAAGCATG TGATCATGCTCTTTGTCCCTGTGACTCTCTGCATGGTGGTGGTCGTGGCTACCATTAAGTCAGTCAGCTTTTATACCCGGAAGGATGG GCAGCTAATCTATACCCCATTCACAGAAGATACCGAGACTGTGGGCCAGAGAGCCCTGCACTCAATTCTGAATGCTGCCATCATGATC AGTGTCATTGTTGTCATGACTATCCTCCTGGTGGTTCTGTATAAATACAGGTGCTATAAGGTCATCCATGCCTGGCTTATTATATCAT CTCTATTGTTGCTGTTCTTTTTTTCATTCATTTACTTGGGGGAAGTGTTTAAAACCTATAACGTTGCTGTGGACTACATTACTGTTGC ACTCCTGATCTGGAATTTTGGTGTGGTGGGAATGATTTCCATTCACTGGAAAGGTCCACTTCGACTCCAGCAGGCATATCTCATTATG ATTAGTGCCCTCATGGCCCTGGTGTTTATCAAGTACCTCCCTGAATGGACTGCGTGGCTCATCTTGGCTGTGATTTCAGTATATGATT TAGTGGCTGTTTTGTGTCCGAAAGGTCCACTTCGTATGCTGGTTGAAACAGCTCAGGAGAGAAATGAAACGCTTTTTCCAGCTCTCAT TTACTCCTCAACAATGGTGTGGTTGGTGAATATGGCAGAAGGAGACCCGGAAGCTCAAAGGAGAGTATCCAAAAATTCCAAGTATAAT GCAGAAAGCACAGAAAGGGAGTCACAAGACACTGTTGCAGAGAATGATGATGGCGGGTTCAGTGAGGAATGGGAAGCCCAGAGGGACA GTCATCTAGGGCCTCATCGCTCTACACCTGAGTCACGAGCTGCTGTCCAGGAACTTTCCAGCAGTATCCTCGCTGGTGAAGACCCAGA GGAAAGGGGAGTAAAACTTGGATTGGGAGATTTCATTTTCTACAGTGTTCTGGTTGGTAAAGCCTCAGCAACAGCCAGTGGAGACTGG AACACAACCATAGCCTGTTTCGTAGCCATATTAATTGGTTTGTGCCTTACATTATTACTCCTTGCCATTTTCAAGAAAGCATTGCCAG CTCTTCCAATCTCCATCACCTTTGGG

Research question: Do the three mRNA regions generally differ in information content?

slide-4
SLIDE 4

7/14/2011 | Life Technologies Proprietary & Confidential |

4

Method: Calculate the Shannon entropy of each 21- nucleotide segment of each mRNA from a well-known

  • database. Group results by region and compare.

MapReduce with Disco speeds the computation (across ~30k mRNA sequences)

slide-5
SLIDE 5

7/14/2011 | Life Technologies Proprietary & Confidential |

5

Map 21-mer segments and regions to 1 Reduce to remove duplicates Map Shannon entropy of 21-mer segment to region Reduce to get a boxplot for each region

slide-6
SLIDE 6

7/14/2011 | Life Technologies Proprietary & Confidential |

6

slide-7
SLIDE 7

7/14/2011 | Life Technologies Proprietary & Confidential |

7

Thank you!