Using ¡the ¡Network ¡Structure ¡of ¡ Annota5on ¡Data ¡to ¡Gain ¡Insights ¡into ¡ Gene ¡Interac5ons ¡and ¡the ¡ Organiza5on ¡of ¡Biological ¡Func5on
in ¡collabora*on ¡with: Kimberly ¡Glass, ¡ Ed ¡O9, Wolfgang ¡Losert
Using the Network Structure of Annota5on Data to Gain - - PowerPoint PPT Presentation
Using the Network Structure of Annota5on Data to Gain Insights into Gene Interac5ons and the Organiza5on of Biological Func5on in collabora*on with: Michelle
in ¡collabora*on ¡with: Kimberly ¡Glass, ¡ Ed ¡O9, Wolfgang ¡Losert
networks that are highly regular (e.g. the lattice connections of atoms in a solid) or highly random (e.g. the interactions of gas molecules).
which to extend the tools of statistical physics.
applying their approaches to many body problems in other fields: animal flocking, market behaviors, etc.
func*ons ¡(terms).
Image from: “Gene Ontology: Tool for the Unification of Biology”
hierarchy.
terms genes
Bipartite Graph of Gene Annotations Term Network Gene Network
0 0 0 0 0 1 0 1 1 0 0 0 0 0 0 0 0 0 1 1 1 0 1 0 1 1 1 0 0 0 0 0
Degree distribution of GO Terms Degree distribution of annotated genes
1 10 100 1,000 10 10
1
10
2
10
3
10
4
10
5
Degree of Gene Number of Genes
1 10 100 1,000 10,000 100,000 10 10
1
10
2
10
3
10
4
10
5
Degree of Term Number of Terms
All Annotations Biological Process Molecular Function Cellular Component
0 0 0 0 0 1 0 1 1 0 0 0 0 0 0 0 0 0 1 1 1 0 1 0 1 1 1 0 0 0 0 0
1/2 1 1/4 1/3
each only have the same single gene annotation.
no common annotations.
share few common annotations.
Adolescent friendship network, from Jim Moody
communi*es ¡can ¡be ¡quan*fied ¡by ¡the ¡modularity ¡func*on:
community ¡i, ¡di ¡is ¡the ¡number ¡of ¡edge ¡ends ¡that ¡connect ¡to ¡ ver*ces ¡in ¡community ¡i, ¡and ¡m ¡is ¡the ¡total ¡number ¡of ¡edges.
Q = ei m − di 2m ⎛ ⎝ ⎜ ⎞ ⎠ ⎟
2
⎡ ⎣ ⎢ ⎢ ⎤ ⎦ ⎥ ⎥
i=1 k
Newman and Girvan, PRE 2004
modularity function.
Brandes et al. 2007, Clauset et al. 2004, Newman 2006, Massen and Doye 2006
Q = ei m − di 2m ⎛ ⎝ ⎜ ⎞ ⎠ ⎟
2
⎡ ⎣ ⎢ ⎢ ⎤ ⎦ ⎥ ⎥
i=1 k
1 2 3 4 1 2 3 4 5 6 7 8 A B C D E F G H A B C D E F G H C 3 5 6 7 8
Hypergeometric probability returns a p-value for the similarity of the cancer signature to the genes annotated to terms in the branch of the hierarchy and for the similarity of the signature to genes annotated to terms in a community.
C A G E H 1 2 3 4 A B C D E F G H C 3 5 6 7 8
GO Terms Communities
Signatures defined in “Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles”
Cancer Signatures
0 0 0 0 0 1 0 1 1 0 0 0 0 0 0 0 0 0 1 1 1 0 1 0 1 1 1 0 0 0 0 0
1/2 1 1/4 1/3
α α α α
In the limit of large α, edges in G to take a particular ordering such that those genes connected through many low degree terms have the highest weight.
many low degree terms.
no common annotations.
a single high degree term.
annotation data such that every gene pair whose Gij is above the threshold is considered connected.
regulatory network.
utility of our gene-gene network for capturing true regulatory interactions.
F = 2 Precision ⋅Recall Precision + Recall Precsion= true positives true positives + false positives Recall = true positives true positives +false negatives
Context-Likelihood-of-Relatedness
expression data.
pairs of genes.
genes experiments
reference for CLR algorithm: Faith, PLoS Biology, 2007.
hierarchy, which provides an independent framework with which to describe and predict the functions of experimentally identified groups
regulatory interactions.
interactions than those predicted by the CLR algorithm and can be combined with CLR further to improve predictive power.
which the term belongs.
terms.
and G2, one constructs an nG1xnG2 where nG1 (nG2) is the number of terms annotated to G1 (G2), and populates it with the semantic similarity values between all the pairs of terms. The semantic similarity between the two genes is then determined by taking the average of all values in the matrix.
SemSim(t1,t2) = −log min
t∈T (t1,t2 ) p(t)
X = N11 − N00 NT κ = X − X 1− X