x ? Machine Learning 5/4/20 Tim Althoff, UW CS547: Machine - - PowerPoint PPT Presentation
x ? Machine Learning 5/4/20 Tim Althoff, UW CS547: Machine - - PowerPoint PPT Presentation
? ? x ? Machine Learning 5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 2 ? ? ? ? Machine Learning ? Node classification 5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data,
5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 2
Machine Learning
? ? ?
x
5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 3
? ? ? ? ?
Machine Learning
Node classification
5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 4
Classifying the function of proteins in the interactome
Image from: Ganapathiraju et al. 2016. Schizophrenia interactome with 504 novel proteinβprotein interactions. Nature.
Β‘ (Supervised) Machine Learning Lifecycle
requires feature engineering every single time!
5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 5
Raw Data Structured Data Learning Algorithm Model Downstream task Feature Engineering
Automatically learn the features
Goal: Efficient task-independent feature learning for machine learning in networks!
5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 6
vec node π: π£ β β! β!
Feature representation, embedding
u
- β
β β
17
Task: We map each node in a network to a point in a low-dimensional space
Β§ Distributed representation for nodes Β§ Similarity of embedding between nodes indicates their network similarity Β§ Encode network information and generate node representation
5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 7
2D embedding of nodes of the Zacharyβs Karate Club network:
5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 8
- Zacharyβs Karate Network:
Image from: Perozzi et al. DeepWalk: Online Learning of Social Representations. KDD 2014.
Β‘ Modern deep learning toolbox is designed
for simple sequences or grids
Β§ CNNs for fixed-size images/gridsβ¦. Β§ RNNs or word2vec for text/sequencesβ¦
5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 9
But networks are far more complex!
Β‘ Complex topographical structure (no spatial
locality like grids)
Β‘ No fixed node ordering or reference point Β‘ Often dynamic and have multimodal features.
5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 10
vs vs.
Networks ks Im Imag ages es Te Text
Assume we have a graph G:
Β‘ V is the vertex set Β‘ A is the adjacency matrix (assume binary) Β‘ No node features or extra information is
used!
5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 12
Β‘ Goal is to encode nodes so that similarity in
the embedding space (e.g., dot product) approximates similarity in the original network
5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 13
5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 14
similarity(u, v) β z>
v zu
Go Goal: Ne Need t to d define!
in the original network Similarity of the embedding
1.
Define an encoder (i.e., a mapping from nodes to embeddings)
2.
Define a node similarity function (i.e., a measure of similarity in the original network)
3.
Optimize the parameters of the encoder so that:
5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 15
similarity(u, v) β z>
v zu
in the original network Similarity of the embedding
Β‘ Encoder maps each node to a low-
dimensional vector
Β‘ Similarity function specifies how relationships
in vector space map to relationships in the
- riginal network
5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 16
enc(v) = zv
node in the input graph d-dimensional embedding Similarity of u and v in the original network dot product between node embeddings
similarity(u, v) β z>
v zu
Β‘ Simplest encoding approach: encoder is just
an embedding-lookup
5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 17
Matrix, each column is π-dim node embedding [w [what w we l learn!] !] Indicator vector, all zeroes except a one in column indicating node π€
enc(v) = Zv
Z β RdΓ|V| v β I|V|
Β‘ Simplest encoding approach: encoder is just
an embedding-lookup
5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 18
Z =
Dimension/size
- f embeddings
- ne column per node
embedding matrix embedding vector for a specific node
Simplest encoding approach: encoder is just an embedding-lookup Each node is assigned a unique embedding vector Many methods: node2vec, DeepWalk, LINE
5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 19
Key choice of methods is how they define node similarity. E.g., should two nodes have similar embeddings if theyβ¦.
Β‘ are connected? Β‘ share neighbors? Β‘ have similar βstructural rolesβ? Β‘ β¦?
5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 20
Material based on:
- Perozzi et al. 2014. DeepWalk: Online Learning of Social Representations. KDD.
- Grover et al. 2016. node2vec: Scalable Feature Learning for Networks. KDD.
5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 22
Probability that π£ and π€ co-occur on a random walk over the network
z>
u zv β
π¨! β¦ embedding of node π£
1.
Estimate probability of visiting node π on a random walk starting from node π using some random walk strategy πΊ
2.
Optimize embeddings to encode these random walk statistics:
Similarity (here: dot product β cos(π)) encodes random walk βsimilarityβ
5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 23
π¨! π¨"
1.
Expressivity: Flexible stochastic definition of node similarity that incorporates both local and higher-
- rder neighborhood information
2.
Efficiency: Do not need to consider all node pairs when training; only need to consider pairs that co-occur on random walks
5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 24
Β‘ Intuition: Find embedding of nodes to
π-dimensional space so that node similarity is preserved
Β‘ Idea: Learn node embedding such that nearby
nodes are close together in the network
Β‘ Given a node π, how do we define nearby
nodes?
Β§ π! π£ β¦ neighbourhood of π£ obtained by some strategy π
5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 25
Β‘ Given π» = (π, πΉ) Β‘ Our goal is to learn a mapping π¨: π£ β β! Β‘ Maximize log-likelihood objective:
max
"
8
# β%
log P(π&(π£)| π¨#)
Β§ where π!(π£) is neighborhood of node π£
Β‘ Given node π£, we want to learn feature
representations predictive of nodes in its neighborhood π&(π£)
5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 26
1.
Run short fixed-length random walks starting from each node on the graph using some strategy R
2.
For each node π£ collect π'(π£), the multiset*
- f nodes visited on random walks starting
from u
3.
Optimize embeddings according to: Given node π£, predict its neighbors π&(π£) max
"
8
# β%
log P(π&(π£)| π¨#)
5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 27
*π!(π£) can have repeat elements since nodes can be visited multiple times on random walks
max
"
8
# β%
log P(π&(π£)| π¨#)
Β‘ Assumption: Conditional likelihood factorizes
- ver the set of neighbors:
log P(π&(π£)|π¨#) = 8
(β)!(#)
log P(z( | π¨#)
Β‘ Softmax parametrization:
Pr z( π¨#) =
,-.(/"β "#) β$β& ,-.(/'β "#)
5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 28
Why softmax? We want node π€ to be most similar to node π£ (out of all nodes π). Intuition: β" exp π¦" β max
"
exp(π¦")
5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 30
Putting it all together:
sum over all nodes π£ sum over nodes π€ seen on random walks starting from π£ predicted probability of π£ and π€ co-occuring on random walk
Optimizing random walk embeddings = Finding node embeddings π that minimize L
L = X
u2V
X
v2NR(u)
β log β exp(z>
u zv)
P
n2V exp(z> u zn)
β
5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 31
But doing this naively is too expensive!!
Nested sum over nodes gives O(|V|2) complexity!
L = X
u2V
X
v2NR(u)
β log β exp(z>
u zv)
P
n2V exp(z> u zn)
β
5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 32
But doing this naively is too expensive!! The normalization term from the softmax is the culprit⦠can we approximate it?
L = X
u2V
X
v2NR(u)
β log β exp(z>
u zv)
P
n2V exp(z> u zn)
β
Β‘ Solution: Negative sampling
Instead of normalizing w.r.t. all nodes, just normalize against π random βnegative samplesβ π2
5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 33
sigmoid function
(makes each term a βprobabilityβ between 0 and 1)
random distribution over all nodes
log β exp(z>
u zv)
P
n2V exp(z> u zn)
β β log(Ο(z>
u zv)) β k
X
i=1
log(Ο(z>
u zni)), ni βΌ PV
Why is the approximation valid? Technically, this is a different objective. But Negative Sampling is a form of Noise Contrastive Estimation (NCE) which approx. maximizes the log probability of softmax. New formulation corresponds to using a logistic regression (sigmoid func.) to distinguish the target node π€ from nodes π! sampled from background distribution π
".
More at https://arxiv.org/pdf/1402.3722.pdf and https://arxiv.org/pdf/1410.8251.pdf
log β exp(z>
u zv)
P
n2V exp(z> u zn)
β β log(Ο(z>
u zv)) β k
X
i=1
log(Ο(z>
u zni)), ni βΌ PV
5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 34
random distribution
- ver all nodes
Β§ Sample π negative nodes proportional to degree Β§ Two considerations for π (# negative samples):
- 1. Higher π gives more robust estimates
- 2. Higher π corresponds to higher prior on negative events
In practice π =5-20
1.
Run short fixed-length random walks starting from each node on the graph using some strategy R.
2.
For each node u collect NR(u), the multiset of nodes visited on random walks starting from u
3.
Optimize embeddings using Stochastic Gradient Descent:
5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 35
We We can efficiently approximate this using ne negative sampling ng!
L = X
uβV
X
vβNR(u)
β log(P(v|zu))
Β‘ So far we have described how to optimize
embeddings given random walk statistics
Β‘ What strategies should we use to run these
random walks?
Β§ Simplest idea: Just run fixed-length, unbiased random walks starting from each node (i.e., DeepWalk from Perozzi et al., 2013).
Β§ The issue is that such notion of similarity is too constrained
Β§ How can we generalize this?
5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 36
Β‘ Goal: Embed nodes with similar network
neighborhoods close in the feature space
Β‘ We frame this goal as prediction-task independent
maximum likelihood optimization problem
Β‘ Key observation: Flexible notion of network
neighborhood π!(π£) of node π£ leads to rich node embeddings
Β‘ Develop biased 2nd order random walk π to
generate network neighborhood π!(π£) of node π£
5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 37
Idea: use flexible, biased random walks that can trade off between local and global views of the network (Grover and Leskovec, 2016).
5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 38
u s3 s2
s1
s4 s8 s9 s6 s7 s5
BFS DFS
Two classic strategies to define a neighborhood π' π£ of a given node π£: Walk of length 3 (π' π£ of size 3):
5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 39
π345 π£ = { π‘6, π‘7, π‘8} π945 π£ = { π‘:, π‘;, π‘<} Local microscopic view Global macroscopic view
u s3 s2
s1
s4 s8 s9 s6 s7 s5
BFS DFS
Biased fixed-length random walk πΊ that given a node π generates neighborhood πΆπΊ π
Β‘ Two parameters:
Β§ Return parameter π:
Β§ Return back to the previous node
Β§ In-out parameter π:
Β§ Moving outwards (DFS) vs. inwards (BFS) Β§ Intuitively, π is the βratioβ of BFS vs. DFS
5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 40
Biased 2nd-order random walks explore network neighborhoods:
Β§ Rnd. walk just traversed edge (π‘#, π₯) and is now at π₯ Β§ Insight: Neighbors of π₯ can only be: Idea: Remember where that walk came from
5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 41
s1 s2 w s3 u
Back k to ππ Sam Same e distan ance ce to ππ Fa Farthe her fr from ππ
Β‘ Walker came over edge (s6, w) and is at w.
Where to go next?
Β‘ π, π model transition probabilities
Β§ π β¦ return parameter Β§ π β¦ βwalk awayβ parameter
5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 42
1 1/π 1/π
1/π, 1/π, 1 are
βunnormalizedβ probabilities (weights we later convert to probability distribution)
s1 s2 w s3 u s4
1/π
Β‘ Walker came over edge (s6, w) and is at w.
Where to go next?
Β§ BFS-like walk: Low value of π Β§ DFS-like walk: Low value of π
π'(π£) are the nodes visited by the biased walk
5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 43
w β
s1 s2 s3 s4 1/π 1 1/π 1/π
Unnormalized transition prob. segmented based
- n distance from π‘!
- Dist. (ππ, π)
1 2 2 1 1/π 1/π
s1 s2 w s3 u s4
1/π
Target π Prob.
Β‘ 1) Compute random walk probabilities Β‘ 2) Simulate π random walks of length π starting
from each node π£
Β‘ 3) Optimize the node2vec objective using
Stochastic Gradient Descent Linear-time complexity. All 3 steps are individually parallelizable
5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 44
5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 45
BFS: Micro-view of neighbourhood
u
DFS: Macro-view of neighbourhood
Small network of interactions of characters in a novel:
5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 46
p=1, q=2
Microscopic view of the network neighbourhood
p=1, q=0.5
Macroscopic view of the network neighbourhood
(not covered in detailed here but for your reference)
Β‘ Different kinds of biased random walks:
Β§ Based on node attributes (Dong et al., 2017). Β§ Based on a learned weights (Abu-El-Haija et al., 2017)
Β‘ Alternative optimization schemes:
Β§ Directly optimize based on 1-hop and 2-hop random walk probabilities (as in LINE from Tang et al. 2015).
Β‘ Network preprocessing techniques:
Β§ Run random walks on modified versions of the original network (e.g., Ribeiro et al. 2017βs struct2vec, Chen et al. 2016βs HARP).
5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 48
Β‘ How to use embeddings ππ of nodes:
Β§ Clustering/community detection: Cluster nodes/points based on π¨$ Β§ Node classification: Predict label π(π¨$) of node π based on π¨$ Β§ Link prediction: Predict edge (π, π) based on π(π¨$, π¨
%)
Β§ Where we can: concatenate, avg, product, or take a difference between the embeddings:
Β§ Concatenate: π(π¨!, π¨
")= π([π¨!, π¨ "])
Β§ Hadamard: π(π¨!, π¨
")= π(π¨! β π¨ ") (per coordinate product)
Β§ Sum/Avg: π(π¨!, π¨
")= π(π¨! + π¨ ")
Β§ Distance: π(π¨!, π¨
")= π(||π¨! β π¨ "||#)
5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 49
Β‘ So what method should I use..? Β‘ No one method wins in all casesβ¦.
Β§ E.g., node2vec performs better on node classification while multi-hop methods performs better on link prediction (Goyal and Ferrara, 2017 survey)
Β‘ Random walk approaches are generally more
efficient
Β‘ In general: Must choose defβn of node
similarity that matches your application!
5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 51
Β‘ Tasks:
Β§ Classifying toxic vs. non-toxic molecules Β§ Identifying cancerogenic molecules Β§ Graph anomaly detection Β§ Classifying social networks
5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 53
Β‘ Goal: Want to embed an entire graph π»
5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 54
π*
Simple idea:
Β‘ Run a standard node embedding
technique on the (sub)graph π»
Β‘ Then just sum (or average) the node
embeddings in the (sub)graph π»
Β‘ Used by Duvenaud et al., 2016 to classify
molecules based on their graph structure
5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 55
π¨! = #
"β!
π¨"
Β‘ Idea: Introduce a βvirtual nodeβ to represent
the (sub)graph and run a standard graph embedding technique
Β‘ Proposed by Li et al., 2016 as a general
technique for subgraph embedding
5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 56