x ? Machine Learning 5/4/20 Tim Althoff, UW CS547: Machine - - PowerPoint PPT Presentation

β–Ά
x
SMART_READER_LITE
LIVE PREVIEW

x ? Machine Learning 5/4/20 Tim Althoff, UW CS547: Machine - - PowerPoint PPT Presentation

? ? x ? Machine Learning 5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 2 ? ? ? ? Machine Learning ? Node classification 5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data,


slide-1
SLIDE 1
slide-2
SLIDE 2

5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 2

Machine Learning

? ? ?

x

slide-3
SLIDE 3

5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 3

? ? ? ? ?

Machine Learning

Node classification

slide-4
SLIDE 4

5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 4

Classifying the function of proteins in the interactome

Image from: Ganapathiraju et al. 2016. Schizophrenia interactome with 504 novel protein–protein interactions. Nature.

slide-5
SLIDE 5

Β‘ (Supervised) Machine Learning Lifecycle

requires feature engineering every single time!

5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 5

Raw Data Structured Data Learning Algorithm Model Downstream task Feature Engineering

Automatically learn the features

slide-6
SLIDE 6

Goal: Efficient task-independent feature learning for machine learning in networks!

5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 6

vec node 𝑔: 𝑣 β†’ ℝ! ℝ!

Feature representation, embedding

u

slide-7
SLIDE 7
  • –

– –

17

Task: We map each node in a network to a point in a low-dimensional space

Β§ Distributed representation for nodes Β§ Similarity of embedding between nodes indicates their network similarity Β§ Encode network information and generate node representation

5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 7

slide-8
SLIDE 8

2D embedding of nodes of the Zachary’s Karate Club network:

5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 8

  • Zachary’s Karate Network:

Image from: Perozzi et al. DeepWalk: Online Learning of Social Representations. KDD 2014.

slide-9
SLIDE 9

Β‘ Modern deep learning toolbox is designed

for simple sequences or grids

Β§ CNNs for fixed-size images/grids…. Β§ RNNs or word2vec for text/sequences…

5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 9

slide-10
SLIDE 10

But networks are far more complex!

Β‘ Complex topographical structure (no spatial

locality like grids)

Β‘ No fixed node ordering or reference point Β‘ Often dynamic and have multimodal features.

5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 10

vs vs.

Networks ks Im Imag ages es Te Text

slide-11
SLIDE 11
slide-12
SLIDE 12

Assume we have a graph G:

Β‘ V is the vertex set Β‘ A is the adjacency matrix (assume binary) Β‘ No node features or extra information is

used!

5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 12

slide-13
SLIDE 13

Β‘ Goal is to encode nodes so that similarity in

the embedding space (e.g., dot product) approximates similarity in the original network

5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 13

slide-14
SLIDE 14

5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 14

similarity(u, v) β‰ˆ z>

v zu

Go Goal: Ne Need t to d define!

in the original network Similarity of the embedding

slide-15
SLIDE 15

1.

Define an encoder (i.e., a mapping from nodes to embeddings)

2.

Define a node similarity function (i.e., a measure of similarity in the original network)

3.

Optimize the parameters of the encoder so that:

5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 15

similarity(u, v) β‰ˆ z>

v zu

in the original network Similarity of the embedding

slide-16
SLIDE 16

Β‘ Encoder maps each node to a low-

dimensional vector

Β‘ Similarity function specifies how relationships

in vector space map to relationships in the

  • riginal network

5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 16

enc(v) = zv

node in the input graph d-dimensional embedding Similarity of u and v in the original network dot product between node embeddings

similarity(u, v) β‰ˆ z>

v zu

slide-17
SLIDE 17

Β‘ Simplest encoding approach: encoder is just

an embedding-lookup

5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 17

Matrix, each column is 𝑒-dim node embedding [w [what w we l learn!] !] Indicator vector, all zeroes except a one in column indicating node 𝑀

enc(v) = Zv

Z ∈ RdΓ—|V| v ∈ I|V|

slide-18
SLIDE 18

Β‘ Simplest encoding approach: encoder is just

an embedding-lookup

5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 18

Z =

Dimension/size

  • f embeddings
  • ne column per node

embedding matrix embedding vector for a specific node

slide-19
SLIDE 19

Simplest encoding approach: encoder is just an embedding-lookup Each node is assigned a unique embedding vector Many methods: node2vec, DeepWalk, LINE

5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 19

slide-20
SLIDE 20

Key choice of methods is how they define node similarity. E.g., should two nodes have similar embeddings if they….

Β‘ are connected? Β‘ share neighbors? Β‘ have similar β€œstructural roles”? Β‘ …?

5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 20

slide-21
SLIDE 21

Material based on:

  • Perozzi et al. 2014. DeepWalk: Online Learning of Social Representations. KDD.
  • Grover et al. 2016. node2vec: Scalable Feature Learning for Networks. KDD.
slide-22
SLIDE 22

5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 22

Probability that 𝑣 and 𝑀 co-occur on a random walk over the network

z>

u zv β‰ˆ

𝑨! … embedding of node 𝑣

slide-23
SLIDE 23

1.

Estimate probability of visiting node π’˜ on a random walk starting from node 𝒗 using some random walk strategy 𝑺

2.

Optimize embeddings to encode these random walk statistics:

Similarity (here: dot product β‰ˆ cos(πœ„)) encodes random walk β€œsimilarity”

5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 23

𝑨! 𝑨"

slide-24
SLIDE 24

1.

Expressivity: Flexible stochastic definition of node similarity that incorporates both local and higher-

  • rder neighborhood information

2.

Efficiency: Do not need to consider all node pairs when training; only need to consider pairs that co-occur on random walks

5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 24

slide-25
SLIDE 25

Β‘ Intuition: Find embedding of nodes to

𝑒-dimensional space so that node similarity is preserved

Β‘ Idea: Learn node embedding such that nearby

nodes are close together in the network

Β‘ Given a node 𝒗, how do we define nearby

nodes?

Β§ 𝑂! 𝑣 … neighbourhood of 𝑣 obtained by some strategy 𝑆

5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 25

slide-26
SLIDE 26

Β‘ Given 𝐻 = (π‘Š, 𝐹) Β‘ Our goal is to learn a mapping 𝑨: 𝑣 β†’ ℝ! Β‘ Maximize log-likelihood objective:

max

"

8

# ∈%

log P(𝑂&(𝑣)| 𝑨#)

Β§ where 𝑂!(𝑣) is neighborhood of node 𝑣

Β‘ Given node 𝑣, we want to learn feature

representations predictive of nodes in its neighborhood 𝑂&(𝑣)

5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 26

slide-27
SLIDE 27

1.

Run short fixed-length random walks starting from each node on the graph using some strategy R

2.

For each node 𝑣 collect 𝑂'(𝑣), the multiset*

  • f nodes visited on random walks starting

from u

3.

Optimize embeddings according to: Given node 𝑣, predict its neighbors 𝑂&(𝑣) max

"

8

# ∈%

log P(𝑂&(𝑣)| 𝑨#)

5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 27

*𝑂!(𝑣) can have repeat elements since nodes can be visited multiple times on random walks

slide-28
SLIDE 28

max

"

8

# ∈%

log P(𝑂&(𝑣)| 𝑨#)

Β‘ Assumption: Conditional likelihood factorizes

  • ver the set of neighbors:

log P(𝑂&(𝑣)|𝑨#) = 8

(∈)!(#)

log P(z( | 𝑨#)

Β‘ Softmax parametrization:

Pr z( 𝑨#) =

,-.(/"β‹…"#) βˆ‘$∈& ,-.(/'β‹…"#)

5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 28

Why softmax? We want node 𝑀 to be most similar to node 𝑣 (out of all nodes π‘œ). Intuition: βˆ‘" exp 𝑦" β‰ˆ max

"

exp(𝑦")

slide-29
SLIDE 29

5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 30

Putting it all together:

sum over all nodes 𝑣 sum over nodes 𝑀 seen on random walks starting from 𝑣 predicted probability of 𝑣 and 𝑀 co-occuring on random walk

Optimizing random walk embeddings = Finding node embeddings π’œ that minimize L

L = X

u2V

X

v2NR(u)

βˆ’ log βœ“ exp(z>

u zv)

P

n2V exp(z> u zn)

β—†

slide-30
SLIDE 30

5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 31

But doing this naively is too expensive!!

Nested sum over nodes gives O(|V|2) complexity!

L = X

u2V

X

v2NR(u)

βˆ’ log βœ“ exp(z>

u zv)

P

n2V exp(z> u zn)

β—†

slide-31
SLIDE 31

5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 32

But doing this naively is too expensive!! The normalization term from the softmax is the culprit… can we approximate it?

L = X

u2V

X

v2NR(u)

βˆ’ log βœ“ exp(z>

u zv)

P

n2V exp(z> u zn)

β—†

slide-32
SLIDE 32

Β‘ Solution: Negative sampling

Instead of normalizing w.r.t. all nodes, just normalize against 𝑙 random β€œnegative samples” π‘œ2

5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 33

sigmoid function

(makes each term a β€œprobability” between 0 and 1)

random distribution over all nodes

log βœ“ exp(z>

u zv)

P

n2V exp(z> u zn)

β—† β‰ˆ log(Οƒ(z>

u zv)) βˆ’ k

X

i=1

log(Οƒ(z>

u zni)), ni ∼ PV

Why is the approximation valid? Technically, this is a different objective. But Negative Sampling is a form of Noise Contrastive Estimation (NCE) which approx. maximizes the log probability of softmax. New formulation corresponds to using a logistic regression (sigmoid func.) to distinguish the target node 𝑀 from nodes π‘œ! sampled from background distribution 𝑄

".

More at https://arxiv.org/pdf/1402.3722.pdf and https://arxiv.org/pdf/1410.8251.pdf

slide-33
SLIDE 33

log βœ“ exp(z>

u zv)

P

n2V exp(z> u zn)

β—† β‰ˆ log(Οƒ(z>

u zv)) βˆ’ k

X

i=1

log(Οƒ(z>

u zni)), ni ∼ PV

5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 34

random distribution

  • ver all nodes

Β§ Sample 𝑙 negative nodes proportional to degree Β§ Two considerations for 𝑙 (# negative samples):

  • 1. Higher 𝑙 gives more robust estimates
  • 2. Higher 𝑙 corresponds to higher prior on negative events

In practice 𝑙 =5-20

slide-34
SLIDE 34

1.

Run short fixed-length random walks starting from each node on the graph using some strategy R.

2.

For each node u collect NR(u), the multiset of nodes visited on random walks starting from u

3.

Optimize embeddings using Stochastic Gradient Descent:

5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 35

We We can efficiently approximate this using ne negative sampling ng!

L = X

u∈V

X

v∈NR(u)

βˆ’ log(P(v|zu))

slide-35
SLIDE 35

Β‘ So far we have described how to optimize

embeddings given random walk statistics

Β‘ What strategies should we use to run these

random walks?

Β§ Simplest idea: Just run fixed-length, unbiased random walks starting from each node (i.e., DeepWalk from Perozzi et al., 2013).

Β§ The issue is that such notion of similarity is too constrained

Β§ How can we generalize this?

5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 36

slide-36
SLIDE 36

Β‘ Goal: Embed nodes with similar network

neighborhoods close in the feature space

Β‘ We frame this goal as prediction-task independent

maximum likelihood optimization problem

Β‘ Key observation: Flexible notion of network

neighborhood 𝑂!(𝑣) of node 𝑣 leads to rich node embeddings

Β‘ Develop biased 2nd order random walk 𝑆 to

generate network neighborhood 𝑂!(𝑣) of node 𝑣

5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 37

slide-37
SLIDE 37

Idea: use flexible, biased random walks that can trade off between local and global views of the network (Grover and Leskovec, 2016).

5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 38

u s3 s2

s1

s4 s8 s9 s6 s7 s5

BFS DFS

slide-38
SLIDE 38

Two classic strategies to define a neighborhood 𝑂' 𝑣 of a given node 𝑣: Walk of length 3 (𝑂' 𝑣 of size 3):

5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 39

𝑂345 𝑣 = { 𝑑6, 𝑑7, 𝑑8} 𝑂945 𝑣 = { 𝑑:, 𝑑;, 𝑑<} Local microscopic view Global macroscopic view

u s3 s2

s1

s4 s8 s9 s6 s7 s5

BFS DFS

slide-39
SLIDE 39

Biased fixed-length random walk 𝑺 that given a node 𝒗 generates neighborhood 𝑢𝑺 𝒗

Β‘ Two parameters:

Β§ Return parameter 𝒒:

Β§ Return back to the previous node

Β§ In-out parameter 𝒓:

Β§ Moving outwards (DFS) vs. inwards (BFS) Β§ Intuitively, π‘Ÿ is the β€œratio” of BFS vs. DFS

5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 40

slide-40
SLIDE 40

Biased 2nd-order random walks explore network neighborhoods:

Β§ Rnd. walk just traversed edge (𝑑#, π‘₯) and is now at π‘₯ Β§ Insight: Neighbors of π‘₯ can only be: Idea: Remember where that walk came from

5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 41

s1 s2 w s3 u

Back k to π’•πŸ Sam Same e distan ance ce to π’•πŸ Fa Farthe her fr from π’•πŸ

slide-41
SLIDE 41

Β‘ Walker came over edge (s6, w) and is at w.

Where to go next?

Β‘ π‘ž, π‘Ÿ model transition probabilities

Β§ π‘ž … return parameter Β§ π‘Ÿ … β€œwalk away” parameter

5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 42

1 1/π‘Ÿ 1/π‘ž

1/π‘ž, 1/π‘Ÿ, 1 are

β€œunnormalized” probabilities (weights we later convert to probability distribution)

s1 s2 w s3 u s4

1/π‘Ÿ

slide-42
SLIDE 42

Β‘ Walker came over edge (s6, w) and is at w.

Where to go next?

Β§ BFS-like walk: Low value of π‘ž Β§ DFS-like walk: Low value of π‘Ÿ

𝑂'(𝑣) are the nodes visited by the biased walk

5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 43

w β†’

s1 s2 s3 s4 1/π‘ž 1 1/π‘Ÿ 1/π‘Ÿ

Unnormalized transition prob. segmented based

  • n distance from 𝑑!
  • Dist. (π’•πŸ, 𝒖)

1 2 2 1 1/π‘Ÿ 1/π‘ž

s1 s2 w s3 u s4

1/π‘Ÿ

Target 𝒖 Prob.

slide-43
SLIDE 43

Β‘ 1) Compute random walk probabilities Β‘ 2) Simulate 𝑠 random walks of length π‘š starting

from each node 𝑣

Β‘ 3) Optimize the node2vec objective using

Stochastic Gradient Descent Linear-time complexity. All 3 steps are individually parallelizable

5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 44

slide-44
SLIDE 44

5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 45

BFS: Micro-view of neighbourhood

u

DFS: Macro-view of neighbourhood

slide-45
SLIDE 45

Small network of interactions of characters in a novel:

5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 46

p=1, q=2

Microscopic view of the network neighbourhood

p=1, q=0.5

Macroscopic view of the network neighbourhood

slide-46
SLIDE 46

(not covered in detailed here but for your reference)

Β‘ Different kinds of biased random walks:

Β§ Based on node attributes (Dong et al., 2017). Β§ Based on a learned weights (Abu-El-Haija et al., 2017)

Β‘ Alternative optimization schemes:

Β§ Directly optimize based on 1-hop and 2-hop random walk probabilities (as in LINE from Tang et al. 2015).

Β‘ Network preprocessing techniques:

Β§ Run random walks on modified versions of the original network (e.g., Ribeiro et al. 2017’s struct2vec, Chen et al. 2016’s HARP).

5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 48

slide-47
SLIDE 47

Β‘ How to use embeddings π’œπ’‹ of nodes:

Β§ Clustering/community detection: Cluster nodes/points based on 𝑨$ Β§ Node classification: Predict label 𝑔(𝑨$) of node 𝑗 based on 𝑨$ Β§ Link prediction: Predict edge (𝑗, π‘˜) based on 𝑔(𝑨$, 𝑨

%)

Β§ Where we can: concatenate, avg, product, or take a difference between the embeddings:

Β§ Concatenate: 𝑔(𝑨!, 𝑨

")= 𝑕([𝑨!, 𝑨 "])

Β§ Hadamard: 𝑔(𝑨!, 𝑨

")= 𝑕(𝑨! βˆ— 𝑨 ") (per coordinate product)

Β§ Sum/Avg: 𝑔(𝑨!, 𝑨

")= 𝑕(𝑨! + 𝑨 ")

Β§ Distance: 𝑔(𝑨!, 𝑨

")= 𝑕(||𝑨! βˆ’ 𝑨 "||#)

5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 49

slide-48
SLIDE 48

Β‘ So what method should I use..? Β‘ No one method wins in all cases….

Β§ E.g., node2vec performs better on node classification while multi-hop methods performs better on link prediction (Goyal and Ferrara, 2017 survey)

Β‘ Random walk approaches are generally more

efficient

Β‘ In general: Must choose def’n of node

similarity that matches your application!

5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 51

slide-49
SLIDE 49
slide-50
SLIDE 50

Β‘ Tasks:

Β§ Classifying toxic vs. non-toxic molecules Β§ Identifying cancerogenic molecules Β§ Graph anomaly detection Β§ Classifying social networks

5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 53

slide-51
SLIDE 51

‘ Goal: Want to embed an entire graph 𝐻

5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 54

π’œ*

slide-52
SLIDE 52

Simple idea:

Β‘ Run a standard node embedding

technique on the (sub)graph 𝐻

Β‘ Then just sum (or average) the node

embeddings in the (sub)graph 𝐻

Β‘ Used by Duvenaud et al., 2016 to classify

molecules based on their graph structure

5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 55

𝑨! = #

"∈!

𝑨"

slide-53
SLIDE 53

Β‘ Idea: Introduce a β€œvirtual node” to represent

the (sub)graph and run a standard graph embedding technique

Β‘ Proposed by Li et al., 2016 as a general

technique for subgraph embedding

5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 56