Graph Convolutional Networks (GCNs) Dimitris Papatheodorou Aalto - - PowerPoint PPT Presentation

graph convolutional networks gcns
SMART_READER_LITE
LIVE PREVIEW

Graph Convolutional Networks (GCNs) Dimitris Papatheodorou Aalto - - PowerPoint PPT Presentation

Graph Convolutional Networks (GCNs) Dimitris Papatheodorou Aalto University dimitrispapatheodorou95@gmail.com May 21, 2019 Dimitris Papatheodorou (Aalto University) GCNs May 21, 2019 1 / 35 Overview Introduction 1 Problem Setting Graph


slide-1
SLIDE 1

Graph Convolutional Networks (GCNs)

Dimitris Papatheodorou

Aalto University dimitrispapatheodorou95@gmail.com

May 21, 2019

Dimitris Papatheodorou (Aalto University) GCNs May 21, 2019 1 / 35

slide-2
SLIDE 2

Overview

1

Introduction Problem Setting Graph Laplacian

2

Graph Convolutional Networks The ideas behind the problem GCN idea and convolutions on graphs Spectral Graph Convolutions (SGC) Implementation and results

Dimitris Papatheodorou (Aalto University) GCNs May 21, 2019 2 / 35

slide-3
SLIDE 3

Problem Setting

Graphs are structured representations of data, such as citation networks, social networks, the World-Wide-Web, protein-interaction networks, and others.

Dimitris Papatheodorou (Aalto University) GCNs May 21, 2019 3 / 35

slide-4
SLIDE 4

Problem Setting

Graphs are structured representations of data, such as citation networks, social networks, the World-Wide-Web, protein-interaction networks, and others. Recent work on generalizing neural networks to graphs in various ways for different tasks (graph classification, nodes classification, clustering, link prediction, node embeddings, and more).

Dimitris Papatheodorou (Aalto University) GCNs May 21, 2019 3 / 35

slide-5
SLIDE 5

Problem Setting

Graphs are structured representations of data, such as citation networks, social networks, the World-Wide-Web, protein-interaction networks, and others. Recent work on generalizing neural networks to graphs in various ways for different tasks (graph classification, nodes classification, clustering, link prediction, node embeddings, and more). We will denote a undirected graph as G = (V, E), with:

nodes υi ∈ V (N nodes) edges ǫij = (υi, υj) ∈ E (M edges) adjacency matrix A ∈ RN×N (binary or weighted) degree matrix Dii =

j Aij

unnormalized graph Laplacian ∆ = D − A normalized graph Laplacian L = IN − D− 1

2 AD− 1 2

Dimitris Papatheodorou (Aalto University) GCNs May 21, 2019 3 / 35

slide-6
SLIDE 6

Intuition of graph Laplacian

Dimitris Papatheodorou (Aalto University) GCNs May 21, 2019 4 / 35

slide-7
SLIDE 7

Intuition of graph Laplacian

The graph Laplacian can be considered as the discrete analogue (applied on graphs) of the Laplacian operator ∇2 on graphs, which is differential operator given by the divergence of the gradient of a function f on Euclidean space. → ∆f = ∇2f = div(grad(f ))

Dimitris Papatheodorou (Aalto University) GCNs May 21, 2019 4 / 35

slide-8
SLIDE 8

Intuition of graph Laplacian

The graph Laplacian can be considered as the discrete analogue (applied on graphs) of the Laplacian operator ∇2 on graphs, which is differential operator given by the divergence of the gradient of a function f on Euclidean space. → ∆f = ∇2f = div(grad(f )) The Gradient Operator For a function on the Euclidean space, the gradient operator gives the derivative of the function along each direction at every point. For a function on a discrete ”graph space”, the graph gradient

  • perator gives the difference of the function along each edge at every

vertex: → For edge ǫ = (u, v) : grad (f )|ǫ = f (u) − f (v). ⇒ grad(f ) = K ⊤f , where K is the incidence matrix of size M × N. (by assigning an arbitrary orientation on the edges)

Dimitris Papatheodorou (Aalto University) GCNs May 21, 2019 4 / 35

slide-9
SLIDE 9

Intuition of graph Laplacian

Dimitris Papatheodorou (Aalto University) GCNs May 21, 2019 5 / 35

slide-10
SLIDE 10

Intuition of graph Laplacian

The Divergence Operator In the Euclidean space, divergence at a point gives the net outward flux of a vector field. For graphs, the vector field is just the gradient of a graph function. In the discrete ”graph space”, we define the graph divergence of a function g over the edges of a graph (eg the graph gradient) as a mapping from g to Kg. → ∇f = div(grad(f )) = KK ⊤f , where KK ⊤ is the Laplacian.

Dimitris Papatheodorou (Aalto University) GCNs May 21, 2019 6 / 35

slide-11
SLIDE 11

Intuition of graph Laplacian

The Divergence Operator In the Euclidean space, divergence at a point gives the net outward flux of a vector field. For graphs, the vector field is just the gradient of a graph function. In the discrete ”graph space”, we define the graph divergence of a function g over the edges of a graph (eg the graph gradient) as a mapping from g to Kg. → ∇f = div(grad(f )) = KK ⊤f , where KK ⊤ is the Laplacian. Notice that the laplacian ∆ = KK ⊤ here is Cholesky decomposed, thus it’s positive semi-definite.

Dimitris Papatheodorou (Aalto University) GCNs May 21, 2019 6 / 35

slide-12
SLIDE 12

Intuition of graph Laplacian

Dimitris Papatheodorou (Aalto University) GCNs May 21, 2019 7 / 35

slide-13
SLIDE 13

Intuition of graph Laplacian

Circled items: degrees of the vertices! Now the definition is more clear: ∆ = D − A

Dimitris Papatheodorou (Aalto University) GCNs May 21, 2019 8 / 35

slide-14
SLIDE 14

Intuition of graph Laplacian

Another example

Dimitris Papatheodorou (Aalto University) GCNs May 21, 2019 9 / 35

slide-15
SLIDE 15

Intuition of graph Laplacian

More intuition For continuous spaces, the Laplacian is the secord derivative, so it measures how smooth is a function over its domain. It’s the same for graph laplacians: the function values don’t change by much from one node to an adjacent one.

Dimitris Papatheodorou (Aalto University) GCNs May 21, 2019 10 / 35

slide-16
SLIDE 16

Intuition of graph Laplacian

More intuition For continuous spaces, the Laplacian is the secord derivative, so it measures how smooth is a function over its domain. It’s the same for graph laplacians: the function values don’t change by much from one node to an adjacent one. Formally (general case of weighted graphs): E(f ) = 1 2

  • u∼v

wuv(f (u) − f (v))2 =

  • K ⊤f
  • 2

= f ⊤∆f

Dimitris Papatheodorou (Aalto University) GCNs May 21, 2019 10 / 35

slide-17
SLIDE 17

Intuition of graph Laplacian

More intuition For continuous spaces, the Laplacian is the secord derivative, so it measures how smooth is a function over its domain. It’s the same for graph laplacians: the function values don’t change by much from one node to an adjacent one. Formally (general case of weighted graphs): E(f ) = 1 2

  • u∼v

wuv(f (u) − f (v))2 =

  • K ⊤f
  • 2

= f ⊤∆f Equivalent to Dirichlet energy, for open set Ω ⊆ Rn and function f : Ω → R: E(f ) = 1 2

∇f (x)2dx a measure of how variable a function is.

Dimitris Papatheodorou (Aalto University) GCNs May 21, 2019 10 / 35

slide-18
SLIDE 18

Intuition of graph Laplacian

So, minimizing the variation of a graph function leads us to the Laplacian. The functions that minimize f ⊤∆f are the eigenvectors of ∆. This can be shown either directly, or via the Courant-Fischer-Weyl min-max principle / variational theorem on the Rayleigh quotient of the laplacian for unit norm functions. (See more in Algorithmic Methods of Data Mining course slides)

Dimitris Papatheodorou (Aalto University) GCNs May 21, 2019 11 / 35

slide-19
SLIDE 19

Intuition of graph Laplacian

Interesting Properties ∆ = KK ⊤, thus the Laplacian is a Gram Matrix.

Dimitris Papatheodorou (Aalto University) GCNs May 21, 2019 12 / 35

slide-20
SLIDE 20

Intuition of graph Laplacian

Interesting Properties ∆ = KK ⊤, thus the Laplacian is a Gram Matrix. The multiplicity of its zero eigenvalue λ0 is equal to the number of components of the graph. (multiplicity: remember the characteristic polynomial det(A − λI) = 0).

Dimitris Papatheodorou (Aalto University) GCNs May 21, 2019 12 / 35

slide-21
SLIDE 21

Intuition of graph Laplacian

Interesting Properties ∆ = KK ⊤, thus the Laplacian is a Gram Matrix. The multiplicity of its zero eigenvalue λ0 is equal to the number of components of the graph. (multiplicity: remember the characteristic polynomial det(A − λI) = 0). The second smallest eigenvalue (aka Fiedler value) of the Laplacian matrix will be zero if and only if the graph is disconnected. The smaller the second smallest eigenvalue, the less ’connected’ the graph. Interlacing property: For a graph with Laplacian ∆ and eigenvalues of ∆: λ1 ≥ λ2 ≥ · · · ≥ λn, if we delete an edge, the new eigenvalues are: µ1 ≥ µ2 ≥ · · · ≥ µn−1. It holds that: 2 ≥ λ1 ≥ µ1 ≥ λ2 ≥ µ2 ≥ · · · ≥ µn−1 ≥ λn ≥ 0 This is the same for the adjacency matrix and nodes!

Dimitris Papatheodorou (Aalto University) GCNs May 21, 2019 12 / 35

slide-22
SLIDE 22

The ideas behind the problem

Kipf’s and Welling’s paper [Kipf and Welling(2016)] focuses on nodes classification, where node labels are available for a small number of nodes.

Dimitris Papatheodorou (Aalto University) GCNs May 21, 2019 13 / 35

slide-23
SLIDE 23

The ideas behind the problem

Kipf’s and Welling’s paper [Kipf and Welling(2016)] focuses on nodes classification, where node labels are available for a small number of nodes. That’s a graph-based semi-supervised learning problem.

Dimitris Papatheodorou (Aalto University) GCNs May 21, 2019 13 / 35

slide-24
SLIDE 24

The ideas behind the problem

Kipf’s and Welling’s paper [Kipf and Welling(2016)] focuses on nodes classification, where node labels are available for a small number of nodes. That’s a graph-based semi-supervised learning problem. It can be tackled by smoothing label information over the graph via some form of explicit graph-based regularization: J = J0 + λJreg Lreg =

  • i,j

Aij f (Xi) − f (Xj)2 = f (X)⊤∆f (X)

  • Jreg is the graph laplacian regularization term
  • J0 is the supervised loss wrt the labeled parts of the graph
  • f (·) is a diffentiable function (eg a neural network)
  • X = {Xi|i = 1, . . . , N} is a matrix of node feature vectors

Dimitris Papatheodorou (Aalto University) GCNs May 21, 2019 13 / 35

slide-25
SLIDE 25

Are they good enough?

It that any good?

Dimitris Papatheodorou (Aalto University) GCNs May 21, 2019 14 / 35

slide-26
SLIDE 26

Are they good enough?

It that any good? Yes, but it assumes that connected nodes are likely to share the same label.

Dimitris Papatheodorou (Aalto University) GCNs May 21, 2019 14 / 35

slide-27
SLIDE 27

Are they good enough?

It that any good? Yes, but it assumes that connected nodes are likely to share the same label. Edges do not necessarily encode node similarity! They may contain additional information.

Dimitris Papatheodorou (Aalto University) GCNs May 21, 2019 14 / 35

slide-28
SLIDE 28

Are they good enough?

It that any good? Yes, but it assumes that connected nodes are likely to share the same label. Edges do not necessarily encode node similarity! They may contain additional information. This assumption restrict the modeling capacity of our classifier.

Dimitris Papatheodorou (Aalto University) GCNs May 21, 2019 14 / 35

slide-29
SLIDE 29

GCNs main idea

A better idea? GCNs encode the graph structure directly using a neural network f (X, A)

Dimitris Papatheodorou (Aalto University) GCNs May 21, 2019 15 / 35

slide-30
SLIDE 30

GCNs main idea

A better idea? GCNs encode the graph structure directly using a neural network f (X, A) using the unregularized supervised loss J0

Dimitris Papatheodorou (Aalto University) GCNs May 21, 2019 15 / 35

slide-31
SLIDE 31

GCNs main idea

A better idea? GCNs encode the graph structure directly using a neural network f (X, A) using the unregularized supervised loss J0 and by conditioning f on A they distribute gradient information from J0 and will enable it to learn representations of nodes both with and without labels.

Dimitris Papatheodorou (Aalto University) GCNs May 21, 2019 15 / 35

slide-32
SLIDE 32

Convolutions on graphs

For a multi-layer f (X, A) GCN, a simple propagation rule could be: H(0) = X H(ℓ+1) = f

  • H(ℓ), A
  • = α
  • AH(ℓ)W (ℓ)

where α is an activation function and W ℓ the trainable weight matrix

  • f the ℓ-th layer.

Dimitris Papatheodorou (Aalto University) GCNs May 21, 2019 16 / 35

slide-33
SLIDE 33

Convolutions on graphs

For a multi-layer f (X, A) GCN, a simple propagation rule could be: H(0) = X H(ℓ+1) = f

  • H(ℓ), A
  • = α
  • AH(ℓ)W (ℓ)

where α is an activation function and W ℓ the trainable weight matrix

  • f the ℓ-th layer.

For better results: ◦ Enforce self-loops: ˜ A = A + IN

  • Symmetrical normalization: ˜

D− 1

2 ˜

A˜ D− 1

2

H(ℓ+1) = f

  • H(ℓ), A
  • = α
  • ˜

D− 1

2 ˜

A˜ D− 1

2 H(ℓ)W (ℓ) Dimitris Papatheodorou (Aalto University) GCNs May 21, 2019 16 / 35

slide-34
SLIDE 34

Spectral Graph Convolutions (SGC)

→ To understand this idea we take a look at Spectral Graph Convolutions.

Dimitris Papatheodorou (Aalto University) GCNs May 21, 2019 17 / 35

slide-35
SLIDE 35

Spectral Graph Convolutions (SGC)

→ To understand this idea we take a look at Spectral Graph Convolutions. What we need: Signal x ∈ RN (scalar for every node)

Dimitris Papatheodorou (Aalto University) GCNs May 21, 2019 17 / 35

slide-36
SLIDE 36

Spectral Graph Convolutions (SGC)

→ To understand this idea we take a look at Spectral Graph Convolutions. What we need: Signal x ∈ RN (scalar for every node) Filter gθ = diag(θ), θ ∈ RN

Dimitris Papatheodorou (Aalto University) GCNs May 21, 2019 17 / 35

slide-37
SLIDE 37

Spectral Graph Convolutions (SGC)

→ To understand this idea we take a look at Spectral Graph Convolutions. What we need: Signal x ∈ RN (scalar for every node) Filter gθ = diag(θ), θ ∈ RN Eigendec normalized laplacian: L = IN − D− 1

2 AD− 1 2 = UΛU⊤ Dimitris Papatheodorou (Aalto University) GCNs May 21, 2019 17 / 35

slide-38
SLIDE 38

Spectral Graph Convolutions (SGC)

→ To understand this idea we take a look at Spectral Graph Convolutions. What we need: Signal x ∈ RN (scalar for every node) Filter gθ = diag(θ), θ ∈ RN Eigendec normalized laplacian: L = IN − D− 1

2 AD− 1 2 = UΛU⊤

SGC multiplication operator ⋆ in the Fourier (frequency) domain

Dimitris Papatheodorou (Aalto University) GCNs May 21, 2019 17 / 35

slide-39
SLIDE 39

Spectral Graph Convolutions (SGC)

→ To understand this idea we take a look at Spectral Graph Convolutions. What we need: Signal x ∈ RN (scalar for every node) Filter gθ = diag(θ), θ ∈ RN Eigendec normalized laplacian: L = IN − D− 1

2 AD− 1 2 = UΛU⊤

SGC multiplication operator ⋆ in the Fourier (frequency) domain U⊤x graph Fourier transform of the Signal We can understand gθ as function of Λ: gθ(Λ) What we get: gθ ⋆ x = UgθU⊤x

Dimitris Papatheodorou (Aalto University) GCNs May 21, 2019 17 / 35

slide-40
SLIDE 40

Spectral Graph Convolutions (SGC)

→ To understand this idea we take a look at Spectral Graph Convolutions. What we need: Signal x ∈ RN (scalar for every node) Filter gθ = diag(θ), θ ∈ RN Eigendec normalized laplacian: L = IN − D− 1

2 AD− 1 2 = UΛU⊤

SGC multiplication operator ⋆ in the Fourier (frequency) domain U⊤x graph Fourier transform of the Signal We can understand gθ as function of Λ: gθ(Λ) What we get: gθ ⋆ x = UgθU⊤x → This procedure is computationally expensinve though. Eigendecomposition is expensive and multiplication with U is O(N2).

Dimitris Papatheodorou (Aalto University) GCNs May 21, 2019 17 / 35

slide-41
SLIDE 41

Spectral Graph Convolutions (SGC)

Solution by [Hammond et al.(2011)Hammond, Vandergheynst, and Gribonval]: Approximate gθ(Λ) using Chebyshev polynomials Tk(x) of K th order.

Dimitris Papatheodorou (Aalto University) GCNs May 21, 2019 18 / 35

slide-42
SLIDE 42

Spectral Graph Convolutions (SGC)

Solution by [Hammond et al.(2011)Hammond, Vandergheynst, and Gribonval]: Approximate gθ(Λ) using Chebyshev polynomials Tk(x) of K th order. Chebyshev Polynomials Review:

  • Recurrence Formula:

T0(x) = 1 T1(x) = x Tn+1(x) = 2xTn(x) − Tn−1(x)

  • Using Rodrigues’ formula:

Tn(x) = (−2)nn! (2n)!

  • 1 − x2 dn

dxn

  • 1 − x2n−1/2

Dimitris Papatheodorou (Aalto University) GCNs May 21, 2019 18 / 35

slide-43
SLIDE 43

Spectral Graph Convolutions (SGC)

So we get this approximation: gθ′(Λ) ≈

K

  • k=0

θ′

kTk(˜

Λ), where ◦ Λ is rescaled: ˜ Λ =

2 λmax Λ − IN,

  • λmax is the largest eigenvalue of L,
  • θ′ ∈ RK Chebyshev coefficients

Thus for the convolution we get: gθ′ ⋆ x ≈

K

  • k=0

θ′

kTk(˜

L)x, which is O(|E|) with ˜ L =

2 λmax L − IN

→ K-localized expression, thus depends only on nodes that are at maximum K steps away from the central node.

Dimitris Papatheodorou (Aalto University) GCNs May 21, 2019 19 / 35

slide-44
SLIDE 44

Stacked SGCs → Profit?

Simply stacking SGCs gives as a neural network.

Dimitris Papatheodorou (Aalto University) GCNs May 21, 2019 20 / 35

slide-45
SLIDE 45

Stacked SGCs → Profit?

Simply stacking SGCs gives as a neural network. What if we only use K = 1? We get a linear function wrt to L and on the graph laplacian spectrum. Is that any good? :S

Dimitris Papatheodorou (Aalto University) GCNs May 21, 2019 20 / 35

slide-46
SLIDE 46

Stacked SGCs → Profit?

Simply stacking SGCs gives as a neural network. What if we only use K = 1? We get a linear function wrt to L and on the graph laplacian spectrum. Is that any good? :S Yes! Stacking simple functions still lets us explore a rich class of convolutional filters and comes with some extra benefits:

Dimitris Papatheodorou (Aalto University) GCNs May 21, 2019 20 / 35

slide-47
SLIDE 47

Stacked SGCs → Profit?

Simply stacking SGCs gives as a neural network. What if we only use K = 1? We get a linear function wrt to L and on the graph laplacian spectrum. Is that any good? :S Yes! Stacking simple functions still lets us explore a rich class of convolutional filters and comes with some extra benefits:

Such filters are also not limited by the explicit form and parameterization of the approximation,

Dimitris Papatheodorou (Aalto University) GCNs May 21, 2019 20 / 35

slide-48
SLIDE 48

Stacked SGCs → Profit?

Simply stacking SGCs gives as a neural network. What if we only use K = 1? We get a linear function wrt to L and on the graph laplacian spectrum. Is that any good? :S Yes! Stacking simple functions still lets us explore a rich class of convolutional filters and comes with some extra benefits:

Such filters are also not limited by the explicit form and parameterization of the approximation, less likely to overfit on local neighborhoods

Dimitris Papatheodorou (Aalto University) GCNs May 21, 2019 20 / 35

slide-49
SLIDE 49

Stacked SGCs → Profit?

Simply stacking SGCs gives as a neural network. What if we only use K = 1? We get a linear function wrt to L and on the graph laplacian spectrum. Is that any good? :S Yes! Stacking simple functions still lets us explore a rich class of convolutional filters and comes with some extra benefits:

Such filters are also not limited by the explicit form and parameterization of the approximation, less likely to overfit on local neighborhoods but still able to convolve a kth order neighborhood through k layers,

Dimitris Papatheodorou (Aalto University) GCNs May 21, 2019 20 / 35

slide-50
SLIDE 50

Stacked SGCs → Profit?

Simply stacking SGCs gives as a neural network. What if we only use K = 1? We get a linear function wrt to L and on the graph laplacian spectrum. Is that any good? :S Yes! Stacking simple functions still lets us explore a rich class of convolutional filters and comes with some extra benefits:

Such filters are also not limited by the explicit form and parameterization of the approximation, less likely to overfit on local neighborhoods but still able to convolve a kth order neighborhood through k layers, and less expensive, so we can STACK MORE LAYERS! to increase the model’s capacity

Dimitris Papatheodorou (Aalto University) GCNs May 21, 2019 20 / 35

slide-51
SLIDE 51

Stacked SGCs → Profit?

Furthermore, because we are engineers we can assume λmax ≈ 2 and hope the gods of neural networks help the parameters adapt this change in scale during training.

Dimitris Papatheodorou (Aalto University) GCNs May 21, 2019 21 / 35

slide-52
SLIDE 52

Stacked SGCs → Profit?

Furthermore, because we are engineers we can assume λmax ≈ 2 and hope the gods of neural networks help the parameters adapt this change in scale during training.

  • The simplified version is:

gθ′ ⋆ x ≈ θ′

0x + θ′ 1 (L − IN) x = θ′ 0x − θ′ 1D− 1

2 AD− 1 2 x Dimitris Papatheodorou (Aalto University) GCNs May 21, 2019 21 / 35

slide-53
SLIDE 53

Stacked SGCs → Profit?

Furthermore, because we are engineers we can assume λmax ≈ 2 and hope the gods of neural networks help the parameters adapt this change in scale during training.

  • The simplified version is:

gθ′ ⋆ x ≈ θ′

0x + θ′ 1 (L − IN) x = θ′ 0x − θ′ 1D− 1

2 AD− 1 2 x

  • Constraining the numbers of parameters even more address overfitting

and computational cost: gθ ⋆ x ≈ θ

  • IN + D− 1

2 AD− 1 2

  • x, where θ = θ′

0 = −θ′ 1

Dimitris Papatheodorou (Aalto University) GCNs May 21, 2019 21 / 35

slide-54
SLIDE 54

Stacked SGCs → Profit?

Furthermore, because we are engineers we can assume λmax ≈ 2 and hope the gods of neural networks help the parameters adapt this change in scale during training.

  • The simplified version is:

gθ′ ⋆ x ≈ θ′

0x + θ′ 1 (L − IN) x = θ′ 0x − θ′ 1D− 1

2 AD− 1 2 x

  • Constraining the numbers of parameters even more address overfitting

and computational cost: gθ ⋆ x ≈ θ

  • IN + D− 1

2 AD− 1 2

  • x, where θ = θ′

0 = −θ′ 1

  • Eigenvalues of IN + D− 1

2 AD− 1 2 ∈ [0, 2] → exploding/vanishing

  • gradients. Solution: apply the renormalization trick again:

gθ ⋆ x ≈ θ

  • ˜

D− 1

2 ˜

A˜ D− 1

2

  • x

Dimitris Papatheodorou (Aalto University) GCNs May 21, 2019 21 / 35

slide-55
SLIDE 55

Stacked SGCs → Profit?

General form for signal X ∈ RN×C (C channels) and F filters with Θ ∈ RC×F filter parameters: Z = ˜ D− 1

2 ˜

A˜ D− 1

2 XΘ,

Z ∈ RN×F being convolved signal matrix. The complexity of the operation is O(|E|FC), as ˜ AX can be efficiently implemented as a product of a sparse matrix with a dense matrix.

Dimitris Papatheodorou (Aalto University) GCNs May 21, 2019 22 / 35

slide-56
SLIDE 56

An example

2-layer GCN: Preprocess: ˆ A = ˜ D− 1

2 ˜

A ˜ D− 1

2

Neural Network: Z = f (X, A) = softmax

  • ˆ

A ReLU

  • ˆ

AXW (0) W (1) Cross-Entropy loss: J = −

  • ℓ∈YL

F

  • f =1

Yℓf ln Zℓf , where YL is the set of node indices that have labels. Adam on a full dataset batch + early stopping Dropout for all layers and L2 regularization for the first one. Glorot weight initialization (aka Xavier normal)

Dimitris Papatheodorou (Aalto University) GCNs May 21, 2019 23 / 35

slide-57
SLIDE 57

Experiments

Datasets

Dimitris Papatheodorou (Aalto University) GCNs May 21, 2019 24 / 35

slide-58
SLIDE 58

Experiments

Classification

Dimitris Papatheodorou (Aalto University) GCNs May 21, 2019 25 / 35

slide-59
SLIDE 59

Experiments

Propagation models evaluation on classification accuracy using random weight initialization

Dimitris Papatheodorou (Aalto University) GCNs May 21, 2019 26 / 35

slide-60
SLIDE 60

Experiments

Node Embeddings Z = tanh

  • ˆ

A tanh

  • ˆ

A tanh

  • ˆ

AXW (0) W (1) W (2)

Dimitris Papatheodorou (Aalto University) GCNs May 21, 2019 27 / 35

slide-61
SLIDE 61

Experiments

Node Embeddings for classification Adding a softmax layer to the previous model:

Dimitris Papatheodorou (Aalto University) GCNs May 21, 2019 28 / 35

slide-62
SLIDE 62

Some code

Graph Convolution Layer (1/2)

Dimitris Papatheodorou (Aalto University) GCNs May 21, 2019 29 / 35

slide-63
SLIDE 63

Some code

Graph Convolution Layer (2/2)

Dimitris Papatheodorou (Aalto University) GCNs May 21, 2019 30 / 35

slide-64
SLIDE 64

Some code

Graph Convolution Network

Dimitris Papatheodorou (Aalto University) GCNs May 21, 2019 31 / 35

slide-65
SLIDE 65

Some code

Training (1/2)

Dimitris Papatheodorou (Aalto University) GCNs May 21, 2019 32 / 35

slide-66
SLIDE 66

Some code

Training (2/2)

Dimitris Papatheodorou (Aalto University) GCNs May 21, 2019 33 / 35

slide-67
SLIDE 67

References

David K Hammond, Pierre Vandergheynst, and R´ emi Gribonval. Wavelets on graphs via spectral graph theory. Applied and Computational Harmonic Analysis, 30(2):129–150, 2011. Thomas N Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907, 2016.

Dimitris Papatheodorou (Aalto University) GCNs May 21, 2019 34 / 35

slide-68
SLIDE 68

The End

Dimitris Papatheodorou (Aalto University) GCNs May 21, 2019 35 / 35