Deep Learning on Graphs [Marc Lelarge]

class: center, middle, title-slide
count: false

# Deep Learning on Graphs<br/>
# (3/3)

<br/><br/>
.bold[Marc Lelarge]

.bold[[www.dataflowr.com](https://www.dataflowr.com)]

---

# .gray[(1) Node embedding]

## .gray[Language model]

### .gray[one fixed graph, no signal. Ex: community detection]

# .gray[(2) Signal processing on graphs]

## .gray[Fourier analysis on graphs]

### .gray[one fixed graph, various signals. Ex: classification of signals]

# (3) Graph embedding

## Graph Neural Networks

### various graphs. Ex: classification of graphs

---

# Graph embedding

## Graph Neural Networks

### various graphs. Ex: classification of graphs

--
count: false

Result of seeing an image where nodes are pixels and where we
replace the grid by the complete graph:

.center.width-80[![](images/graphs/permuted_image.png)]

---

# How to represent a graph?

.center.width-60[![](images/graphs/isomorph.jpeg)]

--
count: false

In graph theory, graph canonization is the problem of finding a canonical form of a given graph $G$ (i.e. every graph that is isomorphic to $G$ should have the same canonical form as $G$). 
Thus, from a solution to the graph canonization problem, one could also solve the problem of graph isomorphism...

---

# Why graph symmetries matter?

Start with a linear regression: your task is to estimate a linear model $\beta_1 x_1+\dots + \beta_n x_n$ from noisy observations $({\bf x},y)$.

**Q:** How many parameters do you need to estimate if you know in addition that the model is .red[invariant] to permutation of the input $(x_1,\dots, x_n)$?

--
count: false

*A:*  there is only one parameter to estimate because .red[invariance] implies $\beta_1=\dots=\beta_n$.

--
count: false

**Q:** a linear regression on graphs: estimate a linear function of the adjacency matrix in $\mathbb{R}^{n\times n}$, how many parameters to estimate?

--
count: false

*A:*  there are only two parameters to estimate for a linear function $f:\mathbb{R}^{n\times n}\to \mathbb{R}$ invariant to permutation of the rows and columns:
$$
f(A) = \alpha \sum\_{i=j}A\_{i,j}+\beta \sum\_{i =\not j}A\_{i,j},
$$
whatever the value of $n$!

---

# Invariant and equivariant functions

We only consider algorithms whose result does not depend on the particular representation of the graph.

--
count: false

Graphs are represented by their adjacency matrix $G\in \mathbb{F}^{n^2}$.

For $\sigma\in \mathcal{S}\_n$, we define: 
- for $X\in \mathbb{F}^n$, $(\sigma \star X)\_{\sigma(i)} = X\_i$
- for $G\in \mathbb{F}^{n^2}$, 
$(\sigma \star G)\_{\sigma(i\_1),\sigma(i\_2)} = G\_{i\_1, i\_2}$.

--
count: false

$G_1,G_2$ are isomorphic iff $G_1=\sigma \star G_2$.

--
count: false

A function $f:\mathbb{F}^{n^k} \to \mathbb{F}$ is said to be .red[invariant] if $f(\sigma \star Z) = f(Z)$ for every permutation $\sigma$ and every $Z \in \mathbb{F}^{n^k}$, $k=1,2$.
    
A function $f:\mathbb{F}^{n^k} \to \mathbb{F}^{n}$ is said to be .red[equivariant] if $f(\sigma \star Z) = \sigma \star f(Z)$ for every permutation $\sigma$ and every $Z \in \mathbb{F}^{n^k}$, $k=1,2$.

---

# Message passing GNN (MGNN)

.center.width-40[![](images/graphs/mgnnlayer.png)]

.red[MGNN] takes as input a discrete graph $G=(V,E)$ with $n$ nodes and 
features on the nodes $h^0\in \mathbb{F}^n$ and are defined inductively as: 
$h^\ell\_i \in \mathbb{F}$ being the features at layer $\ell$ associated with node $i$, then
$$
h^{\ell+1}\_i = f\left( h\_i^\ell, \left[h\_j^\ell\right]\_{j\sim i}\right),
$$
where $f$ is a learnable function and $[\cdot]$ represents the multiset.

--
count: false

The message passing layer $\mathbb{F}^n\to \mathbb{F}^n$ mapping the features $h^\ell$ at layer $\ell$
to the features $h^{\ell+1}$ at layer $\ell+1$ is equivariant.

---
# The many flavors of MGNN

The message passing layer can be expressed as (i.e. for each $f$ there exist $f\_0$ and $f\_1$ such that):
$$
h^{\ell+1}\_i = f\left( h\_i^\ell, \left[h\_j^\ell\right]\_{j\sim i}\right)= 
f\_0\left(h\_i^\ell, \sum\_{j\sim i}f\_1\left( h^\ell\_i, h\_j^\ell\right)\right).
$$

By varying the functions $f\_0$ and $f\_1$, you get: [vanilla GCN](https://arxiv.org/abs/1609.02907), 
[GraphSage](https://arxiv.org/abs/1706.02216), [Graph Attention Network](https://arxiv.org/abs/1710.10903), [MoNet](https://openaccess.thecvf.com/content_cvpr_2017/html/Monti_Geometric_Deep_Learning_CVPR_2017_paper.html), [Gated Graph ConvNet](https://arxiv.org/abs/1711.07553),
[Graph Isomorphism Networks](https://arxiv.org/abs/1810.00826)...

--
count: false

### A problem with regular graphs

These GNNs are unable to distinguish $d$-regular graphs.

--
count: false

### Another problematic pair .center.width-30[![](images/graphs/pbwl2.png)]

---

# Separating power

Let $\mathcal{F}$ be a set of functions $f$ defined on a set $X$, 
where each $f$ takes its values in some $Y\_f$. 
The equivalence relation  $\rho(\mathcal{F})$ defined by $\mathcal{F}$ on $X$ is: for any $x, x' \in X$,
$$
(x, x') \in \rho({\mathcal{F}}) \iff \forall f\in \mathcal{F},\ f(x) = f(x')\,.
$$

Given two sets of functions $\mathcal{F}$ and $\mathcal{E}$, 
we say that $\mathcal{F}$ is more separating than $\mathcal{E}$ 
if $\rho(\mathcal{F}) \subset \rho(\mathcal{E})$.

.center.width-30[![](images/graphs/separation.png)]

--
count: false

## What is the separating power of MGNNs?
---

#  $2$-Weisfeiler-Lehman test

.center.width-70[![](images/graphs/2wl.png)]

Designed for the graph isomorphism problem, but non-isomorphic graphs might give the same output.

--
count: false
.center.width-50[![](images/graphs/2wl2.png)]
---

# How powerful are GNNs

MGNNs are as powerful as $2$-Weisfeiler-Lehman test:
$$
\rho(\text{MGNN}) = \rho(2\text{-WL})
$$
proved in [Xu et al. (2019)](https://arxiv.org/abs/1810.00826).

--
count: false

## Consequence:

.center.width-40[![](images/graphs/mgnnpb.png)]

---

# Results with GIN

.center.width-80[![](images/graphs/result_GIN.png)]
---

# Graphs as higher order tensors

.center.width-70[![](images/graphs/tensor3.png)]

---

# Invariant and equivariant linear operators

For an order-$k$ tensor $T\in \mathbb{R}^{n^k}$, we define for $\sigma\in \mathcal{S}\_n$:
$$
(\sigma \star T)\_{\sigma(i\_1),...,\sigma(i\_k)} = T\_{i\_1,...,i\_k}.
$$

A high order tensor representation of a graph captures more information. 
Hence, it is tempting to construct GNN with linear layers (LGNN)

--
count: false

A function $f:\mathbb{R}^{n^k}\to\mathbb{R}$ is said to be .red[invariant] if $f(\sigma \star T) = f(T)$ for every permutation $\sigma$.

A function $f:\mathbb{R}^{n^k}\to\mathbb{R^{n^\ell}}$ is said to be .red[equivariant] if $f(\sigma \star T) = \sigma\star f(T)$.

--
count: false

There is a basis of $b(k+\ell)$ possible equivariant .red[linear] operators $f:\mathbb{R}^{n^k}\to\mathbb{R^{n^\ell}}$.

The dimension of this space does not depend on the number of nodes $n$: the same linear layer can be applied on graphs of different sizes... but they cannot be very expressive.

.citation[[Invariant and Equivariant Graph Networks](https://arxiv.org/abs/1812.09902)]

---

# Invariant linear GNN (LGNN)

A linear invariant function $f:\mathbb{R}^{n}\to\mathbb{R}$ is of the form $f(x) = \alpha 1^Tx$.

A linear equivariant function $f:\mathbb{R}^{n}\to\mathbb{R^n}$ is of the form $f(x) = [\alpha Id + \beta 11^T]x$.

A linear invariant function $f:\mathbb{R}^{n\times n}\to\mathbb{R}$ is of the form

--
count: false
$$
f(A) = \alpha\sum\_{ i =\not j } A\_{i,j} + \beta \sum\_{i=j}A\_{i,j}.
$$

In order to get more expressive layer, we need to use tensor of higher order.

[On the universality of invariant networks](https://arxiv.org/abs/1901.09342) shows that the minimal required order is $k\geq n^2$ in order to be able to approximate any invariant function.

This is of little practical value... we need another approach!

---
# Folklore GNN (FGNN)

Inspired by the folklore Weisfeiler-Lehman test, [Maron et al. (2019)](https://arxiv.org/abs/1905.11136) proposed the following layer:
$$
h^{\ell+1}\_{i\to j} = f\_0\left( h^\ell\_{i\to j} , \sum\_{k\in V} f\_1(h^\ell\_{i\to k}) \odot f\_2(h^\ell\_{k\to j})\right),
$$
where $f\_0, f\_1$ and $f\_2$ are learnable functions.

--
count: false

They proved that FGNN have more separating power than MGNN and have the same separating power as LGNN with tensors of order $3$:
$$
\rho(\text{FGNN}) = \rho(3\text{-LGNN}) \subsetneq \rho(\text{MGNN})= \rho(2\text{-WL})
$$

More results available in [Azizian et al. (2021)](https://openreview.net/forum?id=lxHgXYN4bwl):
FGNN has a better power of approximation among all architectures working with tensors of order $2$ presented so far.

### Main drawback

Dense matrix multiplication.
---

# .gray[(1) Node embedding]

## .gray[Language model]

### .gray[one fixed graph, no signal. Ex: community detection]

# .gray[(2) Signal processing on graphs]

## .gray[Fourier analysis on graphs]

### .gray[one fixed graph, various signals. Ex: classification of signals]

# (3) Graph embedding

## Graph Neural Networks

### various graphs. Ex: classification of graphs
---
class: end-slide, center
count: false

The end.

.bold[[www.dataflowr.com](https://www.dataflowr.com)]