Invariant and equivariant layers with applications to GNN, PointNet and Transformers

author: Marc Lelarge, course: dataflowr

date: April 23, 2021

Invariant and equivariant functions

As shown in the module on GNN, invariant and equivariant functions are crucial for GNN. For example, the message passing GNN (MGNN) layer is defined by:

hi+1=f(hi,{{hj}}ji), h^{\ell+1}_i = f(h^\ell_i , \{\{ h^\ell_j\}\}_{j\sim i}),

where iji\sim j means that nodes ii and jj are neighbors and the function ff should not depend on the order of the elements in the multiset {{hj}}ji\{\{ h^\ell_j\}\}_{j\sim i}. This layer is applied in parallel to all nodes (with the same function ff) producing a mapping from h=(h1,hn){\bf h}^\ell = (h^\ell_1\dots, h^\ell_n) to F(h)=h+1F({\bf h}^\ell) = {\bf h}^{\ell+1} with F:RnRnF:\mathbb{R}^n \to \mathbb{R}^n where nn is the number of nodes in the graph (and only real hidden states are considered for simplicity). It is easy to see that FF is an equivariant function, i.e. permuting its input will permute its output.

Another example of invariant and equivariant functions is given by the attention layer Attention(Q,K,V)=Z\text{Attention}(Q,K,V) = Z defined for QQ a tensor of row queries, KK the keys and VV the values, Q,K,VRn×dQ,K,V\in \mathbb{R}^{n\times d} by

Zj=i=1nsoftmaxi(QjKiT)Vi. Z_j = \sum_{i=1}^n \text{softmax}_i(Q_jK_i^T) V_i.

The queries are obtained from a tensor XRn×cX\in \mathbb{R}^{n\times c} by Q=XWQTQ= XW_Q^T and the keys and values are obtained from a tensor XRn×cX' \in \mathbb{R}^{n\times c'} by K=XWKTK = X' W_K^T and V=XWVTV = X' W_V^T. We see that when the queries are fixed, the attention layer is invariant in the pair (keys, values):

Zj=i=1nsoftmaxi(QjKσ(i)T)Vσ(i), Z_j = \sum_{i=1}^n \text{softmax}_{i}(Q_j K_{\sigma(i)}^T) V_{\sigma(i)},

hence Attention(X,X)\text{Attention}(X,X') is invariant in XX'. Similarly, when the pair (keys, values) is fixed, the attention layer is equivariant in the queries:

Zσ(j)=i=1nsoftmaxi(Qσ(j)KiT)Vi, Z_{\sigma(j)} = \sum_{i=1}^n \text{softmax}_{i}(Q_{\sigma(j)}K_{i}^T) V_{i},

hence Attention(X,X)\text{Attention}(X,X') is equivariant in XX. If X=XX'=X, we get the self-attention layer so that SelfAttention(X)=Attention(X,X)\text{SelfAttention}(X) = \text{Attention}(X,X) is equivariant in XX.

In this post, we will characterize invariant and equivariant functions following the ideas given in the paper Deep Sets.

Representation of invariant and equivariant functions

We start with some definitions.

For a vector x=(x1,,xn)Rn{\bf x} = (x_1,\dots, x_n)\in \mathbb{R}^n and a permutation σSn\sigma \in \mathcal{S}_n, we define

σx=(xσ1(1),,xσ1(n)) \sigma \star {\bf x} = (x_{\sigma^{-1}(1)},\dots, x_{\sigma^{-1}(n)})

Definitions:

We can now state our main result:

Theorem

  • invariant case: let f:[0,1]nRf:[0,1]^n \to \mathbb R be a continuous function. ff is invariant if and only if there are continuous functions ϕ:[0,1]Rn\phi: [0,1] \to \mathbb R^n and ρ:RnR\rho: \mathbb R^n\to \mathbb R such that

f(x)=ρ(i=1nϕ(xi)) f({\bf x}) = \rho\left( \sum_{i=1}^n \phi(x_i)\right)
  • equivariant case: let f:[0,1]nRnf:[0,1]^n \to \mathbb R^n be a continuous function. ff is equivariant if and only if there are continuous functions ϕ:[0,1]Rn\phi: [0,1] \to \mathbb R^n and ρ:[0,1]×RnR\rho: [0,1]\times \mathbb R^n\to \mathbb R such that

fj(x)=ρ(xj,i=1nϕ(xi)) f_j({\bf x}) = \rho\left( x_j, \sum_{i=1}^n \phi(x_i)\right)

We give some remarks before providing the proof below. For the sake of simplicity, we consider here a fixed number of points nn on the unit interval [0,1][0,1]. For results with a varying number of points, see On the Limitations of Representing Functions on Sets and for points in higher dimension [0,1]d[0,1]^d with d>1d>1, see On Universal Equivariant Set Networks and Expressive Power of Invariant and Equivariant Graph Neural Networks.

Our proof will make the mapping ϕ\phi explicit and it will not depend on the function ff. The mapping ϕ\phi can be seen as an embedding of the points in [0,1][0,1] in a space of high-dimension. Indeed this embedding space has to be of dimension at least the number of points nn in order to ensure universality. This is an important remark as in learning scenario, the size of the embedding is typically fixed and hence will limit the expressiveness of the algorithm.

Coming back to the GNN layer (1), our result on the invariant case tells us that we can always rewrite it as:

hi+1=ρ(hi,jiϕ(hj)), h^{\ell+1}_i =\rho\left( h_i^{\ell}, \sum_{j\sim i} \phi(h^\ell_j)\right),

and the dimension of the embedding ϕ(h)\phi(h) needs to be of the same order as the maximum degree in the graph. Note that (8) is not of the form of (7) as the sum inside the ρ\rho function is taken only on neighbors. Indeed, we know that message passing GNN are not universal (see Expressive Power of Invariant and Equivariant Graph Neural Networks).

As a last remark, note that the original PointNet architecture ff is of the form fi(x)=ρ(xi)f_i({\bf x}) = \rho(x_i) which is not universal equivariant. Indeed, it is impossible to approximate the equivariant function gi(x)=ixig_i({\bf x}) = \sum_i x_i as shown below (we denote e1=(1,0,,0){\bf e}_1=(1,0,\dots,0)):

f(0)g(0)2=nρ(0)2f(e1)g(e1)2=(ρ(1)1)2+(n1)(ρ(0)1)2(n1)(ρ(0)1)2, \|f(0) - g(0)\|^2 = n \rho(0)^2\\ \|f({\bf e}_1) -g({\bf e}_1)\|^2 = (\rho(1)-1)^2 + (n-1)(\rho(0)-1)^2\geq (n-1)(\rho(0)-1)^2,

and these quantities cannot be small together. Hence PointNet is not universal equivariant but as shown in On Universal Equivariant Set Networks, modifying PointNet by adding the term i=1nϕ(xi) \sum_{i=1}^n \phi(x_i) inside the ρ\rho function as in (7) makes it universal equivariant. We refer to Are Transformers universal approximators of sequence-to-sequence functions? for similar results about transformers based on self-attention.

Proof of the Theorem

We first show that the equivariant case is not more difficult than the invariant case. Assume that we proved the invariant case. Consider a permutation σSn\sigma\in \mathcal S_n such that σ(1)=1\sigma(1)=1 so that f(σx)=σf(x)f(\sigma \star {\bf x}) = \sigma \star f({\bf x}) gives for the first component:

f1(x1,xσ(2),,xσ(n))=f1(x1,x2,,xn). f_1(x_1,x_{\sigma(2)},\dots, x_{\sigma(n)}) = f_1(x_1,x_2,\dots, x_n).

For any x1x_1, the mapping (x2,,xn)f1(x1,x2,,xn)(x_2,\dots, x_n) \mapsto f_1(x_1, x_2,\dots, x_n) is invariant. Hence by (6), we have

f1(x1,x2,,xn)=ρ(x1,i1ϕ(xi)) f_1(x_1,x_2,\dots, x_n) = \rho\left(x_1, \sum_{i\neq 1}\phi(x_i) \right)

Now consider a permutation such that σ(1)=k,σ(k)=1\sigma(1)=k, \sigma(k)=1 and σ(i)=i\sigma(i)=i for i1,ki\neq 1,k, then we have

fk(x1,x2,,xn)=f1(xk,x2,x1,xn), f_k(x_1,x_2,\dots, x_n) = f_1(x_k,x_2\dots, x_1,\dots x_n),

hence fk(x1,x2,,xn)=ρ(xk,ikϕ(xi))f_k(x_1,x_2,\dots, x_n)=\rho\left(x_k, \sum_{i\neq k}\phi(x_i) \right) and (7) follows.

Hence, we only need to prove (6) and follow the proof given in Deep Sets. We start with a crucial result stating that a set of nn real points is characterized by the first nn moments of its empirical measure. Let see what it means for n=2n=2: we can recover the values of x1x_1 and x2x_2 from the quantities p1=x1+x2p_1=x_1+x_2 and p2=x12+x22p_2=x_1^2+x_2^2. To see that this is correct, note that

p12=x12+2x1x2+x22=p2+2x1x2, p_1^2 = x_1^2+2x_1x_2+x_2^2 = p_2+2x_1x_2,

so that x1x2=p12p22x_1x_2 = \frac{p_1^2-p_2}{2}. As a result, we have

(xx1)(xx2)=x2p1x+p12p22, (x-x_1)(x-x_2) = x^2-p_1x+\frac{p_1^2-p_2}{2},

and clearly x1x_1 and x2x_2 can be recovered as the roots of this polynomial whose coefficients are functions of p1p_1 and p2p_2. The result below extends this argument for a general nn:

Proposition

Let Φ:[0,1]nRn\Phi:[0,1]_{\leq}^n \to \mathbb{R}^{n}, where [0,1]n={x[0,1]n, x1x2xn}[0,1]_{\leq}^n = \{ {\bf x}\in [0,1]^n,\: x_1\leq x_2\leq \dots\leq x_n\}, be defined by

Φ(x1,x2,,xn)=(ix1,ixi2,,ixin) \Phi(x_1,x_2,\dots, x_n) = \left( \sum_i x_1, \sum_i x_i^2,\dots, \sum_i x_i^n\right)

is injective and has a continuous inverse mapping.

The proof follows from Newton's identities. For knk\leq n, we denote by pk=i=1nxikp_k = \sum_{i=1}^n x_i^k the power sums and by eke_k the elementary symmetric polynomials (note that all polynomials are function of the x1,,xnx_1,\dots, x_n):

e0=1e1=ixie2=i<jxixj e_0 = 1\\ e_1 = \sum_i x_i\\ e_2 = \sum_{i < j} x_i x_j\\ \dots

From Newton's identities, we have for knk\leq n,

kek=i=1k(1)i1ekipi, k e_k = \sum_{i=1}^k (-1)^{i-1}e_{k-i}p_i,

so that, we can express the elementary symmetric polynomials from the power sums:

e1=p12e2=e1p1p2=p12p23e3=e2p2e1p2+p3=12p1332p1p2+p3 e_1 = p_1\\ 2e_2 = e_1p_1-p_2=p_1^2-p_2\\ 3e_3 = e_2p_2-e_1p_2+p_3 = \frac{1}{2}p_1^3-\frac{3}{2}p_1p_2+p3\\ \dots

Note that Φ(x1,x2,,xn)=(p1,,pn)\Phi(x_1,x_2,\dots, x_n) = (p_1,\dots, p_n) and since

i=1n(xxi)=xne1xn1+e2xn2+(1)nen, \prod_{i=1}^n (x-x_i) = x^n -e_1x^{n-1}+e_2x^{n-2}\dots + (-1)^n e_n,

if Φ(x)=Φ(y)\Phi({\bf x}) = \Phi({\bf y}) then i=1n(xxi)=i=1n(xyi)\prod_{i=1}^n (x-x_i)=\prod_{i=1}^n (x-y_i) so that {{x1,,xn}}={{y1,,yn}}\{\{x_1,\dots, x_n\}\} = \{\{y_1,\dots, y_n\}\} and x=y[0,1]n{\bf x}={\bf y} \in [0,1]^n_{\leq}, showing that Φ\Phi is injective.

Hence we proved that Φ:[0,1]nIm(Φ)\Phi:[0,1]^n_{\leq} \to \text{Im}(\Phi) where Im(Φ)\text{Im}(\Phi) is the image of Φ\Phi, is a bijection. We need now to prove that Φ1\Phi^{-1} is continuous and we'll prove it directly. Let ykyIm(Φ){\bf y}_k \to {\bf y} \in\text{Im}(\Phi), we need to show that Φ1(yk)Φ1(y)\Phi^{-1}({\bf y}_k) \to \Phi^{-1}({\bf y}). Now if Φ1(yk)↛Φ1(y)\Phi^{-1}({\bf y}_k) \not\to \Phi^{-1}({\bf y}), since [0,1]M[0,1]^M_{\leq} is compact, this means that there exists a convergent subsequence of Φ1(yk)\Phi^{-1}({\bf y}_{k}) with Φ1(ymk)xΦ1(y)\Phi^{-1}({\bf y}_{m_k}) \to {\bf x}\neq \Phi^{-1}({\bf y}) . But by continuity of Φ\Phi, we have ymkΦ(x)=y{\bf y}_{m_k} \to \Phi({\bf x}) = {\bf y}, so that we get a contradiction and hence proved the continuity of Φ1\Phi^{-1}, finishing the proof of the proposition.

We are now ready to prove (6). Let ϕ:[0,1]Rn\phi:[0,1] \to \mathbb R^n be defined by ϕ(x)=(x,x2,,xn)\phi(x) = (x,x^2,\dots, x^n) and ρ=fΦ1\rho = f\circ \Phi^{-1}. Note that ρ:Im(Φ)R\rho: \text{Im}(\Phi) \to \mathbb R and iϕ(xi)=Φ(x)\sum_{i}\phi(x_i) = \Phi({\bf x}_{\leq}), where x{\bf x}_{\leq} is the vector x{\bf x} with components sorted in non-decreasing order. Hence as soon as f is invariant, we have f(x)=f(x)f({\bf x}) = f({\bf x}_{\leq}) so that (6) is valid. We only need to extend the function ρ\rho from the domain Im(Φ)\text{Im}(\Phi) to Rn\mathbb R^n in a continuous way. This can be done by considering the projection π\pi on the compact Im(Φ)\text{Im}(\Phi) and define ρ(x)=fΦ1(π(x))\rho({\bf x}) = f\circ \Phi^{-1}(\pi({\bf x})).

Follow on twitter!

Thanks for reading!