author: Marc Lelarge, course: dataflowr
date: April 23, 2021
As shown in the module on GNN, invariant and equivariant functions are crucial for GNN. For example, the message passing GNN (MGNN) layer is defined by:
hiℓ+1=f(hiℓ,{{hjℓ}}j∼i), where i∼j means that nodes i and j are neighbors and the function f should not depend on the order of the elements in the multiset {{hjℓ}}j∼i. This layer is applied in parallel to all nodes (with the same function f) producing a mapping from hℓ=(h1ℓ…,hnℓ) to F(hℓ)=hℓ+1 with F:Rn→Rn where n is the number of nodes in the graph (and only real hidden states are considered for simplicity). It is easy to see that F is an equivariant function, i.e. permuting its input will permute its output.
Another example of invariant and equivariant functions is given by the attention layer Attention(Q,K,V)=Z defined for Q a tensor of row queries, K the keys and V the values, Q,K,V∈Rn×d by
Zj=i=1∑nsoftmaxi(QjKiT)Vi. The queries are obtained from a tensor X∈Rn×c by Q=XWQT and the keys and values are obtained from a tensor X′∈Rn×c′ by K=X′WKT and V=X′WVT. We see that when the queries are fixed, the attention layer is invariant in the pair (keys, values):
Zj=i=1∑nsoftmaxi(QjKσ(i)T)Vσ(i), hence Attention(X,X′) is invariant in X′. Similarly, when the pair (keys, values) is fixed, the attention layer is equivariant in the queries:
Zσ(j)=i=1∑nsoftmaxi(Qσ(j)KiT)Vi, hence Attention(X,X′) is equivariant in X. If X′=X, we get the self-attention layer so that SelfAttention(X)=Attention(X,X) is equivariant in X.
In this post, we will characterize invariant and equivariant functions following the ideas given in the paper Deep Sets.
We start with some definitions.
For a vector x=(x1,…,xn)∈Rn and a permutation σ∈Sn, we define
σ⋆x=(xσ−1(1),…,xσ−1(n)) Definitions:
A function f:Rn→R is invariant if for all x and all σ∈Sn, we have f(σ⋆x)=f(x).
A function f:Rn→Rn is equivariant if for all x and all σ∈Sn, we have f(σ⋆x)=σ⋆f(x).
We can now state our main result:
Theorem
invariant case: let f:[0,1]n→R be a continuous function. f is invariant if and only if there are continuous functions ϕ:[0,1]→Rn and ρ:Rn→R such that
f(x)=ρ(i=1∑nϕ(xi)) equivariant case: let f:[0,1]n→Rn be a continuous function. f is equivariant if and only if there are continuous functions ϕ:[0,1]→Rn and ρ:[0,1]×Rn→R such that
fj(x)=ρ(xj,i=1∑nϕ(xi)) We give some remarks before providing the proof below. For the sake of simplicity, we consider here a fixed number of points n on the unit interval [0,1]. For results with a varying number of points, see On the Limitations of Representing Functions on Sets and for points in higher dimension [0,1]d with d>1, see On Universal Equivariant Set Networks and Expressive Power of Invariant and Equivariant Graph Neural Networks.
Our proof will make the mapping ϕ explicit and it will not depend on the function f. The mapping ϕ can be seen as an embedding of the points in [0,1] in a space of high-dimension. Indeed this embedding space has to be of dimension at least the number of points n in order to ensure universality. This is an important remark as in learning scenario, the size of the embedding is typically fixed and hence will limit the expressiveness of the algorithm.
Coming back to the GNN layer (1), our result on the invariant case tells us that we can always rewrite it as:
hiℓ+1=ρ(hiℓ,j∼i∑ϕ(hjℓ)), and the dimension of the embedding ϕ(h) needs to be of the same order as the maximum degree in the graph. Note that (8) is not of the form of (7) as the sum inside the ρ function is taken only on neighbors. Indeed, we know that message passing GNN are not universal (see Expressive Power of Invariant and Equivariant Graph Neural Networks).
As a last remark, note that the original PointNet architecture f is of the form fi(x)=ρ(xi) which is not universal equivariant. Indeed, it is impossible to approximate the equivariant function gi(x)=∑ixi as shown below (we denote e1=(1,0,…,0)):
∥f(0)−g(0)∥2=nρ(0)2∥f(e1)−g(e1)∥2=(ρ(1)−1)2+(n−1)(ρ(0)−1)2≥(n−1)(ρ(0)−1)2, and these quantities cannot be small together. Hence PointNet is not universal equivariant but as shown in On Universal Equivariant Set Networks, modifying PointNet by adding the term ∑i=1nϕ(xi) inside the ρ function as in (7) makes it universal equivariant. We refer to Are Transformers universal approximators of sequence-to-sequence functions? for similar results about transformers based on self-attention.
We first show that the equivariant case is not more difficult than the invariant case. Assume that we proved the invariant case. Consider a permutation σ∈Sn such that σ(1)=1 so that f(σ⋆x)=σ⋆f(x) gives for the first component:
f1(x1,xσ(2),…,xσ(n))=f1(x1,x2,…,xn). For any x1, the mapping (x2,…,xn)↦f1(x1,x2,…,xn) is invariant. Hence by (6), we have
f1(x1,x2,…,xn)=ρ⎝⎜⎛x1,i=1∑ϕ(xi)⎠⎟⎞ Now consider a permutation such that σ(1)=k,σ(k)=1 and σ(i)=i for i=1,k, then we have
fk(x1,x2,…,xn)=f1(xk,x2…,x1,…xn), hence fk(x1,x2,…,xn)=ρ(xk,∑i=kϕ(xi)) and (7) follows.
Hence, we only need to prove (6) and follow the proof given in Deep Sets. We start with a crucial result stating that a set of n real points is characterized by the first n moments of its empirical measure. Let see what it means for n=2: we can recover the values of x1 and x2 from the quantities p1=x1+x2 and p2=x12+x22. To see that this is correct, note that
p12=x12+2x1x2+x22=p2+2x1x2, so that x1x2=2p12−p2. As a result, we have
(x−x1)(x−x2)=x2−p1x+2p12−p2, and clearly x1 and x2 can be recovered as the roots of this polynomial whose coefficients are functions of p1 and p2. The result below extends this argument for a general n:
Proposition
Let Φ:[0,1]≤n→Rn, where [0,1]≤n={x∈[0,1]n,x1≤x2≤⋯≤xn}, be defined by
Φ(x1,x2,…,xn)=(i∑x1,i∑xi2,…,i∑xin) is injective and has a continuous inverse mapping.
The proof follows from Newton's identities. For k≤n, we denote by pk=∑i=1nxik the power sums and by ek the elementary symmetric polynomials (note that all polynomials are function of the x1,…,xn):
e0=1e1=i∑xie2=i<j∑xixj… From Newton's identities, we have for k≤n,
kek=i=1∑k(−1)i−1ek−ipi, so that, we can express the elementary symmetric polynomials from the power sums:
e1=p12e2=e1p1−p2=p12−p23e3=e2p2−e1p2+p3=21p13−23p1p2+p3… Note that Φ(x1,x2,…,xn)=(p1,…,pn) and since
i=1∏n(x−xi)=xn−e1xn−1+e2xn−2⋯+(−1)nen, if Φ(x)=Φ(y) then ∏i=1n(x−xi)=∏i=1n(x−yi) so that {{x1,…,xn}}={{y1,…,yn}} and x=y∈[0,1]≤n, showing that Φ is injective.
Hence we proved that Φ:[0,1]≤n→Im(Φ) where Im(Φ) is the image of Φ, is a bijection. We need now to prove that Φ−1 is continuous and we'll prove it directly. Let yk→y∈Im(Φ), we need to show that Φ−1(yk)→Φ−1(y). Now if Φ−1(yk)→Φ−1(y), since [0,1]≤M is compact, this means that there exists a convergent subsequence of Φ−1(yk) with Φ−1(ymk)→x=Φ−1(y). But by continuity of Φ, we have ymk→Φ(x)=y, so that we get a contradiction and hence proved the continuity of Φ−1, finishing the proof of the proposition.
We are now ready to prove (6). Let ϕ:[0,1]→Rn be defined by ϕ(x)=(x,x2,…,xn) and ρ=f∘Φ−1. Note that ρ:Im(Φ)→R and ∑iϕ(xi)=Φ(x≤), where x≤ is the vector x with components sorted in non-decreasing order. Hence as soon as f is invariant, we have f(x)=f(x≤) so that (6) is valid. We only need to extend the function ρ from the domain Im(Φ) to Rn in a continuous way. This can be done by considering the projection π on the compact Im(Φ) and define ρ(x)=f∘Φ−1(π(x)).
Follow on twitter!