- Published on
Notes on Learning to Solve Routing [KHW-19]
Paper link, super clearly written...
Graph attention recap
For simplicity, we omit the layer index. Unless stated otherwise, the discussion pertains to one layer of a GNN.
Single-head attention. Let denote the embedding of a vertex . Define and to be the dimensions of keys and values, respectively. Let denote the dimension of .
In graph attention, the new embedding in the next layer is computed as follows:
where
is the set of neighbors of ; is a nonlinearity.
is the attention weight.
is the value of vertex .
Formally, and are computed as follows. Let (), () and () be the learnable parameters. For a vertex , its query, key, and value are computed by projecting :
For a neighbor of , the weight is the output of a softmax over the set , where the compatibility is computed as follows:
Multi-head attention. Let be the number of heads, and we compute independently times, resulting a set of embeddings. Importantly, each head has its own set of learnable projection matrices. Following this, one can set , then do concatenation over this set, followed by another project. One can also do averaging if we use multi-head attention in the final prediction layer.