Attention mechanism puts weights in words to make some more important. Self attention marks relationships between words.
In the self attention we weight the importance of words in their context. It tells the model which words he should attend to.
The output of self-attention is Query, Key, Value vectors and a Tensor of attentions.
In an attention the query is a vector representation of the words and we can use a dot product with its transpose, the key to obtain a similarity matrix. In order to do that you
dk = attention dim
Remember than multiplying two vectors with the dot product gives us an idea of the similarity so the numerator of the fraction gives us a matrix that relates every word with each other. We use the Softmax (sort of a multidimensional sigmoid that normalizes so the sum of all values is 1) to normalize and the denominator to prevent exploding or dissapearing gradients.