Analyzing Self-Attention 

Garrett Mulcahy, University of Washington
-
PDL C-401

Introduced in 2017, transformers are a deep learning architecture that have taken the machine learning world by storm, with applications in domains and tasks as diverse as natural language processing, computer vision, antibody design and more. An essential component of the transformer architecture is self-attention, which is the mechanism by which the transformer learns how various elements of an input sequence are related to each other. Treating the elements of an input sequence as an interacting particle system, self-attention then defines a system of ODEs governing the evolution of the particles in time. In this expository talk, we will overview some exciting results about the behavior of this system from a recent paper titled “The Emergence of Clusters in Self-Attention Dynamics” by Geshkovski et al (2023). We will also take this opportunity to introduce some essential notions from optimal transport theory, such as Wasserstein distance and connections between the continuity equation and curves in the space of probability measures.