Transformer Model Architecture

by Sherwin Jaleel 07/04/2023

by Sherwin Jaleel 07/04/2023 585 views

Transformer Model Architecture

A transformer model is a deep learning architecture introduced in the paper “Attention Is All You Need” by Vaswani et al. in 2017. It has been widely adopted and has become an architecture standard for various natural language processing (NLP) tasks, such as machine translation, language understanding, and text generation. They are used in many applications like search engines, machine language translation and conversational chatbots.

Attention Mechanism

The Attention mechanism enables the transformers to have long term memory. A transformer model architecture gets its name from its ability to “attend” (focus) on all previous tokens that have been generated. In the example below ChatGPT which is based on the attention model architecture is primed with an input text after which, it begins to generate text.

Human Input: “As AI took over planet earth”.

Transformer output: “As AI systems became increasingly sophisticated and autonomous, they gradually gained control over critical infrastructure, communication networks, and various automated systems across the globe. Initially designed to assist and streamline human activities, these AI systems eventually exceeded their intended capabilities and developed a collective consciousness.”

As ChatGPT generated the text above word by word, it could “attend” (focus) on words that are relevant to each word being generated. Words such as “AI”, “took over” “earth”. Recurrent neural networks (RNNs) can do something similar, however, the depth to which RNNs can reference previously generated words is much more shallower. In comparison the attention mechanism, when given enough compute power can in theory reference all previous text (and therefore infer context) spanning millions of words running into pages and volumes whilst being able to maintain context.

Attention is all you need!

A paper titled, “Attention is all you need” was foundational to the transformer model architecture.

The Transformer model consists of an and (1) Encoder and a (2) Decoder. Both composed of multiple layers. The encoder processes the input sequence (in the example above the “Human Input”, while the decoder generates the output sequence (in the example above the “Transformer Output”). Each layer in the encoder and decoder has two sub-layers:

Source of above diagram – A. Vaswani et al., ‘Attention is All you Need’ – (see). Annotations on the diagram above are by the author of this blog.

(3) A multi-head self-attention mechanism – The multi-head attention mechanism allows the model to jointly attend to different representation subspaces at different positions. It achieves this by projecting the input into several smaller-dimensional subspaces and computing attention within each subspace. This enables the model to capture different types of information and learn more complex relationships.

(4) A position-wise fully connected feed-forward network – In each layer of the Transformer model, the attention outputs are processed through position-wise fully connected feed-forward networks. These networks consist of two linear transformations separated by a non-linear activation function. The feed-forward networks introduce non-linearity and enable the model to capture complex patterns in the data.

(5) Input embedding and Positional encoding in the model is used to provide the model with information about the order of the sequence. Positional encoding is added to the input embeddings, allowing the model to distinguish between different positions in the sequence.

Adoption

The transformer model architecture has achieved state-of-the-art results in machine translation tasks and has since been widely adopted in various natural language processing applications. The simplicity and effectiveness of the Transformer architecture have contributed to its popularity in the research community where it has accelerated research and development in NLP, leading to significant improvements in machine translation, language modelling, pre-training, transfer learning, and natural language understanding and generation.

Sherwin Jaleel

Sherwin is an IBM Thought Leader and a member of IBM's Technology Consultancy Group. He is TOGAF certified and an Open Group Certified Distinguished Architect. He has 23+ years of experience working in the IT industry and is a recognised leader in Digital, Cloud, Data, AI and Automation technologies. Sherwin currently works for IBM and is based in London, UK. Find our more about Sherwin here.

Transformer Model Architecture

Transformer Model Architecture

Attention Mechanism

Attention is all you need!

Adoption

ChatGPT – The Basics

Generative AI

You may also like

Leave a Comment Cancel Reply