Nel numero:

, anno 2025

LLMs models: how do they work?

Part 4: Transformer Architecture. Requirements

Luca Vetti Tagliati

Senior esperto in tecnologia e trasformazione digitale, con una solida esperienza in Enterprise Architecture, ruoli di CTO, Product Ownership e posizioni di technical management, gestendo elevati budget globali.

Riconosciuto per una leadership innovativa, orientata al cliente e ai risultati, con la capacità di operare efficacemente tra strategia, esecuzione e coinvolgimento degli stakeholder per garantire il successo dei programmi.

Unisce una profonda competenza tecnica all'esperienza commerciale e gestionale, riuscendo anche a risanare programmi complessi, spesso richiesti dai regolatori, e a fornire soluzioni pragmatiche e ad alto impatto in tempi stretti.

Porta visioni strategiche supportate da un approccio concreto, allineando costantemente tecnologia, dati e innovazione agli obiettivi aziendali e normativi.

Ha conseguito un dottorato di ricerca, presso la University Of London, lavorando a tempo pieno nell’industria. Ciò gli permetti di offrire una combinazione di rigore accademico e concretezza operativa.

Recentemente è tornato a dedicarsi all’Intelligenza Artificiale e all’analisi avanzata dei dati, avendo ricoperto il ruolo di Head of Data Analytics & Innovation (2022) per due anni, pubblicando ricerche peer-reviewed e contribuendo attivamente a think tank e forum di settore sull’AI.

Tra le principali attività portate a compimento negli anni:

Progettazione dell’architettura del sistema su larga scala Impairment(s), vincitore del Global Enterprise Architecture Excellence Awards 2018 (categoria Banking)

Ideazione e implementazione del "DANTE Data Fabric", vincitore dell’iCMG Enterprise, Business & IT Architecture Excellence Awards 2015 (categoria SOA Services for Enterprise)

Progettazione della piattaforma di calcolo REF, finalista ai Banking Technology Awards 2014 (categoria Best Use of IT for risk/regulatory change)

Attribuzione del premio “Extraordinary Commitment” nel 2014

Pubblicazioni:

Libro UML and Software Engineering - From Theory to Practice (2004)

Libro Java Best Practices (2008)

Libro Towards Java SE 8: Notes for Developers in Java 7 (2013)

Numerosi articoli su riviste IT e online, inclusi contributi sull’Intelligenza Artificiale.

Pagina LinkedIn: https://ch.linkedin.com/in/luca-vetti-tagliati-phd-2b55618

LLMs models: how do they work?

Part 4: Transformer Architecture. Requirements

Luca Vetti Tagliati

Questo articolo parla di: Intelligenza artificiale, Internet & Digital

Intro

This article represents another step in our journey to explore and highlight the inner workings and functionalities of Large Language Models (LLMs). In our previous paper, we discussed how language models represent words as n-dimensional vectors, known as embeddings. These embeddings capture semantic relationships between words, enabling computers to process and analyse language in numerical form. The mathematical function that transforms text into these numeric vectors allows machine learning models to better understand and “reason” about human language.

In this paper, we focus on the Transformer architecture (TA), designed at Google and introduced in the influential paper “Attention Is All You Need” (refer to [Vaswani et al., 2023]) published for the first time in2017. The Transformer is a neural network architecture that relies on attention mechanisms to identify relationships between different parts of an input sequence. Originally designed for sequence-to-sequence (Seq2Seq) tasks such as machine translation, the Transformer has since proven effective across a wide range of natural language processing (NLP) tasks. By replacing traditional recurrent layers (such as Long Short-Term Memory networks) with attention mechanisms, the Transformer achieves better parallelization and therefore delivers improved performance. The TA has revolutionised the design of large language models to the extent that virtually all modern LLMs now implement some variation of this architecture.

In this paper, we focus on the core requirements that led to the design of the architecture. In the next paper, we will present the TA, providing both a high-level overview and a detailed analysis.

Core requirements

Before delving into the details of the Transformer architecture (TA), let’s first understand the main requirements that underpin its design.

1. Context Aware Input processing

Requirement: The model must be capable of focusing on different parts of the input sequence when processing (input) tokens, moving beyond a strictly sequential processing approach. This capability is essential for understanding relationships between words, even when they are far apart in the sequence.

Rationale: This feature allows the model to effectively capture long-range dependencies and relationships within the data, which is critical for tasks such as translation, summarization, and question answering. It helps the model understand context and relationships between words, even if they are separated by other words in the sequence. Traditional models often struggle with maintaining accuracy when processing long or complex texts.

Solution: The Self-Attention Mechanism is a core innovation of the TA, directly addressing this requirement by enabling the model to dynamically weight the importance of different parts of the input sequence.

For example, considering the phrase:

“The dog sat on the mat because it was comfortable”

The model needs to understand the relationships between words, even if they are distantly placed in the sentence. In particular, the model can learn that “dog” and “mat” are related, despite being separated by other words.

2. Parallel processing of the input sequence

Requirement: The model must be able to analyse input sequences from multiple perspectives simultaneously, capturing diverse relationships and patterns between tokens to enhance its understanding of complex data. This requirement builds upon the previous one: instead of calculating a single set of attention scores, the model must compute multiple sets (or “heads”) of scores in parallel. Each head is designed to focus on different aspects of the input sequence.

Rationale: Implementing this requirement enables the model to focus on various parts of the input sequence simultaneously, learning different aspects of the relationships between tokens.

For example:

One head might focus on short-range dependencies, such as the relationships between adjacent words.
Another head might focus on long-range dependencies, such as connections between words that are far apart in the sequence.
Other heads might capture different linguistic features, such as syntax, semantics, or positional relationships.

For instance, consider the above-mentioned sentence:

“The dog sat on the mat because it was comfortable”.

Head 1 could focus on short-range dependencies, such as the relationship between “The” and “dog” or “sat” and “on.” This helps the model understand the basic structure of the sentence (e.g., subject-verb-object).
Head 2 could focus on long-range dependencies, such as the relationship between “it” and “mat,” helping the model understand that “it” refers to “the mat” and not “the dog.”
Head 3 could focus on causal relationships, such as the connection between “because” and “comfortable.”

By combining these different perspectives, multi-head attention allows the model to fully grasp the meaning of the sentence, including its structure, context, and relationships between words, even when they are distantly placed. This approach is significantly more effective than a single attention mechanism, which might overlook some of these nuances. This parallel processing enables the model to build a richer and more nuanced understanding of the input, which is especially critical for tasks such as translation, summarization, and question answering, where context and multiple layers of meaning are essential.

Solution: Multi-Head Attention is a key component of the TA that enables the model to process input sequences in parallel by applying multiple attention mechanisms (or “heads”) simultaneously.

3. Position encoding

Requirement: The model must be capable of encoding the order of tokens in a sequence. This is a logical consequence of the previous requirements, which state that the model processes input sequences in parallel (unlike RNNs, which process sequences sequentially).

Rationale: The model needs to understand the order of words, as this is essential for capturing the meaning of sentences and the relationships between tokens. Without this capability, the model would struggle to interpret the context and structure of the input.

Solution: Positional Encodings. Positional encodings are numerical vectors (not simple indexes) added to each token embedding to provide information about the position of each token in a sequence. Since transformers process input sequences in parallel, they lack an inherent sense of order. Positional encodings address this limitation by encoding positional information into the input. These encodings are typically generated using sinusoidal functions (sine and cosine). Sinusoidal functions help the model learn and represent relative positions (linear relationships) without adding significant complexity to the calculations. They are critical for enabling transformers to understand the order and relationships between tokens in a sequence.

4. Ability to capture complex patterns

Requirement: The model must be able to capture complex patterns in the data. While the attention mechanism processes the input and identifies relationships between tokens, additional transformations of the data are needed to capture more intricate patterns and features.

Rationale: To enhance the model’s ability to learn and represent complex relationships in the data, it is essential to further transform the input representations produced by the attention mechanism into richer and more abstract features.

Solution: Feed-Forward Layers. These are fully connected neural network layers applied independently to each position in the sequence. They serve as position-wise transformation layers that enhance the features of each token independently. The layers consist of two linear transformations with a non-linear activation function (typically ReLU) in between. Feed-forward layers transform the input representations from the attention mechanism into more abstract and expressive features. Since they operate on each token position independently, they are computationally efficient and can be executed in parallel. These layers are critical for capturing complex patterns and relationships in the data, working in tandem with the self-attention mechanism to enhance the model’s overall performance.

Self-Attention Mechanism

Before proceeding further, let us understand in detail how the self-attention mechanism works. While a high-level understanding of self-attention is often sufficient, this paragraph provides a more detailed explanation to offer deeper insight.

Embeddings/Token Representation

Each word in the sentence is represented/mapped to a vector (its embeddings).

Calculation of Q, K, and V

For each token embedding, the model computes three vectors:

Query (Q): what this position wants to retrieve.
Key (K): what this position offers to others.
Value (V): The information to be aggregated from this position.

Attention Scores

The attention score for a token is computed by comparing its Query vector with the Key vectors of all other tokens in the sequence. This score determines how much attention each token should pay to the others.

Weighted Values

The attention scores are then used to weight the Value vectors (by using SoftMax), producing a context-aware representation of each token.

Analogy: A football (soccer) team

To better understand self-attention and the roles of Q, K, and V, imagine a football team in which each player represents a word in a sentence.

1. Query (Q): What am I trying to do?

The Query is like a player’s goal or intention on the field. For example:

The striker’s Query is: “Who can help me score a goal?”
The midfielder’s Query is: “Who is in the best position to receive a pass?”
The defender’s Query is: “Who is the opponent I need to block?”

Each player (token) has their own Query, which determines what they are looking for in the game.

2. Key (K): What can I offer to the team?

The Key represents the skillset of each player. It represents what that player is good at or what they bring to the game: what they contribute. For example:

The striker’s Key might be: “I’m good at scoring goals”.
The midfielder’s Key might be: “I’m very good at passing and creating opportunities”.
The defender’s Key might be: “I’m good at blocking and intercepting”.

Each player (token) has their own Key, which tells others what they can contribute to the team.

3. Value (V): What information do I provide?

The Value corresponds to the actual action or contribution the player makes during the game. It is the real output of their role. For example:

The striker’s Value might be: “I’m in a good scoring position: pass me the ball!”
The midfielder’s Value might be: “I see space on the wing: let me pass the ball out wide.”
The defender’s Value might be: “I’m marking this opponent: don’t worry about this player.”

Each player (word) has their own Value, which is the actual information they provide to the team.

How It All Comes Together

Now, let’s say the striker (representing a word in the sentence) is trying to decide who to pass the ball to. The self-attention mechanism works as following:

The striker (Query) looks at the Keys of all the other players to figure out who is in the best position to help them score a goal. For example:
- The striker compares their Query with the Key of the midfielder and sees that the midfielder is in a good position to pass the ball forward.
- The striker also compares their Query with the Key of the defender but realizes the defender is not as relevant for scoring a goal.
The striker assigns attention scores to each player based on how well their Key matches the striker’s Query. For example:
- Midfielder gets a high score because their Key (good at passing) matches the striker’s Query (looking for someone to pass to).
- Defender gets a low score because their Key (blocking opponents) doesn’t match the striker’s Query.
The striker uses the Values of the other players to decide what to do. For example:
- The midfielder’s Value might say, “I’m in a good position to pass the ball to the winger.”
- The winger’s Value might say, “I’m open and ready to receive the ball.”
- The striker combines these Values (weighted by the attention scores) to decide the best course of action.

A numerical example

Let’s translate what we have just seen into a numerical example.

1. Input Representation: embeddings

The first step is to represent each word in the sentence as a vector. Let us assume the following (hypothetical) word embeddings for each word:

“The” → [1, 0, 0]
“dog” → [0, 1, 0]
“sat” → [0, 0, 1]
“on” → [1, 1, 0]
“the” → [0, 1, 1]
“mat” → [1, 0, 1]

These embeddings are learned during training and represent the semantic meaning of each word.

2. Calculate Query, Key, and Value Vectors

These vectors are calculated by multiplying the word embeddings by learned weight matrices.

For simplicity, let us assume the following (hypothetical) vectors for the word “dog”:

Query (Q): [0.2, 0.8, 0.1]
Key (K): [0.5, 0.3, 0.7]
Value (V): [0.6, 0.9, 0.4]

Each word in the sentence will have its own Q, K, and V vectors.

3. Computing Attention Scores

To determine how much the word “dog” should attend to each of the other words, we compute a dot product (presented in the previous paper) between:

Q(dog) · K(other word)

For example:

Score(dog → The) = Q(dog) · K(The)
Score(dog → dog) = Q(dog) · K(dog)
Score(dog → sat) = Q(dog) · K(sat)
Score(dog → on) = Q(dog) · K(on)
Score(dog → the) = Q(dog) · K(the)
Score(dog → mat) = Q(dog) · K(mat)

The higher the score, the more relevant that word is to “dog.”

The dot product gives a scalar value for each pair of words, which represents how much “dog” should focus on that word.

4. Softmax Normalization

The raw attention scores are passed through a Softmax function to convert them into probabilities (weights) that sum to 1. In other words, The SoftMax function takes as input the scaled dot-product of the Query (Q) and Key (K) matrices. These weights represent how much attention “dog” should pay to each word in the sentence. The Softmax generates a matrix of attention weights, where each row is a probability distribution over the tokens in the sequence, indicating how much attention each token should pay to the others. These attention weights are therefore used to compute a weighted sum of the Value (V) vectors, which forms the final output of the self-attention mechanism.

Example Softmax output (hypothetical):

Weight(“dog”, “The”) = 0.1
Weight(“dog”, “dog”) = 0.5
Weight(“dog”, “sat”) = 0.2
Weight(“dog”, “on”) = 0.1
Weight(“dog”, “the”) = 0.05
Weight(“dog”, “mat”) = 0.05

This means “dog” focuses mostly on itself (0.5) and somewhat on “sat” (0.2), while the remaining words receive less attention.

5. Weighted Sum of Value Vectors

The model then uses these weights to compute a weighted sum of the Value (V) vectors for all words. This gives the final representation of “dog” in the context of the entire sentence.

For example:

Final representation of “dog” = (0.1 × V(The)) + (0.5 × V(dog)) + (0.2 × V(sat)) + (0.1 × V(on)) + (0.05 × V(the)) + (0.05 × V(mat))

This new vector for “dog” now contains information about how it relates to the other words in the sentence.

6. Iterate for All Words:

The same process is repeated for every word in the sentence. For example:

When processing “sat,” the model will calculate how much attention “sat” should pay to “The,” “dog,” “on,” “the,” and “mat.”
This ensures that each word’s final representation is context-aware and captures its relationships with other words in the sentence.

In the sentence “The dog sat on the mat”, the self-attention mechanism helps the model understand relationships such as:

“dog” is the subject performing the action “sat”
“mat” is the target/location of the action.
The two occurrences of “the” refer to different nouns and thus serve different contextual roles.

By using self-attention, the model can selectively focus on the most relevant parts of the sentence, enabling it to capture meaning and relationships more effectively.

Conclusion

The Transformer Architecture (TA) is a neural network designed specifically for NLP (natural language processing) tasks. Unlike older models such as RNNs (recurrent neural networks) and LSTMs (long short-term memory networks), the Transformer processes sequential data, such as text, in parallel rather than word by word, making it significantly more efficient.

The key services that a TA must implement include:

Capturing relationships between tokens: Achieved through the Self-Attention Mechanism, which allows the model to understand dependencies between words.
Focusing on multiple aspects of the input: Enabled by Multi-Head Attention, which processes different relationships in parallel.
Providing information about token order: Handled by Positional Encoding, which encodes the sequence order of tokens.
Learning complex patterns: Accomplished through Feedforward Layers, which transform token representations.
Stabilizing and accelerating training: Managed by Layer Normalization, which ensures consistent input distributions for each layer.
Retaining information and improving gradient flow: Supported by Residual Connections, which help preserve information from earlier layers.
Facilitating sequence-to-sequence tasks: Enabled by the Encoder-Decoder Structure, which is essential for tasks like machine translation.
Efficiently processing large datasets: Achieved through Scalability and Parallelization, which allow the model to process tokens simultaneously.
Converting text into numerical representations: Handled by Tokenization and Embedding, which transform text into a format the model can process.
Generating predictions or outputs: Managed by the Output Layer, which produces the final results.

In summary, the TA has revolutionised NLP by providing a robust, scalable, and efficient framework for understanding and generating human language. Its ability to process large datasets, capture complex relationships, and adapt to diverse tasks makes it the backbone of modern large language models.

References

[Vaswani et al., 2023] Ashish Vaswani – Noam Shazeer – Niki Parmar – Jakob Uszkoreit – Llion Jones – Aidan N. Gomez – Łukasz Kaiser – Illia Polosukhin, Attention is all you need. 2 August 2023,
https://arxiv.org/pdf/1706.03762.pdf

[Devlin, J., et al., 2018] Jacob Devlin – Ming-Wei Chang – Kenton Lee – Kristina Toutanova, BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.
https://arxiv.org/abs/1810.04805

[Google Gemini Team, 2023] Google Gemini Team, Gemini: A Family of Highly Capable Multimodal Models.
https://arxiv.org/abs/2312.11805

[Radford, A. et al., 2018] Alec Radford – Jeffrey Wu – Rewon Child – David Luan – Dario Amodei – Ilya Sutskever, Language Models are Unsupervised Multitask Learners”.
https://t.ly/4RI02

[Touvron, H., et al., 2023] Hugo Touvron – Thibaut Lavril – Gautier Izacard – Xavier Martinet – Marie-Anne Lachaux – Timothée Lacroix – Baptiste Rozière – Naman Goyal, – Eric Hambro – Faisal Azhar – Aurelien Rodriguez – Armand Joulin – Edouard Grave – Guillaume Lample, LLaMA: Open and Efficient Foundation Language Models.
https://arxiv.org/abs/2302.13971

[Belcic, Stryker, 2024] Ivan Belcic — Cole Stryker, What is Claude AI?. IBM, September 2024.
https://t.ly/K5LV1

Luca Vetti Tagliati

Senior esperto in tecnologia e trasformazione digitale, con una solida esperienza in Enterprise Architecture, ruoli di CTO, Product Ownership e posizioni di technical management, gestendo elevati budget globali.

Riconosciuto per una leadership innovativa, orientata al cliente e ai risultati, con la capacità di operare efficacemente tra strategia, esecuzione e coinvolgimento degli stakeholder per garantire il successo dei programmi.

Unisce una profonda competenza tecnica all'esperienza commerciale e gestionale, riuscendo anche a risanare programmi complessi, spesso richiesti dai regolatori, e a fornire soluzioni pragmatiche e ad alto impatto in tempi stretti.

Porta visioni strategiche supportate da un approccio concreto, allineando costantemente tecnologia, dati e innovazione agli obiettivi aziendali e normativi.

Ha conseguito un dottorato di ricerca, presso la University Of London, lavorando a tempo pieno nell’industria. Ciò gli permetti di offrire una combinazione di rigore accademico e concretezza operativa.

Recentemente è tornato a dedicarsi all’Intelligenza Artificiale e all’analisi avanzata dei dati, avendo ricoperto il ruolo di Head of Data Analytics & Innovation (2022) per due anni, pubblicando ricerche peer-reviewed e contribuendo attivamente a think tank e forum di settore sull’AI.

Tra le principali attività portate a compimento negli anni:

Progettazione dell’architettura del sistema su larga scala Impairment(s), vincitore del Global Enterprise Architecture Excellence Awards 2018 (categoria Banking)

Ideazione e implementazione del "DANTE Data Fabric", vincitore dell’iCMG Enterprise, Business & IT Architecture Excellence Awards 2015 (categoria SOA Services for Enterprise)

Progettazione della piattaforma di calcolo REF, finalista ai Banking Technology Awards 2014 (categoria Best Use of IT for risk/regulatory change)

Attribuzione del premio “Extraordinary Commitment” nel 2014

Pubblicazioni:

Libro UML and Software Engineering - From Theory to Practice (2004)

Libro Java Best Practices (2008)

Libro Towards Java SE 8: Notes for Developers in Java 7 (2013)

Numerosi articoli su riviste IT e online, inclusi contributi sull’Intelligenza Artificiale.

Pagina LinkedIn: https://ch.linkedin.com/in/luca-vetti-tagliati-phd-2b55618

Luca Vetti Tagliati

Senior esperto in tecnologia e trasformazione digitale, con una solida esperienza in Enterprise Architecture, ruoli di CTO, Product Ownership e posizioni di technical management, gestendo elevati budget globali. Riconosciuto per una leadership innovativa, orientata al cliente e ai risultati, con la capacità di operare efficacemente tra strategia, esecuzione e coinvolgimento degli stakeholder per garantire il successo dei programmi. Unisce una profonda competenza tecnica all'esperienza commerciale e gestionale, riuscendo anche a risanare programmi complessi, spesso richiesti dai regolatori, e a fornire soluzioni pragmatiche e ad alto impatto in tempi stretti. Porta visioni strategiche supportate da un approccio concreto, allineando costantemente tecnologia, dati e innovazione agli obiettivi aziendali e normativi. Ha conseguito un dottorato di ricerca, presso la University Of London, lavorando a tempo pieno nell’industria. Ciò gli permetti di offrire una combinazione di rigore accademico e concretezza operativa. Recentemente è tornato a dedicarsi all’Intelligenza Artificiale e all’analisi avanzata dei dati, avendo ricoperto il ruolo di Head of Data Analytics & Innovation (2022) per due anni, pubblicando ricerche peer-reviewed e contribuendo attivamente a think tank e forum di settore sull’AI. Tra le principali attività portate a compimento negli anni: Progettazione dell’architettura del sistema su larga scala Impairment(s), vincitore del Global Enterprise Architecture Excellence Awards 2018 (categoria Banking) Ideazione e implementazione del "DANTE Data Fabric", vincitore dell’iCMG Enterprise, Business & IT Architecture Excellence Awards 2015 (categoria SOA Services for Enterprise) Progettazione della piattaforma di calcolo REF, finalista ai Banking Technology Awards 2014 (categoria Best Use of IT for risk/regulatory change) Attribuzione del premio “Extraordinary Commitment” nel 2014 Pubblicazioni: Libro UML and Software Engineering - From Theory to Practice (2004) Libro Java Best Practices (2008) Libro Towards Java SE 8: Notes for Developers in Java 7 (2013) Numerosi articoli su riviste IT e online, inclusi contributi sull’Intelligenza Artificiale. Pagina LinkedIn: https://ch.linkedin.com/in/luca-vetti-tagliati-phd-2b55618

LLMs models: how do they work?

Part 4: Transformer Architecture. Requirements

Luca Vetti Tagliati

LLMs models: how do they work?

Part 4: Transformer Architecture. Requirements

Luca Vetti Tagliati

Intro

Core requirements

1. Context Aware Input processing

2. Parallel processing of the input sequence

3. Position encoding

4. Ability to capture complex patterns

Self-Attention Mechanism

Embeddings/Token Representation

Calculation of Q, K, and V

Attention Scores

Weighted Values

Analogy: A football (soccer) team

1. Query (Q): What am I trying to do?

2. Key (K): What can I offer to the team?

3. Value (V): What information do I provide?

How It All Comes Together

A numerical example

1. Input Representation: embeddings

2. Calculate Query, Key, and Value Vectors

3. Computing Attention Scores

4. Softmax Normalization

5. Weighted Sum of Value Vectors

6. Iterate for All Words:

Conclusion

Conclusion

References

Luca Vetti Tagliati

Luca Vetti Tagliati