Introduction
This paper represents the next step in a series dedicated to AI models, with a particular focus on Large Language Models (LLMs). Building on the foundation established in the first two articles, which examined the internal structure of LLMs, this article focuses to another key topic: embeddings. Specifically, we explore how models transform and represent words, as well as the mathematical approach to estimating similarities using cosine similarity.
An embedding can be thought of as a compact “meaning fingerprint” for a text or an entity. Embeddings play a crucial role in LLMs, as they enable the model to compare meanings, identify related text, search documents, cluster topics, and perform other tasks that would likely be impossible using a straightforward textual representation alone.
In the next paper, the focus will shift to the overall functioning of LLMs, where all the various components will be integrated.
Embeddings, aka Vectors
LLMs need to store information about elements in both the real world and abstract concepts. To achieve this, LLMs manage information using arrays of numbers known as embeddings or vectors (essentially different terms for the same old concept: a structured list of elements).
These vectors are commonly used to represent data points in a multi-dimensional space. Each element of a vector corresponds to a specific dimension, and the combination of these elements defines the vector’s position in that space. Vectors are crucial for measuring similarities and transforming data.
While visualizing even a three-dimensional space can be challenging for humans, software does not face this limitation. For example, ChatGPT-3 handles 12,288-dimensional word vectors (see [Timothy and Trott, 2023b]), while ChatGPT-4 is believed to handle 16,000-dimensional word vectors. However, this figure should be considered indicative, as it has been mentioned by several sources and commercial articles but has not been officially confirmed.
Through this numeric representation, LLMs can compare meanings, find related text, search documents, cluster topics, and more – tasks that would be very complex or even impossible with textual representation alone.
As we discussed in the first paper, a neural network is trained so that text with similar meanings produces similar vectors. This training uses large text corpora and objectives that pull related items closer together in vector space.
We can think of an embedding as a compact “meaning fingerprint” for a piece of text or entity. Just as coordinates locate a place on a map, an embedding locates meaning in a semantic space. As Tomas Mikolov et al. (see [Tomas Mikolov et al., 2013]) “An embedding is a learned representation for text where words that have the same meaning have a similar representation.”
An example, based on cars
Let us consider a simple example based on cars. The model could create a 4-dimensional embedding with the following attributes:
- on-road capability
- off-road capability
- passenger capacity
- water fording capability
Now, let us consider the following vehicles:
- a family car
- an SUV
- a Formula 1 car
- a rally car
- a Jeep
Although these categories are still quite generic — for instance, the SUV category includes many models with varying — this simplification suffices for the purpose of the example.
Now, we could assign the values as summarised in table 1.

From an initial, superficial view of the various evaluations is already possible to draw some initial conclusions. For example, SUV and Jeep cars are very similar, while a Formula 1 Car and a Jeep are dissimilar. However, models need a scientific/mathematical approach to be able to compare elements and draw conclusions. This requires extra steps as described in the following paragraph.
An example based on flying “things”
Following the example set by Guillaume Desagulier (see [Desagulier, 2018]), let us assume there is a function that, given a word, can return its representation based on the following three factors:
- wings
- engine
- sky
Consider the following words:
- bee
- eagle
- goose
- helicopter
- drone
- rocket
- jet
In LLM terminology, these words are referred to as “tokens”.
Now, imagine submitting these tokens to a function that returns their representation in terms of the three contexts: wings, engine, and sky. This process allows us to create the following table (see Table 2).

Each word occupies a specific position in the vector space. According to this hypothetical model, the coordinates for “bee” might be (3, 0, 2), while “helicopter” could be represented by (0, 2, 4). This allows us to visualize these words in a three-dimensional space (see Figure 1). Although we have chosen three contexts for this example, in real-world scenarios, as discussed earlier, the matrix in an LLM easily encompasses several thousand dimensions.

Same example, with a different representation
Furthermore, it is possible to explicitly depict the seven vectors as arrows originating from the point where all three axes intersect, extending to their respective endpoints as defined by their coordinates (refer to Figure 2).

Once the vectors are established, we can calculate the distance or similarity between them.
Cosine Similarity
One common method for this is the Cosine Similarity function. This function yields a value within a constrained range of -1 to 1. A value closer to 0 indicates that the two vectors are orthogonal (perpendicular) to each other, suggesting little to no similarity. Conversely, a value closer to 1 indicates that the angle between the vectors is small, signifying a higher degree of similarity, with 1 representing the maximum similarity. For example, the cosine similarity between the vectors representing “helicopter” and “rocket” is 0.80, suggesting a relatively high similarity. Finally, a value of -1 indicates that the vectors are diametrically opposite, pointing in opposite directions. In our example, the similarity function yields values within a sub-range from 0 to 1, because they have at least a common feature: the all can fly.
By applying the cosine similarity function to all couple of words it is possible to generate the following matrix (see table 2):

The example described so far is simple, but it should provide readers with a basic understanding of the type of internal representations that the model could use.
Third example, based on cities
Numerous papers (see [Timothy and Trott, 2023a]) and books provide examples based on cities.
For instance, we could consider the following cities along with their coordinates:

We can define the distance between cities as the straight-line (“as the crow flies”) distance. For example, the distance from Rome to Florence is 274.95 km, and from Rome to Venice is 529.63 km, and so on. Consequently, we can say that Naples is near Palermo, and Rome is close to Naples. Conversely, Venice is far from Palermo. If we record the distances from each city to all the others, we create a numerical representation of that city in a multi-dimensional space, similar to what is shown in Table 2.
Similarly, in language models, each word (or token) is represented by a vector in a high-dimensional space. Words with similar meanings are located near each other- just as nearby cities are close in geographic space. For example, words close to “Ferrari” might include “Car,” “F1,” “Red,” “Power,” “Bolide,” “Enzo,” and so on.
Mathematical aspects of the Cosine Similarity function
This section is optional and provides deeper insights for those interested in the technical details. (see [Singhal 2001]).
Cosine similarity is a powerful tool for measuring how similar two “entities” — or more specifically, their vector representations — are. It works by calculating the angle between the vectors values:
- 1 means the vectors are identical in direction;
- 0 means they are orthogonal (unrelated);
- –1 means they are opposite.
The calculation itself is straightforward: it’s the dot product of the two vectors divided by the product of their magnitudes. This simple yet effective method is widely used in applications like search engines, recommendation systems, and natural language processing to identify relationships and patterns in data.
In formula form:
where:
- A⋅B is the dot product (sum of corresponding elements multiplied);
- ∥A∥ and ∥B∥ are the lengths (magnitudes) of the vectors.
Let us consider the following two very small vectors:
- A=[1,2]
- B=[2,3]
The dot product is:
A⋅B=(1×2)+(2×3)=2+6=8
and the magnitudes will be:
∥A∥=12+22=5 ∥B∥=22+32=13
Therefore, the cosine similarity is:
Cosine similarity=5×138≈8.068≈0.99
So, the similarity is about 0.99, meaning the vectors, and therefore the entities they represent, are very similar.
Embedding in the LLM context
In the next paper, we’ll dive into the high-level mechanics of LLM models, providing a clear and structured view of how they operate. Think of it as assembling a finely tuned machine, where every gear clicks into place to power the system as a whole.
Before wrapping up this article, we will also set the stage for understanding embeddings in the right context via an example.
Let us imagine a user entering the following prompt:
“Summarize the main points of AI.”
Tokenizer
The first step in the process is tokenization – a critical component of how LLMs work. Tokenization breaks the text into smaller, digestible units called tokens, which the model can process. These tokens might represent words, sub-words, or even individual characters, depending on the tokenizer used.
For the prompt
“Summarize the main points of AI”
a tokenizer might break it down into tokens like this (using a sub-word-based tokenizer, such as Byte Pair Encoding or WordPiece, which is common in models like GPT):
[“Summarize”, “the”, “main”, “points”, “of”, “AI”]
If sub-word tokenization is applied, it might look like this:
[“Summ”, “arize”, “the”, “main”, “points”, “of”, “AI”]
“Summarize” could be split into two sub-words to capture similarities with verbs like Sum Up, etc.
Embedding generation
Each token is then assigned a unique ID from the model’s vocabulary.
Once tokenized, each token is converted into a numerical representation (embedding). These embeddings are high-dimensional vectors that capture the semantic meaning of the tokens.
Token ID: Each token is mapped to a unique ID from the model’s vocabulary. For example:
[“Summarize”, “the”, “main”, “points”, “of”, “AI”] [1234, 56, 789, 1011, 23, 4567]
Please note, that these numbers are just random examples: actual IDs depend on the model’s vocabulary.
Embedding Lookup: The token IDs are then used to retrieve their corresponding embeddings from a pre-trained embedding matrix. Each token ID maps to a vector of fixed size (e.g., 768 dimensions for GPT-3). For example:
1234 → [0.12, -0.34, 0.56, ..., 0.78] 56 → [0.01, 0.45, -0.67, ..., 0.89]
Positional Encoding:To account for the order of tokens in the sequence, positional encodings are added to the embeddings. This ensures that the model understands the sequence structure (e.g., “AI of points main the Summarize” would mean something different).
Final Output
The result is a sequence of embeddings, one for each token, which is passed into the AI model for further processing. For example:
- [0.12, -0.34, 0.56, …, 0.78] → “Summarize”
- [0.01, 0.45, -0.67, …, 0.89] → “the”
- [0.23, -0.12, 0.34, …, 0.56] → “main”
- …
It should be noted that some tokenizers ignore articles and similar words, as they do not contribute to the semantic value of the sentence.
These embeddings are then used by the model’s neural network layers to perform tasks like understanding, summarizing, or generating text.
Conclusion
Language models represent words as n-dimensional vectors called embeddings. These embeddings capture the semantic relationships between words, enabling computers to process and analyze language numerically. The mathematical function that converts text into these numeric vectors allows machine learning models to understand and reason about human language.
When a user provides a prompt to the model, the first step is tokenization. This process breaks the text into smaller, digestible units called tokens, which the model can process. Each token is assigned a unique ID from the model’s vocabulary and then converted into a numerical representation.
These token IDs are used to retrieve their corresponding embeddings from a pre-trained embedding matrix, where each token ID maps to a vector of fixed size (e.g., 768 dimensions for GPT-3). To account for the order of tokens in the sequence, positional encodings are added to the embeddings, ensuring the model understands the sequence structure (e.g., “AI of points main the Summarize” would mean something entirely different). The result is a sequence of embeddings, one for each token, which is then passed into the AI model for further processing.
References
[Timothy and Trott, 2023a] Timothy B Lee – Sean Trott, Large language models, explained with a minimum of math and jargon. 27 july 2023,
https://www.understandingai.org/p/large-language-models-explained-with
[Timothy and Trott, 2023b] Timothy B Lee – Sean Trott, A jargon-free explanation of how AI large language models work. Ars Technica, 31 july 2023
https://arstechnica.com/science/2023/07/a-jargon-free-explanation-of-how-ai-large-language-models-work/8/
[Tomas Mikolov et al., 2013] Tomas Mikolov – Kai Chen, Greg Corrado – Jeffrey Dean, Efficient Estimation of Word Representations in Vector Space” (last revised 7 Sep 2013)
https://arxiv.org/abs/1301.3781
[Desagulier, 2018] Guillaume Desagulier, Word embeddings: the (very) basics. Hypothesis, 25/04/2018
https://corpling.hypotheses.org/495
[Singhal 2001] Singhal Amit, Modern Information Retrieval: A Brief Overview. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering 24 (4): 35–43, 2001

