How Large Language Models Work - HawkonTech: Trusted Tech Advice

Large language models (LLMs) are advanced AI systems that act like supercharged autocomplete: they read vast volumes of text—books, articles, websites—and learn which words tend to follow one another. When you give an LLM a prompt, it draws on these learned patterns to predict and generate text that feels natural.

How LLMs Learn from Text

Reading and memorising patterns

The model “reads” billions of sentences. It doesn’t memorise them word for word, but it adjusts millions (or billions) of internal settings—called weights—so that, over time, it captures grammar, facts and writing style.

Splitting text into tokens

Before it can process your words, it chops them into tokens (whole words or word fragments). For example, “happiest” might become “happi” + “est.” Smaller chunks let the model build up unfamiliar words from known pieces.

Turning tokens into numbers—embeddings

Each token is converted into a vector (a list of numbers), known as an embedding. Words with similar meanings end up with similar embeddings.

Example: “cat” → [0.7, 1.1], “dog” → [0.6, 0.9], “car” → [−0.2, 0.4].

Because “cat” and “dog” are both animals, their number lists lie close together in this abstract space.

Attention: Focusing on What Matters

Instead of reading left to right only, an LLM uses self-attention to decide which earlier tokens are most relevant when predicting the next word:

Query, Key, Value: For each token, the model computes three new vectors:

Query (Q): what it’s looking for,

Key (K): an identifier for that token,

Value (V): the token’s content to pass along.

Scoring relevance: To predict “sunny” after “The weather today is”, it takes the query for “sunny”, compares it (dot-product) with the keys of all prior tokens in the sentence, and scales the result.

Softmax weighting: Those scores become attention weights via softmax (turning them into probabilities that add to 1), showing how much each previous token should influence the prediction.

Blending values: It multiplies each value vector by its weight and sums them, yielding a single context vector that emphasises the most relevant words.

Multiple heads: Several attention processes (heads) run in parallel—one might focus on the subject-verb relationship (“weather → is”), another on the broader topic (“weather → sunny”).

From Context to Text: Generating Words

Limited memory: The model only “sees” a fixed number of tokens (its context window, e.g. 4 096 tokens).

Scoring candidates: It uses the final context vector to score every word in its vocabulary.

Choosing strategies:

Greedy decoding: pick the single highest-probability word.

Beam search: track a few top sentence candidates in parallel.

Sampling with temperature: add randomness—higher temperature for creativity, lower for safe choices.

Why It Matters

LLMs power chatbots, translate languages, draft emails or code, summarise documents and even help writers brainstorm. By modelling language statistically, they offer flexible tools across industries—from customer service to creative writing.

Must Read