Understanding Foundational Models

Training Data

an AI model is only as good as the data it was trained on
common sources of general training data
- Common Crawl: sporadic crawls of web data done by a nonprofit organization https://commoncrawl.org/
  - Google has a clean subset of this data called: Colossal Clean Crawled Corpus (or CA)
general purpose foundation models will typically perform less well on domain specific tasks (due to less training data)

Modeling

What should modelers consider?
- Model architecture?
- Number of parameters?

Model Architecture

Transformer Architecture see glossary is currently the dominant archiecture for language based Foundation Models
- training is based on the attention mechanism
- intended to solve some prevailing problems:
  1. previous seq2swq architecture generated output only on the final hidden state of input (analop: like generating answers about a book by just reading it’s summary)
  2. using the RNN encoder and decoder meant that input processing and output generation is done sequentially - slow for inputs with lots of tokens
Transformer Architecture Inference: leverages a parallel nature of execution
1. Prefill step: model processes input tokens in parallel
  - this produces key and value vectors for all input tokens
2. Decode step: decode generates one output token at a time
Attention Mechanism
- use key, value, and query vectors
- From ChatGPT: the key vectors are used for matching and weighting (determining “where to look”), while the value vectors provide the substantive information (“what to show”) during the attention computation
- From ChatGPT: query vector is a high-dimensional representation derived from an input element (e.g., a token in a sentence). It represents the “question” that the model asks of other tokens in the sequence
- attention mechanism computes how much attention to give an input token by performing a dot product between a query vector and its key vector

Model Size

Number of parameters is usually appended to the model name: e.g Llama-13B ~ 13 billion parameters
Generally more parameters means more capacity to learn
- However newer models generally perform better even if they are smaller
- a parameter usually stores 2 bytes or 16 bits
a sparse model has a lot of zero value parameters
Training size
- dataset sizes are measured by the number of training samples
- Language Models: a training sample can be a sentense, Wikipedia page, chat conversation, a book
- currently LLMs are trained on datasets representing trillions of tokens

Post Training

Refine trained models: e.g. move from text completion to conversation
Post training - high level steps
1. Supervised finetuning (SFT): fine tune model on high quality instruction data to optimize models for conversations instead of completion
2. Preference fintuning: further finetune the model to output responses that align with human preference - typically using reinforcment learning
Pretraining focus on optimizing token-level quality. Post-training focuses on the quality of the overall response

Infererence

When a production LLM receives a string prompt, it follows these high-level steps to generate an inference:

Tokenization:: The raw text is processed by a tokenizer, which breaks the prompt into tokens (words, subwords, or characters) and converts them into numerical IDs.
Embedding:: These token IDs are mapped to continuous vector representations (embeddings) using a learned lookup table. These embeddings capture semantic and syntactic features of the tokens.
Model Processing:: The sequence of embeddings is fed into the model’s architecture—often a Transformer. Here, layers of self-attention and feed-forward networks integrate context from the entire sequence, using the model’s learned weights, biases, and parameters.
Logit Generation:: The model outputs a set of logits (raw scores) for the next token, representing the unnormalized likelihoods of each token in the vocabulary.
Decoding:: The logits are converted into probabilities (typically via a softmax function). Based on these probabilities and a chosen decoding strategy (like greedy decoding, beam search, or sampling), the model selects the next token.
Iteration:: The newly generated token is appended to the prompt, and steps 3–5 are repeated until a termination condition is met (e.g., an end-of-sequence token or reaching a maximum length).
Detokenization:: Finally, the generated tokens are converted back into human-readable text by reversing the tokenization process.

In summary, the production LLM relies on tokenization, embedding, iterative processing through its network, and decoding to transform a text prompt into a complete inference.

View this page on GitHub

Large Language Models

Agentic AI

Understanding Foundational Models

Training Data link

Modeling link

Model Architecture link

Model Size link

Post Training link