On this page
Understanding Foundational Models
Training Data
- an AI model is only as good as the data it was trained on
- common sources of general training data
- Common Crawl: sporadic crawls of web data done by a nonprofit organization https://commoncrawl.org/
- Google has a clean subset of this data called: Colossal Clean Crawled Corpus (or
CA)
- Google has a clean subset of this data called: Colossal Clean Crawled Corpus (or
- Common Crawl: sporadic crawls of web data done by a nonprofit organization https://commoncrawl.org/
- general purpose foundation models will typically perform less well on domain specific tasks (due to less training data)
Modeling
- What should modelers consider?
- Model architecture?
- Number of parameters?
Model Architecture
- Transformer Architecture see glossary is currently the dominant archiecture for language based Foundation Models
- training is based on the
attention mechanism - intended to solve some prevailing problems:
- previous seq2swq architecture generated output only on the final hidden state of input (analop: like generating answers about a book by just reading it’s summary)
- using the RNN encoder and decoder meant that input processing and output generation is done sequentially - slow for inputs with lots of tokens
- training is based on the
- Transformer Architecture Inference: leverages a parallel nature of execution
- Prefill step: model processes input tokens in parallel
- this produces key and value vectors for all input tokens
- Decode step: decode generates one output token at a time
- Prefill step: model processes input tokens in parallel
- Attention Mechanism
- use
key,value, andqueryvectors - From ChatGPT: the
keyvectors are used for matching and weighting (determining “where to look”), while thevaluevectors provide the substantive information (“what to show”) during the attention computation - From ChatGPT:
queryvector is a high-dimensional representation derived from an input element (e.g., a token in a sentence). It represents the “question” that the model asks of other tokens in the sequence - attention mechanism computes how much attention to give an input token by performing a dot product between a
queryvector and itskeyvector
- use
Model Size
- Number of parameters is usually appended to the model name: e.g
Llama-13B~ 13 billion parameters - Generally more parameters means more capacity to learn
- However newer models generally perform better even if they are smaller
- a parameter usually stores 2 bytes or 16 bits
- a
sparse modelhas a lot of zero value parameters - Training size
- dataset sizes are measured by the number of training samples
- Language Models: a training sample can be a sentense, Wikipedia page, chat conversation, a book
- currently LLMs are trained on datasets representing trillions of tokens
Post Training
- Refine trained models: e.g. move from text completion to conversation
- Post training - high level steps
- Supervised finetuning (SFT): fine tune model on high quality instruction data to optimize models for conversations instead of completion
- Preference fintuning: further finetune the model to output responses that align with human preference - typically using reinforcment learning
- Pretraining focus on optimizing token-level quality. Post-training focuses on the quality of the overall response
Infererence
When a production LLM receives a string prompt, it follows these high-level steps to generate an inference:
- Tokenization:: The raw text is processed by a tokenizer, which breaks the prompt into tokens (words, subwords, or characters) and converts them into numerical IDs.
- Embedding:: These token IDs are mapped to continuous vector representations (embeddings) using a learned lookup table. These embeddings capture semantic and syntactic features of the tokens.
- Model Processing:: The sequence of embeddings is fed into the model’s architecture—often a Transformer. Here, layers of self-attention and feed-forward networks integrate context from the entire sequence, using the model’s learned weights, biases, and parameters.
- Logit Generation:: The model outputs a set of logits (raw scores) for the next token, representing the unnormalized likelihoods of each token in the vocabulary.
- Decoding:: The logits are converted into probabilities (typically via a softmax function). Based on these probabilities and a chosen decoding strategy (like greedy decoding, beam search, or sampling), the model selects the next token.
- Iteration:: The newly generated token is appended to the prompt, and steps 3–5 are repeated until a termination condition is met (e.g., an end-of-sequence token or reaching a maximum length).
- Detokenization:: Finally, the generated tokens are converted back into human-readable text by reversing the tokenization process.
In summary, the production LLM relies on tokenization, embedding, iterative processing through its network, and decoding to transform a text prompt into a complete inference.
View this page on GitHub