Training Data


  • an AI model is only as good as the data it was trained on
  • common sources of general training data
    • Common Crawl: sporadic crawls of web data done by a nonprofit organization https://commoncrawl.org/
      • Google has a clean subset of this data called: Colossal Clean Crawled Corpus (or CA)
  • general purpose foundation models will typically perform less well on domain specific tasks (due to less training data)

Modeling


  • What should modelers consider?
    • Model architecture?
    • Number of parameters?

Model Architecture

  • Transformer Architecture see glossary is currently the dominant archiecture for language based Foundation Models
    • training is based on the attention mechanism
    • intended to solve some prevailing problems:
      1. previous seq2swq architecture generated output only on the final hidden state of input (analop: like generating answers about a book by just reading it’s summary)
      2. using the RNN encoder and decoder meant that input processing and output generation is done sequentially - slow for inputs with lots of tokens
  • Transformer Architecture Inference: leverages a parallel nature of execution
    1. Prefill step: model processes input tokens in parallel
      • this produces key and value vectors for all input tokens
    2. Decode step: decode generates one output token at a time
  • Attention Mechanism
    • use key, value, and query vectors
    • From ChatGPT: the key vectors are used for matching and weighting (determining “where to look”), while the value vectors provide the substantive information (“what to show”) during the attention computation
    • From ChatGPT: query vector is a high-dimensional representation derived from an input element (e.g., a token in a sentence). It represents the “question” that the model asks of other tokens in the sequence
    • attention mechanism computes how much attention to give an input token by performing a dot product between a query vector and its key vector

Model Size

  • Number of parameters is usually appended to the model name: e.g Llama-13B ~ 13 billion parameters
  • Generally more parameters means more capacity to learn
    • However newer models generally perform better even if they are smaller
    • a parameter usually stores 2 bytes or 16 bits
  • a sparse model has a lot of zero value parameters
  • Training size
    • dataset sizes are measured by the number of training samples
    • Language Models: a training sample can be a sentense, Wikipedia page, chat conversation, a book
    • currently LLMs are trained on datasets representing trillions of tokens

Post Training

  • Refine trained models: e.g. move from text completion to conversation
  • Post training - high level steps
    1. Supervised finetuning (SFT): fine tune model on high quality instruction data to optimize models for conversations instead of completion
    2. Preference fintuning: further finetune the model to output responses that align with human preference - typically using reinforcment learning
  • Pretraining focus on optimizing token-level quality. Post-training focuses on the quality of the overall response

Infererence


When a production LLM receives a string prompt, it follows these high-level steps to generate an inference:

  1. Tokenization:: The raw text is processed by a tokenizer, which breaks the prompt into tokens (words, subwords, or characters) and converts them into numerical IDs.
  2. Embedding:: These token IDs are mapped to continuous vector representations (embeddings) using a learned lookup table. These embeddings capture semantic and syntactic features of the tokens.
  3. Model Processing:: The sequence of embeddings is fed into the model’s architecture—often a Transformer. Here, layers of self-attention and feed-forward networks integrate context from the entire sequence, using the model’s learned weights, biases, and parameters.
  4. Logit Generation:: The model outputs a set of logits (raw scores) for the next token, representing the unnormalized likelihoods of each token in the vocabulary.
  5. Decoding:: The logits are converted into probabilities (typically via a softmax function). Based on these probabilities and a chosen decoding strategy (like greedy decoding, beam search, or sampling), the model selects the next token.
  6. Iteration:: The newly generated token is appended to the prompt, and steps 3–5 are repeated until a termination condition is met (e.g., an end-of-sequence token or reaching a maximum length).
  7. Detokenization:: Finally, the generated tokens are converted back into human-readable text by reversing the tokenization process.

In summary, the production LLM relies on tokenization, embedding, iterative processing through its network, and decoding to transform a text prompt into a complete inference.


View this page on GitHub