Training Compute-Optimal Large Language Models

Summary by Sofia Aparicio

Background

What was the context? Wha was studied already?

Large Language Models (LLMs) have billion parameters, but training them takes a lot of time ⏰and computation power.

<aside> ⚠️ There are training larger and larger models trained while keeping the training set at 300 billion tokens, expecting performance increases.

</aside>

Since big models may only be trained once, identifying the optimum hyperparameters for a given compute budget is crucial. In practice, the training compute budget is known in advance: practitioners have access to a specified number of accelerators.

Important Related Paper

<aside> 💡 Kaplan et al. found a power law link between number of parameters in an autoregressive language model (LM) and its performance (measured in evaluation perplexity).

Screen Shot 2022-12-02 at 20.23.01.png

This image can be found in their NeurIPS presentation

</aside>

However, Kaplan et al. fixed the amount of training tokens and the learning rate, preventing them from modeling the effect of these hyperparameters on loss.

Methods & Nature of this study

What was the objective? How was the data collected?

Main Research Question:

Given a fixed FLOPs budget,1 how should one trade-off model size and the number of training tokens?

To answer the research question, they developed two approaches:

Approach 1: Fix model sizes and vary number of training tokens

They had different number of training steps for a set of fixed models (70M to over 10B parameters), training each model for 4 distinct training sequence lengths.

Background

Methods & Nature of this study

Approach 1: Fix model sizes and vary number of training tokens

Approach 2: IsoFLOP profiles/analysis