Summary by Sofia Aparicio
What was the context? Wha was studied already?
Large Language Models (LLMs) have billion parameters, but training them takes a lot of time ⏰and computation power.
<aside> ⚠️ There are training larger and larger models trained while keeping the training set at 300 billion tokens, expecting performance increases.
</aside>
Since big models may only be trained once, identifying the optimum hyperparameters for a given compute budget is crucial. In practice, the training compute budget is known in advance: practitioners have access to a specified number of accelerators.
Important Related Paper
<aside> 💡 Kaplan et al. found a power law link between number of parameters in an autoregressive language model (LM) and its performance (measured in evaluation perplexity).
This image can be found in their NeurIPS presentation
</aside>
However, Kaplan et al. fixed the amount of training tokens and the learning rate, preventing them from modeling the effect of these hyperparameters on loss.
What was the objective? How was the data collected?
Main Research Question:
Given a fixed FLOPs budget,1 how should one trade-off model size and the number of training tokens?
To answer the research question, they developed two approaches:
They had different number of training steps for a set of fixed models (70M to over 10B parameters), training each model for 4 distinct training sequence lengths.