Replies: 17 comments 13 replies
-
|
You mentioned the D:N ratio came out to ~8 vs Chinchilla's ~20, and speculated it might be Muon or small-model artifacts. Is there a simple way to think about why an optimizer would change this ratio? My intuition is that a more efficient optimizer "fills up" model capacity faster, so you'd shift compute toward bigger models rather than longer training. Is that directionally right, or is something else going on? Separate question on depth as the dial... you fixed aspect ratio and swept depth. Is there prior work (or intuition) suggesting depth is the more natural axis to vary vs width? Or was this mainly a practical choice to keep the sweep clean? |
Beta Was this translation helpful? Give feedback.
-
|
This resonates with your “dial / family of models” framing. One question I’ve been thinking about: we treat compute / params / tokens as the primary dials, but there seems to be an upstream dial that usually stays implicit — the human-side pre-filtering and curation regime that defines what even enters the training pipeline. Have you ever tried holding model + compute fixed, while systematically varying how data is curated or labeled (e.g. exploratory vs delivery-driven filtering), and then measuring downstream calibration or OOD behavior? My hunch is that some fragility attributed to scaling limits may actually come from an unablated observer interface earlier in the pipeline. |
Beta Was this translation helpful? Give feedback.
-
|
Title: [2025++] |
Beta Was this translation helpful? Give feedback.
-
|
I might be missing something or perhaps I misunderstood the concept, but I have a question regarding the individual metrics. Since users are often interested in performance on specific tasks, do the individual benchmarks that make up the CORE score also show the same smooth scaling laws? |
Beta Was this translation helpful? Give feedback.
-
|
Wonderful stuff. How does FP precision affect parameter count? Is 2B 8-bit equal to 1B 16-bit? Could this be related to the discrepencies with the optimal ratios between Chinchilla and this: we are training at lower precision (are we?) so we need less information to "fill up" each parameter? |
Beta Was this translation helpful? Give feedback.
-
|
This is so good, thank you! 🙏 |
Beta Was this translation helpful? Give feedback.
-
|
Thanks @karpathy - did you include the scripts for the |
Beta Was this translation helpful? Give feedback.
-
very interesting, how big was the improvement in metrics by making this change? |
Beta Was this translation helpful? Give feedback.
-
|
You are a goddamn hero. I lost it when you trained a model to estimate gpt 3 scores. Genius. |
Beta Was this translation helpful? Give feedback.
-
|
Hi Andrej, Thank you for sharing your detailed and granular empirical analysis. Building on an earlier piece of research: A Resource Based Model For Neural Scaling Laws
From my theoretical derivation with width-only scaling (fixed depth): Np ∝ N2 This implies that N ∝ Np1/2. Since FLOPS (C) is grows proportionally to the resources (neurons) (N) this means that N ∝C0.5 which is pretty close to your 0.49 for the optimal model size exponent. This seems to suggest that width-only scaling beyond critical depth is not just theoretically sound but also matches (at least in your setting) the optimal scaling behavior you've observed in practice. Additionally, here is an illustration(from my preprint) graph / projection in relation with the Chinchilla paper:
This alignment between theory and empirical evidence reinforces the value of your miniseries approach -- it's helping to bridge the gap between theoretical understanding and practical implementation of scaling laws. Best regards, |
Beta Was this translation helpful? Give feedback.
-
|
Amazingly similar to the Cobb–Douglas model (economics). Sub-linear returns if scaling occurs in only one input factor, but linear returns when you scale factors together. |
Beta Was this translation helpful? Give feedback.
-
|
Amazing post! I have just one quick question: did you consider using muP-style hyperparameter transfer for these scaling experiments? If so, what tradeoffs led you to stick with empirical tuning instead? |
Beta Was this translation helpful? Give feedback.
-
|
Update Jan 11: pushed a number of upgrades and re-ran the miniseries. |
Beta Was this translation helpful? Give feedback.
-
|
This is a great repository and a good write-up. Many thanks. However, I am afraid that the scaling laws are more artifact than fact. They are largely a consequence of architectural limitations (a.k.a transformers). In general, compute is not proportional to the number of learning parameter. CNNs are one of the simplest counterexamples. The idea that the affine forward/backward weight operations are the bulk of neural computing is a similar fallacy and consequence of the same limitations. Full connectivity is the worst possible synaptic map and the other culprit. Although distributed semantics is arguably one of the greatest discoveries in (computational) linguistics in the 20th century, embeddings are not an effective way to implement and attempt to learn them. All of those things are ephemeral and they will not survive much longer. The scaling laws depend on them and will likewise collapse. They will certainly not survive my own research. |
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
-
|
What should I do |
Beta Was this translation helpful? Give feedback.
-
|
Update Feb 7 I re-ran the scaling laws overnight because I was curious where the repo is at and it's been exactly 1 month since Jan 7. First, the val_bpb scaling laws look nice and clean:
The CORE looks more noisy sadly. CORE score is a bit noisier than I'd like at this small scale. I did a lot of experimentation on how to smooth it out and succeeded and made a "SmoothCORE" but decided to abandon out of fear of confusing people and introducing another metric. So e.g. I think the "regression" at d16 is just noise. d30 reaches CORE of almost exactly 30.0:
That said, given the fit to the (slightly noisy) CORE scores, the fit implies the following costs for reproducing various models in the GPT-2 and GPT-3 miniseries:
So GPT-3 is now predicted to be $32.3K and take ~56.1 days to train with nanochat d75 model. Recall that in Jan 7 this estimate was all the way up around $1076.5k, so that's a pretty solid improvement. But again, this is all a little bit weird and noisy and not to be trusted exactly. (Ideally we'd have targets in val_bpb which is a clean metric, but this is not a comparable one due to training data distribution shift.). Still, that's the rough napkin math at this point. |
Beta Was this translation helpful? Give feedback.












Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Why miniseries. The correct way to think about LLMs is that you are not optimizing for a single specific model but for a family models controlled by a single dial (the compute you wish to spend) to achieve monotonically better results. This allows you to do careful science of scaling laws and ultimately this is what gives you the confidence that when you pay for "the big run", the extrapolation will work and your money will be well spent. For the first public release of nanochat my focus was on end-to-end pipeline that runs the whole LLM pipeline with all of its stages. Now, I'm coming back around to flesh out some of the parts that I sped through, starting of course with pretraining, which is both computationally heavy and critical as the foundation of intelligence and knowledge in these models.
Miniseries v1. In nanochat, that single dial is the depth of the model. For example, d12 is my favorite model (it's the size of GPT-1!) - it has 12 layers and currently trains in ~6 minutes. The setting of depth determines the number of channels in the Transformer (via the constant "aspect ratio"), and in turn the number of parameters and flops per token of the Transformer, and the optimization hyperparameters (the learning rate in particular) and finally via scaling laws analysis the horizon of training to obtain a "compute optimal" model (more on this below). As of the latest commit the script miniseries.sh, sweeps out the family of nanochat models from d10 to d20. All of these fit into a single 8XH100 node at the training batch size of
2**19 = 524,288tokens without having to reach for micro batches and gradient accumulation. nanochat already supports gradient accumulation and I've trained much larger models (e.g. d34 recently), but I wanted to focus on this simplest setting first. The wandb plots look like this. x-axis is flops and y-axis is validation bpb (bits per byte, i.e. loss):What you're seeing here are models d10...d20. These 11 models took ~4 hours back to back on my trusty 8XH100 node to train for ~$100 of total cost. If your code, architecture and optimization is properly arranged and you did your scaling laws right these curves should not intersect. Each one represents the unique, compute optimal way to reach a target validation loss.
Comparison to GPT-2/GPT-3 miniseries. I did not want to use the validation loss to compare models because while it is simple, it can be subtle and deceiving. For example, modded nanogpt (which I otherwise love) merged a few changes that I thought were mildly gaming the metric (e.g. using very long sequences and batch size 1). It's a bit subtle but stretching out your validation batches into one long row (i.e. batch size B=1) just means there are fewer tokens with cropped contexts at the first few columns of your (B, T) batches when B>>1. This basically changes the validation loss by increasing the amount of context for many of these tokens, so it's not apples to apples, and the resulting "improvement" is not real. In addition, it's a bad form of comparison to GPT-2 and GPT-3 models because they were pretrained on a very different and unknown data mix distribution, so comparing FineWeb loss is not fair or informative. Only actual metrics are real and comparable. Earlier in the year I stumbled by the DCLM paper where they presented a nice ensemble metrics over a lot of different datasets. The metric is called CORE metric and it incorporates performance across 22 nice and high quality datasets. DCLM code had a complicated and bloated way of calculating it, so I stripped it all the way to a single, simple, dependency-free file that evaluates the CORE metric given a model. Then we can chart a nice, valid comparison of our miniseries v1 models to GPT-2 and GPT-3 (more on how I calculated/estimated their CORE scores below) where the x axis is resource spend (FLOPs, time) and y axis is CORE score. To get $ as the x-axis, simply take the time (hours) and multiply by $24 (as the cost of 8XH100 is $3/GPU/hour X 8 GPUs = $24/hour).
The goal for miniseries v2 is now simple: to further optimize the pretraining code and to lift up (and ideally tilt!) this line; to get more bang for the buck.
Details
Scaling laws. One of the important and trickier aspects of getting this to work is doing a good job with your scaling laws (see Kaplan et al. and Chinchilla/Hoffmann et al.). The problem essentially is as follows. Suppose I want to train a d12 model. How many iterations should I train it for? Remember that at this small scale we are in the infinite data regime so there are no concerns of overfitting and therefore it makes no sense to, for example, train until your validation loss starts climbing. In the infinite data regime, the validation loss keeps going down indefinitely as you train longer (it just starts to level off slowly), moreover your train loss is basically equal to your validation loss - no overfitting. The answer to the problem is that the question is not quite right - you don't really want to train a d12. Instead, you have a certain compute budget of FLOPs (e.g. I want to run my cluster for exactly one day) and you want the lowest achievable loss. The real question then is: should you train a small model for many iterations or should you train a bigger model for fewer iterations? Scaling laws are a way of determining how to map from the single variable you have control over (the total number of flops) to the optimal setting of N (the number of parameters of your model) and D (the number of tokens you will train for, which is trivially related to the number of iterations or the length of time given a fixed batch size per step of the optimization).
Given a nanochat model of a certain depth, the way to calculate its flops is as follows:
And the way to calculate its parameters is simply as:
Next we use the
--target_flopsofbase_train.pyto fix the flops to a specific target (e.g. 3e18) and run models of a few depths. The code will automatically scale the number of iterations so that it exactly gets to your desired target flops. Small depths train long, large depths train short. However all of these models of different depths will end up costing exactly 3e18 FLOPs. The individual runs will look like this:You see how the big models (e.g. d20, brown) ran for very iterations and small models (e.g. d8, magenta) ran for many iterations. All of these models cost the exact same amount of FLOPs, but clearly somewhere in the between one of them (d16 here) struck the correct balance and reached the lowest loss. That specific of model size and training length is compute optimal. When you repeat this process for a few FLOP budgets, you get surprisingly nice U shapes where for each one there is a concrete setting of the model size that is compute optimal:
When you look at the optimal points (stars) after a quadratic fit you get:
For comparison, here is the same plot from the Chinchilla paper except with more compute:
Now there are a few important things to note. First, notice how (exactly like Chinchilla), the optimal number of parameters and tokens to train for is proportional to compute C to the power of ~0.5. First, sanity check they add up to ~1 because C ~= 6ND. But more importantly, they are equal to each other. This is a remarkable result that we reproduce, it means that parameters and tokens are on equal footing w.r.t. compute optimal models. For example if you double (2X) the compute budget, this is saying you should 1.41X your parameters and 1.41X your number of tokens (1.41 ~= sqrt(2)). In particular and most importantly, the fact that they are equal means that the optimal ratio of D:N is constant, regardless of the compute budget C. This is because if we model the number of parameters as N = k₁ · C^a and tokens D = k₂ · C^b, then D/N = (k₂/k₁) · C^(b-a) and so if a = b = 0.5, then C^(0.5 - 0.5) = C^0 = 1 = constant! So the optimal ratio of D/N = k₂/k₁ is constant regardless of the compute level of interest C. Note that this really didn't have to be the case, it could have been something else and complicated and a function of the model size (the original scaling laws paper of Kaplan et al. found this to be the case incorrectly due to a major bug in learning rate decay), but nature decided that compute optimal Transformers fall exactly on a straight line (what???). Chinchilla pointed it out and nanochat reproduces this surprising finding. In any case, practically speaking this is huge because it means that we have a single constant telling us the optimal ratio between D and N, and therefore we can simply have a
target_param_data_ratioin nanochat base_train script, which calculates the optimal number of tokens to train for regardless of the depth of the model. In Chinchilla, they empirically measure k₂/k₁ to be 20. In nanochat, when you do the fits you actually get something much lower: 8. It's possible that some specifics of nanochat (the Muon optimizer, or...?) make it so that nanochat prefers bigger models trained shorter. Or it could be an artifact of the smaller model sizes we're looking at here. In any case, we now know how long we should train any given model. You take the number of its parameters N, you multiply by 8 to get the number of target tokens D, then divide D by the batch size (~0.5M) to get your number of iterations, done. We're now only training compute optimal models.Hyperparameter sweeps. I ran a few more tuning sweeps that I won't spend a lot of time on. The learning rates are close to optimal after small nudge to the embedding learning rate. The warmdown ratio was the biggest surprise and I nudged it 0.2 -> 0.4. Sequence length of 2048 turns out to be quite good, balancing context length and document diversity in our batch of 0.5M tokens. The batch size 0.5M is a little bit on the larger size and purely flops-wise should be a little bit smaller (~half), but wallclock-wise is good as is. All this to say that I did some basic tuning for miniseries v1, but by no means exhaustive and there are still many ideas to try.
GPT-2 / GPT-3 CORE scores. Another challenge was calculating CORE scores for the GPT-2 and GPT-3 miniseries. GPT-2 miniseries was easy because the models are available and were released. So you can just download the models and run the eval. GPT-3 miniseries I had to get more creative because the models were never released. But we do have the paper with their evaluation results in the tables. I posted the full approach to a jupyter notebook but basically I found 6 tasks that are both in the CORE metric and reported in the GPT-3 paper in a very similar evaluation setting. They are
['HellaSwag 0-shot', 'LAMBADA', 'HellaSwag 10-shot', 'PIQA', 'ARC Easy', 'ARC Challenge']. I then use the GPT-2 models for calibration, meaning that I trained a simple model that takes the performance on these 6 tasks and estimates the CORE score (of 22 tasks) using 3 different approaches. The fact that these 6 are solid evals and that the points lined up very nicely gave me confidence that these CORE scores are not far off.After all this work we get our targets:
(Note 1: You'll notice that GPT-3 at the same parameter counts as GPT-2 is a slightly better model with stronger performance due to various improvements to data, architecture, optimization and they are also trained for a lot more tokens (300B) compared to GPT-2's estimated token budget of somewhere around ~100B tokens. Note 2: All of these are CORE scores v1, not v2 (iykyk otherwise nvm)).
Miniseries v1 CORE scores Second, here are the nanochat miniseries v1 in their full detail:
NOTE: Do not be confused w.r.t. the v1 miniseries and previous nanochat models I have trained so far in the rest of the discussions. Those models were trained with D:N ratio of 20 (Chinchilla), these models are using 8. So they are less trained (at each depth), but compute optimal for their respective validation loss.
We can also take the models d12+ (discarding some of the smaller models due to fear of outliers at that tiny of a scale) and do a fit predicting (total) parameters -> CORE of the asymptotic form to get a fit
CORE = 1.0000 - 3.7555 * FLOPs^(-0.0344). With this, we can extrapolate to see what we need to reach all of the GPT-2 and GPT-3 models:Depth to match GPT-2/3 CORE (d>=12 fit, 12:1 D:N ratio, 8xH100 @ $3/GPU/hr):
Now, I wouldn't really read into this too much because we're doing a lot of extrapolation over many orders of magnitude of flops based on very few datapoints. And we're optimistically assuming the same utilization at scale as that of the d20 run that fully utilizes the 8XH100 box to estimate the time and cost. But still, it's encouraging to see that e.g. as a sanity check, the predicted FLOPs needed to get to GPT-3 level is 5.7e23 (the real amount needed for the GPT-3 run was ~3e23). If these numbers are to be trusted, then we're in a pretty decent spot, but still with quite a bit of room for improvement.
For any Q&A please feel free to use the discussion below, alternatively find me on the Discord channel #nanochat.
Beta Was this translation helpful? Give feedback.
All reactions