Alberta Machine Intelligence Institute

NeurIPS 2023 Highlights | Amii

Published

Dec 20, 2023

MSc. graduate Revan MacQueen shares his three favourite insights from the conference.

Neural Information Processing Systems (NeurIPS) is to machine learning conferences what Coachella is to music festivals (ok, maybe fewer sunburns, but you get the point). Between the thousands of papers, posters and participants, the scale of this conference is mind-boggling.

The conference is extremely interdisciplinary: with exciting work happening in machine learning (ML), neuroscience, social science and game theory, just to name a few. But to no one’s surprise, large language models (LLMs) played centre stage this year.

A theme I noticed during my exploration was “rethinking common narratives” as many studies challenged existing stories in ML by arguing for alternative views. My three favourite papers all fell into this category for different research areas: LLMs, statistical learning theory, and deep reinforcement learning.

Seeing Emergence in LLMs Clearly

Are Emergent Abilities of Large Language Models a Mirage? Schaeffer et al.

LLMs have gained huge popularity in recent years due to an explosion in performance—but what causes this increase in performance? As the size of models increase (measured in a number of parameters), researchers have noticed there seems to be a critical point where models suddenly undergo rapid qualitative improvements in performance: an effect dubbed emergence.

Emergence in LLMs is often viewed with a mix of awe and apprehension. For example, simply scaling the training FLOPs from 10^22 to 10^24 led to GPT-3’s modular arithmetic accuracy shooting up from near 0% to over 30%! This is shown in the following figure from the paper, where GPT-3 is the purple line.

It’s truly amazing that simply increasing the size of models leads to sudden increases in ability. Seems like something really interesting is going on with emergence, right?

Or maybe it’s all a mirage?

Schaeffer et al. believe that what we call emergence in LLMs is simply the effect of using a non-linear metric to evaluate models. Let me explain.

The performance of deep neural networks scales with the size of the training set, size of the network, and computation resources. Schaeffer et al. simplify this picture using a mathematical model where the performance (measured in cross-entropy loss) of a hypothetical LLM depends only on the number of parameters. Empirical observations suggest this relationship follows a power law, so Schaeffer et al. use this relationship in their work for illustrative purposes, but you could use anything.

Let’s suppose that you use a cross-entropy loss, but you use another metric for evaluation, say, 5-token accuracy (1 if all 5 tokens are selected correctly by the model and 0, otherwise). 5-token accuracy is a very nonlinear metric, and according to Schaeffer et al, this nonlinearity creates the illusion of emergence.

Say you start with a tiny model with very poor 5-token accuracy, and gradually increase the number of parameters. As you increase the number of parameters, the cross-entropy loss decreases fairly smoothly, but performance on 5-token accuracy remains poor. However, there will come a point where, by minimizing cross-entropy loss, the model has learned enough that 5-token accuracy will start to increase. To someone monitoring only 5-token accuracy, this looks like emergence, but it’s actually the result of a non-linear metric.

The authors demonstrate this hypothesis through numerous experiments. They show that changing the evaluation metric to be smooth and continuous eliminates the emergence effect. They even artificially recreate emergence in vision tasks—which haven’t previously exhibited these effects—by designing an appropriate nonlinear metric.

The authors make clear that they don’t want to imply that LLMs cannot exhibit emergence, they just want to show that previously observed emergence can be explained by a non-linear metric. Our own observation of emergence in LLMs can also be explained by this theory; we judge the ability of LLMs by highly non-linear metrics, such as “Can it write a grammatically correct sentence?” and “Is it doing the correct arithmetic steps?”

Rethinking Double Descent

A U-turn on Double Descent: Rethinking Parameter Counting in Statistical Learning, Curth et al. 2023

In introductory ML classes, you learn about the relationship between the size of a model (number of parameters) and its ability to generalize. With few parameters, the model cannot capture complex patterns. As the number of parameters increases, the representational capacity increases, and generalization improves. However, there comes a point where additional parameters only increase the model’s ability to overfit to the training data, therefore weakening generalization. This results in the relationship between number of parameters (on the x-axis) and test-error having a “U-shaped” curve.

Belkin et al. (2019) challenged this intuition, by showing that when the number of parameters exceeds the number of data points in a dataset, generalization begins to increase again. This phenomenon has been dubbed “double descent” since after the U-shaped curve comes another L-shaped curve. What’s going on here?

Belkin et al. suggested that when the number of parameters surpasses the number of data points, we enter a new so-called interpolation regime, where models learn effective internal representations that interpolate between training examples. This would explain why generalization improves in the second descent.

Curth et al. challenge this notion in their paper “A U-turn on Double Descent,” proposing that there may be two different axes along which the “number of parameters'' may grow. The “double descent effect” is observed when you unfold these two dimensions into one. This figure from the paper shows this unfolding:
As an example, the authors demonstrate this hypothesis using decision trees. The first dimension is the number of leaves in a tree. Increasing the number of leaves gives a classic U-shaped test-error curve. But once the number of leaves in the tree equals the number of data points, how can you increase the number of parameters? Well, one way is to instead increase the number of trees, giving us a random forest. Here, we enter the second descent, where increasing the number of trees continues to decrease test error. Once we transition into adding more trees, we’re adding more parameters along a dimension that is different from when we were just increasing the number of leaves (i.e. complexity dimension 2 in the figure). Decision trees are a case where the two dimensions are nicely separable, but the authors show something similar for linear regression (which is a bit too technical for a blog post, check it out in the paper!)

The authors also define a generalized measure for the number of effective parameters for a class of methods called smoothers [Hastie & Tibshirani, 1986], a measure which combines different dimensions of parameters in a principled way. Effective parameters redefine the x-axis of generalization curves and re-establishes the classic U (or L) shapes of these curves once by more accurately counting parameters.

What about deep-learning? This work doesn’t explicitly address deep learning, but it’s an obvious next step. The lottery ticket hypothesis [Frankle & Carbin, 2018] already claims that large nets may have “subnets”, in which case the larger network may aggregate these subnets just like random forests aggregate decision trees. My guess is that an analogous recharacterization of double descent for deep learning is not far away.

When Does Deep RL Actually Work?

Bridging RL Theory and Practice with the Effective Horizon, Laidlaw et al.

Anyone who’s worked in deep RL understands how tricky it can be to get these algorithms to work. The same algorithm will perform great in one environment, but utterly fail in another. There doesn’t seem to be a clear characterization of the types of environments where deep RL will work. Partly, this is due to a wide gap between theory and practice: theoretical bounds on performance are often many orders of magnitude looser than observed performance. Can we do better?

Laidlaw et al. seek to characterize the environments in which deep RL algorithms will perform well. However, to figure out how good an algorithm is, you need to compare the policy it learns to the optimal policy, and the optimal policy is unknown for many standard benchmarks.

To address this problem, this paper introduces BRIDGE, a huge benchmark of 155 deterministic Markov decision processes (MDPs) including Atari games and gridworlds along with their tabular representations. For each MDP, Laidlaw et al. find the optimal policy using its tabular representation — an enormous engineering effort as some of these MDPs have nearly 100 million states. With these optimal policies in hand, you can now figure out in which MDPs deep RL algorithms will learn the optimal policy, and in which they won’t . Can we draw any conclusions about these different types of MDP?

Laidlaw et al. notice that when deep RL does well in an environment, another naive benchmark (GORP, Greedy Over Random Policy) also does well. GORP works by choosing the next action greedily with respect to the uniform random policy’s Q function. In other words, GORP chooses the best action at this state, assuming it will play randomly thereafter (the actual GORP in the paper is a bit more general).

It turns out that in MDPs where GORP finds the optimal policy, it is likely that the deep RL algorithm PPO does too. Conversely, if GORP does not find the optimal policy, it’s unlikely that PPO will too. So, deep RL works in environments that are easily solvable by a myopic algorithm, which suggests that these environments are pretty easy.

This observation inspired Laidlaw et al. to develop a measure of environmental complexity called the effective horizon. Intuitively, the effective horizon is how far an agent needs to look ahead to determine the optimal action, given that the policy after that horizon is uniformly random. The effective horizon gives much tighter bounds for deep RL algorithms and strongly correlates with the performance of PPO.

Characterizing environments where current deep RL approaches perform well is an important step toward using deep RL algorithms in the real world. This lets us understand which environments make good applications right now and helps guide research on more capable algorithms in complex environments (with large effective horizons).

Conclusion

Among the over 3500 papers accepted at NeurIPS 2023, these three works stood out to me for their new perspectives on LLMs, statistical learning theory and deep RL. They challenged existing narratives through good science—an approach which I think will become more common, as researchers may not have the compute to train cutting-edge models.


Revan MacQueen is an MSc. graduate from the University of Alberta. His research focuses on multi-agent machine learning and game theory, which is summarized in a one-minute video here.

Share