From writing marketing copy and emails to solving homework đź«Ł, large language models (LLMs) like ChatGPT seem able to do it all. It's no wonder they're being integrated into workflows across multiple settings.
As these models become more pervasive, we've seen people interact with them in a consistent pattern. And a new method of training these models is emerging, based on that pattern of use: instruction tuning.
Tune It Like You Use It
When people use LLMs, what typically happens is:
A person supplies instructions (e.g. "Write an email to my client")
They include some additional information (e.g. new product available, features, pricing)
The LLM follows the initial instructions and considers the context to generate its response
If this is the way people are using LLMs, then it makes sense to train them in a similar way! The paper Finetuned Language Models are Zero-Shot Learners (Wei et. al.), introduces a new instruction tuning method that mirrors this real-world usage pattern, which trains LLMs to better understand and follow instructions. Surprisingly, this method also improves performance on multiple downstream zero-shot tasks like inference and translation, sometimes outperforming models specifically trained for the task.
So what is instruction tuning? Before we get to that, let’s review how most LLMs are trained for new tasks. Normally we start with a pretrained model and train it with data targeted for a specific task. For example, if we were training a model to answer reading comprehension questions we would start with a pretrained model and further train (finetune) it on a dataset of passages and associated questions to correctly answer the questions. Often we prepend the passage with a prompt like “Answer the following question about this passage.” We then test this model on a set of held out passages and questions, reporting the accuracy of that final data. The final result is a model specifically trained for reading comprehension question answering.
Instruction tuning changes the way we fine tune a model. Now, instead of fine tuning on a specific task (e.g. reading comprehension question answering), we train on multiple datasets with a similar format but different tasks. Each dataset has a specific prompt or instruction (e.g. answer the following question, translate from French to English) followed by some passage, and the model is expected to produce a response taking into account both the instruction and the passage. The model is trained on multiple such datasets, but crucially the training data contains neither the general skill (e.g. question answering, translation) nor the specific dataset used in testing. Thus, testing is zero-shot, but the model can observe the general format of instruction tasks (i.e., instruction, context, generate).
Zero-Shot Testing & Generalization in Instruction-Tuned Models
How well does instruction tuning work? The authors find that, across a variety of tasks, instruction tuning improves zero-shot performance by a large margin. They even find that their zero-shot performance is better than what a few-shot model (trained on a few of the examples like those in the test set) can do (see the figure below). These are impressive results that tell a compelling story.
Source: Finetuned Language Models are Zero-Shot Learners (https://openreview.net/forum?id=gEZrGCozdqR)Source: Finetuned Language Models are Zero-Shot Learners (https://openreview.net/forum?id=gEZrGCozdqR)
But, we should always be critical when considering the performance of models trained in a new way. What exactly is the change that led to this improvement? FLAN uses a pre-trained LaMDA model, which is different from GPT-3 in both architecture and training data. Is that the source of the improvement? And are the instructions really special? Could some other on-topic text help just as much as a natural-language instruction about the task?
To handle the first concern: the authors show that the base LaMDA-PT model performs worse than GPT-3 in the zero-shot setting. So, if anything, starting with the LaMDA model might put them at a slight disadvantage. It doesn’t appear that the architecture or the pretraining data are providing this edge.
Onto the second concern: are the instructions really that important? Maybe the fact that these models are trained on a bunch of additional datasets is what gives FLAN its edge.
To test this, the authors experiment with FLAN trained on the additional data with no instructions and find that the performance is about 20 points lower than an instruction-tuned model (see figure below). But what if we use text that is related to the task but different from the instruction? To test this the authors used the name of the dataset instead of the instruction. This gives the model some information about what it should do, but no explicit directions. Again, the authors find that the performance of this model is below that of an instruction-tuned model. So the instructions do seem to be adding something!
Source: Finetuned Language Models are Zero-Shot Learners (https://openreview.net/forum?id=gEZrGCozdqR)
Because the term instruction tuning sounds a bit like prompt tuning, there is sometimes confusion about what makes the two different. One of the big differences is that prompt tuning (usually) does not update the parameters of the LLM, but rather tunes the parameters of a prompt to find the best way to cue an LLM to do a specific task. This leads to another way in which prompt-tuned modes are different: prompt tuning produces model-prompt pairs that are specific to a task, whereas instruction-tuned models are ready for application to any new task. So, if you want a model that can do many things, an instruction tuned model is likely the way to go. If you only want your model to do one specific task, you might consider prompt tuning.
Conclusion
So, that’s instruction tuning in a nutshell! It’s a new way to train any LLM to make them more useful for a variety of applications. If you want to know more, check out the paper: “Finetuned Language Models are Zero-Shot Learners” Wei et al. And, a few instruction tuned models are available for download; you can find them linked to from the instruction tuning GitHub page.
This article was written by Amii Fellow and Canada CIFAR AI Chair Alona Fyshe. Alona is also an associate professor at the University of Alberta with a joint appointment in the departments of computing science and psychology. She combines her interests in computational linguistics, machine learning and neuroscience to study the way the human brain processes language.