Defeating the entire alpaca family, the new Meta AI self-alignment method requires very little manual labeling data

Original source: Qubit

Is it urgent to manually label data?

Mata's new method builds a high-quality instruction following (instruction following) language model with only a small amount of seed data.

In other words, large language models require a large amount of human-labeled instruction data for fine-tuning, but now the model can automatically infer instructions from unlabeled text in web corpora.

Then use the instruction data generated by yourself for training, which is comparable to self-produced and sold.

And the model trained by this method surpasses the open source alpaca and its series of derivative models on the Alpaca benchmark test.

LeCun tweeted that the study was sensational in terms of model self-alignment:

To sum it up in a sentence from a netizen:

The alpaca started training itself.

The two sentences sum it up like this:

Originally required instruction>response data set (requires manual labeling), now it is only necessary to simply train a "reverse model" for response> instruction. Any text can be freely converted into an instruction data set.

Another netizen issued a soul torture:

Am I the only one who thinks this looks like the path to superintelligence? If you can get LLMs that get smarter and smarter without additional high-quality external data, then this is a self-improving closed system. Maybe just a reinforcement learning system is needed to provide the signal, and then the LLM's own iterations can do the rest.

Alpaca: I used data to train a whale

This scalable new method is called Instruction Back Translation, and Mata named the model trained by this method-Humpback (humpback whale, also known as humpback whale).

(The researchers said that the name was given because of its relationship with the camel's back, and the larger size of the whale corresponds to a larger scale of the model)

The step of training a Humpback is simply to start with a small amount of labeled data and use the language model to generate instructions corresponding to unlabeled text to form candidate training data. Then use the model to evaluate the data quality and select high-quality data for retraining. The process is then repeated to further improve the model.

As shown in the figure above, the "materials" that need to be prepared are:

  • A base model - LLaMa
  • A seed data (Seed Data) composed of 3200 examples from the Open Assistant dataset, each example includes an instruction and corresponding output.
  • From the ClueWeb corpus, 502K unlabeled texts (Unlabeled Data) that have been deduplicated, filtered, and potentially low-quality paragraphs have been deleted.

The labeled examples and corpus sources are available, and the next step is the Self-augment stage.

The researchers fine-tuned the basic model LLaMa with the seed data to obtain the instruction prediction model. This instruction prediction model is then used to infer a candidate instruction for the unlabeled text. Then combine the candidate instruction and text (instruction-output pair) as a candidate enhanced training data, which is the Augmented Data A in the above figure.

However, it is not possible to use the data of A for direct training, because the quality of the unlabeled text itself is uneven, and the generated candidate instructions also have noise.

So the key Self-curate steps are needed, using the model to predict data quality and selecting high-quality samples for training.

Specifically, the researchers scored candidate data using an instruction model fine-tuned only on the seed data. The full score is five points, and those with higher scores will be selected as candidate data for the next round.

In order to improve the quality of model instruction prediction, the researchers trained the model with candidate data iterative, and in iterative training, the data quality will get better and better.

In addition, when combining seed data and augmentation data to fine-tune the model, they also use different system hint tags to distinguish between these two data sources:

  • Seed data usage tips "Answer in the style of an AI Assistant."
  • Filter data using the prompt "Answer with knowledge from web search."

After two iterations, the final model is fresh out of the oven.

Merge two kinds of training data: 1+1>2

Let's take a look at the results of the researchers' analysis:

** **###### Instruction diversity for seed data and enhanced data. The inner circle is the common root verb and the outer circle is the common noun that corresponds to it.

The figure above shows the instruction diversity with 8% seed data and 13% enhanced data statistics.

It can be seen intuitively that the enhanced data diversity is stronger in the long tail part, and the enhanced data complements the existing artificially labeled seed data, supplementing the types that do not appear in the seed data.

Second, the researchers compared three augmented datasets: Augmented data, all (no self-management),

, less data but higher quality

Experiments have observed that although the data set becomes smaller, the performance of the model has also been improved with the improvement of the quality of the training data.

** **###### Use self-filtering to evaluate self-augmentation data of different data sizes and qualities. The y-axis represents the win rate with text-davinci-003 when fine-tuning LLaMa 7B with a given data size and quality.

(text-davinci-003, A GPT-3-based instruction following model fine-tuned on human-written instruction data, outputs, model responses, and human preferences using reinforcement learning)

Finally, let's take a look at the results on the Alpaca leaderboard. Humpback significantly outperforms other methods without relying on distilled data and closes the gap with proprietary models.

Non-distilled (Non-distilled), refers to a training model that does not rely on any external model as any form of supervision; Distilled (Distilled), refers to the introduction of a more powerful external model during the training process, such as using data distilled from an external model; Proprietary refers to models trained using proprietary data and techniques.

** **###### Compared to text-davinci-003's winning rate

In comparison with open source models LIMA 65B, Guanaco 65B, Falcon-Instruct 40B and proprietary models davinci-003, Claude, Humpback's performance is also more in line with human preferences.

Additionally, the researchers noted limitations of the method:

Since the text data used for training comes from a web corpus, the fine-tuned model may amplify the bias of the web data. Although compared to the base model, the fine-tuned model improves the accuracy of detecting bias. However, this does not mean that the problem will be completely solved.

Portal: paper link)

Reference link: [1] [2] [3]

View Original
The content is for reference only, not a solicitation or offer. No investment, tax, or legal advice provided. See Disclaimer for more risks disclosure.
  • Reward
  • Comment
  • Share
Comment
0/400
No comments
Trade Crypto Anywhere Anytime
qrCode
Scan to download Gate app
Community
English
  • 简体中文
  • English
  • Tiếng Việt
  • 繁體中文
  • Español
  • Русский
  • Français (Afrique)
  • Português (Portugal)
  • Bahasa Indonesia
  • 日本語
  • بالعربية
  • Українська
  • Português (Brasil)