GPT

核心点:预训练一个transformer decoder模型之后,做下游任务时,不会很大改变模型结构。

视频链接:GPT,GPT-2,GPT-3 论文精读【论文精读】

在这里插入图片描述

title: Improving Language Understanding by Generative Pre-Training

Abstract

GPT:Generative Pre-Training 通用的预训练模型

整个模型结构如下图所示:
在这里插入图片描述

先训练好预训练模型,然后再做微调。但是使用的是未标注的数据

Although large unlabeled text corpora are abundant,
labeled data for learning these specific tasks is scarce, making it challenging for
discriminatively trained models to perform adequately. We demonstrate that large
gains on these tasks can be realized by generative pre-training of a language model
on a diverse corpus of unlabeled text, followed by discriminative fine-tuning on each
specific task.

1 Introduction

利用未标注的text数据,有两个问题需要解决。

第一个:不知道使用哪种类型的优化目标

第二个:没有好的方法把学习到的文本表示传递到下游的子任务上。

Leveraging more than word-level information from unlabeled text, however, is challenging for two
main reasons. First, it is unclear what type of optimization objectives are most effective at learning
text representations that are useful for transfer. Recent research has looked at various objectives
such as language modeling [44], machine translation [38], and discourse coherence [22], with each
method outperforming the others on different tasks.1 Second, there is no consensus on the most
effective way to transfer these learned representations to the target task. Existing techniques involve
a combination of making task-specific changes to the model architecture [43, 44], using intricate
learning schemes [21] and adding auxiliary learning objectives [50]. These uncertainties have made
it difficult to develop effective semi-supervised learning approaches for language processing.

半监督学习semi-supervised :有标注好的数据,还有大量未标注的数据,如何用进来。

本文:半监督学习方法,综合使用无监督的预训练模型和监督的微调模型。

我们的目标是:学习一个通用的表示模型,稍微调整之后可以迁移到大量的子任务中。(Our goal is to learn a universal representation that transfers with little adaptation to a wide range of tasks.)

In this paper, we explore a semi-supervised approach for language understanding tasks using a
combination of unsupervised pre-training and supervised fine-tuning. Our goal is to learn a universal
representation that transfers with little adaptation to a wide range of tasks. We assume access to
a large corpus of unlabeled text and several datasets with manually annotated training examples
(target tasks). Our setup does not require these target tasks to be in the same domain as the unlabeled
corpus. We employ a two-stage training procedure. First, we use a language modeling objective on
the unlabeled data to learn the initial parameters of a neural network model. Subsequently, we adapt
these parameters to a target task using the corresponding supervised objective

使用Transformer架构,这篇文章发表在Transformer 出来一年之后,在当时,选择Transformer 或者RNN不是那么显而易见的。

作者的理由:在迁移学习的时候,Transformer 学习到的feature更稳健,原因可能是Transformer 里面有更结构化的记忆,可以处理更长的文本信息,这样可以抽取出更好的句子层面、段落层面的语义信息。

在迁移的时候的做法:使用任务相关的输入表示

For our model architecture, we use the Transformer [62], which has been shown to perform strongly on
various tasks such as machine translation [62], document generation [34], and syntactic parsing [29].
This model choice provides us with a more structured memory for handling long-term dependencies in
text, compared to alternatives like recurrent networks, resulting in robust transfer performance across
diverse tasks. During transfer, we utilize task-specific input adaptations derived from traversal-style
approaches [52], which process structured text input as a single contiguous sequence of tokens. As
we demonstrate in our experiments, these adaptations enable us to fine-tune effectively with minimal
changes to the architecture of the pre-trained model.

6 Conclusion

We introduced a framework for achieving strong natural language understanding with a single
task-agnostic model through generative pre-training and discriminative fine-tuning. By pre-training
on a diverse corpus with long stretches of contiguous text our model acquires significant world
knowledge and ability to process long-range dependencies which are then successfully transferred to
solving discriminative tasks such as question answering, semantic similarity assessment, entailment
determination, and text classification, improving the state of the art on 9 of the 12 datasets we
study. Using unsupervised (pre-)training to boost performance on discriminative tasks has long
been an important goal of Machine Learning research. Our work suggests that achieving significant
performance gains is indeed possible, and offers hints as to what models (Transformers) and data sets
(text with long range dependencies) work best with this approach. We hope that this will help enable
new research into unsupervised learning, for both natural language understanding and other domains,
further improving our understanding of how and when unsupervised learning works.

2 Related Work

Semi-supervised learning for NLP

Unsupervised pre-training

Auxiliary training objectives

3 Framework

3.1 Unsupervised pre-training

使用标准的语言模型目标函数,最大化下面这个似然函数(likelihood)

在这里插入图片描述

语言模型要预测第i个词出现的概率,记为 u i u_i ui,使用它前面的连续k个词做预测,这里的k是上下文窗口(context window)

GPT使用transformer decoder 解码器

transformer encoder 编码器:可以看到整个文本信息

transformer decoder 解码器:由于mask的存在,只能看到当前元素和它之前的元素

这里使用的是标准的语言模型,预测第i个词出现的概率的时候,不看其后面的词。所以用transformer decoder。

a variant of the transformer [62]. This model applies a multi-headed self-attention operation over the
input context tokens followed by position-wise feedforward layers to produce an output distribution
over target tokens:

在这里插入图片描述

GPT和BERT的区别

第一个:GPT采用transformer decoder,只看到当前词前面的词

BERT不是使用标准的语言模型,而是采用带掩码的语言模型(完形填空,给一个句子,把中间的一些词挖掉,让你预测出来),做预测的时候既可以看到这个词的之前的词,又可以看到之后的词。这对应于transformer encoder。

第二个:主要区别是在于目标函数的选取:

GPT的任务是更难的,告诉历史信息,预测未来;而BERT是预测中间状态。

3.2 Supervised fine-tuning

微调模型:把有标注的数据集序列 ( x 1 , . . . , x m ) (x^1,..., x^m) x1...,xm,标签是y,即给定序列,预测y。

做法是:把这个序列放到之前训练好的GPT中,然后拿到transformer块的最后一层输出的激活函数 h l m h_l^m hlm, 乘以一个输出层,再放到softmax中求概率。

然后对似然概率做最大化
在这里插入图片描述

此外,把之前的语言模型同样放进来,效果也不错

We additionally found that including language modeling as an auxiliary objective to the fine-tuning
helped learning by (a) improving generalization of the supervised model, and (b) accelerating
convergence. This is in line with prior work [50, 43], who also observed improved performance with
such an auxiliary objective. Specifically, we optimize the following objective (with weight λ \lambda λ):

image-20220323115519986

把两个目标函数加起来,权重 λ \lambda λ可调

3.3 Task-specific input transformations

主体模型不变,只是让输入改变。

Instead, we use a traversal-style approach [52], where we convert structured inputs into an ordered sequence that our pre-trained model can process. These input transformations allow us to avoid making extensive changes to the architecture across tasks. We provide a brief description of these input transformations below and Figure 1 provides a visual illustration. All transformations include adding randomly initialized start and end tokens (hsi, hei).

在这里插入图片描述

第一个任务:分类classification

把文本用一个起始符和抽取符串成一个序列

image-20220323120346897

这里是新加了一个线性层:在微调的时候,重新构造了一个新的线性层,里面是随机初始化的,然后它的输出大小跟label是一样的。

第二个任务:蕴涵 entailment

在这里插入图片描述

两段文本(文本和假设),做三分类问题。支持、不支持、既不支持也不反对。

用起始符、分隔符、抽取符串成一个序列,然后给transformer

Textual entailment For entailment tasks, we concatenate the premise p and hypothesis h token
sequences, with a delimiter token ($) in between.

第三个任务:相似 similarity

判断两个文本是不是相似:比如一个搜索词和一个文档是不是相似的,或者说两个文档是不是相似的,或者说两个问题是不是相似的。

相似其实是对称的,a和b相似,也就是b和a相似。但是我们的语言模型中是存在先后关系的,所以这里做了两个序列,一个是text1在前,另一个序列是text2在前。

两个序列分别进入我们的模型之后得到输出,然后相加进入线性层。得到二分类的结果。

image-20220323120643636

Similarity For similarity tasks, there is no inherent ordering of the two sentences being compared.
To reflect this, we modify the input sequence to contain both possible sentence orderings (with a
delimiter in between) and process each independently to produce two sequence representations h m l h_m^l hml
which are added element-wise before being fed into the linear output layer.

第四个任务:多选 multiple choice

问一个问题,给几个答案,让你选择出来你认为正确的答案。

做法:如果有n个答案,就构造n个序列。每个序列由问题和答案组成。分别放入模型,然后用一个线性投影层,它的输出大小是1,得到的是该答案是正确答案的一个置信度。每个序列都有一个这样的标量,最后做一个softmax,然后就知道n个答案各自的置信度是多少了。

image-20220323121043465

Question Answering and Commonsense Reasoning For these tasks, we are given a context
document z, a question q, and a set of possible answers fakg. We concatenate the document context
and question with each possible answer, adding a delimiter token in between to get [z; q; $; ak]. Each
of these sequences are processed independently with our model and then normalized via a softmax
layer to produce an output distribution over possible answers.

我们预训练好的transformer模型是不变的,在做下游任务的时候,不会对该模型结构做改变,只是修改输入。

4 Experiments

4.1 Setup

数据集:BooksCorpus ,包含7000+未发表的书。

Unsupervised pre-training We use the BooksCorpus dataset [71] for training the language model.
It contains over 7,000 unique unpublished books from a variety of genres including Adventure,
Fantasy, and Romance. Crucially, it contains long stretches of contiguous text, which allows the
generative model to learn to condition on long-range information.

模型的参数

模型大小: 用了一个12层的transformer decoder ,每一层的维度是768,多头数是12

Model specifications

Our model largely follows the original transformer work [62]. We trained a
12-layer decoder-only transformer with masked self-attention heads (768 dimensional states and 12
attention heads). For the position-wise feed-forward networks, we used 3072 dimensional inner states.
We used the Adam optimization scheme [27] with a max learning rate of 2.5e-4. The learning rate
was increased linearly from zero over the first 2000 updates and annealed to 0 using a cosine schedule.
We train for 100 epochs on minibatches of 64 randomly sampled, contiguous sequences of 512 tokens.
Since layernorm [2] is used extensively throughout the model, a simple weight initialization of
N(0; 0:02) was sufficient. We used a bytepair encoding (BPE) vocabulary with 40,000 merges [53]
and residual, embedding, and attention dropouts with a rate of 0.1 for regularization. We also
employed a modified version of L2 regularization proposed in [37], with w = 0:01 on all non bias or
gain weights. For the activation function, we used the Gaussian Error Linear Unit (GELU) [18]. We
used learned position embeddings instead of the sinusoidal version proposed in the original work.
We use the ftfy library2 to clean the raw text in BooksCorpus, standardize some punctuation and
whitespace, and use the spaCy tokenizer.3

这里看一下BERT的参数

BERT base:用的是 12层,维数也是768,只不过是transformer encoder (前后信息都能看到,没有掩码)

微调的细节

Fine-tuning details Unless specified, we reuse the hyperparameter settings from unsupervised
pre-training. We add dropout to the classifier with a rate of 0.1. For most tasks, we use a learning rate
of 6.25e-5 and a batchsize of 32. Our model finetunes quickly and 3 epochs of training was sufficient
for most cases. We use a linear learning rate decay schedule with warmup over 0.2% of training.
was set to 0.5.

总结

GPT核心点:预训练一个transformer decoder模型之后,做下游任务时,不会很大改变模型结构。

注意和BERT的区别,BERT使用的是transformer encoder

参考

https://github.com/mli/paper-reading

https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf

Logo

鸿蒙生态一站式服务平台。

更多推荐