核心点:预训练一个transformer decoder模型之后,做下游任务时,不会很大改变模型结构。

视频链接:GPT,GPT-2,GPT-3 论文精读【论文精读】


title: Improving Language Understanding by Generative Pre-Training


GPT:Generative Pre-Training 通用的预训练模型



Although large unlabeled text corpora are abundant,
labeled data for learning these specific tasks is scarce, making it challenging for
discriminatively trained models to perform adequately. We demonstrate that large
gains on these tasks can be realized by generative pre-training of a language model
on a diverse corpus of unlabeled text, followed by discriminative fine-tuning on each
specific task.

1 Introduction




Leveraging more than word-level information from unlabeled text, however, is challenging for two
main reasons. First, it is unclear what type of optimization objectives are most effective at learning
text representations that are useful for transfer. Recent research has looked at various objectives
such as language modeling [44], machine translation [38], and discourse coherence [22], with each
method outperforming the others on different tasks.1 Second, there is no consensus on the most
effective way to transfer these learned representations to the target task. Existing techniques involve
a combination of making task-specific changes to the model architecture [43, 44], using intricate
learning schemes [21] and adding auxiliary learning objectives [50]. These uncertainties have made
it difficult to develop effective semi-supervised learning approaches for language processing.

半监督学习semi-supervised :有标注好的数据,还有大量未标注的数据,如何用进来。


我们的目标是:学习一个通用的表示模型,稍微调整之后可以迁移到大量的子任务中。(Our goal is to learn a universal representation that transfers with little adaptation to a wide range of tasks.)

In this paper, we explore a semi-supervised approach for language understanding tasks using a
combination of unsupervised pre-training and supervised fine-tuning. Our goal is to learn a universal
representation that transfers with little adaptation to a wide range of tasks. We assume access to
a large corpus of unlabeled text and several datasets with manually annotated training examples
(target tasks). Our setup does not require these target tasks to be in the same domain as the unlabeled
corpus. We employ a two-stage training procedure. First, we use a language modeling objective on
the unlabeled data to learn the initial parameters of a neural network model. Subsequently, we adapt
these parameters to a target task using the corresponding supervised objective

使用Transformer架构,这篇文章发表在Transformer 出来一年之后,在当时,选择Transformer 或者RNN不是那么显而易见的。

作者的理由:在迁移学习的时候,Transformer 学习到的feature更稳健,原因可能是Transformer 里面有更结构化的记忆,可以处理更长的文本信息,这样可以抽取出更好的句子层面、段落层面的语义信息。


For our model architecture, we use the Transformer [62], which has been shown to perform strongly on
various tasks such as machine translation [62], document generation [34], and syntactic parsing [29].
This model choice provides us with a more structured memory for handling long-term dependencies in
text, compared to alternatives like recurrent networks, resulting in robust transfer performance across
diverse tasks. During transfer, we utilize task-specific input adaptations derived from traversal-style
approaches [52], which process structured text input as a single contiguous sequence of tokens. As
we demonstrate in our experiments, these adaptations enable us to fine-tune effectively with minimal
changes to the architecture of the pre-trained model.

6 Conclusion

We introduced a framework for achieving strong natural language understanding with a single
task-agnostic model through generative pre-training and discriminative fine-tuning. By pre-training
on a diverse corpus with long stretches of contiguous text our model acquires significant world
knowledge and ability to process long-range dependencies which are then successfully transferred to
solving discriminative tasks such as question answering, semantic similarity assessment, entailment
determination, and text classification, improving the state of the art on 9 of the 12 datasets we
study. Using unsupervised (pre-)training to boost performance on discriminative tasks has long
been an important goal of Machine Learning research. Our work suggests that achieving significant
performance gains is indeed possible, and offers hints as to what models (Transformers) and data sets
(text with long range dependencies) work best with this approach. We hope that this will help enable
new research into unsupervised learning, for both natural language understanding and other domains,
further improving our understanding of how and when unsupervised learning works.

2 Related Work

Semi-supervised learning for NLP

Unsupervised pre-training

Auxiliary training objectives

3 Framework

3.1 Unsupervised pre-training



语言模型要预测第i个词出现的概率,记为 u i u_i ui,使用它前面的连续k个词做预测,这里的k是上下文窗口(context window)

GPT使用transformer decoder 解码器

transformer encoder 编码器:可以看到整个文本信息

transformer decoder 解码器:由于mask的存在,只能看到当前元素和它之前的元素

这里使用的是标准的语言模型,预测第i个词出现的概率的时候,不看其后面的词。所以用transformer decoder。

a variant of the transformer [62]. This model applies a multi-headed self-attention operation over the
input context tokens followed by position-wise feedforward layers to produce an output distribution
over target tokens:



第一个:GPT采用transformer decoder,只看到当前词前面的词

BERT不是使用标准的语言模型,而是采用带掩码的语言模型(完形填空,给一个句子,把中间的一些词挖掉,让你预测出来),做预测的时候既可以看到这个词的之前的词,又可以看到之后的词。这对应于transformer encoder。



3.2 Supervised fine-tuning

微调模型:把有标注的数据集序列 ( x 1 , . . . , x m ) (x^1,..., x^m) x1...,xm,标签是y,即给定序列,预测y。

做法是:把这个序列放到之前训练好的GPT中,然后拿到transformer块的最后一层输出的激活函数 h l m h_l^m hlm, 乘以一个输出层,再放到softmax中求概率。



We additionally found that including language modeling as an auxiliary objective to the fine-tuning
helped learning by (a) improving generalization of the supervised model, and (b) accelerating
convergence. This is in line with prior work [50, 43], who also observed improved performance with
such an auxiliary objective. Specifically, we optimize the following objective (with weight λ \lambda λ):


把两个目标函数加起来,权重 λ \lambda λ可调

3.3 Task-specific input transformations


Instead, we use a traversal-style approach [52], where we convert structured inputs into an ordered sequence that our pre-trained model can process. These input transformations allow us to avoid making extensive changes to the architecture across tasks. We provide a brief description of these input transformations below and Figure 1 provides a visual illustration. All transformations include adding randomly initialized start and end tokens (hsi, hei).






第二个任务:蕴涵 entailment




Textual entailment For entailment tasks, we concatenate the premise p and hypothesis h token
sequences, with a delimiter token ($) in between.

第三个任务:相似 similarity





Similarity For similarity tasks, there is no inherent ordering of the two sentences being compared.
To reflect this, we modify the input sequence to contain both possible sentence orderings (with a
delimiter in between) and process each independently to produce two sequence representations h m l h_m^l hml
which are added element-wise before being fed into the linear output layer.

第四个任务:多选 multiple choice




Question Answering and Commonsense Reasoning For these tasks, we are given a context
document z, a question q, and a set of possible answers fakg. We concatenate the document context
and question with each possible answer, adding a delimiter token in between to get [z; q; $; ak]. Each
of these sequences are processed independently with our model and then normalized via a softmax
layer to produce an output distribution over possible answers.


4 Experiments

4.1 Setup

数据集:BooksCorpus ,包含7000+未发表的书。

Unsupervised pre-training We use the BooksCorpus dataset [71] for training the language model.
It contains over 7,000 unique unpublished books from a variety of genres including Adventure,
Fantasy, and Romance. Crucially, it contains long stretches of contiguous text, which allows the
generative model to learn to condition on long-range information.


模型大小: 用了一个12层的transformer decoder ,每一层的维度是768,多头数是12

Model specifications

Our model largely follows the original transformer work [62]. We trained a
12-layer decoder-only transformer with masked self-attention heads (768 dimensional states and 12
attention heads). For the position-wise feed-forward networks, we used 3072 dimensional inner states.
We used the Adam optimization scheme [27] with a max learning rate of 2.5e-4. The learning rate
was increased linearly from zero over the first 2000 updates and annealed to 0 using a cosine schedule.
We train for 100 epochs on minibatches of 64 randomly sampled, contiguous sequences of 512 tokens.
Since layernorm [2] is used extensively throughout the model, a simple weight initialization of
N(0; 0:02) was sufficient. We used a bytepair encoding (BPE) vocabulary with 40,000 merges [53]
and residual, embedding, and attention dropouts with a rate of 0.1 for regularization. We also
employed a modified version of L2 regularization proposed in [37], with w = 0:01 on all non bias or
gain weights. For the activation function, we used the Gaussian Error Linear Unit (GELU) [18]. We
used learned position embeddings instead of the sinusoidal version proposed in the original work.
We use the ftfy library2 to clean the raw text in BooksCorpus, standardize some punctuation and
whitespace, and use the spaCy tokenizer.3


BERT base:用的是 12层,维数也是768,只不过是transformer encoder (前后信息都能看到,没有掩码)


Fine-tuning details Unless specified, we reuse the hyperparameter settings from unsupervised
pre-training. We add dropout to the classifier with a rate of 0.1. For most tasks, we use a learning rate
of 6.25e-5 and a batchsize of 32. Our model finetunes quickly and 3 epochs of training was sufficient
for most cases. We use a linear learning rate decay schedule with warmup over 0.2% of training.
was set to 0.5.


GPT核心点:预训练一个transformer decoder模型之后,做下游任务时,不会很大改变模型结构。

注意和BERT的区别,BERT使用的是transformer encoder