RoBERTa

RoBERTa 对 Bert 模型做了一些简单的修改:

  1. 使用更多的数据,更大的 batch size,训练更长时间;
  2. 移除了 Bert 的 NSP 预训练任务;
  3. 在更长的序列上进行训练;
  4. 使用动态掩码的策略。

In summary, the contributions of this paper are:

  1. We present a set of important BERT design choices and training strategies and introduce alternatives that lead to better downstream task performance;
  2. We use a novel dataset, CCNEWS, and confirm that using more data for pretraining further improves performance on downstream tasks;
  3. Our training improvements show that masked language model pretraining, under the right design choices, is competitive with all other recently published methods. We release our model, pretraining and fine-tuning code implemented in PyTorch

Dynamic Masking

The original BERT implementation performed masking once during data preprocessing, resulting in a single static mask. To avoid using the same mask for each training instance in every epoch, training data was duplicated 10 times so that each sequence is masked in 10 different ways over the 40 epochs of training. Thus, each training sequence was seen with the same mask four times during training.

详细还是看下 Paper:https://arxiv.org/pdf/1907.11692.pdf

RobertaForSequenceClassification.from_pretrained('model/roberta_chinese_clue_large')
未经允许不得转载:一亩三分地 » RoBERTa
评论 (0)

4 + 6 =