SKEP 通过情感知识增强型而得到的预训练的 transformer 模型,能够更好的用于情感分类场景。该模型如下图所示:
SKEP 模型掩码策略称为 Hybrid Sentiment Masking,它将输入序列中的 token 分为 3 种类型:aspect-sentiment pairs、sentiment words、common tokens,针对这三种 token 的掩码策略如下:
- Aspect-sentiment Pair Masking:SKEP 对输入最多掩码两个 Aspect-sentiment Pair,并且属性词和情感词同时要进行掩码。比如:上图中 product 和 fast 就是一个 Aspect-sentiment Pair,SKEP 就将这两个 token 一同进行掩码
- Sentiment Word Masking:从尚未被掩码的情感词中随机选择一些进行掩码,总共的被掩码的 token 数量不能超过输入 token 数量的 10%
- Common Token Masking:如果第二步掩码一些情感词之后,仍然没有达到 10%,此时随机选择一个通用的 token 进行掩码,使得掩码数量达到 10%
SKEP 的优化目标如下图所示,共包含了 3 个 不同的目标:
Lsw 表示 Sentiment Word Prediction 情感词预测目标,这个目标就是预测被掩码的情感词,公式如下:
上图公式中, W 和 b 表示用于输出各个 label 的 logits,最后计算负对数损失。这里需要注意下 m 这个参数,m 的值非 0 即 1,也就是说,只有被 mask 的 token 对应的 m 值才是 1,其他为 0. 简单来讲,就是只计算 mask token 的损失。
Lwp 表示 Word Polarity Prediction 情感词极性的预测,即:预测情感词是 positive 还是 negative 的二分类。
Lap 表示 Aspect-sentiment Pair Prediction,该目标是用 [CLS] token 来预测 Aspect-sentiment Pair。这个 Pair 存储在另外一张表中,并且每个 Aspect-sentiment Pair 都有一个 ID。公式如下:
更详细的关于 SKEP 模型,还是看其论文:https://arxiv.org/pdf/2005.05635.pdf。接下来,我们就基于 SKEP 的预训练模型来实现对输入句子进行观点抽取。
import warnings warnings.filterwarnings('ignore') import paddlenlp.data import pandas as pd from paddlenlp.transformers import SkepForTokenClassification from paddlenlp.transformers import SkepTokenizer from torch.utils.data import DataLoader import paddle from paddlenlp.data import Pad from tqdm import tqdm import paddle.nn as nn import paddle.optimizer as optim import glob
1. 模型训练
模型训练过程中,有以下几点需要了解:
- 由于我的 RTX 2060 显卡只有 6G 显存,我们这里使用的 SKEP 预训练模型 skep_ernie_1.0_large_ch 太大,无法在 GPU 中进行训练,并且我也没有用 AMP 这些训练方法,所以就直接在 CPU 上进行训练。
- 观点抽取本质上也属于 token 级别的分类任务,与 NER 不同的是使用的标签体系不同。我们这里由于只抽取观点和情感词,所以使用的是 BIO 标签体系:B-Aspect、I-Aspect、B-Opinion、I-Opinion、O 共计 5 个预测标签
- 由于 SkepTokenizer 在将输入 token 转换为 id 时碰到空格就会跳过,并不会进行编码,所以需要把原始的训练数据中的空格,以及空格对应的标签去除。避免出现,token 和 label 数量不匹配。
- 优化方法使用的是 Adam、学习率为 2e-5、batch_size 设置为 32、训练轮数为 15
完整训练代码如下:
paddle.set_device('cpu') # 训练数据处理 def train_process(): traindata = pd.read_csv('data/train_ext.txt', delimiter='\t').to_numpy().tolist() # 标签映射为数字 label_to_index, index_to_label = {}, {} for index, label in enumerate(open('data/label_ext.dict')): label = label[:-1] label_to_index[label] = index index_to_label[index] = label # 句子分割成单字,并转换标签为数字 token_list, label_list = [], [] for tokens, labels in traindata: tokens = list(tokens) labels = [label_to_index[label] for label in labels.split()] # 去掉空格字符及其对应的标签 my_tokens, my_labels = [], [] for token, label in zip(tokens, labels): if token == ' ': continue my_tokens.append(token) my_labels.append(label) token_list.append(my_tokens) label_list.append(my_labels) return token_list, label_list class OpinionExtractionDataset: def __init__(self): self.token_list, self.label_list = train_process() self.train_size = len(self.token_list) def __len__(self): return self.train_size def __getitem__(self, index): return self.token_list[index], self.label_list[index] def train(): estimator = SkepForTokenClassification.from_pretrained('skep_ernie_1.0_large_ch', num_classes=5) tokenizer = SkepTokenizer.from_pretrained('skep_ernie_1.0_large_ch') def collate_function(batch_data): inputs, labels = [], [] for input, label in batch_data: inputs.append(input) labels.append(label) inputs = tokenizer.batch_encode(inputs, is_split_into_words=True, padding='longest', return_tensors='pd', add_special_tokens=False) labels = paddle.to_tensor(Pad(pad_val=-100)(labels)) return inputs, labels traindata = OpinionExtractionDataset() dataloader = DataLoader(traindata, batch_size=32, shuffle=True, collate_fn=collate_function) criterion = nn.CrossEntropyLoss() optimizer = optim.Adam(parameters=estimator.parameters(), learning_rate=2e-5) for epoch in range(15): progress = tqdm(range(len(dataloader))) total_loss, total_size = 0.0, 0.0 for inputs, labels in dataloader: outputs = estimator(**inputs) loss = criterion(outputs, labels) optimizer.clear_grad() loss.backward() optimizer.step() total_loss += loss.item() total_size += len(labels) progress.set_description('epoch %2d loss %7.4f' % (epoch, total_loss)) progress.update() progress.close() tokenizer.save_pretrained('model/%d-%.4f' % (epoch + 1,total_loss )) estimator.save_pretrained('model/%d-%.4f' % (epoch + 1,total_loss ))
训练过程输入:
[2023-03-10 01:37:58,573] [ INFO] - Already cached /root/.paddlenlp/models/skep_ernie_1.0_large_ch/skep_ernie_1.0_large_ch.pdparams [2023-03-10 01:38:15,510] [ INFO] - Already cached /root/.paddlenlp/models/skep_ernie_1.0_large_ch/skep_ernie_1.0_large_ch.vocab.txt [2023-03-10 01:38:15,521] [ INFO] - tokenizer config file saved in /root/.paddlenlp/models/skep_ernie_1.0_large_ch/tokenizer_config.json [2023-03-10 01:38:15,522] [ INFO] - Special tokens file saved in /root/.paddlenlp/models/skep_ernie_1.0_large_ch/special_tokens_map.json epoch 0 loss 5.5341: 100%|████████████████████| 25/25 [29:38<00:00, 71.15s/it] [2023-03-10 02:07:54,196] [ INFO] - tokenizer config file saved in model/1-5.5341/tokenizer_config.json [2023-03-10 02:07:54,196] [ INFO] - Special tokens file saved in model/1-5.5341/special_tokens_map.json epoch 1 loss 2.7045: 100%|████████████████████| 25/25 [30:34<00:00, 73.39s/it] [2023-03-10 02:38:38,636] [ INFO] - tokenizer config file saved in model/2-2.7045/tokenizer_config.json [2023-03-10 02:38:38,644] [ INFO] - Special tokens file saved in model/2-2.7045/special_tokens_map.json epoch 2 loss 2.1480: 100%|████████████████████| 25/25 [28:44<00:00, 68.97s/it] [2023-03-10 03:07:33,407] [ INFO] - tokenizer config file saved in model/3-2.1480/tokenizer_config.json [2023-03-10 03:07:33,416] [ INFO] - Special tokens file saved in model/3-2.1480/special_tokens_map.json epoch 3 loss 1.7210: 100%|████████████████████| 25/25 [28:53<00:00, 69.34s/it] [2023-03-10 03:36:37,017] [ INFO] - tokenizer config file saved in model/4-1.7210/tokenizer_config.json [2023-03-10 03:36:37,026] [ INFO] - Special tokens file saved in model/4-1.7210/special_tokens_map.json epoch 4 loss 1.4525: 100%|████████████████████| 25/25 [28:54<00:00, 69.36s/it] [2023-03-10 04:05:41,490] [ INFO] - tokenizer config file saved in model/5-1.4525/tokenizer_config.json [2023-03-10 04:05:41,499] [ INFO] - Special tokens file saved in model/5-1.4525/special_tokens_map.json epoch 5 loss 1.1014: 100%|████████████████████| 25/25 [28:54<00:00, 69.36s/it] [2023-03-10 04:34:45,747] [ INFO] - tokenizer config file saved in model/6-1.1014/tokenizer_config.json [2023-03-10 04:34:45,759] [ INFO] - Special tokens file saved in model/6-1.1014/special_tokens_map.json epoch 6 loss 0.8688: 100%|████████████████████| 25/25 [29:03<00:00, 69.75s/it] [2023-03-10 05:04:00,275] [ INFO] - tokenizer config file saved in model/7-0.8688/tokenizer_config.json [2023-03-10 05:04:00,285] [ INFO] - Special tokens file saved in model/7-0.8688/special_tokens_map.json epoch 7 loss 0.6463: 100%|████████████████████| 25/25 [30:48<00:00, 73.96s/it] [2023-03-10 05:34:59,627] [ INFO] - tokenizer config file saved in model/8-0.6463/tokenizer_config.json [2023-03-10 05:34:59,637] [ INFO] - Special tokens file saved in model/8-0.6463/special_tokens_map.json epoch 8 loss 0.5815: 100%|████████████████████| 25/25 [31:02<00:00, 74.50s/it] [2023-03-10 06:06:12,441] [ INFO] - tokenizer config file saved in model/9-0.5815/tokenizer_config.json [2023-03-10 06:06:12,452] [ INFO] - Special tokens file saved in model/9-0.5815/special_tokens_map.json epoch 9 loss 0.3756: 100%|████████████████████| 25/25 [32:16<00:00, 77.44s/it] [2023-03-10 06:38:38,967] [ INFO] - tokenizer config file saved in model/10-0.3756/tokenizer_config.json [2023-03-10 06:38:38,977] [ INFO] - Special tokens file saved in model/10-0.3756/special_tokens_map.json epoch 10 loss 0.2703: 100%|████████████████████| 25/25 [35:17<00:00, 84.71s/it] [2023-03-10 07:14:07,185] [ INFO] - tokenizer config file saved in model/11-0.2703/tokenizer_config.json [2023-03-10 07:14:07,195] [ INFO] - Special tokens file saved in model/11-0.2703/special_tokens_map.json epoch 11 loss 0.2044: 100%|████████████████████| 25/25 [39:21<00:00, 94.46s/it] [2023-03-10 07:53:38,926] [ INFO] - tokenizer config file saved in model/12-0.2044/tokenizer_config.json [2023-03-10 07:53:38,939] [ INFO] - Special tokens file saved in model/12-0.2044/special_tokens_map.json epoch 12 loss 0.1824: 100%|████████████████████| 25/25 [40:13<00:00, 96.53s/it] [2023-03-10 08:34:03,207] [ INFO] - tokenizer config file saved in model/13-0.1824/tokenizer_config.json [2023-03-10 08:34:03,217] [ INFO] - Special tokens file saved in model/13-0.1824/special_tokens_map.json epoch 13 loss 0.1728: 100%|███████████████████| 25/25 [42:33<00:00, 102.14s/it] [2023-03-10 09:16:46,986] [ INFO] - tokenizer config file saved in model/14-0.1728/tokenizer_config.json [2023-03-10 09:16:46,996] [ INFO] - Special tokens file saved in model/14-0.1728/special_tokens_map.json epoch 14 loss 0.1069: 100%|███████████████████| 25/25 [48:18<00:00, 115.94s/it] [2023-03-10 10:05:15,823] [ INFO] - tokenizer config file saved in model/15-0.1069/tokenizer_config.json [2023-03-10 10:05:15,834] [ INFO] - Special tokens file saved in model/15-0.1069/special_tokens_map.json
2. 模型推理
def opinion_extraction(token_labels, inputs): print(token_labels) # 标签映射为数字 label_to_index = { label[:-1] : index for index, label in enumerate(open('data/label_ext.dict'))} begin_aspect = label_to_index['B-Aspect'] inner_aspect = label_to_index['I-Aspect'] begin_opinion = label_to_index['B-Opinion'] inner_opinion = label_to_index['I-Opinion'] # 如果碰到 B-Aspect 或者 B-Opinion 标签,则取出该词 def decode_word(index, next_label): # 提取第一个字 word = [inputs[index]] current_index = index + 1 while current_index < len(token_labels): if token_labels[current_index] == next_label: word.append(inputs[current_index]) current_index += 1 else: break return current_index, ''.join(word) # 解码出所有可能的 Aspect 和 Opinion 词 index = 0 decode_words = [] while index < len(token_labels): current_label = token_labels[index] if current_label == begin_aspect: index, word = decode_word(index, inner_aspect) decode_words.append((word, 'aspect')) continue if current_label == begin_opinion: index, word = decode_word(index, inner_opinion) decode_words.append((word, 'opinion')) continue index += 1 print('PPP:', decode_words) # [('很好', 'opinion'), ('地方', 'aspect'), ('牛扒', 'aspect'), ('一般', 'opinion'), ('牛扒', 'aspect'), ('一般', 'opinion')] # 将连续的属性和观点拼接到一起 index = 0 result = {} while index < len(decode_words): current_word, current_type = decode_words[index] if current_type == 'opinion': key = None val = [current_word] opinion_index = index + 1 while opinion_index < len(decode_words): opinion_word, opinion_type = decode_words[opinion_index] if opinion_type == 'opinion': val.append(opinion_word) if opinion_type == 'aspect': key = opinion_word break opinion_index += 1 index = opinion_index + 1 result[key] = val if current_type == 'aspect': key = current_word val = [] aspect_index = index + 1 while aspect_index < len(decode_words): aspect_word, aspect_type = decode_words[aspect_index] if aspect_type == 'opinion': val.append(aspect_word) if aspect_type == 'aspect': break aspect_index += 1 index = aspect_index result[key] = val # 去除没有 opinion 的 aspect extracted_opinions = {aspect: opinions for aspect, opinions in result.items() if len(opinions) > 0} return extracted_opinions def extraction(): model_name = glob.glob('model/15-*')[0] estimator = SkepForTokenClassification.from_pretrained(model_name, num_classes=5) estimator.eval() tokenizer = SkepTokenizer.from_pretrained(model_name) # inputs = '蓝色的水飞流直下别有风情很好的地方' # inputs = '牛扒很一般,所谓的自助餐玩文字游戏' # inputs = '一个字好两个字好吃,但是口味绝对的赞' inputs = '搓澡的大姐太给力了,不仅搓澡好,洗澡的环境干净,挺温馨的' inputs = ''.join(inputs.split()) inputs_encode = tokenizer.encode(list(inputs), is_split_into_words=True, return_tensors='pd', add_special_tokens=False) with paddle.no_grad(): outputs = estimator(**inputs_encode) pred_labels = paddle.argmax(outputs, axis=-1) result = opinion_extraction(pred_labels.tolist()[0], inputs) print('句子:', inputs) print('观点:', result) if __name__ == '__main__': extraction()
程序中给定的 4 个输入抽取的观点如下:
句子: 蓝色的水飞流直下别有风情很好的地方 观点: {'地方': ['很好']} 句子: 牛扒很一般,所谓的自助餐玩文字游戏 观点: {'牛扒': ['一般']} 句子: 一个字好两个字好吃,但是口味绝对的赞 观点: {'个字': ['好'], '口味': ['赞']} 句子: 搓澡的大姐太给力了,不仅搓澡好,洗澡的环境干净,挺温馨的 观点: {'大姐': ['给力', '好'], '环境': ['干净', '温馨']}