Using Transformers – pipeline

Transformer 通常有数树百万、甚至数百亿的参数,训练和部署这些模型是一项复杂的工作。此外,由于几乎每天都会发布新模型并且每个模型都有自己的实现,使用它们并不是一件容易的事。Transformers 库提供了简单、统一的接口来加载、训练、保存这些模型。Transfomers 库支持 tensorflow 和 pytorch 两种深度学习框架。

1. pipline 工作流程

  1. Tokenizer 对输入的文本根据词表进行分词,并先将其映射到一个数字,例如上图中将 This 映射为 2023、将 course 映射为 2607,并添加一些额外的内容,例如上图中的开头的 101 和结尾的 102,表示句子的开始和结束。pipeline 会自动加载用到的分词器;
  2. Model 则接受来自 Tokenizer 处理之后的数据,并输出 hidden_states,该数据表示 Transformer 模型对输入的理解;
  3. 只得到 hidden_states 有时候可能可能有用,但是一般我们将其作为其他模型的输入,从而和具体的任务绑定。在 Transformers 库中封装一些带有特定任务 Head 模型,主要有:
    1. *Model
    2. *ForCausalLM
    3. *ForMaskedLM
    4. *ForMultipleChoice
    5. *ForQuestionAnswering
    6. *ForSequenceClassification
    7. *ForTokenClassification
    8. …others

接下来,我们根据上面的知识,编写一段示例程序,首先导入所需要的包:

# 导入带有 MaskedLM 头的模型
import torch
from transformers import AutoModelForMaskedLM
# 导入模型对应的分词器
from transformers import AutoTokenizer

实例化带有 MaskedLM 头的模型和分词器:

# 1. 加载预训练模型,模型不存在的话,会自行从网络下载 并缓存回到家目录 .cache/huggingface/transformers
model = AutoModelForMaskedLM.from_pretrained('bert-base-chinese')

# 2. 加载分词器,对数据数据进行预处理
tokenizer = AutoTokenizer.from_pretrained('bert-base-chinese')

将输入的文本处理模型需要的输入格式,这个工作交给对应模型的 tokenizer 来完成即可:

inputs = ['我是一个非常[MASK]的人', '小朋友们在打[MASK]球']
inputs_ids = tokenizer(inputs, padding=True, return_tensors='pt')
print(inputs_ids)

inputs_ids 对应的内容如下:

{
    'input_ids':tensor([
        [ 101, 2769, 3221,  671,  702, 7478, 2382,  103, 4638,  782,  102],
        [ 101, 2207, 3301, 1351,  812, 1762, 2802,  103, 4413,  102,    0]]),
    'token_type_ids': tensor([
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]),
    'attention_mask': tensor([
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0]])
}
  1. input_ids 表示对输入文本添加了开始和结束标识后的文本的索引表示
  2. attention_mask 是模型注意力计算中使用的掩码,或者理解为告诉注意力层哪些值需要计算注意力
  3. token_type_ids 用于区分多个 sentence, 具体用到再说

然后将由 Tokenizer 预处理好的数据送入模型:

outputs = outputs = model(**inputs_ids)
print(outputs.logits.shape, outputs)

得到的 outputs.logits 的形状为 torch.Size([2, 11, 21128]),其中 2 表示我们共送入到模型两个样本, 11 表示每个样本中的每个字的预测分数, 21128 表示具体的预测分数。完整输入内容:

MaskedLMOutput(loss=None, logits=tensor([
    [[ -7.9921,  -7.9514,  -7.9941,  ...,  -6.8486,  -6.9934,  -7.0647],
     [ -8.5090,  -8.3233,  -8.2923,  ...,  -6.6875,  -7.1689,  -6.0902],
     [-17.2248, -15.6046, -17.1670,  ...,  -5.3784,  -8.0310,  -7.7882],
     ...,
     [-10.5774, -11.2269, -10.6044,  ...,  -6.0821,  -6.9838,  -7.9344],
     [-11.1292, -11.6596, -11.5012,  ...,  -7.9874,  -7.3275,  -6.1994],
     [ -9.5499,  -9.5935,  -9.6982,  ...,  -6.6942,  -7.0926,  -7.5598]],

    [[ -8.4039,  -8.3126,  -8.3381,  ...,  -7.2512,  -7.2176,  -7.2747],
     [ -8.2717,  -8.2131,  -8.2040,  ...,  -7.1573,  -6.8951,  -6.5240],
     [-18.2110, -17.2316, -17.4112,  ..., -10.5202, -12.1790, -15.4470],
     ...,
     [-11.4894, -11.5857, -11.7265,  ...,  -9.3840,  -7.4743,  -8.0007],
     [ -9.9360, -10.0215, -10.2464,  ...,  -7.2153,  -7.9782,  -8.0221],
     [-10.2763, -10.2004, -10.3447,  ...,  -7.4666,  -5.5277,  -6.6890]]],
       grad_fn=<AddBackward0>), hidden_states=None, attentions=None)

由于我们要得到 [MASK] 位置的词,所以我们只需要拿到对应位置的 21128 个分数中最大的索引即可。

sentence1_mask_id = torch.argmax(outputs.logits[0][7])
sentence2_mask_id = torch.argmax(outputs.logits[1][7])
print(sentence1_mask_id, sentence2_mask_id)

# 将 id 还原为文本
sentence1_mask_word = tokenizer.decode(sentence1_mask_id)
sentence2_mask_word = tokenizer.decode(sentence2_mask_id)
# '我是一个非常[MASK]的人'  [MASK] ==> 好
# '小朋友们在打[MASK]球' [MASK] ==> 篮
print(sentence1_mask_word, sentence2_mask_word)

2. pipline 用法

import torch
from transformers import pipeline
import numpy as np

# 1. 情感分析任务
def test01():

    # https://huggingface.co/techthiyanes/chinese_sentiment
    model = pipeline('sentiment-analysis', model='chinese_sentiment')
    print(model('我爱你'))
    print(model('我恨你'))

    # 输出结果
    # [{'label': 'star 5', 'score': 0.5765597820281982}]
    # [{'label': 'star 1', 'score': 0.4358566999435425}]


# 2. 特征提取任务
def test02():

    # https://huggingface.co/bert-base-chinese
    model = pipeline('feature-extraction', model='bert-base-chinese')
    output = model('我是一个中国人')
    print(np.array(output).shape)

    # 输出结果
    # [CLS] 我 是 一 个 中 国 人 [SEP]
    # (1, 9, 768)


# 3. 完形填空任务
def test03():

    # https://huggingface.co/bert-base-chinese
    # 全词模型 https://huggingface.co/hfl/chinese-bert-wwm
    model = pipeline('fill-mask', model='chinese-bert-wwm')
    inputs = '我想明天去[MASK]家吃饭.'
    print(model(inputs))

    # 输出结果
    # {'sequence': '我 想 明 天 去 她 家 吃 饭.', 'score': 0.3433133959770202, 'token': 1961, 'token_str': '她'},
    # {'sequence': '我 想 明 天 去 你 家 吃 饭.', 'score': 0.2533259987831116, 'token': 872, 'token_str': '你'},
    # {'sequence': '我 想 明 天 去 他 家 吃 饭.', 'score': 0.1874391734600067, 'token': 800, 'token_str': '他'},
    # {'sequence': '我 想 明 天 去 我 家 吃 饭.', 'score': 0.1273055076599121, 'token': 2769, 'token_str': '我'},
    # {'sequence': '我 想 明 天 去 您 家 吃 饭.', 'score': 0.0216297898441553, 'token': 2644, 'token_str': '您'}

# 4. 阅读理解(抽取式问答)
def test04():

    # https://huggingface.co/luhua/chinese_pretrain_mrc_roberta_wwm_ext_large
    model = pipeline('question-answering', model='chinese_pretrain_mrc_roberta_wwm_ext_large')
    print(model(context='我是一个好人.', question='我是谁?'))

    # 输出结果
    # {'score': 3.6716744228337816e-11, 'start': 4, 'end': 6, 'answer': '好人'}


# 5. 文本摘要
def test05():

    model = pipeline('summarization')
    text = '''
    In this notebook we will be using the transformer model, first introduced in this paper. Specifically, we will be using the BERT (Bidirectional Encoder Representations from Transformers) model from this paper.
    Transformer models are considerably larger than anything else covered in these tutorials. As such we are going to use the transformers library to get pre-trained transformers and use them as our embedding layers. We will freeze (not train) the transformer and only train the remainder of the model which learns from the representations produced by the transformer. In this case we will be using a multi-layer bi-directional GRU, however any model can learn from these representations.
    '''
    print(model(text))

    # [{'summary_text': ' In this notebook we will be using the transformer model, first introduced in this paper .
    # Transformer models are considerably larger than anything else covered in these tutorials .
    # We will freeze (not train) the transformer and only train the remainder of the model which learns from the representations produced by the transformer .'}]


# 6. 命名实体任务
def test06():

    # https://huggingface.co/uer/roberta-base-finetuned-cluener2020-chinese
    model = pipeline('ner', model='roberta-base-finetuned-cluener2020-chinese')
    print(model('我去北京的家乐福超市买东西'))

    # {'entity': 'B-address', 'score': 0.913372, 'index': 3, 'word': '北', 'start': 2, 'end': 3},
    # {'entity': 'I-address', 'score': 0.8613379, 'index': 4, 'word': '京', 'start': 3, 'end': 4},
    # {'entity': 'B-address', 'score': 0.47611365, 'index': 6, 'word': '家', 'start': 5, 'end': 6},
    # {'entity': 'I-address', 'score': 0.6594356, 'index': 7, 'word': '乐', 'start': 6, 'end': 7},
    # {'entity': 'I-address', 'score': 0.5203492, 'index': 8, 'word': '福', 'start': 7, 'end': 8}


if __name__ == '__main__':
    test06()

未经允许不得转载:一亩三分地 » Using Transformers – pipeline
评论 (0)

1 + 1 =