模型评估主要做的事情，提取测试集所有的实体名称，并划分为 ORG、PER、LOC 类别。分别统计每个类别的精度、召回率，以及准确率。

1. 提取实体名称

我编写了 extract_decode 函数，该函数接收一个句子，以及该句子中每个字对应的实体标签。从而，提取出该剧中标识的所有实体，并分类存储。

def extract_decode(label_list, text):
    """
    :param label_list: 模型输出的包含标签序列的一维列表
    :param text: 模型输入的句子
    :return: 提取到的实体名字
    """

    labels = ['O', 'B-ORG', 'I-ORG', 'B-PER', 'I-PER', 'B-LOC', 'I-LOC']
    label_to_index = {label: index for index, label in enumerate(labels)}
    B_ORG, I_ORG = label_to_index['B-ORG'], label_to_index['I-ORG']
    B_PER, I_PER = label_to_index['B-PER'], label_to_index['I-PER']
    B_LOC, I_LOC = label_to_index['B-LOC'], label_to_index['I-LOC']

    # 提取连续的标签代表的实体
    def extract_word(start_index, next_label):

        # index 表示最后索引的位置
        index, entity = start_index + 1, [text[start_index]]
        for index in range(start_index + 1, len(label_list)):
            if label_list[index] != next_label:
                break
            entity.append(text[index])

        return index, ''.join(entity)

    # 存储提取的命名实体
    extract_entites, index = {'ORG': [], 'PER': [],  'LOC': []}, 0
    # 映射下一个持续的标签
    next_label = {B_ORG: I_ORG, B_PER: I_PER, B_LOC: I_LOC}
    # 映射词的所属类别
    word_class = {B_ORG: 'ORG', B_PER: 'PER', B_LOC: 'LOC'}

    while index < len(label_list):
        # 获得当前位置的标签
        label = label_list[index]
        if label in next_label.keys():
            # 将当前位置和对应的下一个持续标签传递到 extract_word 函数
            index, word = extract_word(index, next_label[label])
            extract_entites[word_class[label]].append(word)
            continue
        index += 1

    return extract_entites

2. 评估函数

评估函数 evaluate 主要做三件事。首先，计算提取测试集中所有的实体名称；然后，由模型对测试集数据进行预测，从而得到模型预测出的实体标签。最后，根据模型预测的结果和真实结果计算不同类别的精度、召回率，例如：计算模型对 LOC 位置实体的预测的精度、召回率。

前面，我们训练出了多个模型，我们这里分别对每个模型都进行评估。部分评估的结果如下：

# ner-model-12 这个模型在测试集上的评估结果

ORG 查全率: 0.929
ORG 查准率: 0.844
--------------------------------------------------
PER 查全率: 0.973
PER 查准率: 0.936
--------------------------------------------------
LOC 查全率: 0.933
LOC 查准率: 0.902
--------------------------------------------------
准确率: 0.942

# ner-model-37 这个模型在测试集上的评估结果
ORG 查全率: 0.924
ORG 查准率: 0.814
--------------------------------------------------
PER 查全率: 0.961
PER 查准率: 0.926
--------------------------------------------------
LOC 查全率: 0.924
LOC 查准率: 0.894
--------------------------------------------------
准确率: 0.933

...等等

从评估结果可以看到，不同的模型在 ORG、PER、LOC 上的精度、召回率都是不同的。这也可以简单理解，不同的模型在对具体的类别的命名实体识别时，性能是不同的。下面是完整的评估代码：

def evaluate():

    # 读取测试数据
    valid_data = load_from_disk('data/03-train')['valid_data']

    # 1. 计算各个不同类别总实体数量

    # 计算测试集实体数量
    total_entities = {'ORG': [], 'PER': [], 'LOC': []}
    def calculate_handler(data_inputs, data_label_ids):
        # 将 data_inputs 转换为没有空格隔开的句子
        text = ''.join(data_inputs.split())
        label_list = data_label_ids
        # 提取句子中的实体
        extract_entities = extract_decode(data_label_ids, text)
        # 统计每种实体的数量
        nonlocal total_entities
        for key, value in extract_entities.items():
            total_entities[key].extend(value)

    # 统计不同实体的数量
    valid_data.map(calculate_handler, input_columns=['data_inputs', 'data_label_ids'])
    print(total_entities)


    # 2. 计算模型预测的各个类别实体数量
    # 初始化分词器
    tokenizer = BertTokenizer.from_pretrained('bert-base-chinese')

    for index in range(11, 38):

    # 初始化模型
        model = BertForTokenClassification.from_pretrained('data/ner-model-%d' % index, num_labels=7)
        model.train(mode=False)

        model_entities = {'ORG': [], 'PER': [], 'LOC': []}
        def start_evaluate(data_inputs):

            # 对输入文本进行分词
            model_inputs = tokenizer(data_inputs, add_special_tokens=False, return_tensors='pt')
            # 文本送入模型进行计算
            with torch.no_grad():
                outputs = model(**model_inputs)

            # 统计预测的实体数量
            label_list = torch.argmax(outputs.logits.squeeze(), dim=-1).tolist()
            text = ''.join(data_inputs.split())

            # 从预测结果提取实体名字
            extract_entities = extract_decode(label_list, text)
            nonlocal model_entities
            for key, value in extract_entities.items():
                model_entities[key].extend(value)

        # 统计预测不同实体的数量
        valid_data.map(start_evaluate, input_columns=['data_inputs'], batched=False)
        print(model_entities)

        # 3. 统计每个类别的召回率
        print('#%d\n' % index)
        total_pred_correct = 0
        total_true_correct = 0
        for key in total_entities.keys():

            # 获得当前 key 类别真实和模型预测实体列表
            true_entities = total_entities[key]
            true_entities_num = len(true_entities)
            pred_entities = model_entities[key]

            # 分解预测实体中，pred_correct 表示预测正确，pred_incorrect 表示预测错误
            pred_correct, pred_incorrect = 0, 0
            for pred_entity in pred_entities:
                if pred_entity in true_entities:
                    pred_correct += 1
                    continue
                pred_incorrect += 1

            # 模型预测的 key 类别的实体数量
            model_pred_key_num = true_entities_num + pred_incorrect

            # 计算共预测正确多少个实体
            total_pred_correct += pred_correct
            # 计算共有多少个真实的实体
            total_true_correct += true_entities_num

            # 计算精度
            print(key, '查全率: %.3f' % (pred_correct / true_entities_num))
            print(key, '查准率: %.3f' % (pred_correct / model_pred_key_num))
            print('-' * 50)

        print('准确率: %.3f' % (total_pred_correct / total_true_correct))

3. 预测函数

预测函数就比较简单了，就是传递一个句子，抽取句子中的实体。实现代码如下：

def entity_extract(text):

    # 初始化分词器
    tokenizer = BertTokenizer.from_pretrained('bert-base-chinese')
    # 初始化模型
    model = BertForTokenClassification.from_pretrained('data/ner-model-39', num_labels=7)
    model.train(False)

    # 我们先按字将其分开，并在字之间添加空格，便于 Bert 分词器能够准确按字分割
    input_text = ' '.join(list(text))
    print(text)
    inputs = tokenizer.encode_plus(input_text, add_special_tokens=False, return_tensors='pt')

    with torch.no_grad():
        outputs = model(**inputs)
        y_pred = torch.argmax(outputs.logits, dim=-1)[0].tolist()

    return extract_decode(y_pred, text)


if __name__ == '__main__':
    text = '今年７月１日我国政府恢复对香港行使主权，标志着“一国两制”构想的巨大成功，标志着中国人民在祖国统一大业的道路上迈出了重要的一步。'
    result = entity_extract(text)
    print(result)
    text = '同时，三毛集团自身也快速扩张，企业新创造了３０００多个就业岗位，安置了一大批下岗职工。'
    result = entity_extract(text)
    print(result)
    text = '我要感谢洛杉矶市民议政论坛、亚洲协会南加中心、美中关系全国委员会、美中友协美西分会等友好团体的盛情款待。'
    result = entity_extract(text)
    print(result)

程序运行结果如下：

今年７月１日我国政府恢复对香港行使主权，标志着“一国两制”构想的巨大成功，标志着中国人民在祖国统一大业的道路上迈出了重要的一步。
{'ORG': [], 'PER': [], 'LOC': ['香港', '中国']}
同时，三毛集团自身也快速扩张，企业新创造了３０００多个就业岗位，安置了一大批下岗职工。
{'ORG': ['三毛集团'], 'PER': [], 'LOC': []}
我要感谢洛杉矶市民议政论坛、亚洲协会南加中心、美中关系全国委员会、美中友协美西分会等友好团体的盛情款待。
{'ORG': ['亚洲协会南加中心', '美中关系全国委员会', '美中友协美西分会'], 'PER': [], 'LOC': ['洛杉矶']}

基于 Bert 实现 NER 任务 – 模型评估

1. 提取实体名称

2. 评估函数

3. 预测函数

文章目录