Bert 学习 OpenAI 句向量

OpenAI 提供了 text-embedding-ada-002 模型用于计算输入 sentence 的向量表示。该模型包括一个多层双向 Transformer 编码器和一个平均池化层,用于将编码器的输出转换为固定长度的嵌入向量。模型的输入是一个不定长的文本,其输出是一个维度为 1536 的向量,该向量捕获了输入序列的语义信息。

该模型使用大量的文本语料库进行监督学习,以学习将文本转换为有意义的语义表示。这使得它在各种自然语言处理任务中表现出色,包括文本分类、情感分析、问答、自然语言生成等。

如果将我们的训练语料送入到 text-embedding-ada-002 模型得到向量表示,然后让 Bert 预训练模型去学习该向量表示,能否获得一个对 sentence 表征较好的模型,这里进行了一个尝试。

  1. 首先,将问题送入到 text-embedding-ada-002 模型计算向量表示
  2. 然后,使用一个包含 bert + linear 的网络学习 text-embedding-ada-002 的表示
  3. 最后,使用余弦距离作为损失函数

1. 获得 OpenAI 句向量

由于 OpenAI 对请求有速率限制,这里我们使用 tenacity 库来重试执行。10000 个问题 10000 个请求大致花费了 2 小时 45 分钟左右。完整代码如下:

import requests
import openai
import tenacity
import torch
import pickle
import json
import torch
import pandas as pd
import numpy as np
from tqdm import tqdm
openai.api_key = open('openai_api_key').read()


@tenacity.retry(wait=tenacity.wait_fixed(3))
def get_embedding(question):
    outputs = openai.Embedding.create(model='text-embedding-ada-002', input=question)
    return outputs['data'][0]['embedding']

# 10000 问题: 2小时45分钟
def generate_embedding():

    trains = pd.read_csv('data/question.csv', usecols=['question']).to_numpy().tolist()
    trains = [train[0] for train in trains]
    progress = tqdm(range(len(trains)), desc='generate embedding1')
    question_embeddings = []
    for question in trains:
        embedding = get_embedding(question)
        question_embeddings.append(embedding)
        progress.update()
    progress.close()

    pickle.dump(question_embeddings, open('data/question_embedded.pkl', 'wb'))
    open('data/questions_selected.txt', 'w').write('\n'.join(trains))


def similarity_matching():

    embeddings = pickle.load(open('data/question_embedded.pkl', 'rb'))
    message_inputs = '今天早上突然发烧了,浑身酸疼,吃点啥药啊?'
    print('输入问题:', message_inputs)
    print('-' * 60)
    message_embeds = get_embedding(message_inputs)
    questions = open('data/questions_selected.txt').readlines()
    question_indexes = list(range(len(questions)))

    question_scores = []
    for embed in embeddings:
        score = torch.cosine_similarity(torch.tensor(), torch.tensor([message_embeds]))
        question_scores.append(score.item())

    sorted_indexes = np.argsort(-np.array(question_scores))[:5]
    print('相似问题:')
    for index in sorted_indexes:
        print(round(question_scores[index], 4), questions[index].strip())


if __name__ == '__main__':
    generate_embedding()
    similarity_matching()

2. 构建 Bert 向量模型

由于 Bert 模型输出的向量维度为 768,而 text-embedding-ada-002 模型输出的向量维度为 1536,所以我们增加一个线性层将 768 维度变换到 1536 维度。完整模型代码如下:

import torch
import torch.nn as nn
from transformers import BertModel

import logging
logging.getLogger("transformers").setLevel(logging.ERROR)


class SentenceEmbeddingBert(nn.Module):

    def __init__(self, pretrained=None, embed_dim=1536):
        super(SentenceEmbeddingBert, self).__init__()
        if pretrained is None:
            self.base_model = BertModel.from_pretrained('pretrained/bert-base-chinese')
        else:
            self.base_model = BertModel.from_pretrained(pretrained)

        self.transpose = nn.Linear(self.base_model.config.hidden_size, embed_dim)

    def get_inputs_embedding(self, input_ids, attention_mask):
        outputs = self.base_model(input_ids=input_ids, attention_mask=attention_mask)
        inputs_embedding = outputs.last_hidden_state[:, 0]
        inputs_embedding = self.transpose(inputs_embedding)
        return inputs_embedding

    def get_loss(self, inputs_embeddings, labels):
        """余弦距离损失"""
        distance = 1 - torch.cosine_similarity(inputs_embeddings, labels)
        loss = torch.mean(distance)
        return loss

    def forward(self, input_ids, attention_mask, labels):
        sentence_embeddings = self.get_inputs_embedding(input_ids, attention_mask)
        return self.get_loss(sentence_embeddings, labels)


def test():
    estimator = SentenceEmbeddingBert()
    target = torch.randn(2, 1536)
    estimator(input_ids=torch.tensor([[1, 2, 3, 4], [10, 20, 30, 40]]), attention_mask=torch.tensor([[1, 1, 1, 1], [1, 1, 1, 1]]), target=target)


if __name__ == '__main__':
    test()

3. 训练 Bert 向量模型

我们使用 CLS 作为每个 sentence 的表征向量,然后和 OpenAI 向量计算余弦距离损失,最终我们只保留损失最小的 3 个模型。完整训练代码如下:

from sentence_bert import SentenceEmbeddingBert
from transformers import BertTokenizer
import torch
import torch.optim as optim
from torch.utils.data import DataLoader
from ignite.engine import Engine
import pickle
from ignite.engine import Events
from tqdm import tqdm
from ignite.handlers.early_stopping import EarlyStopping
from ignite.handlers import Checkpoint


device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
description_format = 'epoch %2d loss %7.2f'


def train_step(engine, batch_data):
    loss = engine.estimator(**batch_data)
    engine.optimizer.zero_grad()
    loss.backward()
    engine.optimizer.step()
    return {'loss': loss.item() * batch_data['labels'].size(0)}


def on_train_epoch_started(engine):
    engine.progress = tqdm(range(engine.state.epoch_length))
    description = description_format % (engine.state.epoch, 0)
    engine.progress.set_description(description)
    engine.totaloss = 0.0


def on_train_epoch_completed(engine):
    engine.progress.close()


def on_train_iteration_started(engine):
    engine.progress.update()


def on_train_iteration_completed(engine):
    engine.totaloss += engine.state.output['loss']
    description = description_format % (engine.state.epoch, engine.totaloss)
    engine.progress.set_description(description)


class QuestionDataset:

    def __init__(self):
        self.embedding = pickle.load(open('data/question_embedded.pkl', 'rb'))
        self.questions = [line.strip() for line in open('data/questions_selected.txt')]

    def __len__(self):
        return len(self.questions)

    def __getitem__(self, index):
        return {'question': self.questions[index], 'embbeding': self.embedding[index]}


def do_train():

    estimator = SentenceEmbeddingBert().to(device)
    optimizer = optim.Adam(estimator.parameters(), lr=1e-5)
    tokenizer = BertTokenizer.from_pretrained('pretrained/bert-base-chinese')

    def collate_function(batch_data):
        labels = []
        inputs = []
        for data in batch_data:
            inputs.append(data['question'])
            labels.append(data['embbeding'])
        labels = torch.tensor(labels, device=device)
        inputs = tokenizer(inputs, padding='longest', return_token_type_ids=False, return_tensors='pt')
        inputs = {key: value.to(device) for key, value in inputs.items()}
        inputs['labels'] = labels
        return inputs

    trainer = Engine(train_step)
    trainer.estimator = estimator
    trainer.optimizer = optimizer
    trainer.tokenizer = tokenizer

    trainer.add_event_handler(Events.EPOCH_STARTED, on_train_epoch_started)
    trainer.add_event_handler(Events.EPOCH_COMPLETED, on_train_epoch_completed)
    trainer.add_event_handler(Events.ITERATION_STARTED, on_train_iteration_started)
    trainer.add_event_handler(Events.ITERATION_COMPLETED, on_train_iteration_completed)

    early_stopping = EarlyStopping(patience=2, score_function=lambda engine: -engine.totaloss, trainer=trainer)
    trainer.add_event_handler(Events.EPOCH_COMPLETED, early_stopping)

    checkpoint = Checkpoint(to_save={'estimator': estimator, 'optimizer': optimizer, 'trainer': trainer},
                            save_handler='finish',
                            score_function=lambda engine: -engine.totaloss,
                            n_saved=3,
                            filename_pattern='{name}-{global_step}-loss-{score}.pt')
    trainer.add_event_handler(Events.EPOCH_COMPLETED, checkpoint)

    dataloader = DataLoader(QuestionDataset(), batch_size=8, collate_fn=collate_function)
    trainer.run(dataloader, max_epochs=50)


if __name__ == '__main__':
    do_train()

4. 使用 Bert 模型计算向量

使用训练得到的 Bert 模型对问题进行编码,并存储。完整代码如下:

from sentence_bert import SentenceEmbeddingBert
from transformers import BertTokenizer
import torch
from torch.utils.data import DataLoader
from tqdm import tqdm
import pickle
import numpy as np


device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

def build_embedding():
    estimator = SentenceEmbeddingBert().to(device).eval()
    checkpoint = torch.load('finish/checkpoint-None-loss--102.3389.pt')
    estimator.load_state_dict(checkpoint['estimator'])
    tokenizer = BertTokenizer.from_pretrained('pretrained/bert-base-chinese')
    questions = [line.strip() for line in open('data/questions_selected.txt')]

    def collate_function(batch_data):
        inputs = tokenizer(batch_data, padding='longest', return_token_type_ids=False, return_tensors='pt')
        inputs = {key: value.to(device) for key, value in inputs.items()}
        return inputs

    dataloader = DataLoader(questions, batch_size=16, collate_fn=collate_function)
    embeddings = []
    progress = tqdm(range(len(dataloader)))
    for inputs in dataloader:
        with torch.no_grad():
            outputs = estimator.get_inputs_embedding(**inputs)
            embeddings.extend(outputs)
            progress.update()
    progress.close()

    pickle.dump(embeddings, open('data/bert_question_embeded.pkl', 'wb'))

if __name__ == '__main__':
    build_embedding()

5. 随便输个问题看下召回结果

from sentence_bert import SentenceEmbeddingBert
from transformers import BertTokenizer
import torch
from torch.utils.data import DataLoader
from tqdm import tqdm
import pickle
import numpy as np

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

def predict(inputs):

    estimator = SentenceEmbeddingBert().eval().to(device)
    checkpoint = torch.load('finish/checkpoint-None-loss--102.3389.pt')
    estimator.load_state_dict(checkpoint['estimator'])
    tokenizer = BertTokenizer.from_pretrained('pretrained/bert-base-chinese')
    questions_embeds = pickle.load(open('data/bert_question_embeded.pkl', 'rb'))
    questions_select = [line.strip() for line in open('data/questions_selected.txt')]

    inputs = tokenizer([inputs], return_token_type_ids=False, return_tensors='pt')
    inputs = {key: value.to(device) for key, value in inputs.items()}
    simis = []
    with torch.no_grad():
        embedding = estimator.get_inputs_embedding(**inputs)
        for embed in questions_embeds:
            simi = torch.cosine_similarity(embedding, embed.unsqueeze(0))
            simis.append(simi.squeeze().item())

    indexes = np.argsort(-np.array(simis))[:5]
    for index in indexes:
        print(round(simis[index], 4), questions_select[index])


def test():

    inputs = '我得了痔疮了,真疼啊,用什么药啊?'
    print('输入问题:', inputs)
    print('相似问题:')
    predict(inputs)
    print('-' * 60)

    inputs = '肚子疼得厉害,我是不是得什么病了,咋办啊?'
    print('输入问题:', inputs)
    print('相似问题:')
    predict(inputs)


if __name__ == '__main__':
    test()

程序输出结果:

输入问题: 我得了痔疮了,真疼啊,用什么药啊?
相似问题:
0.9544 我痔疮又犯了该用什么药我痔疮又犯了,该用什么药好呢?
0.9537 痔疮用什么药好,我现在会疼
0.9389 痔疮用什么药好?请帮忙回答下好吗?
0.9367 痔疮用什么药疗效快啊?
0.9357 痔疮用药可以治愈吗????
------------------------------------------------------------
输入问题: 肚子疼得厉害,我是不是得什么病了,咋办啊?
相似问题:
0.935 有时候我胃痛的厉害,请问是不是胃病?应注意什么?
0.932 肛门痒是怎么回事,怎么办?
0.9289 我肚子疼,还有头晕耳鸣,眼睛花,我该怎么办?
0.9161 总是大便拉稀,肚子很痛。这个是不是肠炎啊?
0.9123 我是得了一种叫做痔疮的病好像小肚子有石头很重我发现拉出来的屎里面都有血有时候头晕人也没有血色干一点活儿就喘气不停吃不了多少东西肚子胀身体总是觉得累我觉得拉屎肛门痛痔疮会引起怎样的危险?
未经允许不得转载:一亩三分地 » Bert 学习 OpenAI 句向量
评论 (0)

2 + 5 =