SentenceTransformers is a Python framework for state-of-the-art sentence, text and image embeddings.
You can use this framework to compute sentence / text embeddings for more than 100 languages. These embeddings can then be compared e.g. with cosine-similarity to find sentences with a similar meaning. This can be useful for semantic textual similar, semantic search, or paraphrase mining.
The framework is based on PyTorch and Transformers and offers a large collection of pre-trained models tuned for various tasks. Further, it is easy to fine-tune your own models.
Paper:https://arxiv.org/pdf/1908.10084.pdf
安装命令如下:
pip install sentence-transformers
1. 句子相似度
从 https://www.sbert.net/docs/pretrained_models.html 查看可用的预训练模型。下面例子加载的预训练模型是 paraphrase-multilingual-MiniLM-L12-v2,模型信息如下:
我们使用余弦相似度来计算两个输入 sentence 的相似度,示例代码:
from sentence_transformers import SentenceTransformer import sentence_transformers.util as util import torch device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') def test(): # 模型构建 model = SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2', device=device) # 计算单个句子向量 sentence_embedding1 = model.encode('我是中国人') sentence_embedding2 = model.encode('我是生活在北京的中国人') similarity = util.cos_sim(sentence_embedding1, sentence_embedding2) print('相似度:', '%.2f' % similarity) # 计算多个句子向量 sentences = ['我是中国人', '我是生活在北京的中国人'] sentence_embeddings = model.encode(sentences) similarity = util.cos_sim(sentence_embeddings[0], sentence_embeddings[1]) print('相似度:', '%.2f' % similarity) if __name__ == '__main__': test()
程序执行结果:
相似度: 0.89 相似度: 0.89
2. 语义检索
首先将语料中的 N 个 sentence 计算出词向量表示,根据 query 词向量从语料中计算语义相近的 K 个 sentence。
sentence_transformers.util.semantic_search function performs a cosine similarity search between a list of query embeddings and a list of corpus embeddings. It can be used for Information Retrieval / Semantic Search for corpora up to about 1 Million entries.
参数如下:
- query_embeddings – A 2 dimensional tensor with the query embeddings.
- corpus_embeddings – A 2 dimensional tensor with the corpus embeddings.
- query_chunk_size – Process 100 queries simultaneously. Increasing that value increases the speed, but requires more memory.
- corpus_chunk_size – Scans the corpus 100k entries at a time. Increasing that value increases the speed, but requires more memory.
- top_k – Retrieve top k matching entries.
- score_function – Function for computing scores. By default, cosine similarity.
返回值:
- Returns a list with one entry for each query. Each entry is a list of dictionaries with the keys ‘corpus_id’ and ‘score’, sorted by decreasing cosine similarity scores.
示例代码:
from sentence_transformers import SentenceTransformer from sentence_transformers.util import cos_sim from sentence_transformers.util import semantic_search import torch device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') def test(): # 模型构建 model = SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2', device=device) # 语料文本 corpus = ['我是中国人', '我是生活在北京的中国人', '我今天去公园玩了很长时间', '前天老王去他邻居家玩了'] # 查询文本 query = '我是个生活在北京的中国平民' # 语料词向量 corpus_embeddings = model.encode(corpus, convert_to_tensor=True, device=device) # 语料词向量 query_embedding = model.encode(query, convert_to_tensor=True, device=device) # 查询前 K 个语义相近的 sentence results = semantic_search(query_embeddings=query_embedding, corpus_embeddings=corpus_embeddings, top_k=2, score_function=cos_sim) # 打印输出结果 print('搜索结果:', results) search_sentences = [corpus[result['corpus_id']] for result in results[0]] print('搜索结果:', search_sentences) if __name__ == '__main__': test()
程序执行结果:
搜索结果: [[{'corpus_id': 1, 'score': 0.9394572377204895}, {'corpus_id': 0, 'score': 0.8251887559890747}]] 搜索结果: ['我是生活在北京的中国人', '我是中国人']
搜索加速可以使用 ElsticSearch、Annoy、Faiss、hnswlib
https://www.sbert.net/examples/applications/semantic-search/README.html#speed-optimization
3. 微调模型
https://www.sbert.net/docs/training/overview.html
from sentence_transformers import SentenceTransformer from sentence_transformers import InputExample from sentence_transformers import losses from torch.utils.data import DataLoader from sentence_transformers import evaluation if __name__ == '__main__': # Define the model. Either from scratch of by loading a pre-trained model model = SentenceTransformer('bert-base-chinese') pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension()) # Define your train examples. You need more than just two examples... train_examples = [InputExample(texts=['My first sentence', 'My second sentence'], label=0.8), InputExample(texts=['Another pair', 'Unrelated sentence'], label=0.3)] # Define your train dataset, the dataloader and the train loss train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16) train_loss = losses.CosineSimilarityLoss(model) sentences1 = ['This list contains the first column', 'With your sentences', 'You want your model to evaluate on'] sentences2 = ['Sentences contains the other column', 'The evaluator matches sentences1[i] with sentences2[i]', 'Compute the cosine similarity and compares it to scores[i]'] scores = [0.3, 0.6, 0.2] evaluator = evaluation.EmbeddingSimilarityEvaluator(sentences1=sentences1, sentences2=sentences2, scores=scores) # Tune the model model.fit(train_objectives=[(train_dataloader, train_loss)], epochs=1, warmup_steps=100, evaluator=evaluator, evaluation_steps=500) # Save model.save('model')