PaddleNLP 数据增强函数

PaddleNLP 库提供了非常简便的文本数据增强实现，主要包括：

WordSubstitute 词替换
WordDelete 词删除
WordSwap 词交换
WordInsert 词插入

并且 WordSubstitute 和 WordInsert 还支持 4 种替换和插入方法：

synonym 同义词替换、插入
homonym 同形异义词替换，插入
custom 根据自定义词典替换、插入
mlm 由模型来预测词进行替换和插入

其中自定词典简单示例如下：

{"中国": ["德国", "美国", "日本"]}

使用示例代码：

from paddlenlp.dataaug import WordSubstitute
from paddlenlp.dataaug import WordDelete
from paddlenlp.dataaug import WordSwap
from paddlenlp.dataaug import WordInsert


# 1. 同义词替换
def test01():

    # aug_type 指定字典的类型，可选的值为 ['synonym', 'homonym', 'custom', 'mlm']
    # 同义词、同形异义词、自定义、模型 MLM
    # create_n: 产生多少个增强文本
    # aug_percent : 替换的概率


    # 1.1 同义词替换
    aug = WordSubstitute(aug_type='synonym', create_n=2, aug_percent=0.8)
    text = aug.augment('我是中国人，我生活在美丽的中国大地上')
    # 输出: ['我是中国人，我生活在美丽的中国举世上', '我是中国人，我生活在美丽的赤县大地上']
    # "大地" 被替换为 "举世", "中国" 被替换为 "赤县"
    print(text)


    # 1.2 自定义词表替换
    aug = WordSubstitute(aug_type='custom', create_n=1, aug_percent=1, custom_file_path='aug.txt')
    # 随机根据词表将某些词替换成其他的词
    text = aug.augment('我是中国人，我生活在美丽的中国大地上')
    # ['我是中国人，我生活在美丽的德国大地上']
    # 中国被随机替换为: {"中国": ["德国", "美国", "日本"]}
    print(text)

    # 1.3 同形异义词
    # 英语中形式（包括发音和拼写）相同而意义毫不相同的词称为同音异义词（homonym）
    # 如： light（光）light（轻的）

    aug = WordSubstitute(aug_type='homonym', create_n=1, aug_percent=1)
    text = aug.augment('我是中国人，我生活在美丽的中国大地上')
    # 输出: ['我是中国人，我声活在美丽的中国大地上']
    # "中国" 被替换为 "种过"
    print(text)

    # 1.3 MLM
    # 默认使用 Ernie 1.0 预训练模型
    # 随机掩码部分词，由模型预测被掩码的词，得到的新的句子
    aug = WordSubstitute(aug_type='mlm', create_n=1, aug_percent=1, )
    text = aug.augment('我是中国人，我生活在美丽的中国大地上')
    # 输出: ['我是中国人，我生活在广袤的中国大地上']
    # "中国" 被模型预测为 "广袤"
    print(text)


# 2. 随机删除某些词
def test02():

    aug = WordDelete()
    text = aug.augment('我是中国人，我生活在美丽的中国大地上')
    # 输出: ['我是中国人，我生活在美丽的中国上']
    # 将 "大地" 从原始文本中删除
    print(text)


# 3. 随机交换两个词
def test03():

    aug = WordSwap(aug_n=2)
    text = aug.augment('以下场景实操教程均已提供数据集，供您快速体验零代码AI开发落地。')
    # 输出: ['以下场景实操教程均已提供集数据，供您快速零体验代码AI开发落地。']
    # "数据" 和 "集" 调换了位置，"体验" 和 "零" 调换了位置
    print(text)


# 4. 随机插入
def test04():

    # 随机插入同义词，同形异义词、自定义词等
    aug = WordInsert(aug_type='synonym')
    text = aug.augment('我是中国人，我生活在美丽的中国大地上')
    # 输出: ['我是中国人唐人，我生活在美丽的中国大地上']
    # 在 "中国人" 后面插入同义词 "唐人"
    print(text)


if __name__ == '__main__':
    test04()

PaddleNLP 数据增强函数

取消回复