文本数据增强 – EDA

在 《Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks》 中作者提到:

在小数据集上训练的文本分类器的性能不好,于是尝试测试了一些数据增强方法,这些方法受到了计算机视觉中使用的方法的启发,并且发现这些方法有助于训练更健壮的模型。EDA 的操作就是对于训练集中选定的句子,随机选择并执行以下操作中的一个:

  1. Synonym Replacement (SR)
  2. Random Insertion (RI)
  3. Random Swap (RS)
  4. Random Deletion (RD)

paper 链接: https://arxiv.org/pdf/1901.11196.pdf

下面是结合 Paper 的基本思想做的一个实现,实现时做了某些改动。

1. Synonym Replacement

Randomly choose n words from the sentence that are not stop words. Replace each of these words with one of its synonyms chosen at random.

from contextlib import contextmanager
import sys
import os
import math

@contextmanager
def no_print():
    sys.stdout = open(os.devnull, 'w')
    yield
    sys.stdout = sys.__stdout__

import jieba
import logging
jieba.setLogLevel(logging.CRITICAL)

with no_print():
    import synonyms

import random


def synonym_replacement(sentence, proportion=1.0):
    stopwords = set(word.strip() for word in open('data/stopwords.txt'))
    sentence_words = jieba.lcut(sentence)
    non_stopwords = [word for word in set(sentence_words) if word not in stopwords]
    # 按照 proportion 比例计算采样数量
    select_number = int(math.floor(len(non_stopwords) * proportion))
    random_words = random.sample(non_stopwords, k=select_number)

    for current_word in random_words:
        # 获得当前词的同义词列表
        synonym_words, _ = synonyms.nearby(current_word, 5)
        # 如果当前词没有同义词,则返回本身
        synonym_word = random.choice(synonym_words[1:]) if synonym_words != [] else current_word
        # 使用同义词替换
        sentence_words = [synonym_word if origin_word == current_word else origin_word for origin_word in sentence_words]

    return ''.join(sentence_words)


if __name__ == '__main__':
    
    data = '来广州两天都没能织围脖,一直都在忙,再加上又感冒了,好痛苦《泪》不过广州给我的感觉灰常好!'
    print('SR 之前:', data)
    data = synonym_replacement(data)
    print('SR 之后:', data)

程序输出结果:

SR 之前: 来广州两天都没能织围脖,一直都在忙,再加上又感冒了,好痛苦《泪》不过广州给我的感觉灰常好!
SR 之后: 来杭州四天都没法织围脖,一直都在忙活,再加上又过敏性了,好无助《眼泪》不过杭州给我的觉得灰常好!

2. Random Insertion

Find a random synonym of a random word in the sentence that is not a stop word. Insert that synonym into a random position in the sentence. Do this n times.

from contextlib import contextmanager
import sys
import os
import math

@contextmanager
def no_print():
    sys.stdout = open(os.devnull, 'w')
    yield
    sys.stdout = sys.__stdout__

import jieba
import logging
jieba.setLogLevel(logging.CRITICAL)

with no_print():
    import synonyms

import random


def random_insertion(sentence, proportion=0.8):

    stopwords = set(word.strip() for word in open('data/stopwords.txt'))
    sentence_words = jieba.lcut(sentence)
    non_stopwords = [word for word in set(sentence_words) if word not in stopwords]
    # 按照 proportion 比例计算采样数量
    select_number = int(math.floor(len(non_stopwords) * proportion))
    random_words = random.sample(non_stopwords, k=select_number)

    insert_words = []
    for current_word in random_words:
        # 获得当前词的同义词列表
        synonym_words, _ = synonyms.nearby(current_word, 5)
        # 如果当前词没有同义词,则返回本身
        synonym_word = random.choice(synonym_words[1:]) if synonym_words != [] else current_word
        insert_words.append(synonym_word)

    # 同义词随机插入
    for current_word in insert_words:
        index = random.choice(range(0, len(sentence_words)))
        sentence_words.insert(index, current_word)

    return ''.join(sentence_words)


if __name__ == '__main__':

    data = '来广州两天都没能织围脖,一直都在忙,再加上又感冒了,好痛苦《泪》不过广州给我的感觉灰常好!'
    print('SR 之前:', data)
    data = random_insertion(data)
    print('SR 之后:', data)

程序执行结果:

SI 之前: 来广州两天都没能织围脖,一直都在忙,再加上又感冒了,好痛苦《泪》不过广州给我的感觉灰常好!
SI 之后: 来广州忙碌两天杭州都没能却没能织围脖,一直都在灰常好忙,蓝雨织围脖再几天加上又感冒了,好痛苦《泪》不过样子广州给我的感觉灰常好!

3. Random Swap

Randomly choose two words in the sentence and swap their positions. Do this n times.

import jieba
import logging
jieba.setLogLevel(logging.CRITICAL)
import random
import string as en
import zhon.hanzi as cn


def random_swap(sentence, k=8):
    sentence_words = jieba.lcut(sentence)
    punctuation = en.punctuation + cn.punctuation
    # 不交换标点符号
    index_range = [index for index, word in enumerate(sentence_words) if word not in punctuation]
    for _ in range(k):
        index1 = random.choice(index_range)
        index2 = random.choice(index_range)
        sentence_words[index1], sentence_words[index2] = sentence_words[index2], sentence_words[index1]

    return ''.join(sentence_words)

if __name__ == '__main__':

    data = '来广州两天都没能织围脖,一直都在忙,再加上又感冒了,好痛苦《泪》不过广州给我的感觉灰常好!'
    print('RS 之前:', data)
    data = random_swap(data)
    print('RS 之后:', data)

程序执行结果:

RS 之前: 来广州两天都没能织围脖,一直都在忙,再加上又感冒了,好痛苦《泪》不过广州给我的感觉灰常好!
RS 之后: 来广州两天都了围脖,一直都加上忙,又广州再感冒在,好痛苦《泪》不过灰常好给我的感觉没能织!

4. Random Deletion

Randomly remove each word in the sentence with probability p.

import jieba
import logging
jieba.setLogLevel(logging.CRITICAL)
import random


def random_deletion(sentence, p=0.2):
    sentence_words = jieba.lcut(sentence)
    sentence_words = [word for word in sentence_words if random.random() > p]
    if len(sentence_words) == 0:
        sentence_words = [random.choice(sentence_words)]
    return ''.join(sentence_words)


if __name__ == '__main__':

    data = '来广州两天都没能织围脖,一直都在忙,再加上又感冒了,好痛苦《泪》不过广州给我的感觉灰常好!'
    print('RD 之前:', data)
    data = random_deletion(data)
    print('RD 之后:', data)

程序执行结果:

RD 之前: 来广州两天都没能织围脖,一直都在忙,再加上又感冒了,好痛苦《泪》不过广州给我的感觉灰常好!
RD 之后: 来广州两天没能织围脖一直都在忙,再加上又好痛苦《》不过给我的感觉灰常好!
未经允许不得转载:一亩三分地 » 文本数据增强 – EDA