1. 因果卷积

因果卷积就是基于因果语言模型的卷积计算，本质上讲还是我们熟知的卷积计算方式，只不过加了一些技巧，使得：

输入 token 数量和输出 token 数量一样
每个 token 只能看到自己和之前的 token

当我们使用 Conv1D 计算文本的 token 表征时，我们使用卷积计算如下图所示：

问题：

当 kernel_size=2 时，输入了 6 个 token, 经过卷积计算之后 token 变成 5 个了。
当 kernel_size=3 时，表征第一个 token 时，就使用到了下文的 token 信息。

解决方法如下图所示：

根据 kernel_size 的大小，在输入序列两侧添加 pad 填充，然后再进行卷积计算，此时发现就可以使得输入和输出 token 数量相同，并且每个 token 只能看到自己和之前的 token, 以后的 token 不可见。上面其实也有问题，添加 pad 只在左侧添加就可以了，但是我们是在两侧添加，所以就需要使用切片将右侧的 pad 去掉。

2. 扩张卷积

我们希望最后一个 token 能够表达整个输入的语义，就像 RNN 网络中使用最后一个时间步的隐藏状态来表征整个输入序列一样。但是，上面的结果是，最后一个 token 只能看到前一个 token 和当前 token, 其涵盖的信息不足以表征整个输入序列。此时，我们就需要通过增加卷积层的层数，以及扩展卷积来实现这一点，具体如下图所示：

从图中可以看到，我们增加了 3 个卷积层。我们简要分析下：

第一层的最后一个 token A 只能关注到 2 个 token 的信息
第二层的最后一个 token B 能够关注到 4 个 token 的信息
第三层的最后一个 token C 能够关注到 6 个 token 的信息

我们会发现，层次越高，最后一个 token 能够关注到前面 token 信息就越多。通过多层扩展因果卷积计算，可以得到关注更多前面 token 的向量表示。如果我们要做分类问题的话，只需要用最后一层的最后一个 token 对应的向量来进行预测即可。

另外，需要注意的是：如果指定的 in_channels 和 out_channels 不同的话，会导致输出的 token 维度发生变化，此时，我们可以在最后再增加一个 Conv1D，来将 in_channels 维度映射成 out_channels 维度。

3. 代码实现

每一层并不是使用简单的 Conv1D 卷积层，而是一个时序块（Temporal Block），其架构如下：

示例代码（Paddle中的实现）：

class Chomp1d(nn.Layer):
    """
    Remove the elements on the right.

    Args:
        chomp_size (int): The number of elements removed.
    """

    def __init__(self, chomp_size):
        super(Chomp1d, self).__init__()
        self.chomp_size = chomp_size

    def forward(self, x):
        return x[:, :, :-self.chomp_size]


class TemporalBlock(nn.Layer):
    """
    The TCN block, consists of dilated causal conv, relu and residual block. 

    Args:
        n_inputs ([int]): The number of channels in the input tensor.
        n_outputs ([int]): The number of filters.
        kernel_size ([int]): The filter size.
        stride ([int]): The stride size.
        dilation ([int]): The dilation size.
        padding ([int]): The size of zeros to be padded.
        dropout (float, optional): Probability of dropout the units. Defaults to 0.2.
    """

    def __init__(self,
                 n_inputs,
                 n_outputs,
                 kernel_size,
                 stride,
                 dilation,
                 padding,
                 dropout=0.2):

        super(TemporalBlock, self).__init__()
        self.conv1 = weight_norm(
            nn.Conv1D(n_inputs,
                      n_outputs,
                      kernel_size,
                      stride=stride,
                      padding=padding,
                      dilation=dilation))
        # Chomp1d is used to make sure the network is causal.
        # We pad by (k-1)*d on the two sides of the input for convolution,
        # and then use Chomp1d to remove the (k-1)*d output elements on the right.
        self.chomp1 = Chomp1d(padding)
        self.relu1 = nn.ReLU()
        self.dropout1 = nn.Dropout(dropout)

        self.conv2 = weight_norm(
            nn.Conv1D(n_outputs,
                      n_outputs,
                      kernel_size,
                      stride=stride,
                      padding=padding,
                      dilation=dilation))
        self.chomp2 = Chomp1d(padding)
        self.relu2 = nn.ReLU()
        self.dropout2 = nn.Dropout(dropout)

        self.net = nn.Sequential(self.conv1, self.chomp1, self.relu1,
                                 self.dropout1, self.conv2, self.chomp2,
                                 self.relu2, self.dropout2)
        self.downsample = nn.Conv1D(n_inputs, n_outputs,
                                    1) if n_inputs != n_outputs else None
        self.relu = nn.ReLU()
        self.init_weights()

    def init_weights(self):
        self.conv1.weight.set_value(
            paddle.tensor.normal(0.0, 0.01, self.conv1.weight.shape))
        self.conv2.weight.set_value(
            paddle.tensor.normal(0.0, 0.01, self.conv2.weight.shape))
        if self.downsample is not None:
            self.downsample.weight.set_value(
                paddle.tensor.normal(0.0, 0.01, self.downsample.weight.shape))

    def forward(self, x):
        out = self.net(x)
        res = x if self.downsample is None else self.downsample(x)
        return self.relu(out + res)

代码中使用的 weight normalization 是一种对神经网络中的权重向量重新参数化，将长度和方向解耦，即：把权重向量使用两个参数进行替换：

表示长度的 weight_g
表示方向的 weight_v

weight normalization 实现 Paper 为：https://arxiv.org/pdf/1602.07868.pdf

时序卷积网络（Temporal Convolutional Network）

1. 因果卷积

2. 扩张卷积

3. 代码实现

取消回复

文章目录