在带有注意力机制的 Encoder-Decoder 模型中存在很多注意力机制,本篇文章根据原始论文对 Bahdanau 注意力计算方法和 Luong 注意力计算方法进行总结。
- Bahdanau Attention
- Luong Attention
参考:https://www.zhihu.com/question/68482809/answer/1742071699
1. Bahdanau Attention
论文地址:《Neural Machine Translation By Jointly Learning To Align And Translate》

scoreij 表示解码器第 i-1 时刻隐藏状态和解码器第 j 时刻的分数。
1. si-1 解码器 i 时刻的上一时刻隐藏状态;
2. hj 编码器 j 时刻的时刻隐藏状态;
3. Wa 表示对 si-1 线性变换的参数;
4. Ua 表示对 hj 线性变换的参数;
5. va 表示对 tanh 线性变换的参数;

示例代码:
import torch
import torch.nn as nn
import torch.nn.functional as F
class BahdanauAttention(nn.Module):
def __init__(self, encoder_hidden_dim, decoder_hidden_dim, attn_dim):
super(BahdanauAttention, self).__init__()
self.decoder_linear = nn.Linear(decoder_hidden_dim, attn_dim)
self.encoder_linear = nn.Linear(encoder_hidden_dim, attn_dim)
self.score_linear = nn.Linear(attn_dim, 1)
def forward(self, value, query):
# encoder_output 形状为 (batch_size, seq_len, encoder_hidden_dim)
# decoder_hidden 形状为 (batch_size, 1, decoder_hidden_dim)
# q 的形状为 (batch_size, seq_len, attn_dim)
# k 的形状为 (batch_size, 1, attn_dim)
q = self.decoder_linear(query)
k = self.encoder_linear(value)
# score 的形状为 (batch_size, seq_len, 1)
# attn_weright 形状为 (batch_size, seq_len, 1)
score = self.score_linear(torch.tanh(q + k))
attn_weight = F.softmax(score, dim=1)
# attn_tensor 的形状 ()
attn_tensor = torch.sum(attn_weight * value, dim=1)
return attn_tensor, attn_weight
def test():
# 编码器输出张量: batch 为 32, seq_len 为 300, 每个词的维度为 256
encoder_output = torch.randn(32, 300, 256)
# 解码器隐藏状态: batch 为 32, seq_len 为 1, 每个词的维度为 256
decoder_hidden = torch.randn(32, 1, 256)
atttention = BahdanauAttention(encoder_hidden_dim=256, decoder_hidden_dim=256, attn_dim=64)
attn_tensor, attn_weight = atttention(encoder_output, decoder_hidden)
print(attn_tensor.shape)
print(attn_weight.shape)
if __name__ == '__main__':
test()
程序输出结果:
torch.Size([32, 256]) torch.Size([32, 300, 1])
2. Luong Attention
论文地址: Effective Approaches To Attention-Based Neural Machine Translation
Luong Attention 中输入的是编码器的各个时间步的隐藏状态输出,以及解码器当前时间步的隐藏状态。具体的 score 计算方式论文给出了如下三种方式:

1. 带横线的 hs 表示编码器的所有的隐藏状态输出; 2. ht 表示解码器当前时间步的输出隐藏状态,而不是上一个时刻的隐藏状态; 3. 计算得出的 score 值经过 softmax 得到注意力权重分布,最后得出 ct
示例代码:
import torch
import torch.nn as nn
import torch.nn.functional as F
class LuongAttention(nn.Module):
def __init__(self, encoder_hidden_dim, decoder_hidden_dim, attn_dim):
super(LuongAttention, self).__init__()
self.encoder_linear = nn.Linear(encoder_hidden_dim, encoder_hidden_dim)
self.linear1 = nn.Linear(encoder_hidden_dim, encoder_hidden_dim)
self.linear2 = nn.Linear(encoder_hidden_dim, encoder_hidden_dim)
def forward(self, value, query):
# value 形状为 (batch_size, seq_len, encoder_hidden_dim)
# query 形状为 (batch_size, 1, decoder_hidden_dim)
# dot
score = value @ query.transpose(1, 2)
print(score.shape)
# general
score = self.encoder_linear(value) @ query.transpose(1, 2)
print(score.shape)
# concat
score = self.linear2(torch.tanh(self.linear1(value + query)))
print(score.shape)
attn_weight = F.softmax(score, dim=1)
attn_tensor = torch.sum(attn_weight * value, dim=1)
return attn_tensor, attn_weight
def test():
# 编码输出张量: batch 为 32, seq_len 为 300, 每个词的维度为 256
encoder_output = torch.randn(32, 300, 256)
# 解码器当前隐藏状态: batch 为 32, seq_len 为 1, 每个词的维度为 256
decoder_hidden = torch.randn(32, 1, 256)
atttention = LuongAttention(encoder_hidden_dim=256, decoder_hidden_dim=256, attn_dim=64)
attn_tensor, attn_weight = atttention(encoder_output, decoder_hidden)
print(attn_tensor.shape)
print(attn_weight.shape)
if __name__ == '__main__':
test()
程序输出结果:
torch.Size([32, 300, 1]) torch.Size([32, 300, 1]) torch.Size([32, 300, 256]) torch.Size([32, 256]) torch.Size([32, 300, 256])

冀公网安备13050302001966号