VGG 网络模型是在 2014 年 ImageNet 大规模视觉识别挑战赛(ILSVRC)中提出的,该模型在图像分类任务中取得了优异的成绩。VGG 网络的核心思想是通过使用小尺寸的卷积核(3×3)和增加网络深度来提高模型的性能。相比于 AlexNet 中使用的 11×11 和 5×5 的卷积核,小卷积核可以减少参数数量,同时通过多层堆叠,可以增加网络的非线性表达能力,显著提升了模型的准确率。
1. 网络架构
VGG 模型的主要版本包括 VGG-11、VGG-13、VGG-16 和 VGG-19,其中数字表示网络的总层数(包括卷积层和全连接层)。这些变体中最常见的是 VGG16 和 VGG19,下面我们以VGG16为例,详细解析其结构。
层数 | |||
1 | Conv2d(3, 64, k=3, stride=1, p=1) | ReLU(inplace=True) | |
2 | Conv2d(64, 64, k=3), s=1, p=1) | ReLU(inplace=True) | MaxPool2d(k=2, s=2, p=0) |
3 | Conv2d(64, 128, k=3, s=1, p=1) | ReLU(inplace=True) | |
4 | Conv2d(128, 128, k=3, s=1, p=1) | ReLU(inplace=True) | MaxPool2d(k=2, s=2, p=0) |
5 | Conv2d(128, 256, k=3, s=1, p=1) | ReLU(inplace=True) | |
6 | Conv2d(256, 256, k=3, s=1, p=1) | ReLU(inplace=True) | |
7 | Conv2d(256, 256, k=3, s=1, p=1) | ReLU(inplace=True) | MaxPool2d(k=2, s=2, p=0) |
8 | Conv2d(256, 512, k=3, s=1, p=1) | ReLU(inplace=True) | |
9 | Conv2d(512, 512, k=3, s=1, p=1) | ReLU(inplace=True) | |
10 | Conv2d(512, 512, k=3, s=1, p=1) | ReLU(inplace=True) | MaxPool2d(k=2, s=2, p=0) |
11 | Conv2d(512, 512, k=3, s=1, p=1) | ReLU(inplace=True) | |
12 | Conv2d(512, 512, k=3, s=1, p=1) | ReLU(inplace=True) | |
13 | Conv2d(512, 512, k=3, s=1, p=1) | ReLU(inplace=True) | MaxPool2d(k=2, s=2, p=0) |
AdaptiveAvgPool2d(output_size=(7, 7)) | |||
14 | Linear(in=25088, out=4096, bias=True) | ReLU(inplace=True) | Dropout(p=0.5) |
15 | Linear(in=4096, out=4096, bias=True) | ReLU(inplace=True) | Dropout(p=0.5) |
16 | Linear(in=4096, out=1000, bias=True) |
2. 模型微调
如果针对 预训练模型微调,建议严格按照 224×224 的输入尺寸进行训练。如果从零开始训练(随机初始化权重),可以使用任意大小的输入图像,不需要固定 224×224。另外,orchvision 存在 vgg16 和 vgg16_bn 两个模型,后者是在网络中增加了 BatchNorm2d 层。接下来,我们下面选择在 vgg16 预训练模型基础进行微调。
由于我们微调使用的数据集是 CIFAR10,而预训练模型最后的输出层是 1000 类别,我们需要手动替换最后一层。
num_features = estimator.classifier[6].in_features # 获得最后一层的输入维度 estimator.classifier[6] = nn.Linear(num_features, 10) # CIFAR-10 有 10 类
import torch import torchvision import torchvision.models as models import torchvision.transforms as transforms import torch.optim as optim import torch.nn as nn from tqdm import tqdm def train(): device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') transform = transforms.Compose([transforms.Resize((224, 224)), transforms.ToTensor(), transforms.Normalize((0.5,), (0.5,))]) # 下载 CIFAR-10 训练数据集 train_data = torchvision.datasets.CIFAR10(root='data', train=True, download=True, transform=transform) dataloader = torch.utils.data.DataLoader(train_data, batch_size=64, shuffle=True) # 加载 VGG16 预训练模型 # IMAGENET1K_V1 : 有全连接层,直接分类,完整微调 # IMAGENET1K_FEATURES : 只有卷积层,迁移学习,特征提取 # 下载权重: https://download.pytorch.org/models/vgg16-397923af.pth estimator = models.vgg16() estimator.load_state_dict(torch.load('vgg16-397923af.pth')) estimator.train() # 修改最后的全连接层, num_features = estimator.classifier[6].in_features # 获得最后一层的输入维度 estimator.classifier[6] = nn.Linear(num_features, 10) # CIFAR-10 有 10 类 estimator = estimator.to(device) # 损失函数和优化器 criterion = nn.CrossEntropyLoss() optimizer = optim.Adam(estimator.parameters(), lr=0.0001) num_epochs = 10 for epoch in range(num_epochs): running_loss, running_size = 0.0, 0 progress = tqdm(range(len(dataloader)), desc='Epoch: %2d Loss: %.3f' % (0, 0)) for inputs, labels in dataloader: inputs, labels = inputs.to(device), labels.to(device) optimizer.zero_grad() outputs = estimator(inputs) cur_loss = criterion(outputs, labels) cur_loss.backward() optimizer.step() running_loss += (cur_loss.item() * len(labels)) running_size += len(labels) progress.set_description('Epoch: %2d Loss: %.3f' % (epoch + 1, running_loss / running_size)) progress.update() progress.close() torch.save(estimator.state_dict(), 'vgg16_cifar10.pth') if __name__ == '__main__': train()
Epoch: 1 Loss: 0.437: 100%|██████████████████| 782/782 [09:31<00:00, 1.37it/s] Epoch: 2 Loss: 0.197: 100%|██████████████████| 782/782 [09:32<00:00, 1.37it/s]
3. 模型评估
import torchvision import torch import torchvision.models as models from tqdm import tqdm from torchvision.transforms import transforms from sklearn.metrics import accuracy_score from torch.utils.data import DataLoader def evaluate(): device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') transform = transforms.Compose([transforms.Resize((224, 224)), transforms.ToTensor(), transforms.Normalize((0.5,), (0.5,))]) test_data = torchvision.datasets.CIFAR10(root='data', train=False, download=True, transform=transform) dataloader = torch.utils.data.DataLoader(test_data, batch_size=64, shuffle=False) estimator = models.vgg16(num_classes=10) estimator.load_state_dict(torch.load('vgg16_cifar10.pth')) estimator = estimator.to(device) progress = tqdm(range(len(dataloader)), desc='Acc: %.2f' % 0) y_true, y_pred = [], [] for inputs, batch_true in dataloader: inputs = inputs.to(device) outputs = estimator(inputs) batch_pred = torch.argmax(outputs, dim=-1) y_true.extend(batch_true.tolist()) y_pred.extend(batch_pred.cpu().tolist()) accuracy = accuracy_score(y_true, y_pred) progress.set_description('Acc: %.2f' % accuracy) progress.update() progress.close() if __name__ == '__main__': evaluate()
Acc: 0.91: 100%|██████████████████████████████| 157/157 [00:39<00:00, 3.96it/s]