我们一直使用 PyTorch 进行模型训练，有时会出现显存不足的情况。除了找到对应的解决办法，比如：累加梯度、使用自动混合精度，还应该了解训练时，显存究竟在哪些环节被大量占用。主要有以下四个环节：

CUDA 运行内存
模型的固定参数
模型的前向计算
模型的反向计算
优化方法统计量

1. CUDA 运行内存

CUDA（Compute Unified Device Architecture,，计算统一设备架构），是显卡厂商 NVIDIA 推出的运算平台。通过它我们就利用 GPU 的处理能力，大幅提升计算性能。

CUDA 对我们来说，本质是一套在 GPU 硬件设备上运行的软件程序，我们的计算任务需要在该软件平台基础上运行才能利用到 GPU 的运算能力。既然是软件程序，所以 CUDA 运行起来时也会占用一部分的显存，至于占用多大，这得看 CUDA 的版本，有的占 600M 左右，有的会占到 1G 以上。

首先，我们先了解下 PyTorch 的内存使用机制。GPU 显存相当于我们全部可用的资源，掌握 C/C++ 的同学会知道，频繁的资源申请和释放操作，比如 C 的 malloc/free ，C++ 的 new/delete 会非常降低系统的性能。为了减少此类的操作，就有了资源池的概念。其思想是：预先从去全部可用资源中申请较大一块资源，当用户程序需要资源时，从资源池中申请，这就跳过了复杂的、耗时的系统调用过程，资源回收时，将资源放到资源池中。当资源池用尽时，再从可用资源中申请。这样提高了程序在资源使用这个环节的效率。

PyTorch 为张量分配内存资源也是使用这种方法，先申请较大的内存，张量需要需要内存时从内存池获取，不用时，归还到内存池。所以，如果 PyTorch 不使用这种资源缓存的机制，那么运行效率将会非常慢。

我们接下来，通过一段代码来验证下，CUDA 软件平台运行时，会占用部分显存，先安装一个库：

pip install pynvml

import torch
import pynvml


# 初始化 pynvml 库
pynvml.nvmlInit()
convert = lambda x: int(x / 1024 / 1024)
# 获得显卡设备对象
device_object = pynvml.nvmlDeviceGetHandleByIndex(0)


# 查看显存资源
def show_usage():
    # 获得显存信息
    device_memory = pynvml.nvmlDeviceGetMemoryInfo(device_object)
    # 全部可用显存
    total = convert(device_memory.total)
    # 已经使用显存
    used = convert(device_memory.used)
    # 剩余可用显存
    free = convert(device_memory.free)
    print('总共:', total, '使用:', used, '剩余:', free)


# 1. CUDA 初始化会占用部分显存
def test01():
    show_usage()
    # 如果张量创建在 CPU 是不会占用显存，并且也不会初始化 CUDA
    torch.tensor(0.0, device='cpu')
    show_usage()
    torch.tensor(0.0, device='cuda')
    # 清空缓存
    torch.cuda.empty_cache()
    show_usage()

if __name__ == '__main__':
    test01()

程序输出结果：

总共: 5932 使用: 0 剩余: 5932
总共: 5932 使用: 0 剩余: 5932
总共: 5932 使用: 586 剩余: 5346

上面代码如果不清空缓存，输出结果 588，而不是 586。588 = 586 + PyTorch 缓存。另外，我们创建的 cuda 张量并没有建立引用，所以创建之后会被自动回收，此时清理缓存才是 586，否则的话仍然是 588. 这是因为每次向 cuda 设备创建张量，都会分配 512 的倍数的显存。

import torch

def test02():

    # 0
    print(torch.cuda.memory_allocated())
    a = torch.tensor(0.0, device='cuda')
    # 512
    print(torch.cuda.memory_allocated())
    # 1024
    b = torch.tensor(0.0, device='cuda')
    print(torch.cuda.memory_allocated())
    # 1536
    c = torch.tensor(0.0, device='cuda')
    print(torch.cuda.memory_allocated())



if __name__ == '__main__':
    test02()

程序输出结果：

torch.cuda.memory_allocated 可以获得目前分配的内存数量。

2. 模型的固定参数

这一部分也是比较容易理解的，加载模型就是加载模型参数。所以，模型的参数会占用一部分的显存。默认情况下， PyTorch 中的参数使用的是 float32 类型。请看下面的代码：

import torch
import torch.nn as nn


def test01():
    print(torch.cuda.memory_allocated())
    linear = nn.Linear(in_features=1, out_features=1, bias=False).cuda()
    print(torch.cuda.memory_allocated())

if __name__ == '__main__':
    test01()

程序输出结果：

0
512

我们前面创建的线性层不带偏置，只有一个参数，占用的显存应该是 4 字节，为什么这里是 512 字节？原因是 PyTorch 分配显存时是按照 512 倍数分配，也就是按块分配。为啥这样？不怕显存浪费？这也是从效率角度考虑的，按块分配便于内存管理，尽可能避免内存碎片。

import torch
import torch.nn as nn

def test02():
    print(torch.cuda.memory_allocated())
    linear = nn.Linear(in_features=128, out_features=1, bias=False).cuda()
    print(torch.cuda.memory_allocated())


if __name__ == '__main__':
    test02()

输出结果仍然是 512 字节，如果把 in_features 128 换成 129，那么就会分配 1024 字节的显存。注意一个参数的大小是 4 字节。

思考：下面的模型占用多大显存？

import torch
import torch.nn as nn


class Net(nn.Module):

    def __init__(self):
        super(Net, self).__init__()
        self.linear1 = nn.Linear(1, 1, bias=False)
        self.linear2 = nn.Linear(1, 1, bias=False)

    def forward(self, inputs):
        inputs = self.linear1(inputs)
        inputs = self.linear2(inputs)
        return inputs


def test03():
    print(torch.cuda.memory_allocated())
    model = Net().cuda()
    print(torch.cuda.memory_allocated())


if __name__ == '__main__':
    test03()

程序输出结果是：

0
1024

3. 前向和反向计算

网络模型在进行前向计算时会保存中间结果，为啥要保存？就是反向计算求梯度时需要用到这些中间结果。反向计算后得到的梯度值是需要显存来存储，所以，正向和反向计算都会占用显存。

另外，输入的 batch_size 越大，占用的显存越大。

import torch
import torch.nn as nn


def test():

    print(torch.cuda.memory_allocated())

    model = nn.Linear(1, 1).cuda()
    print(torch.cuda.memory_allocated())

    # 前向计算
    # 5120 = 1024 + 4096（1024 个输入大小）
    inputs = torch.randn(size=(1024, 1)).cuda()
    print(torch.cuda.memory_allocated())

    # 正向计算需要缓存中间计算结果(outputs)
    # 注意：用变量承接相当于缓存了中间结果
    # 9216 = 5120 + 4096（1024 个缓存结果）
    outputs = model(inputs)
    print(torch.cuda.memory_allocated())

    # 计算损失
    # 9728 = 9216 + 512 缓存损失结果
    loss = torch.mean(outputs)
    print(torch.cuda.memory_allocated())

    # 反向计算
    # 10752 = 9728 + 512 保存梯度值
    loss.backward()
    print(torch.cuda.memory_allocated())


if __name__ == '__main__':
    test()

程序执行结果：

反向传播之后，可以释放 outputs、loss 这些变量。

4. 优化方法统计量

不同的优化方法中会存在一些统计量。例如：对于 SGD 会记录每个参数的历史移动平均梯度动量，Adam 优化方法中会记录每个参数的一阶、二阶梯度动量。这些在训练过程中，也是需要占用一定的显存，并且参数量越大，这些优化方法占用的显存就越大。

import torch
import torch.nn as nn
import torch.optim as optim


def test():

    # 0
    print(torch.cuda.memory_allocated())

    # 512
    model = nn.Linear(1, 1, bias=False).cuda()
    print(torch.cuda.memory_allocated())

    # 1024
    inputs = torch.randn(size=(1, 1)).cuda()
    print(torch.cuda.memory_allocated())

    # 1536
    outputs = model(inputs)
    print(torch.cuda.memory_allocated())

    # 2048
    loss = torch.mean(outputs)
    print(torch.cuda.memory_allocated())

    # 2560
    loss.backward()
    print(torch.cuda.memory_allocated())

    # 3584
    optimizer = optim.Adam(model.parameters(), lr=1e-3)
    optimizer.step()
    print(torch.cuda.memory_allocated())


if __name__ == '__main__':
    test()

程序执行结果：

SGD 如果设置 momentum 的话，内部会对每个参数记录一个历史梯度。Adam 则记录的数据较多一些。所以，Adam 的显存占用会更多一些。

显存使用分析（PyTorch）

1. CUDA 运行内存

2. 模型的固定参数

3. 前向和反向计算

4. 优化方法统计量

取消回复

文章目录