llm一种剪枝的方式(合并LayerNorm跟Linear)

llm一种剪枝的方式(合并LayerNorm跟Linear)

前言

llm模型中经常会遇到LayerNorm跟Linear相连,通过LayerNorm中的weight跟bias作用到Linear中,进而不再依赖LayerNorm的weight跟bias来计算

测试代码

import torch
from torch import nn
import time


shape =(512,1024)


norm = nn.LayerNorm(shape[0])
linear = nn.Linear(shape[0],shape[1])

# 初始化参数
nn.init.normal_(norm.weight,std=0.1)
nn.init.normal_(norm.bias,std=0.1)
nn.init.normal_(linear.weight,std=0.1)
nn.init.normal_(linear.bias,std=0.1)




# 合并 norm 跟 linear 成一个新的linear
linear1 = nn.Linear(shape[0],shape[1])
linear1.weight = torch.nn.Parameter(torch.mul(linear.weight,norm.weight))
linear1.bias = torch.nn.Parameter(torch.add(torch.matmul(linear.weight,norm.bias),linear.bias))


norm = norm.eval()
linear = linear.eval()
linear1 = linear1.eval()
if torch.cuda.is_available():
    norm = norm.cuda()
    linear = linear.cuda()
    linear1 = linear1.cuda()




input = torch.rand(20,shape[0])
if torch.cuda.is_available():
    input = input.cuda()

out1 = linear1(torch.layer_norm(input,(shape[0],)))

out = linear(norm(input))
print("out1:",out1)
print("out:",out)


test_count = 10000
time0 = 0
time1 = 0


for i in range(test_count):
    input = torch.rand(20,shape[0])
    if torch.cuda.is_available():
        input = input.cuda()
    start = time.time()
    out = linear(norm(input))
    time0 = time0 + (time.time()-start)

for i in range(test_count):
    input = torch.rand(20,shape[0])
    if torch.cuda.is_available():
        input = input.cuda()
    start = time.time()
    out1 = linear1(torch.layer_norm(input,(shape[0],)))
    time1 = time1 + (time.time()-start)

# 比较耗时
print(time0,time1)

上述代码可以看出 out1 out的输出是一样的,time1 也比 time0小一些。 原理上来说LayerNorm 没有weight跟bias 所以就可以少做一次tensor的乘法跟加法了

后记

在用ggml这种来实现llm时遇到这种情况构图时就会体现的特别明显


阅读量: