mean triton确定性算子适配

h_2025

14人浏览 · 2025-12-29 20:37:10

h_2025 · 2025-12-29 20:37:10 发布

作者：昇腾实战派

1、背景概述

在深度学习中，算子的确定性计算对于确保模型训练和推理的可复现性至关重要。然而，许多算子（如均值计算算子）在默认实现下可能存在非确定性问题，尤其是在float32数据类型上，这是由于浮点数计算不满足结合律，导致多核并行计算顺序不一致引入的数值偏差。本文以mean算子为例，系统性地复现了其在GPU和NPU上的非确定性计算问题，对比了确定性实现与非确定性实现的性能差异，并探索了通过Triton框架在NPU上的适配与优化策略。实验表明，确定性实现能消除结果波动，但会带来显著的性能开销，为实际应用中的需要进行权衡。

repo仓库链接：https://github.com/thinking-machines-lab/batch_invariant_ops

2、确定性问题复现

首先依据官方给出的操作步骤复现该算子的确定性计算问题。

a) 安装batch_invariant_ops库

git clone https://github.com/thinking-machines-lab/batch_invariant_ops
pip install -e .

b) 改写测试脚本

原仓中的测试用例test_batch_invariance.py，提供了matmul算子的确定性结果演示；为复现mean算子的问题，需要进行一些改写，代码如下：

import torch
from batch_invariant_ops import set_batch_invariant_mode

device_type = getattr(torch.accelerator.current_accelerator(), "type", "cpu")
print(device_type)
torch.set_default_device(device_type)

# Just to get the logging out of the way haha
with set_batch_invariant_mode(True):
    pass

def test_batch_invariance(dtype=torch.float32):
    B, D, X = 2048, 4096, 16
    a = torch.linspace(-100, 100, B*D*X, dtype=dtype).reshape(B, D, X)

    # Method 1: vector multiplication (batch size 1)
    out1 = torch.mean(a[:1], dim=1, keepdim=False)

    # Method 2: Matrix multiplication, then slice (full batch)
    out2 = torch.mean(a, dim=1, keepdim=False)[:1]

    # Check if results are identical
    diff = (out1 - out2).abs().max()

    return diff.item() == 0, diff

def run_iters(iters=10):
    for dtype in [ torch.float32 , torch.bfloat16 ]:
        is_deterministic = True
        difflist = []
        for i in range (iters):
            isd, df = test_batch_invariance(dtype)
            is_deterministic = is_deterministic and isd
            difflist.append(df)
        print( f"Batch Deterministic: {is_deterministic} run-to-run max/min/diff {max(difflist)}/{min(difflist)}/{max(difflist)-min(difflist)} for {dtype} in {iters} iterations")


# Test with standard PyTorch (likely to show differences)
# shows that: in standard mode, the results may have some diffs
print("Standard PyTorch:")
with set_batch_invariant_mode(False):
    run_iters()
  
# Test with batch-invariant operations
print("\nBatch-Invariant Mode:")
with set_batch_invariant_mode(True):
    run_iters()
# shows that: in batch_invariant mode，the results are all the same

该示例设置输入尺寸为（2048，4096, 16）；为全面评测算子性能，需要覆盖更多输入场景。运行结果如下：

可以看到，在pytorch标准模式下，对于float32数据格式出现了不确定性计算的问题；而在batch-invariant mode下，float32数据格式下计算结果是确定的。

——值得一提的是：在bfloat16数据格式下，两种模式下的计算结果都是确定的。

c) 性能差异比较

一般来说，为了实现确定性计算，需要牺牲一些性能。那么，对于该算子确定性实现和非确定性实现之间的性能差异有多大呢？

import torch

import triton
import triton.language as tl

@triton.jit
def mean_kernel(
    input_ptr,
    output_ptr,
    input_stride0,
    input_stride1,
    input_stride2,
    output_stride0,
    output_stride1,
    M,  # size before reduction dim
    N,  # size of reduction dim
    K,  # size after reduction dim
    BLOCK_SIZE: tl.constexpr,
):
    """
    Kernel for computing mean along a single dimension.
    Input is viewed as (M, N, K) where N is the dimension being reduced.
    """
    # Program ID gives us which output element we're computing
    pid = tl.program_id(0)

    # Compute output indices
    m_idx = pid // K
    k_idx = pid % K

    # Bounds check
    if m_idx >= M or k_idx >= K:
        return

    # Accumulate sum across reduction dimension
    acc = 0.0
    for n_start in range(0, N, BLOCK_SIZE):
        n_offsets = n_start + tl.arange(0, BLOCK_SIZE)
        mask = n_offsets < N

        # Calculate input indices
        input_idx = m_idx * input_stride0 + n_offsets * input_stride1 + k_idx * input_stride2

        # Load and accumulate
        vals = tl.load(input_ptr + input_idx, mask=mask, other=0.0)
        acc += tl.sum(vals)

    # Compute mean and store
    mean_val = acc / N
    output_idx = m_idx * output_stride0 + k_idx * output_stride1
    tl.store(output_ptr + output_idx, mean_val)

def mean_dim(
    input: torch.Tensor, dim: int, keepdim: bool = False, dtype: torch.dtype | None = None
) -> torch.Tensor:
    """
    Triton implementation of torch.mean with single dimension reduction.

    Args:
        input: Input tensor
        dim: Single dimension along which to compute mean
        keepdim: Whether to keep the reduced dimension
        dtype: Output dtype. If None, uses input dtype (or float32 for integer inputs)

    Returns:
        Tensor with mean values along specified dimension
    """
    # Validate inputs
    assert input.is_cuda, "Input must be a CUDA tensor"
    assert -input.ndim <= dim < input.ndim, (
        f"Invalid dimension {dim} for tensor with {input.ndim} dimensions"
    )

    # Handle negative dim
    if dim < 0:
        dim = dim + input.ndim

    # Handle dtype
    if dtype is None:
        if input.dtype in [torch.int8, torch.int16, torch.int32, torch.int64]:
            dtype = torch.float32
        else:
            dtype = input.dtype

    # Convert input to appropriate dtype if needed
    if input.dtype != dtype:
        input = input.to(dtype)

    # Get input shape and strides
    shape = list(input.shape)

    # Calculate dimensions for kernel
    M = 1
    for i in range(dim):
        M *= shape[i]

    N = shape[dim]

    K = 1
    for i in range(dim + 1, len(shape)):
        K *= shape[i]

    # Reshape input to 3D view (M, N, K)
    input_3d = input.reshape(M, N, K)

    # Create output shape
    if keepdim:
        output_shape = shape.copy()
        output_shape[dim] = 1
    else:
        output_shape = shape[:dim] + shape[dim + 1 :]

    # Create output tensor
    output = torch.empty(output_shape, dtype=dtype, device=input.device)

    # Reshape output for kernel
    if keepdim:
        output_2d = output.reshape(M, 1, K).squeeze(1)
    else:
        output_2d = output.reshape(M, K)

    # Launch kernel: in npu: you must limit that the max_grids is 48
    grid = (M * K,)   
    BLOCK_SIZE = 1024

    mean_kernel[grid](
        input_3d,
        output_2d,
        input_3d.stride(0),
        input_3d.stride(1),
        input_3d.stride(2),
        output_2d.stride(0),
        output_2d.stride(1) if output_2d.ndim > 1 else 0,
        M,
        N,
        K,
        BLOCK_SIZE,
    )

    return output

if __name__=="__main__":
    import torch
    import time
    torch.rand(42)

    device_type = getattr(torch.accelerator.current_accelerator(), "type", "cpu")
    print(device_type)
    torch.set_default_device(device_type)
    repreat_num = 100
    warmup_time = 3

    #1. construct input
    B, D, X = 2048, 4096, 16
    input = torch.linspace(-100, 100, B*D*X, dtype = torch.float32).reshape(B, D, X)

    #2. run, get profiling data
    for i in range(warmup_time):
        output = mean_dim(input, dim=1, keepdim=False)
  
    start = time.time()
    for i in range(repreat_num):
        output = mean_dim(input, dim=1, keepdim=False)
    torch.cuda.synchronize()
    end = time.time()
    print("the time elapsed for shape {}, {} is {}ms".format(B, D, (end - start) * 1000 / repreat_num))

    #3. compare with torch implementation
    for i in range(warmup_time):
        output = torch.mean(input, dim=1, keepdim=False)
  
    start = time.time()
    for i in range(repreat_num):
        output = torch.mean(input, dim=1, keepdim=False)
    torch.cuda.synchronize()
    end = time.time()
    print("for torch implementation, the time elapsed for shape {}, {} is {}ms".format(B, D, (end - start) * 1000 / repreat_num))

可以看到，对于（2048,4096,16）这种输入shape来说，mean确定性计算的实现耗时会大大增加，性能约为非确定性实现的1/3。GPU卡型号为96G H20

d) NPU下问题是否存在

在NPU下，是否存在类似问题，即Ascend-C算子是否实现了确定性计算呢？实验如下：

import torch
from batch_invariant_ops import set_batch_invariant_mode

device_type = getattr(torch.accelerator.current_accelerator(), "type", "cpu")
print(device_type)
torch.set_default_device(device_type)

def test_batch_invariance(dtype=torch.float32):
    B, D, X = 2048, 4096, 16
    a = torch.linspace(-100, 100, B*D*X, dtype=dtype).reshape(B, D, X)

    # Method 1: Matrix-vector multiplication (batch size 1)
    out1 = torch.mean(a[:1], dim=1, keepdim=False)

    # Method 2: Matrix-matrix multiplication, then slice (full batch)
    out2 = torch.mean(a, dim=1, keepdim=False)[:1]

    # Check if results are identical
    diff = (out1 - out2).abs().max()

    return diff.item() == 0, diff

def run_iters(iters=10):
    for dtype in [ torch.float32 , torch.bfloat16 ]:
        is_deterministic = True
        difflist = []
        for i in range (iters):
            isd, df = test_batch_invariance(dtype)
            is_deterministic = is_deterministic and isd
            difflist.append(df)
        print( f"Batch Deterministic: {is_deterministic} run-to-run max/min/diff {max(difflist)}/{min(difflist)}/{max(difflist)-min(difflist)} for {dtype} in {iters} iterations")


# Test with standard PyTorch (likely to show differences)
# shows that: in standard mode, the results may have some diffs
print("Standard PyTorch:")
with set_batch_invariant_mode(False):
    run_iters()

可以看到，在NPU上，非确定性计算的问题也会存在，且只会出现在float32数据类型上。

2、NPU适配

该repo下的算子均为triton实现，因此在NPU下的适配可以走triton-ascend适配路径。当然，也可以用Ascend-C去实现一版确定性的mean计算，该实现方式不在本文讨论范围内。

a) 安装triton-ascend

目前官方提供daily包，采用daily包安装方式，找到对应的安装命令：

pip install -i https://test.pypi.org/simple/ triton-ascend==3.2.0.dev2025112001

daily包查询路径：https://test.pypi.org/project/triton-ascend/3.2.0.dev2025112001/#history

b) 适配跑通

安装triton-ascend之后，后续的triton操作默认都是执行在npu上；无需额外import triton-ascend。

triton-ascend期望达到的适配效果是仅需修改少量代码，即可完成算子的NPU适配，对于上面的代码，需要做如下几处适配修改，如下图所示：

导入torch_npu

torch.cuda -> torch.npu

运行测试脚本，结果如下：

可以看到算子可以正常运行，不过提示Grid数过多，物理限制为40个，可能导致性能下降。实际的耗时结果如下：

triton实现（确定）	Ascend-C实现（非确定）	耗时比例
77.31ms	0.46ms	168x

c) 性能优化

确定性计算的实现思想是把规约操作尽量放在单核上进行，避免多核执行顺序不一致以及float数据类型不满足结合律特性引入不确定性。结合WARN提示，和代码实现逻辑来看，Grid设置太大，超过了物理核的实际个数，会

引入较大的开销。因此结合该算子的实现，可以设计如下几种优化思路：

每个Grid对应M轴；
每个Grid对应K轴；
若M轴很大，则Grid设为40，每个Grid处理M轴多行；
若K轴很大，则Grid设为40，每个Grid处理K轴多列；
输入进行reshape，变换为(M,K,N); 使要处理的数据位于尾轴；
对Block_Size进行优选

详细实验结果如下所示：

序号	triton确定性	Ascend-C非确定性	Grid数	耗时比	分析
每个Grid对应M轴	45.69	0.467	2048	97.83X	逻辑Grid数》物理核数，引入较大开销；存在进一步优化空间；相较于之前的32768，grid数已大大减少
每个Grid对应K轴	226.12	0.468	16	484.20X	1. 物理核数未打满；2. K轴数据存储不连续，数据读取耗时会比较大
Grid设为40，每个Grid处理M轴离散多行	46.17	0.468	40	98.65X	逻辑核降到40，略有上升，why?
Grid设为40，每个Grid处理M轴连续多行	46.07	0.468	40	98.44X	逻辑核降到40，略有上升，why?
输入reshape为(M,K,N);	2.75	0.468	2048	5.87X	数据处理发生在尾轴，因数据连续，读写开销会大幅减少
对Block_Size进行优选（512-》8192）	1.78	0.468	2048	3.8X	尽量增大UB使用大小，减小循环次数
2048 thread -> 40 thread	1.80	0.468	40	3.84X	将逻辑核降低到40，略有上升
合并为M * K, 均分到40thread处理	1.83	0.468	40	3.91X	引入比较多的标量计算

3、总结

GPU和NPU，Pytorch官方API, mean算子在float32数据类型下均存在不确定性计算问题，float16无该问题；
NPU下也存在不确定性问题；
GPU下采用triton实现的mean确定性计算算子，会导致算子性能下降为CUDA实现的1/3;
采用triton-ascend适配mean算子，算子性能下降幅度与GPU类似。

4、参考资料

https://lmsys.org/blog/2025-09-22-sglang-deterministic/

https://thinkingmachines.ai/blog/defeating-nondeterminism-in-llm-inference/

https://zhuanlan.zhihu.com/p/1956415532606665583

https://gitcode.com/Ascend/triton-ascend/blob/master/docs/HighPerformanceGuide.md

鲲鹏昇腾开发者社区是面向全社会开放的“联接全球计算开发者，聚合华为+生态”的社区，内容涵盖鲲鹏、昇腾资源，帮助开发者快速获取所需的知识、经验、软件、工具、算力，支撑开发者易学、好用、成功，成为核心开发者。

更多推荐

BoostKit 鲲鹏数据库优化原理分析从 MySQL 到 MariaDB 的性能突破

鲲鹏昇腾开发者社区

DeepSeek-V3模型特性之FP8混合精度训练

受此前低精度训练优势的启发，文章提出了使用FP8数据格式的细粒度混合精度框架用于DeepSeek-V3的训练。尽管低精度训练有巨大的潜力，但经常被激活值、权重、梯度的异常值限制。虽然在推理量化上已经有重大进步，但是低精度技术在大规模语言模型预训练中的成功应用相对较少。为了应对这一挑战并且有效地拓宽FP8格式的动态范围，文章引入了细粒度量化策略：切片分组和分块分组量化。在提高累加精度。