MindSpore 集成 HCCL 自定义算子

本文介绍了在昇腾910B服务器上为MindSpore 2.3.0框架集成自定义HCCL通信算子的完整流程。以AlltoAllV算子为例，详细说明了从算子编译、MindSpore注册到分布式训练部署的步骤，包括环境准备、参数配置和具体实现方法。实验结果表明，自定义算子与原生HCCL算子性能相当（64MB数据耗时约1.1ms），并在MoE模型训练中实现了10%的性能提升，验证了该方案的可行性和有效性。

昇思MindSpore

302人浏览 · 2025-12-05 11:36:09

昇思MindSpore · 2025-12-05 11:36:09 发布

第 1 章概述

MindSpore 作为昇腾生态的全场景 AI 框架，支持集成自定义 HCCL 算子以满足特定通信需求。本文以自定义 AlltoAllV 算子为例，详解从算子编译、MindSpore 注册到分布式训练部署的全流程，帮助开发者快速扩展昇腾平台的通信能力。

第 2 章环境准备

硬件环境：昇腾 910B 服务器（2 卡）。

软件依赖：

MindSpore 2.3.0：pip install mindspore-ascend==2.3.0；

HCCL 自定义算子：基于cann-hccl编译的动态库；

分布式训练示例：MindSpore 的resnet50_distributed。

环境变量：

export MS_ENABLE_HCCL=1
export HCCL_CUSTOM_OP_PATH=/path/to/custom_hccl_op

第 3 章配置参数

自定义算子注册参数：

第 4 章操作步骤

1.编译自定义 HCCL 算子：

# 生成算子动态库 gcc -fPIC -shared custom_alltoallv.cc -o libcustom_hccl_op.so -lhccl -I$ASCEND_HOME/hccl/include

2.MindSpore 算子注册：

# custom_op.py
from mindspore.ops import CustomRegOp, custom_op_attr_register

custom_alltoallv_op = CustomRegOp("CustomAlltoAllV") \
    .input(0, "send_buf", "required") \
    .input(1, "send_counts", "required") \
    .output(0, "recv_buf", "required") \
    .attr("comm_group", "str", "required") \
    .target("Ascend") \
    .dtype_format("float32->float32") \
    .get_op_info()

@custom_op_attr_register(op_info=custom_alltoallv_op)
def custom_alltoallv_impl(send_buf, send_counts, comm_group):
    from mindspore.communication import hccl
    # 调用自定义HCCL算子
    hccl.custom_op("CustomAlltoAllV", send_buf, send_counts, comm_group)

3.分布式训练集成：

# train.py
from mindspore.communication import init
from custom_op import custom_alltoallv_impl

if __name__ == "__main__":
    init()
    # 构造输入数据
    send_buf = Tensor(np.random.rand(1024, 1024), dtype=ms.float32)
    send_counts = Tensor([512, 512], dtype=ms.int32)
    # 调用自定义算子
    recv_buf = custom_alltoallv_impl(send_buf, send_counts, "hccl_world_group")