【MindSpeed+ vLLM-Ascend】Qwen3-Coder-Next模型上线，昇腾环境极速落地指南

zzzliupi

286人浏览 · 2026-06-14 14:21:49

zzzliupi · 2026-06-14 14:21:49 发布

📍 昇腾开发者社区活动入口
2026年2月4日，千问Qwen发布一款专为编程智能体与本地开发设计的开源权重语言模型：Qwen3-Coder-Next。此前昇腾一直同步支持Qwen系列模型，此次Qwen3-Coder-Next模型一经发布开源，即在MindSpeed和vLLM Ascend中适配支持供开发者尝鲜体验。适配模型及权重已同时上线魔乐社区和Atomgit AI社区，欢迎开发者们下载！

一、Qwen3-Coder-Next模型亮点介绍

该模型基于 Qwen3-Next-80B-A3B-Base 构建，采用混合注意力与 MoE 的新架构，不依赖单纯的参数扩展，而是聚焦于扩展智能体训练信号。使用大规模的可验证编程任务与可执行环境进行训练，使模型能够直接从环境反馈中学习。训练过程包括：

在以代码与智能体为中心的数据上进行持续预训练
在包含高质量智能体轨迹的数据上进行监督微调
领域专精的专家训练（如软件工程、QA、Web/UX 等）
将专家能力蒸馏到单一、可部署的模型中

该配方强调长程推理、工具使用以及从执行失败中恢复，这些对现实世界中的编程智能体至关重要。尽管激活参数规模很小，但该模型在多项智能体评测上仍能匹敌或超过若干更大的开源模型。

二、基于昇腾快速上手Qwen3-coder-next模型

本教程将手把手指导您完成Qwen3-Coder-Next模型的训练、推理部署流程。我们提供了详尽的步骤说明和最佳实践，确保您能够快速上手。

基于 MindSpeed 训练上手指导

1、环境配置

硬件要求、MindSpeed LLM 仓库部署请参考 MindSpeed Qwen3-Coder-Next 环境配置章节。

2 、权重转换

1）权重下载

来源	链接
Hugging Face	Qwen3-Coder-Next 模型页
魔乐社区	Qwen3-Coder-Next 模型页

2）权重转换

MindSpeed-LLM提供脚本将已经huggingface开源权重转换为mcore权重，用于训练、推理、评估等任务。使用方法如下，请根据实际需要的TP/PP等切分策略和权重路径修改权重转换脚本。

cd MindSpeed-LLM 
bash examples/mcore/qwen3_coder_next/ckpt_convert_qwen3_coder_next_80b_hf2mcore.sh

3 、数据预处理

1）预训练数据预处理

MindSpeed-LLM 提供预训练数据预处理脚本：data_convert_qwen3_coder_next_pretrain.sh

使用方法如下，请根据实际需要修改以下参数

cd MindSpeed-LLM 
bash examples/mcore/qwen3_coder_next/data_convert_qwen3_coder_next_pretrain.sh

2）微调数据预处理

MindSpeed-LLM 提供微调数据预处理脚本：data_convert_qwen3_coder_next_instruction.sh

使用方法如下，请根据实际需要修改以下参数

cd MindSpeed-LLM 
bash examples/mcore/qwen3_coder_next/data_convert_qwen3_coder_next_instruction.sh

4 、预训练

cd MindSpeed-LLM 
bash examples/mcore/qwen3_coder_next/pretrain_qwen3_coder_next_80b_4K_A3_ptd.sh

5 、微调

cd MindSpeed-LLM 
bash examples/mcore/qwen3_coder_next/tune_qwen3_coder_next_80b_4K_full_ptd.sh

6、推理

cd MindSpeed-LLM 
bash examples/mcore/qwen3_coder_next/generate_qwen3_coder_next_80b_ptd.sh

基于 vLLM Ascend 推理上手指导

1 、获取权重

可在魔乐社区快速下载模型权重：Qwen3-Coder-Next 模型页

Qwen3-Coder-Next已在vllm-ascend:v0.14.0rc1版本镜像支持。

2 、部署模型

启动Docker容器：

# Update the vllm-ascend image
# For Atlas A2 machines:
# export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version|
# For Atlas A3 machines:
export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version|-a3
docker run --rm \
--shm-size=1g \
--name qwen3-coder-next \
--device /dev/davinci0 \
--device /dev/davinci1 \
--device /dev/davinci2 \
--device /dev/davinci3 \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /root/.cache:/root/.cache \
-p 8000:8000 \
-it $IMAGE bash

需要确保你的环境中有Triton Ascend以运行该模型（Triton Ascend）。

pip install triton-ascend==3.2.0

3 、推理

离线推理

执行以下离线脚本，给模型输入四条prompt：

import os
os.environ["VLLM_USE_MODELSCOPE"] = "True"
os.environ["VLLM_WORKER_MULTIPROC_METHOD"] = "spawn"

from vllm import LLM, SamplingParams

def main():
    prompts = [
        "Hello, my name is",
        "The president of the United States is",
        "The capital of France is",
        "The future of AI is",
    ]

    # Create a sampling params object.
    sampling_params = SamplingParams(max_tokens=100, temperature=0.0)
    # Create an LLM.
    llm = LLM(model="/path/to/model/Qwen3-Coder-Next/",
            tensor_parallel_size=4,
            trust_remote_code=True,
            max_model_len=10000,
            gpu_memory_utilization=0.8,
            max_num_seqs=4,
            max_num_batched_tokens = 4096,
            compilation_config={
            "cudagraph_mode": "FULL_DECODE_ONLY",},
        )

    # Generate texts from the prompts.
    outputs = llm.generate(prompts, sampling_params)
    for output in outputs:
        prompt = output.prompt
        generated_text = output.outputs[0].text
        print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

if __name__ == "__main__":
main()

在线推理

执行以下脚本启动一个在线的服务：

vllm serve /path/to/model/Qwen3-Coder-Next/ --tensor-parallel-size 4 --max-model-len 32768 --gpu-memory-utilization 0.8 --max-num-batched-tokens 4096 --compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY"}'

然后执行以下脚本向模型发送一条请求：

curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
        "prompt": "The future of AI is",
        "path": "/path/to/model/Qwen3-Coder-Next/",
        "max_tokens": 100,
        "temperature": 0
        }'

执行结束后，你可以看到模型回答如下：

Prompt: ‘The future of AI is’, Generated text: ’ not just about building smarter machines, but about creating systems that can collaborate with humans in meaningful, ethical, and sustainable ways. As AI continues to evolve, it will increasingly shape how we live, work, and interact — and the decisions we make today will determine whether this future is one of shared prosperity or deepening inequality.\n\nThe rise of generative AI, for example, has already begun to transform creative industries, education, and scientific research. Tools like ChatGPT, Midjourney, and’

当前仅为尝鲜体验，性能优化中。如您在部署的过程中，发现任何问题（包括但不限于功能问题、合规问题），请在模型代码仓提交issue，开发者将及时审视并解答。