全家桶集齐！Qwen3.5四款小模型上线魔乐社区，附昇腾全套实践教程

0.8B / 2B：极致轻量，端侧首选特点：体积极小，推理速度极快。场景：非常适合移动设备、IoT 边缘设备部署，以及低延时的实时交互场景。4B：轻量级 Agent 的强劲基座特点：性能强劲，多模态基座模型，适合Agent。场景：适合作为轻量级智能体的核心大脑，完美平衡了性能与资源消耗。9B：紧凑尺寸，越级性能特点：结构紧凑，但性能媲美gpt-oss-120B，让人惊艳。场景：适合需要较高智力水

魔乐社区

473人浏览 · 2026-03-04 17:59:32

魔乐社区 · 2026-03-04 17:59:32 发布

Qwen3.5又开源了！3月2日，千问正式开源4款 Qwen3.5 小尺寸模型系列Qwen3.5-0.8B/2B/4B/9B。这一系列模型继承了 Qwen3.5 家族的强大基因，采用原生多模态训练、最新的模型架构，表现出卓越的性能，以满足从极端资源受限到高性能轻量级应用的不同需求。其中，9B模型性能媲美gpt-oss-120B，让人惊艳。

魔乐社区已同步上线四款模型并同时发布基于昇腾的手把手部署教程，让开发者能Day0体验模型能力，欢迎开发者下载体验！完成部署体验还可以投稿参与社区的新春模型征文赛哦！

模型和教程链接

BF16版本模型权重：

➡️https://modelers.cn/models/Qwen-AI/Qwen3.5-0.8B

➡️https://modelers.cn/models/Qwen-AI/Qwen3.5-2B

➡️https://modelers.cn/models/Qwen-AI/Qwen3.5-4B

➡️ https://modelers.cn/models/Qwen-AI/Qwen3.5-9B

vLLM Ascend部署教程

魔乐社区为各型号模型配套了定制化部署教程：

➡️https://modelers.cn/models/vLLM_Ascend/Qwen3.5-0.8B

➡️https://modelers.cn/models/vLLM_Ascend/Qwen3.5-2B

➡️https://modelers.cn/models/vLLM_Ascend/Qwen3.5-4B

➡️https://modelers.cn/models/vLLM_Ascend/Qwen3.5-9B

SGLang部署教程：

魔乐社区为各型号模型配套了定制化部署教程：

➡️https://modelers.cn/models/SGLangAscend/Qwen3.5-0.8B

➡️https://modelers.cn/models/SGLangAscend/Qwen3.5-2B

➡️https://modelers.cn/models/SGLangAscend/Qwen3.5-4B

➡️https://modelers.cn/models/SGLangAscend/Qwen3.5-9B

01 模型介绍

0.8B / 2B：极致轻量，端侧首选

特点：体积极小，推理速度极快。
场景：非常适合移动设备、IoT 边缘设备部署，以及低延时的实时交互场景。
4B：轻量级 Agent 的强劲基座

特点：性能强劲，多模态基座模型，适合Agent。
场景：适合作为轻量级智能体的核心大脑，完美平衡了性能与资源消耗。
9B：紧凑尺寸，越级性能

特点：结构紧凑，但性能媲美gpt-oss-120B，让人惊艳。
场景：适合需要较高智力水平但受限显存资源的服务器端部署，是性价比极高的通用模型选择。

至此，千问3.5家族已开源Qwen3.5-397B-A17B的大尺寸模型，Qwen3.5-122-A10B、Qwen3.5-35B-A3B、Qwen3.5-27B等3个中型尺寸模型，以及此次开源的Qwen3.5-9B/4B/2B/0.8B的4款小尺寸模型。

以下将以Qwen3.5-9B为例，展开详细的基于vLLM和SGLang的部署实操步骤，其他型号模型可参考此流程适配操作。

基于vLLM Ascend在昇腾部署模型

1、模型权重

Qwen3.5-9B（BF16 版本）：

https://modelers.cn/models/Qwen-AI/Qwen3.5-9B

注：建议将模型权重下载至多节点共享目录，例如 /root/.cache/。

2、安装

官方Docker镜像

您可以通过镜像链接下载镜像压缩包来进行部署，具体流程如下：

# 使用docker加载下载的镜像压缩包
# 根据您的环境更新要加载的vllm-ascend镜像压缩包名称,以下以A3 arm为例：
docker load -i Vllm-ascend-Qwen3_5-A3-Ubuntu-v0.tar 
# 根据您的设备更新 --device（Atlas A3：/dev/davinci[0-15]）。

# 注意：您需要提前将权重下载至 /root/.cache。
# 更新 vllm-ascend 镜像，并配置对应的Image名
export IMAGE=vllm-ascend:qwen3_5-v0-a3 
export NAME=vllm-ascend

# 使用定义的变量运行容器
# 注意：若使用 Docker 桥接网络，请提前开放可供多节点通信的端口
docker run --rm \
--name $NAME \
--net=host \
--shm-size=100g \
--device /dev/davinci0 \
--device /dev/davinci1 \
--device /dev/davinci2 \
--device /dev/davinci3 \
--device /dev/davinci4 \
--device /dev/davinci5 \
--device /dev/davinci6 \
--device /dev/davinci7 \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/Ascend/driver/tools/hccn_tool:/usr/local/Ascend/driver/tools/hccn_tool \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /root/.cache:/root/.cache \
-it $IMAGE bash

2）源码构建

如果您不希望使用上述 Docker 镜像，也可通过源码完整构建：

保证你的环境成功安装了CANN 8.5.0
从源码安装 vllm-ascend，请参考安装指南。

从源码安装 vllm-ascend后，您需要将 vllm、vllm-ascend、transformers 升级至主分支：

# 升级 vllm
git clone https://github.com/vllm-project/vllm.git
cd vllm
git checkout a75a5b54c7f76bc2e15d3025d6
git fetch origin pull/34521/head:pr-34521
git merge pr-34521
VLLM_TARGET_DEVICE=empty pip install -v .

# 升级 vllm-ascend
pip uninstall vllm-ascend -y
git clone https://github.com/vllm-project/vllm-ascend.git
cd vllm-ascend
git checkout c63b7a11888e9e1caeeff8
git fetch origin pull/6742/head:pr-6742
git merge pr-6742
pip install -v .

# 重新安装 transformers
git clone https://github.com/huggingface/transformers.git
cd transformers
git reset --hard fc9137225880a9d03f130634c20f9dbe36a7b8bf
pip install .

如需部署多节点环境，您需要在每个节点上分别完成环境配置。

3、部署

单节点部署

以A3系列为例，执行以下脚本进行在线推理。

export PYTORCH_NPU_ALLOC_CONF="expandable_segments:True"
export HCCL_OP_EXPANSION_MODE="AIV"
export HCCL_BUFFSIZE=1024
export OMP_NUM_THREADS=1
export LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libjemalloc.so.2:$LD_PRELOAD
export TASK_QUEUE_ENABLE=1

vllm serve /root/.cache/modelscope/hub/models/vllm-ascend/Qwen3.5-9B/ \
    --served-model-name "qwen3.5" \
    --host 0.0.0.0 \
    --port 8010 \
    --data-parallel-size 1 \
    --tensor-parallel-size 4 \
    --max-model-len 5000 \
    --max-num-batched-tokens 16384 \
    --max-num-seqs 128 \
    --gpu-memory-utilization 0.8 \
    --skip-mm-profiling \
    --trust-remote-code \
    --async-scheduling \
    --allowed-local-media-path / \
    --mm-processor-cache-gb 0 \
  --enforce-eager \
    --additional-config '{"enable_cpu_binding":true, "multistream_overlap_shared_expert": true}'

执行以下脚本向模型发送一条请求：

curl http://localhost:8010/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
"prompt": "The future of AI is",
"path": "/path/to/model/Qwen3.5-27B/",
"max_tokens": 100,
"temperature": 0
        }'

执行结束后，您可以看到模型回答如下：

Prompt: 'The future of AI is', Generated text: ' not just about building smarter machines, but about creating systems that can collaborate with humans in meaningful, ethical, and sustainable ways. As AI continues to evolve, it will increasingly shape how we live, work, and interact — and the decisions we make today will determine whether this future is one of shared prosperity or deepening inequality.\n\nThe rise of generative AI, for example, has already begun to transform creative industries, education, and scientific research. Tools like ChatGPT, Midjourney, and'

也可执行以下脚本向模型发送一条多模态请求：

curl http://localhost:8010/v1/completions \
  -H "Content-Type: application/json" \
    -d '{
        "model": "qwen3.5",
        "messages": [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": [
                {"type": "image_url", "image_url": {"url": "https://modelscope.oss-cn-beijing.aliyuncs.com/resource/qwen.png"}},
                {"type": "text", "text": "What is the text in the illustrate?"}
            ]}
        ]
    }'

执行结束后，您可以看到模型回答如下：

{"id":"chatcmpl-9dab99d55addd8c0","object":"chat.completion","created":1771060145,"model":"qwen3.5","choices":[{"index":0,"message":{"role":"assistant","content":"TONGYI Qwen","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning":null},"logprobs":null,"finish_reason":"stop","stop_reason":null,"token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":112,"total_tokens":119,"completion_tokens":7,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null}

基于SGLang在昇腾部署模型

1、安装

NPU运行时环境所需的依赖已集成到Docker镜像中，并上传至华为云平台，用户可直接拉取该镜像。

#Atlas 800 A3
swr.cn-southwest-2.myhuaweicloud.com/base_image/dockerhub/lmsysorg/sglang:main-cann8.5.0-a3
#Atlas 800 A2
swr.cn-southwest-2.myhuaweicloud.com/base_image/dockerhub/lmsysorg/sglang:main-cann8.5.0-910b
#start container
docker run -itd --shm-size=16g --privileged=true --name ${NAME} \
--privileged=true --net=host \
-v /var/queue_schedule:/var/queue_schedule \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /usr/local/sbin:/usr/local/sbin \
-v /usr/local/Ascend/driver:/usr/local/Ascend/driver \
-v /usr/local/Ascend/firmware:/usr/local/Ascend/firmware \
--device=/dev/davinci0:/dev/davinci0  \
--device=/dev/davinci1:/dev/davinci1  \
--device=/dev/davinci2:/dev/davinci2  \
--device=/dev/davinci3:/dev/davinci3  \
--device=/dev/davinci4:/dev/davinci4  \
--device=/dev/davinci5:/dev/davinci5  \
--device=/dev/davinci6:/dev/davinci6  \
--device=/dev/davinci7:/dev/davinci7  \
--device=/dev/davinci8:/dev/davinci8  \
--device=/dev/davinci9:/dev/davinci9  \
--device=/dev/davinci10:/dev/davinci10  \
--device=/dev/davinci11:/dev/davinci11  \
--device=/dev/davinci12:/dev/davinci12  \
--device=/dev/davinci13:/dev/davinci13  \
--device=/dev/davinci14:/dev/davinci14  \
--device=/dev/davinci15:/dev/davinci15  \
--device=/dev/davinci_manager:/dev/davinci_manager \
--device=/dev/hisi_hdc:/dev/hisi_hdc \
--entrypoint=bash \
quay.io/ascend/sglang:${tag}

2、下载权重

Qwen3.5-9B（BF16 版本）：

https://modelers.cn/models/Qwen-AI/Qwen3.5-9B

3、部署

本次采用单节点部署方式，全程通过脚本配置环境参数、启动推理服务，并提供 curl 请求示例，快速验证部署效果。

单节点部署

执行以下脚本进行在线推理。

# high performance cpu
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000
# bind cpu
export SGLANG_SET_CPU_AFFINITY=1
unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING
# cann
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
export STREAMS_PER_DEVICE=32
export HCCL_BUFFSIZE=1000
export HCCL_OP_EXPANSION_MODE=AIV
export HCCL_SOCKET_IFNAME=lo
export GLOO_SOCKET_IFNAME=lo
python3 -m sglang.launch_server \
        --model-path $MODEL_PATH \
        --attention-backend ascend \
        --device npu \
        --tp-size 1 \
        --chunked-prefill-size -1 --max-prefill-tokens 120000 \
        --disable-radix-cache \
        --trust-remote-code \
        --host 127.0.0.1 \
        --mem-fraction-static 0.8 \
        --port 8000 \
        --cuda-graph-bs 16 \
        --enable-multimodal \
        --mm-attention-backend ascend_attn

发送请求测试：

curl --location http://127.0.0.1:8000/v1/chat/completions --header 'Content-Type: application/json' --data '{
  "model": "qwen3.5",
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "image_url",
          "image_url": {"url": "/image_path/qwen.png"} 
        },
        {"type": "text", "text": "What is the text in the illustrate?"}
      ]
    }
  ]
}'

结果返回如下：

{"id":"cdcd6d14645846e69cc486554f198154","object":"chat.completion","created":1772098465,"model":"qwen3.5","choices":[{"index":0,"message":{"role":"assistant","content":"The user is asking about the text present in the image. I will analyze the image to identify the text.\n</think>\n\nThe text in the image is \"TONGyi Qwen\".","reasoning_content":null,"tool_calls":null},"logprobs":null,"finish_reason":"stop","matched_stop":248044}],"usage":{"prompt_tokens":98,"total_tokens":138,"completion_tokens":40,"prompt_tokens_details":null,"reasoning_tokens":0},"metadata":{"weight_version":"default"}}

欢迎体验反馈！

该教程程中的适配模型当前仅为尝鲜体验，性能优化中。如您在使用过程中发现任何问题（包括但不限于功能问题、合规问题），请在模型的代码仓提交issue，开发者将及时审视并解答。

🔗https://modelers.cn/models/vLLM_Ascend/Qwen3.5-9B
🔗https://modelers.cn/models/SGLangAscend/Qwen3.5-9B

参与社区新年征文赛，赢好礼！

魔乐社区新年模型部署征文赛正火热进行中，欢迎各位开发者根据本次教程部署 Qwen3.5 系列模型，分享你的技术实践、部署心得、问题排查技巧等内容，优质投稿将获得社区专属好礼，快来参与吧！

📺 直播预告：3月5日（本周四）19:30，社区技术专家将开启专场直播，手把手带你玩转 Qwen3.5 在国产算力上的部署全流程。即刻预约，解锁硬核实操技巧！

鲲鹏昇腾开发者社区是面向全社会开放的“联接全球计算开发者，聚合华为+生态”的社区，内容涵盖鲲鹏、昇腾资源，帮助开发者快速获取所需的知识、经验、软件、工具、算力，支撑开发者易学、好用、成功，成为核心开发者。

更多推荐

Qwen3.5 四款小尺寸模型开源：昇腾适配已到位，AtomGit AI 开放体验

鲲鹏昇腾开发者社区

MindSpore并行训练中梯度同步异常Loss shows abnormal fluctuation: from 0.25 to 1.56 within 10 steps

在使用华为昇腾MindSpore进行分布式训练时，我遇到了梯度同步异常的问题。具体表现为在Ascend 910硬件平台上使用并行训练模式时，模型参数更新不一致，导致训练精度大幅下降。该问题在单卡训练时不会出现，仅在多卡并行环境下发生。

鲲鹏昇腾开发者社区

MindSpore 自定义算子开发实战——从 CUDA 到 Ascend C 的迁移与优化

场景框架原生算子局限自定义算子价值稀疏训练标准 Dropout 无法处理动态稀疏开发，显存降低 60%大模型推理FlashAttention 未集成移植优化版，吞吐提升 2.8 倍国产化迁移CUDA 算子无法在昇腾运行重写 Ascend C，性能反超 GPU算法创新新论文提出定制算子快速验证，抢占研究先机💡 案例：某自动驾驶公司开发 BEV 池化算子，将感知模块延迟从 45ms 降至 18ms，