小参数・大码力・易部署 | Qwen3.6-27B上线魔乐社区，基于昇腾的部署教程来了

继一周前模型开源发布后，千问再度开源Qwen3.6-27B —— 一个拥有270亿参数的稠密多模态模型，也是社区呼声最高的模型规格。Qwen3.6-27B 依然支持多模态思考与非思考模式，在智能体编程方面达到了旗舰级表现，全面超越前代开源旗舰 Qwen3.5-397B-A17B（总参数397B / 激活参数17B的MoE模型）。作为稠密架构，它无需MoE路由即可部署，是开发者在实用、可广泛部署规模

魔乐社区

116人浏览 · 2026-05-18 16:23:19

魔乐社区 · 2026-05-18 16:23:19 发布

继一周前Qwen3.6-35B-A3B模型开源发布后，千问再度开源Qwen3.6-27B —— 一个拥有270亿参数的稠密多模态模型，也是社区呼声最高的模型规格。Qwen3.6-27B 依然支持多模态思考与非思考模式，在智能体编程方面达到了旗舰级表现，全面超越前代开源旗舰 Qwen3.5-397B-A17B（总参数397B / 激活参数17B的MoE模型）。作为稠密架构，它无需MoE路由即可部署，是开发者在实用、可广泛部署规模上获取顶尖编程能力的理想选择。

Qwen3.6-27B和35B模型开源权重均已上线魔乐社区。昇腾同步支持基于vLLM和SGLang在昇腾全系列产品上高效推理部署。社区上线适配和量化模型以及部署教程，供开发者按需选择，欢迎下载体验！

🔗 开源权重：

https://modelers.cn/models/Qwen-AI/Qwen3.6-27B
https://modelers.cn/models/Qwen-AI/Qwen3.6-35B-A3B

🔗 W8A8量化版（NPU适配）：

https://modelers.cn/models/Eco-Tech/Qwen3.6-35B-A3B-w8a8
27B量化版（Coming soon~)

🔗 vLLM Ascend部署教程：

https://modelers.cn/models/vLLM_Ascend/Qwen3.6-27B
https://modelers.cn/models/vLLM_Ascend/Qwen3.6-35B-A3B

🔗 SGLang部署教程：

- https://modelers.cn/models/SGLangAscend/Qwen3.6-27B
- https://modelers.cn/models/SGLangAscend/Qwen3.6-35B-A3B

🔗 FlagOS适配版：

https://modelers.cn/user/FlagRelease?model_name=qwen3.6

模型亮点

旗舰级智能体编程能力，全面超越 Qwen3.5-397B-A17B

Qwen3.6-27B 在稠密模型的智能体编程能力上实现了突破。仅凭270亿参数，它在所有主要编程基准上全面超越了 Qwen3.5-397B-A17B（总参数397B / 激活参数17B）——包括 SWE-bench Verified（77.2 vs. 76.2）、SWE-bench Pro（53.5 vs. 50.9）、Terminal-Bench 2.0（59.3 vs. 52.5）以及 SkillsBench（48.2 vs. 30.0）。同时，它也大幅领先于同规模的稠密模型。在推理任务上，Qwen3.6-27B 在 GPQA Diamond 上取得了87.8的成绩，可与数倍于其规模的模型相媲美。

强大的文本和多模态推理能力

Qwen3.6-27B 原生支持多模态，支持视觉语言思考与非思考模式——与 Qwen3.6-35B-A3B 相同。它能够处理图像、视频与文本的多模态理解，支持视觉推理、文档理解和视觉问答等任务。

基于vLLM Ascend的部署推理教程

以下以Qwen3.6-27B基于vLLM Ascend在昇腾上部署为例，手把手教你如何部署推理。

1. 下载模型权重

Qwen3.6-27B：

https://modelers.cn/models/Qwen-AI/Qwen3.6-27B

注：建议将模型权重下载至多节点共享目录，例如 /root/.cache/。

2. 安装

1）官方 Docker 镜像

可以通过镜像链接下载镜像压缩包来进行部署。

镜像链接：https://quay.io/repository/ascend/vllm-ascend?tab=tags&tag=latest
具体流程如下：

# 拉取0.18.0rc1镜像，以A3 openeuler为例
docker pull quay.io/ascend/vllm-ascend:v0.18.0rc1-a3-openeuler
# 配置对应的Image名
export IMAGE=vllm-ascend:v0.18.0rc1-a3-openeuler
export NAME=vllm-ascend
# 使用定义的变量运行容器
# 注意：若使用 Docker 桥接网络，请提前开放可供多节点通信的端口
# 根据您的设备更新 --device（Atlas A3：/dev/davinci[0-15]）。
docker run --rm \
--name $NAME \
--net=host \
--shm-size=100g \
--device /dev/davinci0 \
--device /dev/davinci1 \
--device /dev/davinci2 \
--device /dev/davinci3 \
--device /dev/davinci4 \
--device /dev/davinci5 \
--device /dev/davinci6 \
--device /dev/davinci7 \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/Ascend/driver/tools/hccn_tool:/usr/local/Ascend/driver/tools/hccn_tool \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /root/.cache:/root/.cache \
-it $IMAGE bash

2）源码构建

如果不希望使用上述 Docker 镜像，也可通过源码完整构建：

确保你的环境成功安装了CANN 8.5.0
从源码安装 vllm-ascend，请参考安装指南（https://docs.vllm.ai/projects/ascend/en/latest/installation.html）

从源码安装 vllm-ascend后，需要将 vllm、vllm-ascend、transformers 升级至主分支：

# 升级 vllm
git clone https://github.com/vllm-project/vllm.git
cd vllm
git checkout bcf2be96120005e9aea171927f85055a6a5c0cf6
VLLM_TARGET_DEVICE=empty pip install -v .
# 升级 vllm-ascend
pip uninstall vllm-ascend -y
git clone https://github.com/vllm-project/vllm-ascend.git
cd vllm-ascend
git checkout 99e1ea0fe685e93f53ee5adfe4b41cdd42fb809f
pip install -v .
# 重新安装 transformers
git clone https://github.com/huggingface/transformers.git
cd transformers
git reset --hard fc9137225880a9d03f130634c20f9dbe36a7b8bf
pip install .

如需部署多节点环境，您需要在每个节点上分别完成环境配置。

3. 部署

以下单节点部署为例。

A2 系列

执行以下脚本进行在线推理。

export PYTORCH_NPU_ALLOC_CONF="expandable_segments:True"
export HCCL_OP_EXPANSION_MODE="AIV"
export HCCL_BUFFSIZE=1024
export OMP_NUM_THREADS=1
export TASK_QUEUE_ENABLE=1
export LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libjemalloc.so.2:$LD_PRELOAD
vllm serve /root/.cache/Qwen3.6-27B \
    --served-model-name "qwen3.6" \
    --host 0.0.0.0 \
    --port 8010 \
    --data-parallel-size 1 \
    --tensor-parallel-size 2 \
    --max-model-len 262144 \
    --max-num-batched-tokens 25600 \
    --max-num-seqs 128 \
    --gpu-memory-utilization 0.94 \
    --compilation-config '{"cudagraph_capture_sizes":[1,2,3,4,8,12,16,24,32,48], "cudagraph_mode":"FULL_DECODE_ONLY"}' \
    --trust-remote-code \
    --async-scheduling \
    --allowed-local-media-path / \
    --no-enable-prefix-caching \
    --mm-processor-cache-gb 0 \
    --additional-config '{"enable_cpu_binding":true}'

A3 系列

执行以下脚本进行在线推理：

export PYTORCH_NPU_ALLOC_CONF="expandable_segments:True"
export HCCL_OP_EXPANSION_MODE="AIV"
export HCCL_BUFFSIZE=1024
export OMP_NUM_THREADS=1
export TASK_QUEUE_ENABLE=1
export LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libjemalloc.so.2:$LD_PRELOAD
vllm serve /root/.cache/Qwen3.6-27B \
    --served-model-name "qwen3.6" \
    --host 0.0.0.0 \
    --port 8010 \
    --data-parallel-size 1 \
    --tensor-parallel-size 2 \
    --max-model-len 262144 \
    --max-num-batched-tokens 25600 \
    --max-num-seqs 128 \
    --gpu-memory-utilization 0.94 \
    --compilation-config '{"cudagraph_capture_sizes":[1,2,3,4,8,12,16,24,32,48], "cudagraph_mode":"FULL_DECODE_ONLY"}' \
    --trust-remote-code \
    --async-scheduling \
    --allowed-local-media-path / \
    --no-enable-prefix-caching \
    --mm-processor-cache-gb 0 \
    --additional-config '{"enable_cpu_binding":true}'

执行以下脚本向模型发送一条请求：

curl http://localhost:8010/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
        "prompt": "The future of AI is",
        "path": "/path/to/model/Qwen3.6-27B/",
        "max_tokens": 100,
        "temperature": 0
        }'

执行结束后，您可以看到模型回答如下：

Prompt: 'The future of AI is', Generated text: ' not just about building smarter machines, but about creating systems that can collaborate with humans in meaningful, ethical, and sustainable ways. As AI continues to evolve, it will increasingly shape how we live, work, and interact — and the decisions we make today will determine whether this future is one of shared prosperity or deepening inequality.\n\nThe rise of generative AI, for example, has already begun to transform creative industries, education, and scientific research. Tools like ChatGPT, Midjourney, and'

也可执行以下脚本向模型发送一条多模态请求：


curl http://localhost:8010/v1/completions \
  -H "Content-Type: application/json" \
    -d '{
        "model": "qwen3.6",
        "messages": [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": [
                {"type": "image_url", "image_url": {"url": "https://modelscope.oss-cn-beijing.aliyuncs.com/resource/qwen.png"}},
                {"type": "text", "text": "What is the text in the illustrate?"}
            ]}
        ]
    }'

执行结束后，您可以看到模型回答如下：

{"id":"chatcmpl-9dab99d55addd8c0","object":"chat.completion","created":1771060145,"model":"qwen3.6","choices":[{"index":0,"message":{"role":"assistant","content":"TONGYI Qwen","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning":null},"logprobs":null,"finish_reason":"stop","stop_reason":null,"token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":112,"total_tokens":119,"completion_tokens":7,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null}

4. 精度评估

可使用AISBench或者语言模型评估工具（Language Model Evaluation Harness）进行精度评估。

AISBench链接：https://docs.vllm.ai/projects/ascend/en/latest/developer_guide/evaluation/using_ais_bench.html

5. 性能评估

可使用 AISBench进行评估：https://docs.vllm.ai/projects/ascend/en/latest/developer_guide/evaluation/using_ais_bench.html#execute-performance-evaluation

或使用 vLLM 基准测试工具：https://docs.vllm.ai/en/latest/contributing/benchmarks.html

注：以上仅为尝鲜体验，性能优化中。如您在部署过程中，发现任何问题（包括但不限于功能问题、合规问题），请在推理指南的代码仓提交issue，开发者将及时审视并解答。

🔗 https://modelers.cn/models/vLLM_Ascend/Qwen3.6-27B

基于SGLang的部署推理教程

以下以Qwen3.6-27B基于SGLang在昇腾上部署为例，手把手教你如何部署推理。

1. 准备环境

NPU运行时环境所需的依赖已集成到Docker镜像中，并上传至华为云平台，用户可直接拉取该镜像。

#Atlas 800 A3
docker pull quay.io/ascend/sglang:v0.5.10-npu.rc1-a3
#Atlas 800 A2
docker pull quay.io/ascend/sglang:v0.5.10-npu.rc1-910b

#start container
docker run -itd --shm-size=16g --privileged=true --name ${NAME} \
--privileged=true --net=host \
-v /var/queue_schedule:/var/queue_schedule \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /usr/local/sbin:/usr/local/sbin \
-v /usr/local/Ascend/driver:/usr/local/Ascend/driver \
-v /usr/local/Ascend/firmware:/usr/local/Ascend/firmware \
--device=/dev/davinci0:/dev/davinci0  \
--device=/dev/davinci1:/dev/davinci1  \
--device=/dev/davinci2:/dev/davinci2  \
--device=/dev/davinci3:/dev/davinci3  \
--device=/dev/davinci4:/dev/davinci4  \
--device=/dev/davinci5:/dev/davinci5  \
--device=/dev/davinci6:/dev/davinci6  \
--device=/dev/davinci7:/dev/davinci7  \
--device=/dev/davinci8:/dev/davinci8  \
--device=/dev/davinci9:/dev/davinci9  \
--device=/dev/davinci10:/dev/davinci10  \
--device=/dev/davinci11:/dev/davinci11  \
--device=/dev/davinci12:/dev/davinci12  \
--device=/dev/davinci13:/dev/davinci13  \
--device=/dev/davinci14:/dev/davinci14  \
--device=/dev/davinci15:/dev/davinci15  \
--device=/dev/davinci_manager:/dev/davinci_manager \
--device=/dev/hisi_hdc:/dev/hisi_hdc \
--entrypoint=bash \
quay.io/ascend/sglang:${tag}

2. 下载模型权重

Qwen3.6-27B：

https://modelers.cn/models/Qwen-AI/Qwen3.6-27B

3. 部署

单节点部署，执行以下脚本进行在线推理。

# high performance cpu
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000
export SGLANG_SET_CPU_AFFINITY=1

unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING

# cann
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh

export STREAMS_PER_DEVICE=32
export HCCL_OP_EXPANSION_MODE=AIV
export HCCL_SOCKET_IFNAME=lo
export GLOO_SOCKET_IFNAME=lo

export SGLANG_ENABLE_SPEC_V2=1
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=0
export SGLANG_SCHEDULER_DECREASE_PREFILL_IDLE=1
export SGLANG_PREFILL_DELAYER_MAX_DELAY_PASSES=100

python3 -m sglang.launch_server \
    --model-path $MODEL_PATH \
    --attention-backend ascend \
    --device npu \
    --tp-size 4 --nnodes 1 --node-rank 0\
    --chunked-prefill-size -1 --max-prefill-tokens 60000\
    --disable-radix-cache \
    --trust-remote-code \
    --host 127.0.0.1 --max-running-requests 48 --max-mamba-cache-size 60\
    --mem-fraction-static0.7\
    --port 8000\
    --cuda-graph-bs 28163248\
    --enable-multimodal \
    --mm-attention-backend ascend_attn \
    --dtype bfloat16 --mamba-ssm-dtype bfloat16 \
    --speculative-algorithm NEXTN \
    --speculative-num-steps 3\
    --speculative-eagle-topk 1\
    --speculative-num-draft-tokens 4

发送请求测试：

curl --location http://127.0.0.1:8000/v1/chat/completions --header 'Content-Type: application/json' --data '{
  "model": "qwen3.6",
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "image_url",
          "image_url": {"url": "/image_path/qwen.png"} 
        },
        {"type": "text", "text": "What is the text in the illustrate?"}
      ]
    }
  ]
}'

结果返回如下：

{"id":"cdcd6d14645846e69cc486554f198154","object":"chat.completion","created":1772098465,"model":"qwen3.6","choices":[{"index":0,"message":{"role":"assistant","content":"The user is asking about the text present in the image. I will analyze the image to identify the text.\n</think>\n\nThe text in the image is \"TONGyi Qwen\".","reasoning_content":null,"tool_calls":null},"logprobs":null,"finish_reason":"stop","matched_stop":248044}],"usage":{"prompt_tokens":98,"total_tokens":138,"completion_tokens":40,"prompt_tokens_details":null,"reasoning_tokens":0},"metadata":{"weight_version":"default"}}

欢迎体验

在魔乐社区，你可以轻松开启Qwen3.6探索之旅：

极速下载通过社区专属高速通道，快速获取模型权重和适配版本，即刻上手体验，进行部署和微调开发。
在线创建 AI 应用进入“体验空间”，无需复杂部署即可在线推理、创建专属 AI 应用，快速验证想法。
交流共创共成长前往社区“博客”板块，分享实测效果、微调方案与应用案例，与开发者共同挖掘Qwen3.6的无限潜力。

鲲鹏昇腾开发者社区是面向全社会开放的“联接全球计算开发者，聚合华为+生态”的社区，内容涵盖鲲鹏、昇腾资源，帮助开发者快速获取所需的知识、经验、软件、工具、算力，支撑开发者易学、好用、成功，成为核心开发者。

更多推荐

Verl Full Async架构昇腾实践

在大规模语言模型的强化学习（RL）训练中，高效利用计算资源是提升训练效率的核心挑战。传统RL框架普遍采用共卡共进程方案，即每张NPU在训练中仅执行单一任务，导致训练流程严格串行执行（Rollout→Train→Sync）。这种设计在实际开发中面临显著瓶颈：当处理长尾序列时，部分NPU的推理延迟会引发其他NPU的空闲等待，无法通过增加资源缓解，造成整体训练效率下降。为解决这一问题，我们设计了Full