【昇腾】多卡分布式训练-官方例程

world_size(进程总数)：在分布式训练或并行计算任务中，world_size表示总共有多少个进程参与计算。每个进程通常运行在不同的CPU核心、GPU或整个计算节点上。4.将数据加载器train_dataloader与train_sampler相结合。Shell 脚本可以自动化任务，使得重复性的工作可以快速、一致地完成。2.在获取训练数据集后，设置train_sampler。3.定义模型后，

verse_armour

1536人浏览 · 2024-10-29 10:27:30

verse_armour · 2024-10-29 10:27:30 发布

[naie@notebook-npu-bde28a8c-568c5bcb58-mlj9c multi_GPU_training]$ netstat -tulnp | grep 29500
[naie@notebook-npu-bde28a8c-568c5bcb58-mlj9c multi_GPU_training]$ netstat -tulnp
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name    
tcp        0      0 10.234.91.164:8888      0.0.0.0:*               LISTEN      367/python          
tcp6       0      0 10.234.91.164:2222      :::*                    LISTEN      210/java            
[naie@notebook-npu-bde28a8c-568c5bcb58-mlj9c multi_GPU_training]$ netstat -tulnp | grep 8888
tcp        0      0 10.234.91.164:8888      0.0.0.0:*               LISTEN      367/python          
[naie@notebook-npu-bde28a8c-568c5bcb58-mlj9c multi_GPU_training]$ netstat -tulnp | grep 2222
tcp6       0      0 10.234.91.164:2222      :::*                    LISTEN      210/java            
[naie@notebook-npu-bde28a8c-568c5bcb58-mlj9c multi_GPU_training]$ netstat -tulnp | grep localhost
[naie@notebook-npu-bde28a8c-568c5bcb58-mlj9c multi_GPU_training]$ netstat -tulnp | grep 65536
[naie@notebook-npu-bde28a8c-568c5bcb58-mlj9c multi_GPU_training]$ netstat -tulnp | grep 'localhost'
[naie@notebook-npu-bde28a8c-568c5bcb58-mlj9c multi_GPU_training]$

01 单卡训练

配置环境变量后，执行单卡训练脚本拉起训练，命令示例如下（以下参数为举例，用户可根据实际情况自行改动）：

python3 main.py   --batch-size 128 \       # 训练批次大小，请尽量设置为处理器核数的倍数以更好的发挥性能 
--data_path /home/data/resnet50/imagenet # 数据集路径
                                               --lr 0.1 \               # 学习率 
                                               --epochs 90 \            # 训练迭代轮数 
                                               --arch resnet50 \        # 模型架构 
                                               --world-size 1 \ 
                                               --rank 0 \          
                                               --workers 40 \           # 加载数据进程数 
                                               --momentum 0.9 \         # 动量   
                                               --weight-decay 1e-4 \    # 权重衰减 
                                               --gpu 0                  # device号, 这里参数名称仍为gpu, 但迁移完成后实际训练设备已在代码中定义为npu

02 多卡分布式训练

https://www.hiascend.com/document/detail/zh/Pytorch/60RC2/ptmoddevg/trainingmigrguide/PT_LMTMOG_0080.html
多卡训练分为单机多卡训练与多机多卡训练，二者均需要将单机训练脚本修改为多机训练脚本，配置流程如下。

（1）单机多卡训练

先参考单卡脚本修改为多卡脚本章节
再参考拉起多卡分布式训练章节，选择拉起合适的方式，进行必要的修改后，执行对应的拉起命令。

单卡脚本修改为多卡脚本

1.在主函数中添加如下代码。

local_rank = int(os.environ["LOCAL_RANK"])
device = torch.device('npu', local_rank)
torch.distributed.init_process_group(backend="hccl",rank=local_rank)

2.在获取训练数据集后，设置train_sampler

train_sampler = torch.utils.data.distributed.DistributedSampler(train_data)

3.定义模型后，开启DDP模式

model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[local_rank], output_device=local_rank)

4.将数据加载器train_dataloader与train_sampler相结合。

train_dataloader = DataLoader(dataset = train_data, batch_size=batch_size, sampler = train_sampler)

拉起多卡分布式训练

在单机和多机场景下，有5种方式可拉起分布式训练：

shell脚本方式（推荐）
mp.spawn方式
Python方式
torchrun方式：仅在PyTorch 1.11.0及以上版本支持使用。
torch_npu_run方式（集群场景推荐）：此方式是torchrun在大集群场景的改进版，提升集群建链性能，仅在PyTorch 1.11.0版本支持使用。

1.shell脚本方式

Shell 脚本是一种脚本语言，它允许用户编写一系列的命令，这些命令可以被 Shell 执行。Shell 脚本可以自动化任务，使得重复性的工作可以快速、一致地完成。

export HCCL_WHITELIST_DISABLE=1
RANK_ID_START=0
WORLD_SIZE=8
for((RANK_ID=$RANK_ID_START;RANK_ID<$((WORLD_SIZE+RANK_ID_START));RANK_ID++));
do
    echo "Device ID: $RANK_ID"
    export LOCAL_RANK=$RANK_ID
    python3 ddp_test_shell.py &
done
wait

注意：这里8卡训练则设置WORLD_SIZE=8，如果只有单卡训练则，WORLD_SIZE=1.

world_size(进程总数)：在分布式训练或并行计算任务中，world_size表示总共有多少个进程参与计算。每个进程通常运行在不同的CPU核心、GPU或整个计算节点上。

将以上代码保存到一个sh文件中。
给予执行权限：chmod +x run.sh
运行该脚本：./run.sh

2.torchrun方式

export HCCL_WHITELIST_DISABLE=1
torchrun --standalone --nnodes=1 --nproc_per_node=8 ddp_test_shell.py

3.python方式

# master_addr和master_port参数需用户根据实际情况设置
export HCCL_WHITELIST_DISABLE=1
python3 -m torch.distributed.launch --nproc_per_node 8 --master_addr localhost  --master_port *** ddp_test.py

[naie@notebook-npu-bde28a8c-568c5bcb58-mlj9c multi_GPU_training]$ python3 -m torch.distributed.launch --nproc_per_node 1 --master_addr localhost  --master_port "29500" ddp_test.py
/home/naie/.local/lib/python3.9/site-packages/torch/distributed/launch.py:181: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use-env is set by default in torchrun.
If your script expects `--local-rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See 
https://pytorch.org/docs/stable/distributed.html#launch-utility for 
further instructions

  warnings.warn(
/home/naie/.local/lib/python3.9/site-packages/torch_npu/utils/path_manager.py:79: UserWarning: Warning: The /usr/local/Ascend/ascend-toolkit/latest owner does not match the current user.
  warnings.warn(f"Warning: The {path} owner does not match the current user.")
/home/naie/.local/lib/python3.9/site-packages/torch_npu/utils/path_manager.py:79: UserWarning: Warning: The /usr/local/Ascend/ascend-toolkit/8.0.RC2.2/aarch64-linux/ascend_toolkit_install.info owner does not match the current user.
  warnings.warn(f"Warning: The {path} owner does not match the current user.")
usage: ddp_test.py [-h] [--batch_size BATCH_SIZE] [--gpu GPU]
ddp_test.py: error: unrecognized arguments: --local-rank=0
[2024-10-28 23:26:37,024] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 2) local_rank: 0 (pid: 24010) of binary: /usr1/python/python/target/checkout/package/python/python-24.10.217/bin/python3
Traceback (most recent call last):
  File "/usr1/build/lib/python3.9/runpy.py", line 197, in _run_module_as_main
  File "/usr1/build/lib/python3.9/runpy.py", line 87, in _run_code
  File "/home/naie/.local/lib/python3.9/site-packages/torch/distributed/launch.py", line 196, in <module>
    main()
  File "/home/naie/.local/lib/python3.9/site-packages/torch/distributed/launch.py", line 192, in main
    launch(args)
  File "/home/naie/.local/lib/python3.9/site-packages/torch/distributed/launch.py", line 177, in launch
    run(args)
  File "/home/naie/.local/lib/python3.9/site-packages/torch/distributed/run.py", line 797, in run
    elastic_launch(
  File "/home/naie/.local/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/naie/.local/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
ddp_test.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-10-28_23:26:37
  host      : notebook-npu-bde28a8c-568c5bcb58-mlj9c
  rank      : 0 (local_rank: 0)
  exitcode  : 2 (pid: 24010)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
[naie@notebook-npu-bde28a8c-568c5bcb58-mlj9c multi_GPU_training]$ ./run_python.sh
sh: ./run_python.sh: Permission denied
[naie@notebook-npu-bde28a8c-568c5bcb58-mlj9c multi_GPU_training]$ chmod +x run_python.sh
[naie@notebook-npu-bde28a8c-568c5bcb58-mlj9c multi_GPU_training]$ ./run_python.sh
/home/naie/.local/lib/python3.9/site-packages/torch/distributed/launch.py:181: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use-env is set by default in torchrun.
If your script expects `--local-rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See 
https://pytorch.org/docs/stable/distributed.html#launch-utility for 
further instructions

  warnings.warn(
[2024-10-28 23:39:53,660] torch.distributed.run: [WARNING] 
[2024-10-28 23:39:53,660] torch.distributed.run: [WARNING] *****************************************
[2024-10-28 23:39:53,660] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
[2024-10-28 23:39:53,660] torch.distributed.run: [WARNING] *****************************************
/home/naie/.local/lib/python3.9/site-packages/torch_npu/utils/path_manager.py:79: UserWarning: Warning: The /usr/local/Ascend/ascend-toolkit/latest owner does not match the current user.
  warnings.warn(f"Warning: The {path} owner does not match the current user.")
/home/naie/.local/lib/python3.9/site-packages/torch_npu/utils/path_manager.py:79: UserWarning: Warning: The /usr/local/Ascend/ascend-toolkit/8.0.RC2.2/aarch64-linux/ascend_toolkit_install.info owner does not match the current user.
  warnings.warn(f"Warning: The {path} owner does not match the current user.")
/home/naie/.local/lib/python3.9/site-packages/torch_npu/utils/path_manager.py:79: UserWarning: Warning: The /usr/local/Ascend/ascend-toolkit/latest owner does not match the current user.
  warnings.warn(f"Warning: The {path} owner does not match the current user.")
/home/naie/.local/lib/python3.9/site-packages/torch_npu/utils/path_manager.py:79: UserWarning: Warning: The /usr/local/Ascend/ascend-toolkit/8.0.RC2.2/aarch64-linux/ascend_toolkit_install.info owner does not match the current user.
  warnings.warn(f"Warning: The {path} owner does not match the current user.")
/home/naie/.local/lib/python3.9/site-packages/torch_npu/utils/path_manager.py:79: UserWarning: Warning: The /usr/local/Ascend/ascend-toolkit/latest owner does not match the current user.
  warnings.warn(f"Warning: The {path} owner does not match the current user.")
/home/naie/.local/lib/python3.9/site-packages/torch_npu/utils/path_manager.py:79: UserWarning: Warning: The /usr/local/Ascend/ascend-toolkit/8.0.RC2.2/aarch64-linux/ascend_toolkit_install.info owner does not match the current user.
  warnings.warn(f"Warning: The {path} owner does not match the current user.")
/home/naie/.local/lib/python3.9/site-packages/torch_npu/utils/path_manager.py:79: UserWarning: Warning: The /usr/local/Ascend/ascend-toolkit/latest owner does not match the current user.
  warnings.warn(f"Warning: The {path} owner does not match the current user.")
/home/naie/.local/lib/python3.9/site-packages/torch_npu/utils/path_manager.py:79: UserWarning: Warning: The /usr/local/Ascend/ascend-toolkit/latest owner does not match the current user.
  warnings.warn(f"Warning: The {path} owner does not match the current user.")
/home/naie/.local/lib/python3.9/site-packages/torch_npu/utils/path_manager.py:79: UserWarning: Warning: The /usr/local/Ascend/ascend-toolkit/8.0.RC2.2/aarch64-linux/ascend_toolkit_install.info owner does not match the current user.
  warnings.warn(f"Warning: The {path} owner does not match the current user.")
/home/naie/.local/lib/python3.9/site-packages/torch_npu/utils/path_manager.py:79: UserWarning: Warning: The /usr/local/Ascend/ascend-toolkit/8.0.RC2.2/aarch64-linux/ascend_toolkit_install.info owner does not match the current user.
  warnings.warn(f"Warning: The {path} owner does not match the current user.")
usage: ddp_test.py [-h] [--batch_size BATCH_SIZE] [--gpu GPU]
usage: ddp_test.py [-h] [--batch_size BATCH_SIZE] [--gpu GPU]
ddp_test.py: error: unrecognized arguments: --local-rank=0
ddp_test.py: error: unrecognized arguments: --local-rank=1
/home/naie/.local/lib/python3.9/site-packages/torch_npu/utils/path_manager.py:79: UserWarning: Warning: The /usr/local/Ascend/ascend-toolkit/latest owner does not match the current user.
  warnings.warn(f"Warning: The {path} owner does not match the current user.")
/home/naie/.local/lib/python3.9/site-packages/torch_npu/utils/path_manager.py:79: UserWarning: Warning: The /usr/local/Ascend/ascend-toolkit/8.0.RC2.2/aarch64-linux/ascend_toolkit_install.info owner does not match the current user.
  warnings.warn(f"Warning: The {path} owner does not match the current user.")
/home/naie/.local/lib/python3.9/site-packages/torch_npu/utils/path_manager.py:79: UserWarning: Warning: The /usr/local/Ascend/ascend-toolkit/latest owner does not match the current user.
  warnings.warn(f"Warning: The {path} owner does not match the current user.")
/home/naie/.local/lib/python3.9/site-packages/torch_npu/utils/path_manager.py:79: UserWarning: Warning: The /usr/local/Ascend/ascend-toolkit/8.0.RC2.2/aarch64-linux/ascend_toolkit_install.info owner does not match the current user.
  warnings.warn(f"Warning: The {path} owner does not match the current user.")
/home/naie/.local/lib/python3.9/site-packages/torch_npu/utils/path_manager.py:79: UserWarning: Warning: The /usr/local/Ascend/ascend-toolkit/latest owner does not match the current user.
  warnings.warn(f"Warning: The {path} owner does not match the current user.")
/home/naie/.local/lib/python3.9/site-packages/torch_npu/utils/path_manager.py:79: UserWarning: Warning: The /usr/local/Ascend/ascend-toolkit/8.0.RC2.2/aarch64-linux/ascend_toolkit_install.info owner does not match the current user.
  warnings.warn(f"Warning: The {path} owner does not match the current user.")
usage: ddp_test.py [-h] [--batch_size BATCH_SIZE] [--gpu GPU]
ddp_test.py: error: unrecognized arguments: --local-rank=3
usage: ddp_test.py [-h] [--batch_size BATCH_SIZE] [--gpu GPU]
ddp_test.py: error: unrecognized arguments: --local-rank=4
usage: ddp_test.py [-h] [--batch_size BATCH_SIZE] [--gpu GPU]
ddp_test.py: error: unrecognized arguments: --local-rank=5
usage: ddp_test.py [-h] [--batch_size BATCH_SIZE] [--gpu GPU]
ddp_test.py: error: unrecognized arguments: --local-rank=6
usage: ddp_test.py [-h] [--batch_size BATCH_SIZE] [--gpu GPU]
ddp_test.py: error: unrecognized arguments: --local-rank=2
usage: ddp_test.py [-h] [--batch_size BATCH_SIZE] [--gpu GPU]
ddp_test.py: error: unrecognized arguments: --local-rank=7
[2024-10-28 23:40:03,752] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 2) local_rank: 0 (pid: 24696) of binary: /usr1/python/python/target/checkout/package/python/python-24.10.217/bin/python3
Traceback (most recent call last):
  File "/usr1/build/lib/python3.9/runpy.py", line 197, in _run_module_as_main
  File "/usr1/build/lib/python3.9/runpy.py", line 87, in _run_code
  File "/home/naie/.local/lib/python3.9/site-packages/torch/distributed/launch.py", line 196, in <module>
    main()
  File "/home/naie/.local/lib/python3.9/site-packages/torch/distributed/launch.py", line 192, in main
    launch(args)
  File "/home/naie/.local/lib/python3.9/site-packages/torch/distributed/launch.py", line 177, in launch
    run(args)
  File "/home/naie/.local/lib/python3.9/site-packages/torch/distributed/run.py", line 797, in run
    elastic_launch(
  File "/home/naie/.local/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/naie/.local/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
ddp_test.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2024-10-28_23:40:03
  host      : notebook-npu-bde28a8c-568c5bcb58-mlj9c
  rank      : 1 (local_rank: 1)
  exitcode  : 2 (pid: 24697)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
  time      : 2024-10-28_23:40:03
  host      : notebook-npu-bde28a8c-568c5bcb58-mlj9c
  rank      : 2 (local_rank: 2)
  exitcode  : 2 (pid: 24698)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
  time      : 2024-10-28_23:40:03
  host      : notebook-npu-bde28a8c-568c5bcb58-mlj9c
  rank      : 3 (local_rank: 3)
  exitcode  : 2 (pid: 24699)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[4]:
  time      : 2024-10-28_23:40:03
  host      : notebook-npu-bde28a8c-568c5bcb58-mlj9c
  rank      : 4 (local_rank: 4)
  exitcode  : 2 (pid: 24700)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[5]:
  time      : 2024-10-28_23:40:03
  host      : notebook-npu-bde28a8c-568c5bcb58-mlj9c
  rank      : 5 (local_rank: 5)
  exitcode  : 2 (pid: 24701)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[6]:
  time      : 2024-10-28_23:40:03
  host      : notebook-npu-bde28a8c-568c5bcb58-mlj9c
  rank      : 6 (local_rank: 6)
  exitcode  : 2 (pid: 24702)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[7]:
  time      : 2024-10-28_23:40:03
  host      : notebook-npu-bde28a8c-568c5bcb58-mlj9c
  rank      : 7 (local_rank: 7)
  exitcode  : 2 (pid: 24703)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-10-28_23:40:03
  host      : notebook-npu-bde28a8c-568c5bcb58-mlj9c
  rank      : 0 (local_rank: 0)
  exitcode  : 2 (pid: 24696)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
[naie@notebook-npu-bde28a8c-568c5bcb58-mlj9c multi_GPU_training]$ ./run_python.sh
/home/naie/.local/lib/python3.9/site-packages/torch/distributed/launch.py:181: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use-env is set by default in torchrun.
If your script expects `--local-rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See 
https://pytorch.org/docs/stable/distributed.html#launch-utility for 
further instructions

  warnings.warn(
/home/naie/.local/lib/python3.9/site-packages/torch_npu/utils/path_manager.py:79: UserWarning: Warning: The /usr/local/Ascend/ascend-toolkit/latest owner does not match the current user.
  warnings.warn(f"Warning: The {path} owner does not match the current user.")
/home/naie/.local/lib/python3.9/site-packages/torch_npu/utils/path_manager.py:79: UserWarning: Warning: The /usr/local/Ascend/ascend-toolkit/8.0.RC2.2/aarch64-linux/ascend_toolkit_install.info owner does not match the current user.
  warnings.warn(f"Warning: The {path} owner does not match the current user.")
usage: ddp_test.py [-h] [--batch_size BATCH_SIZE] [--gpu GPU]
ddp_test.py: error: unrecognized arguments: --local-rank=0
[2024-10-28 23:41:28,244] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 2) local_rank: 0 (pid: 24925) of binary: /usr1/python/python/target/checkout/package/python/python-24.10.217/bin/python3
Traceback (most recent call last):
  File "/usr1/build/lib/python3.9/runpy.py", line 197, in _run_module_as_main
  File "/usr1/build/lib/python3.9/runpy.py", line 87, in _run_code
  File "/home/naie/.local/lib/python3.9/site-packages/torch/distributed/launch.py", line 196, in <module>
    main()
  File "/home/naie/.local/lib/python3.9/site-packages/torch/distributed/launch.py", line 192, in main
    launch(args)
  File "/home/naie/.local/lib/python3.9/site-packages/torch/distributed/launch.py", line 177, in launch
    run(args)
  File "/home/naie/.local/lib/python3.9/site-packages/torch/distributed/run.py", line 797, in run
    elastic_launch(
  File "/home/naie/.local/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/naie/.local/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
ddp_test.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-10-28_23:41:28
  host      : notebook-npu-bde28a8c-568c5bcb58-mlj9c
  rank      : 0 (local_rank: 0)
  exitcode  : 2 (pid: 24925)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

以上python拉起方式一直报错。最终聚焦报错信息中的这句话：

usage: ddp_test.py [-h] [--batch_size BATCH_SIZE] [--gpu GPU] [--local_rank LOCAL_RANK]
ddp_test.py: error: unrecognized arguments: --local-rank=0

观察一下，这里的未识别参数是：local-rank，确实不是local_rank，所以确实这里出了问题，但是我去sh脚本里检查了一圈，我并不知道为什么莫名其妙传进去的是个local-rank，最终没有办法，只能去ddp_test.py脚本里，改成了这样，才可以成功跑通：
在这里插入图片描述

（2）多机多卡训练

先参考单卡脚本修改为多卡脚本章节，
再参考准备多机多卡训练环境进行必要的配置，
最后参考拉起多卡分布式训练章节，选择拉起合适的方式，进行必要的修改后，执行对应的拉起命令。

鲲鹏昇腾开发者社区是面向全社会开放的“联接全球计算开发者，聚合华为+生态”的社区，内容涵盖鲲鹏、昇腾资源，帮助开发者快速获取所需的知识、经验、软件、工具、算力，支撑开发者易学、好用、成功，成为核心开发者。

更多推荐

鲲鹏+昇腾：开启 AI for Science 新范式——基于PINN的流体仿真加速实践

鲲鹏昇腾开发者社区

鲲鹏 DevKit 持续集成部署实践：从零搭建 CI/CD 流水线

随着项目规模不断扩大，构建一条简单、稳定、自动化的 CI/CD 流水线变得越来越重要。鲲鹏 DevKit 在这一方面提供了完整的工具链支持，从代码检查到构建、测试、部署都有覆盖，让我们能够在国产化环境中快速搭建可靠的持续交付体系。我将结合实际使用经验，介绍如何基于 DevKit 构建一条完整、高效的 CI/CD 流水线，并给出相关配置示例与最佳实践。本次实验是在华为云开发者空间上进行的，点击进入D