系统环境

硬件环境(Ascend/GPU/CPU): Ascend

MindSpore版本: 2.1

执行模式(PyNative/ Graph): 不限

报错信息

2.1 问题描述

在新机器上重装Ascend+MindSpore2.1环境,并行设置dp:mp:pp = 1:1:2时,报如下错误:(只要是设置了pp>1,都报下面的错误)

2.2 报错信息

[Traceback (most recent call last):
  File "wizardcoder/run_wizardcoder.py", line 149, in <module>
    device_id=args.device_id)
  File "wizardcoder/run_wizardcoder.py", line 81, in main
    task.train(train_checkpo int=ckpt, resume=resume)
  File "/home/wizardcoder/1_wizardcoder-mindformers/mind formers/trainer/trainer.py", line 424, in train
    is_full _config=True, **kwargs)
  File " /home/wizardcoder/1_wizardcoder- mindformers/mindformers/trainer/caus al_language_modeling/causal_l anguage_modeling.py", line 104, in train
    **kwargs)
  File "/home/wizardcoder/1_wizardcoder -mindformers/mindformers/trainer/base_t rainer.py", line 631, in training_process
    initial_epoch=config.runner_config.initial_epoch)
  File "/root/miniconda3/envs /zxw/lib/python3.7/site-packages /mindspore/train/model .py", line 1066, in train
    initial_epoch=initial_epoch)
  File "/root/miniconda3/envs/zxw/lib/python3.7/site-pack ages/mindspore/train/model.py", line 113, in wrapper
    func(self, *args, **kwargs)
  File "/root/miniconda3/envs/zxw/lib/python3.7/site-packages/mindspore/train/model.py", line 620, in _train
    cb_params, sink_size,initial epoch, valid infos)
  File "/root/miniconda3/envs/zxw/lib/python3.7/site-pack ages/mindspore/train/model .py", line 703, in _train_dataset_s ink_process
    outputs = train_network(*inputs)
  File "/root/miniconda3/envs/zxw/lib/python3.7/site-pack ages/mindspore/nn/cell.py", line 637, in _call_
    out = self.compile and run(*args **kwargs)
  File "/root/miniconda3/envs/zxw/lib/python3.7/site-packages /mindspore/nn/cell.py", line 961, in compile_and_run
    self.compile(*args, **kwargs)
    File "/root/miniconda3/envs/zxw/l ib/python3.7/site-packages/mindspo re/nn/cell.py", line 939, in compile
    jit_config dict=self.j it config_ dict, *compile_args, **kwargs)
  File "/rootyminiconda3/envs7z xw/lib/python3.7/site-packages/mindspo re/common/api.py", line 1623, in compile
    result = self. _graph executor.compile(obj, args, kwargs, phase, self._use_vm_mode())复制

3 根因分析

应该是开启了cell共享的环境变量,所以报这个pipeline错误。

4 解决方案

两种解决方法:

  1. 关闭cell共享环境变量,使用命令 export MS_DEV_CELL_REUSE=0。Cell共享的作用是可优化编译性能。
  2. 如果不关闭cell共享,则需要配置相应的装饰器。
from mindformers.models.utils import cell_reuse
from mindformers.modules. transformer.moe import default_moe_config
from mindformers.modules.layers import LayerNorm, Dropout
from mindformers.core.loss import CrossEntropyLoss

@MindFormerRegister.register(MindFormerModuleType. MODELS)
class WizardCoderLMHeadModel (BaseModel):
    r"""..."""
    ...
    @cell_reuse()
    def _init_(self, config: WizardCoderConfig = None):
        config = config if configis not None else WizardCoderConfig()
        super(WizardCoderLMHeadModel,self).__init__(config, auto_prefix=True)复制

【注意】:MindSpore2.2.0版本,cell_reuse装饰器的写法有变化,不需要后面的括号。

@MindFormerRegister.register(MindFormerModuleType. MODELS)
class WizardCoderLMHeadModel (BaseModel):
    r"""..."""
    ...
    @cell_reuse
    def _init_(self, config: WizardCoderConfig = None):
        config = config if configis not None else WizardCoderConfig()
        super(WizardCoderLMHeadModel,self).__init__(config, auto_prefix=True)
Logo

鲲鹏昇腾开发者社区是面向全社会开放的“联接全球计算开发者,聚合华为+生态”的社区,内容涵盖鲲鹏、昇腾资源,帮助开发者快速获取所需的知识、经验、软件、工具、算力,支撑开发者易学、好用、成功,成为核心开发者。

更多推荐