Skip to content

Ascend910B XTuner微调InternVL3.5-1B报错 #1407

@JeffDing

Description

@JeffDing

Ascend910B XTuner微调InternVL3.5-1B报错,出现以下报错,这个问题是CANN版本太低还是哪个环境问题?

[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[rank0]: Traceback (most recent call last):
[rank0]:   File "/root/xtuner_config/vl.py", line 72, in <module>
[rank0]:     trainer = Trainer.from_config(trainer)
[rank0]:   File "/root/xtuner/xtuner/v1/train/trainer.py", line 407, in from_config
[rank0]:     self = cls(
[rank0]:   File "/root/xtuner/xtuner/v1/train/trainer.py", line 313, in __init__
[rank0]:     self._init_dist(backend)
[rank0]:   File "/root/xtuner/xtuner/v1/train/trainer.py", line 902, in _init_dist
[rank0]:     torch.accelerator.set_device_index(int(os.environ["LOCAL_RANK"]))
[rank0]:   File "/root/.conda/envs/xtuner_npu/lib/python3.10/site-packages/torch/accelerator/__init__.py", line 133, in set_device_index
[rank0]:     torch._C._accelerator_setDeviceIndex(device_index)
[rank0]:   File "/root/.conda/envs/xtuner_npu/lib/python3.10/site-packages/torch_npu/npu/__init__.py", line 251, in _lazy_init
[rank0]:     torch_npu._C._npu_init()
[rank0]: RuntimeError: SetPrecisionMode:build/CMakeFiles/torch_npu.dir/compiler_depend.ts:175 NPU function error: at_npu::native::AclSetCompileopt(aclCompileOpt::ACL_PRECISION_MODE, precision_mode), error code is 500001
[rank0]: [ERROR] 2026-01-03-18:55:33 (PID:8511, Device:0, RankID:0) ERR00100 PTA call acl api failed
[rank0]: [Error]: The internal ACL of the system is incorrect.
[rank0]:         Rectify the fault based on the error information in the ascend log.
[rank0]: EC0010: [PID: 8511] 2026-01-03-18:55:33.631.880 Failed to import Python module [AttributeError: `np.float_` was removed in the NumPy 2.0 release. Use `np.float64` instead..].
[rank0]:         Solution: Check that all required components are properly installed and the specified Python path matches the Python installation directory. (If the path does not match the directory, run set_env.sh in the installation package.)
[rank0]:         TraceBack (most recent call last):
[rank0]:         AOE Failed to call InitCannKB[FUNC:Initialize][FILE:python_adapter_manager.cc][LINE:47]
[rank0]:         Failed to initialize TeConfigInfo.
[rank0]:         [GraphOpt][InitializeInner][InitTbeFunc] Failed to init tbe.[FUNC:InitializeTeFusion][FILE:tbe_op_store_adapter.cc][LINE:1921]
[rank0]:         [GraphOpt][InitializeInner][InitTeFusion]: Failed to initialize TeFusion.[FUNC:InitializeInner][FILE:tbe_op_store_adapter.cc][LINE:1888]
[rank0]:         [SubGraphOpt][PreCompileOp][InitAdapter] InitializeAdapter adapter [tbe_op_adapter] failed! Ret [4294967295][FUNC:InitializeAdapter][FILE:op_store_adapter_manager.cc][LINE:79]
[rank0]:         [SubGraphOpt][PreCompileOp][Init] Initialize op store adapter failed, OpsStoreName[tbe-custom].[FUNC:Initialize][FILE:op_store_adapter_manager.cc][LINE:120]
[rank0]:         [FusionMngr][Init] Op store adapter manager init failed.[FUNC:Initialize][FILE:fusion_manager.cc][LINE:115]
[rank0]:         PluginManager InvokeAll failed.[FUNC:Initialize][FILE:ops_kernel_manager.cc][LINE:83]
[rank0]:         OpsManager initialize failed.[FUNC:InnerInitialize][FILE:gelib.cc][LINE:239]
[rank0]:         GELib::InnerInitialize failed.[FUNC:Initialize][FILE:gelib.cc][LINE:164]
[rank0]:         GEInitialize failed.[FUNC:GEInitialize][FILE:ge_api.cc][LINE:382]
[rank0]:         [Initialize][Ge]GEInitialize failed. ge result = 4294967295[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161]
[rank0]:         [Init][Compiler]Init compiler failed[FUNC:ReportInnerError][FILE:log_inner.cpp][LINE:145]
[rank0]:         [Set][Options]OpCompileProcessor init failed![FUNC:ReportInnerError][FILE:log_inner.cpp][LINE:145]

E0103 18:55:39.548000 8415 site-packages/torch/distributed/elastic/multiprocessing/api.py:874] failed (exitcode: 1) local_rank: 0 (pid: 8511) of binary: /root/.conda/envs/xtuner_npu/bin/python3.10
Traceback (most recent call last):
  File "/root/.conda/envs/xtuner_npu/bin/torchrun", line 7, in <module>
    sys.exit(main())
  File "/root/.conda/envs/xtuner_npu/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 357, in wrapper
    return f(*args, **kwargs)
  File "/root/.conda/envs/xtuner_npu/lib/python3.10/site-packages/torch/distributed/run.py", line 901, in main
    run(args)
  File "/root/.conda/envs/xtuner_npu/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in run
    elastic_launch(
  File "/root/.conda/envs/xtuner_npu/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 143, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/root/.conda/envs/xtuner_npu/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 277, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
vl.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2026-01-03_18:55:39
  host      : aide-20251118-e18b94f-0005395-84b6c4d979-tvvbf
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 8511)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

配置代码内容

from xtuner.v1.model import InternVL3P5Dense1BConfig
from xtuner.v1.train import Trainer, TrainerConfig
from xtuner.v1.config import AdamWConfig, LRConfig
from xtuner.v1.datasets import InternS1VLTokenizeFnConfig, DataloaderConfig, DatasetConfig
from xtuner.v1.loss import CELossConfig
import sys
# model config - 启用梯度检查点
model_cfg = InternVL3P5Dense1BConfig(
    use_gradient_checkpointing=True, freeze_vision=True, freeze_projector=False, freeze_language=False
)
# dataset and dataloader config
sample_max_length = 8000
pack_max_length = 8000

dataset_config = [
    {
        "dataset": DatasetConfig(
            name="formula_recognition",
            anno_path="/home/ma-user/work/dataset/VLM-formula-recognition-dataset_intern_camp/train/train_mini_xt.jsonl",
            media_root="/home/ma-user/work/dataset/VLM-formula-recognition-dataset_intern_camp/train/",
            sample_ratio=1.0,
            class_name="VLMJsonlDataset",
        ),
        # 使用 InternVL3.5 模板,确保 prompt 与视觉 token 对齐
        "tokenize_fn": InternS1VLTokenizeFnConfig(
            model_cfg=model_cfg,
            max_length=sample_max_length,
            template_name="internvl-3.5",
        ),
    }
]
dataloader_config = DataloaderConfig(
    dataset_config_list=dataset_config,
    pack_max_length=pack_max_length,
    num_workers=16,
    pack_level="soft",
    collator="intern_s1_vl_sft_collator",
)

# 优化学习率配置 - 提高学习率以加快收敛
optim_cfg = AdamWConfig(
    lr=3e-5,           
    weight_decay=0.01, # 添加权重衰减防止过拟合
    betas=(0.9, 0.95), # 优化Adam参数
    foreach=False
)
lr_cfg = LRConfig(
    lr_type="cosine",
    warmup_ratio=0.1,  # 增加warmup比例,让模型更稳定地开始训练
    min_lr_ratio=0.1   # 添加最小学习率比例
)

load_from = "/home/ma-user/work/model/InternVL3_5-1B-HF"
tokenizer = "/home/ma-user/work/model/InternVL3_5-1B-HF"

# trainer config
trainer = TrainerConfig(
    load_from=load_from,
    model_cfg=model_cfg,
    optim_cfg=optim_cfg,
    dataloader_cfg=dataloader_config,
    lr_cfg=lr_cfg,
    tokenizer_path=tokenizer,
    global_batch_size=8,
    gradient_accumulation_steps=4,
    total_epoch=5,
    work_dir="/root/data/xtuner_workdir/vl_1031/",
    loss_cfg=CELossConfig(mode="chunk", chunk_size=1024),
    hf_interval=50,
    hf_max_keep=2,
)
trainer = Trainer.from_config(trainer)
# 检查模型是否正确加载了预训练权重
print(f"Model device: {next(trainer._engine.model.        parameters()).device}")
print(f"Model dtype: {next(trainer._engine.model.parameters()).dtype}")
# sys.exit(0)
trainer.fit()

环境版本

CANN=8.2.RC2
torch==2.8.0
torch-npu==2.8.0
transformers==4.57.0

XTuner安装命令

git clone https://gh.llkk.cc/https://github.com/InternLM/xtuner.git
cd xtuner
git checkout 4990d05c5a5416fbfd51fee9e6cf502c66947099
pip install -e .

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions