-
Notifications
You must be signed in to change notification settings - Fork 404
Open
Description
Ascend910B XTuner微调InternVL3.5-1B报错,出现以下报错,这个问题是CANN版本太低还是哪个环境问题?
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[rank0]: Traceback (most recent call last):
[rank0]: File "/root/xtuner_config/vl.py", line 72, in <module>
[rank0]: trainer = Trainer.from_config(trainer)
[rank0]: File "/root/xtuner/xtuner/v1/train/trainer.py", line 407, in from_config
[rank0]: self = cls(
[rank0]: File "/root/xtuner/xtuner/v1/train/trainer.py", line 313, in __init__
[rank0]: self._init_dist(backend)
[rank0]: File "/root/xtuner/xtuner/v1/train/trainer.py", line 902, in _init_dist
[rank0]: torch.accelerator.set_device_index(int(os.environ["LOCAL_RANK"]))
[rank0]: File "/root/.conda/envs/xtuner_npu/lib/python3.10/site-packages/torch/accelerator/__init__.py", line 133, in set_device_index
[rank0]: torch._C._accelerator_setDeviceIndex(device_index)
[rank0]: File "/root/.conda/envs/xtuner_npu/lib/python3.10/site-packages/torch_npu/npu/__init__.py", line 251, in _lazy_init
[rank0]: torch_npu._C._npu_init()
[rank0]: RuntimeError: SetPrecisionMode:build/CMakeFiles/torch_npu.dir/compiler_depend.ts:175 NPU function error: at_npu::native::AclSetCompileopt(aclCompileOpt::ACL_PRECISION_MODE, precision_mode), error code is 500001
[rank0]: [ERROR] 2026-01-03-18:55:33 (PID:8511, Device:0, RankID:0) ERR00100 PTA call acl api failed
[rank0]: [Error]: The internal ACL of the system is incorrect.
[rank0]: Rectify the fault based on the error information in the ascend log.
[rank0]: EC0010: [PID: 8511] 2026-01-03-18:55:33.631.880 Failed to import Python module [AttributeError: `np.float_` was removed in the NumPy 2.0 release. Use `np.float64` instead..].
[rank0]: Solution: Check that all required components are properly installed and the specified Python path matches the Python installation directory. (If the path does not match the directory, run set_env.sh in the installation package.)
[rank0]: TraceBack (most recent call last):
[rank0]: AOE Failed to call InitCannKB[FUNC:Initialize][FILE:python_adapter_manager.cc][LINE:47]
[rank0]: Failed to initialize TeConfigInfo.
[rank0]: [GraphOpt][InitializeInner][InitTbeFunc] Failed to init tbe.[FUNC:InitializeTeFusion][FILE:tbe_op_store_adapter.cc][LINE:1921]
[rank0]: [GraphOpt][InitializeInner][InitTeFusion]: Failed to initialize TeFusion.[FUNC:InitializeInner][FILE:tbe_op_store_adapter.cc][LINE:1888]
[rank0]: [SubGraphOpt][PreCompileOp][InitAdapter] InitializeAdapter adapter [tbe_op_adapter] failed! Ret [4294967295][FUNC:InitializeAdapter][FILE:op_store_adapter_manager.cc][LINE:79]
[rank0]: [SubGraphOpt][PreCompileOp][Init] Initialize op store adapter failed, OpsStoreName[tbe-custom].[FUNC:Initialize][FILE:op_store_adapter_manager.cc][LINE:120]
[rank0]: [FusionMngr][Init] Op store adapter manager init failed.[FUNC:Initialize][FILE:fusion_manager.cc][LINE:115]
[rank0]: PluginManager InvokeAll failed.[FUNC:Initialize][FILE:ops_kernel_manager.cc][LINE:83]
[rank0]: OpsManager initialize failed.[FUNC:InnerInitialize][FILE:gelib.cc][LINE:239]
[rank0]: GELib::InnerInitialize failed.[FUNC:Initialize][FILE:gelib.cc][LINE:164]
[rank0]: GEInitialize failed.[FUNC:GEInitialize][FILE:ge_api.cc][LINE:382]
[rank0]: [Initialize][Ge]GEInitialize failed. ge result = 4294967295[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161]
[rank0]: [Init][Compiler]Init compiler failed[FUNC:ReportInnerError][FILE:log_inner.cpp][LINE:145]
[rank0]: [Set][Options]OpCompileProcessor init failed![FUNC:ReportInnerError][FILE:log_inner.cpp][LINE:145]
E0103 18:55:39.548000 8415 site-packages/torch/distributed/elastic/multiprocessing/api.py:874] failed (exitcode: 1) local_rank: 0 (pid: 8511) of binary: /root/.conda/envs/xtuner_npu/bin/python3.10
Traceback (most recent call last):
File "/root/.conda/envs/xtuner_npu/bin/torchrun", line 7, in <module>
sys.exit(main())
File "/root/.conda/envs/xtuner_npu/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 357, in wrapper
return f(*args, **kwargs)
File "/root/.conda/envs/xtuner_npu/lib/python3.10/site-packages/torch/distributed/run.py", line 901, in main
run(args)
File "/root/.conda/envs/xtuner_npu/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in run
elastic_launch(
File "/root/.conda/envs/xtuner_npu/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 143, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/root/.conda/envs/xtuner_npu/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 277, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
vl.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2026-01-03_18:55:39
host : aide-20251118-e18b94f-0005395-84b6c4d979-tvvbf
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 8511)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
配置代码内容
from xtuner.v1.model import InternVL3P5Dense1BConfig
from xtuner.v1.train import Trainer, TrainerConfig
from xtuner.v1.config import AdamWConfig, LRConfig
from xtuner.v1.datasets import InternS1VLTokenizeFnConfig, DataloaderConfig, DatasetConfig
from xtuner.v1.loss import CELossConfig
import sys
# model config - 启用梯度检查点
model_cfg = InternVL3P5Dense1BConfig(
use_gradient_checkpointing=True, freeze_vision=True, freeze_projector=False, freeze_language=False
)
# dataset and dataloader config
sample_max_length = 8000
pack_max_length = 8000
dataset_config = [
{
"dataset": DatasetConfig(
name="formula_recognition",
anno_path="/home/ma-user/work/dataset/VLM-formula-recognition-dataset_intern_camp/train/train_mini_xt.jsonl",
media_root="/home/ma-user/work/dataset/VLM-formula-recognition-dataset_intern_camp/train/",
sample_ratio=1.0,
class_name="VLMJsonlDataset",
),
# 使用 InternVL3.5 模板,确保 prompt 与视觉 token 对齐
"tokenize_fn": InternS1VLTokenizeFnConfig(
model_cfg=model_cfg,
max_length=sample_max_length,
template_name="internvl-3.5",
),
}
]
dataloader_config = DataloaderConfig(
dataset_config_list=dataset_config,
pack_max_length=pack_max_length,
num_workers=16,
pack_level="soft",
collator="intern_s1_vl_sft_collator",
)
# 优化学习率配置 - 提高学习率以加快收敛
optim_cfg = AdamWConfig(
lr=3e-5,
weight_decay=0.01, # 添加权重衰减防止过拟合
betas=(0.9, 0.95), # 优化Adam参数
foreach=False
)
lr_cfg = LRConfig(
lr_type="cosine",
warmup_ratio=0.1, # 增加warmup比例,让模型更稳定地开始训练
min_lr_ratio=0.1 # 添加最小学习率比例
)
load_from = "/home/ma-user/work/model/InternVL3_5-1B-HF"
tokenizer = "/home/ma-user/work/model/InternVL3_5-1B-HF"
# trainer config
trainer = TrainerConfig(
load_from=load_from,
model_cfg=model_cfg,
optim_cfg=optim_cfg,
dataloader_cfg=dataloader_config,
lr_cfg=lr_cfg,
tokenizer_path=tokenizer,
global_batch_size=8,
gradient_accumulation_steps=4,
total_epoch=5,
work_dir="/root/data/xtuner_workdir/vl_1031/",
loss_cfg=CELossConfig(mode="chunk", chunk_size=1024),
hf_interval=50,
hf_max_keep=2,
)
trainer = Trainer.from_config(trainer)
# 检查模型是否正确加载了预训练权重
print(f"Model device: {next(trainer._engine.model. parameters()).device}")
print(f"Model dtype: {next(trainer._engine.model.parameters()).dtype}")
# sys.exit(0)
trainer.fit()环境版本
CANN=8.2.RC2
torch==2.8.0
torch-npu==2.8.0
transformers==4.57.0
XTuner安装命令
git clone https://gh.llkk.cc/https://github.com/InternLM/xtuner.git
cd xtuner
git checkout 4990d05c5a5416fbfd51fee9e6cf502c66947099
pip install -e .
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels