Project Page | Models | Datasets
🌎English | 🇨🇳中文
UnifoLM-VLA-0 is a Vision–Language–Action (VLA) large model in the UnifoLM series, designed for general-purpose humanoid robot manipulation. It goes beyond the limitations of conventional Vision–Language Models (VLMs) in physical interaction. Through continued pre-training on robot manipulation data, the model evolves from "vision-language understanding" to an "embodied brain" equipped with physical common sense.
| Spatial Semantic Enhancement | Manipulation Generalization |
|---|---|
| To address the requirements for instruction comprehension and spatial understanding in manipulation tasks, the model deeply integrates textual instructions with 2D/3D spatial details through continued pre-training, substantially strengthening its spatial perception and geometric understanding capabilities. | By leveraging full dynamics prediction data, the model achieves strong generalization across diverse manipulation tasks. In real-robot validation, it can complete 12 categories of complex manipulation tasks with high quality using only a single policy. |
- Jan 29, 2026: 🚀 We released the training and inference code, along with the model weights for UnifoLM-VLA-0.
- Training
- Inference
- Checkpoints
This project is built on CUDA 12.4, and using the same version is strongly recommended to ensure compatibility.
conda create -n unifolm-vla python==3.10.18
conda activate unifolm-vla
git clone https://github.com/unitreerobotics/unifolm-vla.git
# If you already downloaded the repo:
cd unifolm-vla
pip install --no-deps "lerobot @ git+https://github.com/huggingface/lerobot.git@0878c68"
pip install -e .
# Install FlashAttention2
pip install "flash-attn==2.5.6" --no-build-isolation
| Model | Description | Link |
|---|---|---|
UnifoLM-VLM-Base |
Fine-tuned on general-purpose image–text VQA data and open-source robot datasets. | HuggingFace |
UnifoLM-VLA-Base |
Fine-tuned on Unitree opensource dataset. | HuggingFace |
UnifoLM-VLA-LIBERO |
Fine-tuned on Libero dataset. | HuggingFace |
In our experiments, we consider the following twelve open-source dataset:
| Dataset | Robot | Link |
|---|---|---|
| G1_Stack_Block | Unitree G1 | Huggingface |
| G1_Bag_Insert | Unitree G1 | Huggingface |
| G1_Erase_Board | Unitree G1 | Huggingface |
| G1_Clean_Table | Unitree G1 | Huggingface |
| G1_Pack_PencilBox | Unitree G1 | Huggingface |
| G1_Pour_Medicine | Unitree G1 | Huggingface |
| G1_Pack_PingPong | Unitree G1 | Huggingface |
| G1_Prepare_Fruit | Unitree G1 | Huggingface |
| G1_Organize_Tools | Unitree G1 | Huggingface |
| G1_Fold_Towel | Unitree G1 | Huggingface |
| G1_Wipe_Table | Unitree G1 | Huggingface |
| G1_DualRobot_Clean_Table | Unitree G1 | Huggingface |
To train on your own dataset, ensure the data follows the Huggingface LeRobot V2.1 dataset format. Assume the source directory structure of the dataset is as follows:
source_dir/
├── dataset1_name
├── dataset2_name
├── dataset3_name
└── ...
Then, run the following command to convert the dataset from LeRobot format to HDF5 format:
cd prepare_data
python convert_lerobot_to_hdf5.py \
--data_path /path/to/your/source_dir/dataset1_name \
--target_path /path/to/save/the/converted/data/directoryFinally, run the following command to convert the HDF5 format into the RLDS dataset format required for training. Be sure to update the path (here) to the correct location of the HDF5 data.
cd prepare_data/hdf5_to_rlds/rlds_dataset
tfds build --data_dir /path/to/save/the/converted/data/directory
The directory structure of the converted RLDS dataset is as follows:
source_dir/
├── downloads
├── rlds_dataset
└── 1.0.0
The 1.0.0 directory is the final RLDS dataset version that can be used for training. The final directory should be kept as source_dir/1.0.0(e.g., g1_stack_block/1.0.0).
To train on a single dataset or multiple datasets, follow the steps below:
- Step 1: Assuming you have already prepared the RLDS dataset, register the dataset (e.g., the Unitree open-source dataset
G1_StackBox) with our dataloader by adding an entry for it inconfigs.py(here),transforms.py(here), andmixtures.py(here) anddatasets.py(here). For reference, in each of these files, there are sample entries for the G1 datasets that we used in experiments. - Step 2: Before starting fine-tuning, configure the size of the action chunks predicted by the model, the action and state degrees of freedom in the dataset, and the data normalization scheme in
constants.py(here). Refer toNUM_ACTIONS_CHUNK,ACTION_DIM,PROPRIO_DIM, andACTION_PROPRIO_NORMALIZATION_TYPEinG1_CONSTANTS. - Step 3: please complete the configuration in the following order (see here):
- Model Initialization: Set
base_vlmto the local path or the corresponding model weight URL of UnifoLM-VLM-Base, which will be used to initialize the vision–language backbone model. - Dataset Path Configuration: After configuring the model path, set
oxe_data_rootto the root directory of the dataset to ensure that the training script can correctly load the RLDS data. - Dataset Mixture Specification: Based on the configured data root, set
data_mixto the name of the dataset(s) to be used for training or to the desired dataset mixture. - Model Checkpoint Saving: Specify the paths for saving model checkpoints and logs, which will store the model weights and training states generated during fine-tuning for later recovery, evaluation, and inference.
- Parallelism Configuration: Finally, adjust
num_processesaccording to the number of available GPUs to match the scale of distributed training.
- Step 4: You can now start fine-tuning. Before running the script
run_unifolm_vla_train.sh,
To evaluate the UnifoLM-VLA-Libero model in the LIBERO simulation environment (here), follow the steps below:
- Step 1: Install the LIBERO simulation environment and its dependencies::
git clone https://github.com/Lifelong-Robot-Learning/LIBERO.git
pip install -e LIBERO
pip install -r experiments/LIBERO/libero_requirements.txt # Run from the UnifoLM-VLA project root directory
- Step 2: In
run_eval_libero.sh(here), modify the following fields:your_ckpt,task_suite_name,unnorm_key, andLIBERO_HOMEandvlm_pretrained_path. - Step 3: Launch the evaluation:
conda activate unifolm-vla
cd unifolm-vla
bash scripts/eval_scripts/run_eval_libero.sh
In our system, inference is executed on the server side. The robot client collects observations from the real robot and sends them to the server for action inference. The full pipeline can be completed by following the steps below.
- Step 1: In
run_real_eval_server.sh(here), modify the following fields:ckpt_path,port, andunnorm_keyandvlm_pretrained_path. - Step 2: Launch the server:
conda activate unifolm-vla
cd unifolm-vla
bash scripts/eval_scripts/run_real_eval_server.sh
-
Step 1: Refer to unitree_deploy/README.md to create the
unitree_deployconda environment, install the required dependencies, and start the controller or service on the real robot. -
Step 2: Open a new terminal and establish a tunnel connection from the client to the server:
ssh user_name@remote_server_IP -CNg -L port:127.0.0.1:port
- Step 3: Modify and run the script
unitree_deploy/robot_client.pyas a reference.
Here's a high-level overview of the project's code structure and core components:
unifolm-vla/
├── assets # Media assets such as GIFs
├── experiments # Libero datasets for running inference
├── deployment # Deployment server code
├── prepare_data # Scripts for dataset preprocessing and format conversion
├── scripts # Main scripts for training, evaluation, and deployment
├── src
│ ├──unifolm_vla # Core Python package for the Unitree world model
│ │ ├── config # Configuration files for training
│ │ ├── model # Model architectures and backbone definitions
│ │ ├── rlds_dataloader # Dataset loading, transformations, and dataloaders
│ │ └── training # Model Training
Lots of code are inherited from Qwen2.5-VL, Isaac-GR00T, Open-X, openvla-oft, InternVLA-M1.
@misc{unifolm-vla-0,
author = {Unitree},
title = {UnifoLM-VLA-0: A Vision-Language-Action (VLA) Framework under UnifoLM Family},
year = {2026},
}
