Skip to content

unitreerobotics/unifolm-vla

Repository files navigation

UnifoLM-VLA-0: A Vision-Language-Action (VLA) Framework under UnifoLM Family

Project Page | Models | Datasets

🌎English | 🇨🇳中文

UnifoLM-VLA-0 is a Vision–Language–Action (VLA) large model in the UnifoLM series, designed for general-purpose humanoid robot manipulation. It goes beyond the limitations of conventional Vision–Language Models (VLMs) in physical interaction. Through continued pre-training on robot manipulation data, the model evolves from "vision-language understanding" to an "embodied brain" equipped with physical common sense.

Spatial Semantic Enhancement Manipulation Generalization
To address the requirements for instruction comprehension and spatial understanding in manipulation tasks, the model deeply integrates textual instructions with 2D/3D spatial details through continued pre-training, substantially strengthening its spatial perception and geometric understanding capabilities. By leveraging full dynamics prediction data, the model achieves strong generalization across diverse manipulation tasks. In real-robot validation, it can complete 12 categories of complex manipulation tasks with high quality using only a single policy.

🔥 News

  • Jan 29, 2026: 🚀 We released the training and inference code, along with the model weights for UnifoLM-VLA-0.

📑 Open-Source Plan

  • Training
  • Inference
  • Checkpoints

⚙️ Installation

This project is built on CUDA 12.4, and using the same version is strongly recommended to ensure compatibility.

conda create -n unifolm-vla python==3.10.18
conda activate unifolm-vla

git clone https://github.com/unitreerobotics/unifolm-vla.git

# If you already downloaded the repo:
cd unifolm-vla
pip install --no-deps "lerobot @ git+https://github.com/huggingface/lerobot.git@0878c68"
pip install -e .

# Install FlashAttention2
pip install "flash-attn==2.5.6" --no-build-isolation

🧰 Model Checkpoints

Model Description Link
UnifoLM-VLM-Base Fine-tuned on general-purpose image–text VQA data and open-source robot datasets. HuggingFace
UnifoLM-VLA-Base Fine-tuned on Unitree opensource dataset. HuggingFace
UnifoLM-VLA-LIBERO Fine-tuned on Libero dataset. HuggingFace

🛢️ Dataset

In our experiments, we consider the following twelve open-source dataset:

Dataset Robot Link
G1_Stack_Block Unitree G1 Huggingface
G1_Bag_Insert Unitree G1 Huggingface
G1_Erase_Board Unitree G1 Huggingface
G1_Clean_Table Unitree G1 Huggingface
G1_Pack_PencilBox Unitree G1 Huggingface
G1_Pour_Medicine Unitree G1 Huggingface
G1_Pack_PingPong Unitree G1 Huggingface
G1_Prepare_Fruit Unitree G1 Huggingface
G1_Organize_Tools Unitree G1 Huggingface
G1_Fold_Towel Unitree G1 Huggingface
G1_Wipe_Table Unitree G1 Huggingface
G1_DualRobot_Clean_Table Unitree G1 Huggingface

To train on your own dataset, ensure the data follows the Huggingface LeRobot V2.1 dataset format. Assume the source directory structure of the dataset is as follows:

source_dir/
    ├── dataset1_name
    ├── dataset2_name
    ├── dataset3_name
    └── ...

Then, run the following command to convert the dataset from LeRobot format to HDF5 format:

cd prepare_data
python convert_lerobot_to_hdf5.py \
    --data_path /path/to/your/source_dir/dataset1_name \
    --target_path /path/to/save/the/converted/data/directory

Finally, run the following command to convert the HDF5 format into the RLDS dataset format required for training. Be sure to update the path (here) to the correct location of the HDF5 data.

cd prepare_data/hdf5_to_rlds/rlds_dataset
tfds build  --data_dir  /path/to/save/the/converted/data/directory

The directory structure of the converted RLDS dataset is as follows:

source_dir/
├── downloads
├── rlds_dataset
         └── 1.0.0

The 1.0.0 directory is the final RLDS dataset version that can be used for training. The final directory should be kept as source_dir/1.0.0(e.g., g1_stack_block/1.0.0).

🚴‍♂️ Training

To train on a single dataset or multiple datasets, follow the steps below:

  • Step 1: Assuming you have already prepared the RLDS dataset, register the dataset (e.g., the Unitree open-source dataset G1_StackBox) with our dataloader by adding an entry for it in configs.py (here), transforms.py (here), and mixtures.py (here) and datasets.py(here). For reference, in each of these files, there are sample entries for the G1 datasets that we used in experiments.
  • Step 2: Before starting fine-tuning, configure the size of the action chunks predicted by the model, the action and state degrees of freedom in the dataset, and the data normalization scheme in constants.py (here). Refer to NUM_ACTIONS_CHUNK, ACTION_DIM, PROPRIO_DIM, and ACTION_PROPRIO_NORMALIZATION_TYPE in G1_CONSTANTS.
  • Step 3: please complete the configuration in the following order (see here):
  1. Model Initialization: Set base_vlm to the local path or the corresponding model weight URL of UnifoLM-VLM-Base, which will be used to initialize the vision–language backbone model.
  2. Dataset Path Configuration: After configuring the model path, set oxe_data_root to the root directory of the dataset to ensure that the training script can correctly load the RLDS data.
  3. Dataset Mixture Specification: Based on the configured data root, set data_mix to the name of the dataset(s) to be used for training or to the desired dataset mixture.
  4. Model Checkpoint Saving: Specify the paths for saving model checkpoints and logs, which will store the model weights and training states generated during fine-tuning for later recovery, evaluation, and inference.
  5. Parallelism Configuration: Finally, adjust num_processes according to the number of available GPUs to match the scale of distributed training.

🌏 Simulation Inference Evaluation

To evaluate the UnifoLM-VLA-Libero model in the LIBERO simulation environment (here), follow the steps below:

  • Step 1: Install the LIBERO simulation environment and its dependencies::
git clone https://github.com/Lifelong-Robot-Learning/LIBERO.git
pip install -e LIBERO
pip install -r experiments/LIBERO/libero_requirements.txt  # Run from the UnifoLM-VLA project root directory
  • Step 2: In run_eval_libero.sh (here), modify the following fields: your_ckpt, task_suite_name, unnorm_key, and LIBERO_HOME and vlm_pretrained_path.
  • Step 3: Launch the evaluation:
conda activate unifolm-vla
cd unifolm-vla
bash scripts/eval_scripts/run_eval_libero.sh

🤖 Real-World Inference Evaluation

In our system, inference is executed on the server side. The robot client collects observations from the real robot and sends them to the server for action inference. The full pipeline can be completed by following the steps below.

Server Setup

  • Step 1: In run_real_eval_server.sh (here), modify the following fields: ckpt_path, port, and unnorm_key and vlm_pretrained_path.
  • Step 2: Launch the server:
conda activate unifolm-vla
cd unifolm-vla
bash scripts/eval_scripts/run_real_eval_server.sh

Client Setup

  • Step 1: Refer to unitree_deploy/README.md to create the unitree_deploy conda environment, install the required dependencies, and start the controller or service on the real robot.

  • Step 2: Open a new terminal and establish a tunnel connection from the client to the server:

ssh user_name@remote_server_IP -CNg -L port:127.0.0.1:port
  • Step 3: Modify and run the script unitree_deploy/robot_client.py as a reference.

📝 Codebase Architecture

Here's a high-level overview of the project's code structure and core components:

unifolm-vla/
    ├── assets                      # Media assets such as GIFs
    ├── experiments                 # Libero datasets for running inference
    ├── deployment                  # Deployment server code
    ├── prepare_data                # Scripts for dataset preprocessing and format conversion
    ├── scripts                     # Main scripts for training, evaluation, and deployment
    ├── src
    │    ├──unifolm_vla             # Core Python package for the Unitree world model
    │    │      ├── config          # Configuration files for training
    │    │      ├── model           # Model architectures and backbone definitions
    │    │      ├── rlds_dataloader # Dataset loading, transformations, and dataloaders
    │    │      └── training        # Model Training

🙏 Acknowledgement

Lots of code are inherited from Qwen2.5-VL, Isaac-GR00T, Open-X, openvla-oft, InternVLA-M1.

📝 Citation

@misc{unifolm-vla-0,
  author       = {Unitree},
  title        = {UnifoLM-VLA-0: A Vision-Language-Action (VLA) Framework under UnifoLM Family},
  year         = {2026},
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors