- 2026-05-23: Released
inference.shfor real-time policy deployment. - 2026-04-27: ViTacFormer accepted to RSS 2026!
Our hardware setup consists of an active vision system, a bi-manual robot arm, and two high-DoF dexterous hands (SharpaWave) equipped with high-resolution tactile sensors.
To enable rich data collection, we use a teleoperation system with precision exoskeleton for fine-grained arm and finger control, plus a VR interface to control the camera and guide demonstrations.
conda create -n vitacformer python=3.8.10
conda activate vitacformer
pip install torchvision
pip install torch
pip install opencv-python
pip install matplotlib
pip install tqdm
pip install einops
pip install h5py
pip install ipython
pip install transforms3d
pip install zarr
pip install transformers
pip install pyzmq
pip install pytorch-kinematics
cd dataset/ha_data && pip install -e .
cd detr && pip install -e .
Please download and unzip the example data here. To train ViTacFormer, run:
conda activate vitacformer
bash train.sh
inference.py runs the trained policy as a ZMQ client that connects to a robot or simulator backend on tcp://127.0.0.1:7778. Edit the checkpoint paths in inference.sh, start your backend server, then:
bash inference.sh
The backend must publish per-frame observations (images, joint angles, 20 tactile sensor channels) and consume per-frame absolute joint-angle action targets. Action chunking and temporal ensemble are handled inside inference.py.
If you find ViTacFormer useful, please consider citing it:
@misc{heng2026vitacformerlearningcrossmodalrepresentation,
title={ViTacFormer: Learning Cross-Modal Representation for Visuo-Tactile Dexterous Manipulation},
author={Liang Heng and Haoran Geng and Kaifeng Zhang and Pieter Abbeel and Jitendra Malik},
year={2026},
eprint={2506.15953},
archivePrefix={arXiv},
primaryClass={cs.RO},
url={https://arxiv.org/abs/2506.15953},
}
