ViTacFormer: Learning Cross-Modal Representation for Visuo-Tactile Dexterous Manipulation

News

2026-05-23: Released inference.sh for real-time policy deployment.
2026-04-27: ViTacFormer accepted to RSS 2026!

Hardware Setup

Our hardware setup consists of an active vision system, a bi-manual robot arm, and two high-DoF dexterous hands (SharpaWave) equipped with high-resolution tactile sensors.

To enable rich data collection, we use a teleoperation system with precision exoskeleton for fine-grained arm and finger control, plus a VR interface to control the camera and guide demonstrations.

Installation

conda create -n vitacformer python=3.8.10
conda activate vitacformer
pip install torchvision
pip install torch
pip install opencv-python
pip install matplotlib
pip install tqdm
pip install einops
pip install h5py
pip install ipython
pip install transforms3d
pip install zarr
pip install transformers
pip install pyzmq                  
pip install pytorch-kinematics     
cd dataset/ha_data && pip install -e .
cd detr && pip install -e .

Example Usages

Please download and unzip the example data here. To train ViTacFormer, run:

conda activate vitacformer
bash train.sh

Inference

inference.py runs the trained policy as a ZMQ client that connects to a robot or simulator backend on tcp://127.0.0.1:7778. Edit the checkpoint paths in inference.sh, start your backend server, then:

bash inference.sh

The backend must publish per-frame observations (images, joint angles, 20 tactile sensor channels) and consume per-frame absolute joint-angle action targets. Action chunking and temporal ensemble are handled inside inference.py.

Citation

If you find ViTacFormer useful, please consider citing it:

@misc{heng2026vitacformerlearningcrossmodalrepresentation,
      title={ViTacFormer: Learning Cross-Modal Representation for Visuo-Tactile Dexterous Manipulation}, 
      author={Liang Heng and Haoran Geng and Kaifeng Zhang and Pieter Abbeel and Jitendra Malik},
      year={2026},
      eprint={2506.15953},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2506.15953}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
assets		assets
dataset		dataset
detr		detr
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
imitate_episodes.py		imitate_episodes.py
inference.py		inference.py
inference.sh		inference.sh
policy.py		policy.py
train.sh		train.sh
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ViTacFormer: Learning Cross-Modal Representation for Visuo-Tactile Dexterous Manipulation

News

Hardware Setup

Installation

Example Usages

Inference

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ViTacFormer: Learning Cross-Modal Representation for Visuo-Tactile Dexterous Manipulation

News

Hardware Setup

Installation

Example Usages

Inference

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages