Ali Athar, Xueqing Deng, Liang-Chieh Chen
This is the official baseline implementation for the ViCaS dataset (CVPR'25). The main project GitHub repo is here.
The trained model is uploaded to HuggingFace.
Create a new conda environment with Python 3.9.2 and activate it:
conda create -n videollavaseg python==3.9.2
conda activate videollavasegInstall the correct version of PyTorch. We used CUDA 12.1 in our setup:
pip install torch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 --index-url https://download.pytorch.org/whl/cu121Install Flash Attention:
pip install flash-attn==2.6.3 --no-build-isolationCreate a datasets directory and place the ViCaS dataset in it. For instructions on downloading the dataset, refer to the main dataset GitHub repo. The repo folder should then look like this:
$REPO_DIR
├── llava
│ ├── ...
├── datasets
│ ├── ViCaS
│ │ └── splits
│ │ │ ├── v0.1
│ │ │ ├── v1.0
│ │ └── annotations
│ │ │ ├── v0.1
│ │ │ ├── v1.0
│ │ └── videos
│ │ └── video_frames
Run the following command:
python llava/inference/main.py -i /path/to/model/directory -o /path/to/output --dataset_split {val,test}If you're on a multi-GPU setup then you can parallelize the inference by running the inference script with the same arguments:
bash scripts/infer.sh -i /path/to/model/directory -o /path/to/output --dataset_split {val,test}To evaluate the results, refer to the instructions in the main repo.
We provide the pretrained model which has been optimized for video captioning on a subset of WebVid10M and Panda70M on HuggingFace. To finetune this model on ViCaS for captioning and LG-VIS, run the following training script:
bash scripts/train/llama3/stage3_with_seg.sh --output_dir /path/to/output/directory --restore_weights /path/to/pretrained/modelBefore executing it, please fill out some of the variables in this script according to your setup (number of GPUs, nodes, rdzv_endpoint, etc.). Our hardware setup involves 4 nodes, each with 8 Nvidia A100 (80G) GPUs (i.e. total 32 GPUs).
- This model cannot be used for commercial purposes. It has been created for research purposes only.
- This is not an official ByteDance product.
@article{athar2024vicas,
author = {Ali Athar, Xueqing Deng, Liang-Chieh Chen},
title = {ViCaS: A Dataset for Combining Holistic and Pixel-level Video Understanding using Captions with Grounded Segmentation},
journal = {CVPR},
year = {2025}
}