Video-LLaVA-Seg

Ali Athar, Xueqing Deng, Liang-Chieh Chen

This is the official baseline implementation for the ViCaS dataset (CVPR'25). The main project GitHub repo is here.

The trained model is uploaded to HuggingFace.

Environment

Create a new conda environment with Python 3.9.2 and activate it:

conda create -n videollavaseg python==3.9.2
conda activate videollavaseg

Install the correct version of PyTorch. We used CUDA 12.1 in our setup:

pip install torch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 --index-url https://download.pytorch.org/whl/cu121

Install Flash Attention:

pip install flash-attn==2.6.3 --no-build-isolation

Dataset Structure

Create a datasets directory and place the ViCaS dataset in it. For instructions on downloading the dataset, refer to the main dataset GitHub repo. The repo folder should then look like this:

$REPO_DIR
├── llava                      
│   ├── ...
├── datasets
│   ├── ViCaS
│   │   └── splits
│   │   │   ├── v0.1
│   │   │   ├── v1.0
│   │   └── annotations
│   │   │   ├── v0.1
│   │   │   ├── v1.0
│   │   └── videos
│   │   └── video_frames

Inference

Run the following command:

python llava/inference/main.py -i /path/to/model/directory -o /path/to/output --dataset_split {val,test}

If you're on a multi-GPU setup then you can parallelize the inference by running the inference script with the same arguments:

bash scripts/infer.sh -i /path/to/model/directory -o /path/to/output --dataset_split {val,test}

To evaluate the results, refer to the instructions in the main repo.

Training

We provide the pretrained model which has been optimized for video captioning on a subset of WebVid10M and Panda70M on HuggingFace. To finetune this model on ViCaS for captioning and LG-VIS, run the following training script:

bash scripts/train/llama3/stage3_with_seg.sh --output_dir /path/to/output/directory --restore_weights /path/to/pretrained/model

Before executing it, please fill out some of the variables in this script according to your setup (number of GPUs, nodes, rdzv_endpoint, etc.). Our hardware setup involves 4 nodes, each with 8 Nvidia A100 (80G) GPUs (i.e. total 32 GPUs).

⚠️ Terms of use

This model cannot be used for commercial purposes. It has been created for research purposes only.
This is not an official ByteDance product.

BibTeX

@article{athar2024vicas,
author = {Ali Athar, Xueqing Deng, Liang-Chieh Chen},
title = {ViCaS: A Dataset for Combining Holistic and Pixel-level Video Understanding using Captions with Grounded Segmentation},
journal = {CVPR},
year = {2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
llava		llava
sam2		sam2
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Video-LLaVA-Seg

Environment

Dataset Structure

Inference

Training

⚠️ Terms of use

BibTeX

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

Ali2500/Video-LLaVA-Seg

Folders and files

Latest commit

History

Repository files navigation

Video-LLaVA-Seg

Environment

Dataset Structure

Inference

Training

⚠️ Terms of use

BibTeX

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages