Skip to content

Ali2500/Video-LLaVA-Seg

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Video-LLaVA-Seg

Ali Athar, Xueqing Deng, Liang-Chieh Chen

Website Dataset paper Full Paper

This is the official baseline implementation for the ViCaS dataset (CVPR'25). The main project GitHub repo is here.

The trained model is uploaded to HuggingFace.

Environment

Create a new conda environment with Python 3.9.2 and activate it:

conda create -n videollavaseg python==3.9.2
conda activate videollavaseg

Install the correct version of PyTorch. We used CUDA 12.1 in our setup:

pip install torch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 --index-url https://download.pytorch.org/whl/cu121

Install Flash Attention:

pip install flash-attn==2.6.3 --no-build-isolation

Dataset Structure

Create a datasets directory and place the ViCaS dataset in it. For instructions on downloading the dataset, refer to the main dataset GitHub repo. The repo folder should then look like this:

$REPO_DIR
├── llava                      
│   ├── ...
├── datasets
│   ├── ViCaS
│   │   └── splits
│   │   │   ├── v0.1
│   │   │   ├── v1.0
│   │   └── annotations
│   │   │   ├── v0.1
│   │   │   ├── v1.0
│   │   └── videos
│   │   └── video_frames

Inference

Run the following command:

python llava/inference/main.py -i /path/to/model/directory -o /path/to/output --dataset_split {val,test}

If you're on a multi-GPU setup then you can parallelize the inference by running the inference script with the same arguments:

bash scripts/infer.sh -i /path/to/model/directory -o /path/to/output --dataset_split {val,test}

To evaluate the results, refer to the instructions in the main repo.

Training

We provide the pretrained model which has been optimized for video captioning on a subset of WebVid10M and Panda70M on HuggingFace. To finetune this model on ViCaS for captioning and LG-VIS, run the following training script:

bash scripts/train/llama3/stage3_with_seg.sh --output_dir /path/to/output/directory --restore_weights /path/to/pretrained/model

Before executing it, please fill out some of the variables in this script according to your setup (number of GPUs, nodes, rdzv_endpoint, etc.). Our hardware setup involves 4 nodes, each with 8 Nvidia A100 (80G) GPUs (i.e. total 32 GPUs).

⚠️ Terms of use

  • This model cannot be used for commercial purposes. It has been created for research purposes only.
  • This is not an official ByteDance product.

BibTeX

@article{athar2024vicas,
author = {Ali Athar, Xueqing Deng, Liang-Chieh Chen},
title = {ViCaS: A Dataset for Combining Holistic and Pixel-level Video Understanding using Captions with Grounded Segmentation},
journal = {CVPR},
year = {2025}
}

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published