Official codebase for the AAAI 2025 paper:
"HiCM²: Hierarchical Compact Memory Modeling for Dense Video Captioning"
[Paper]
This repository provides the implementation of HiCM², a memory-efficient, hierarchy-aware video-language modeling framework designed to enhance dense video captioning through compact temporal memory. We introduce a hierarchical memory construction scheme and retrieval-augmented reasoning based on temporal clustering and CLIP alignment.
📁 Root ├── args.py ├── dataset/ # Dataset loaders (YouCook2, VideoCaption, etc.) ├── dvc_eval/ # Captioning evaluation (SODA, METEOR, CIDEr, etc.) ├── model/ # HiCM² model, backbone, and T5-related modules ├── util/ # Utility functions (metrics, dist, t5, etc.) ├── presave/ # Pretrained checkpoints (to be downloaded) │ ├── vid2seq_htmchapters.pth │ ├── vid2seq_htm.pth │ ├── vitt/best_model.pth │ └── yc2/best_model.pth ├── data/ # YouCook2, ViTT Data (to be downloaded) │ ├── vitt/* │ └── yc2/* ├── finch-llama_hier.py # Hierarchical memory constructor (our main contribution) ├── finch-llama_hier.sh # Shell script to run memory constructor ├── train_ret_yc2_hier.sh # Training script for YC2 with HiCM² ├── train_ret_vitt_hier.sh # Training script for VITT with HiCM² ├── eval_ret_yc2_hier.sh # Evaluation script for YC2 with HiCM² ├── eval_ret_vitt_hier.sh # Evaluation script for VITT with HiCM² ├── hierarchical_clustering_results_yc2_70B.pkl # Hierarchical Memory for YC2 ├── hierarchical_clustering_results_vitt_70B.pkl # Hierarchical Memory for VITT ├── requirements.txt └── README.md
We recommend using a Conda environment:
conda create --name HiCM2 python=3.7
conda activate HiCM2
pip install -r requirements.txt| Dataset | Download Link | Save Path |
|---|---|---|
| YC2 | Hugging Face | data/yc2/* |
| VITT | Hugging Face | data/vitt/* |
📌 Please download the corresponding dataset files and place them in the above directories to match the training/evaluation scripts.
Due to large file sizes, pretrained weights are provided via external download:
| Model Type | Dataset | Download Link | Save Path |
|---|---|---|---|
| Ours | YC2 | Hugging Face | presave/yc2/best_model.pth |
| Ours | VITT | Hugging Face | presave/vitt/best_model.pth |
| Vid2Seq Baseline | HTM-Chapters | Hugging Face | presave/vid2seq_htmchapters.pth |
Make sure to place downloaded .pth files in the correct subdirectories as shown above.
We provide a script to construct Hierarchical Compact Memory with clustering and representation selection:
bash finch-llama_hier.shThis will generate clustering outputs like:
hierarchical_clustering_results_yc2_8B.pklhierarchical_clustering_results_vitt_70B.pkl
You can modify the backbone, dataset, or levels within finch-llama_hier.py.
Note: This code is based on Hugging Face's LLaMA 3 model. You must first obtain access to LLaMA 3 from Hugging Face, and then insert your Hugging Face token into finch-llama_hier.py.
Example (YouCook2 + Hierarchical Memory):
bash train_ret_yc2_hier.shEvaluate on YouCook2:
bash eval_ret_yc2_hier.shWe support standard dense video captioning metrics:
- SODA
- METEOR
- CIDEr
- ROUGE-L
- BLEU-4
Evaluation is handled via the scripts in dvc_eval/.
If you find our work useful, please consider citing:
@inproceedings{kim2025hicm2,
title={HiCM$^2$: Hierarchical Compact Memory Modeling for Dense Video Captioning},
author={Kim, Minkuk and Kim, Hyeon Bae and Moon, Jinyoung and Choi, Jinwoo and Kim, Seong Tae},
booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
volume={39},
number={4},
pages={4293--4301},
year={2025}
}This project is licensed under the MIT License.
Our framework builds upon prior work from Vid2Seq, CLIP, and others. We thank the open-source community! This work was supported by Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government(MSIT) (No.RS-2022-00155911, Artificial Intelligence Convergence Innovation Human Resources Development (Kyung Hee University)