HiCM²: Hierarchical Compact Memory Modeling for Dense Video Captioning (AAAI 2025)

Official codebase for the AAAI 2025 paper:
"HiCM²: Hierarchical Compact Memory Modeling for Dense Video Captioning"
[Paper]

🧠 Overview

This repository provides the implementation of HiCM², a memory-efficient, hierarchy-aware video-language modeling framework designed to enhance dense video captioning through compact temporal memory. We introduce a hierarchical memory construction scheme and retrieval-augmented reasoning based on temporal clustering and CLIP alignment.

📁 Directory Structure

📁 Root
├── args.py
├── dataset/                  # Dataset loaders (YouCook2, VideoCaption, etc.)
├── dvc_eval/                # Captioning evaluation (SODA, METEOR, CIDEr, etc.)
├── model/                   # HiCM² model, backbone, and T5-related modules
├── util/                    # Utility functions (metrics, dist, t5, etc.)
├── presave/                 # Pretrained checkpoints (to be downloaded)
│   ├── vid2seq_htmchapters.pth
│   ├── vid2seq_htm.pth
│   ├── vitt/best_model.pth
│   └── yc2/best_model.pth
├── data/                    # YouCook2, ViTT Data (to be downloaded)
│   ├── vitt/*
│   └── yc2/*
├── finch-llama_hier.py      # Hierarchical memory constructor (our main contribution)
├── finch-llama_hier.sh      # Shell script to run memory constructor
├── train_ret_yc2_hier.sh    # Training script for YC2 with HiCM²
├── train_ret_vitt_hier.sh   # Training script for VITT with HiCM²
├── eval_ret_yc2_hier.sh     # Evaluation script for YC2 with HiCM²
├── eval_ret_vitt_hier.sh    # Evaluation script for VITT with HiCM²
├── hierarchical_clustering_results_yc2_70B.pkl   # Hierarchical Memory for YC2
├── hierarchical_clustering_results_vitt_70B.pkl  # Hierarchical Memory for VITT
├── requirements.txt
└── README.md

🛠️ Setup

We recommend using a Conda environment:

conda create --name HiCM2 python=3.7
conda activate HiCM2
pip install -r requirements.txt

📂 Dataset Download

Dataset	Download Link	Save Path
YC2	Hugging Face	`data/yc2/*`
VITT	Hugging Face	`data/vitt/*`

📌 Please download the corresponding dataset files and place them in the above directories to match the training/evaluation scripts.

🔗 Pretrained Models

Due to large file sizes, pretrained weights are provided via external download:

Model Type	Dataset	Download Link	Save Path
Ours	YC2	Hugging Face	`presave/yc2/best_model.pth`
Ours	VITT	Hugging Face	`presave/vitt/best_model.pth`
Vid2Seq Baseline	HTM-Chapters	Hugging Face	`presave/vid2seq_htmchapters.pth`

Make sure to place downloaded .pth files in the correct subdirectories as shown above.

🧩 Hierarchical Memory Construction

We provide a script to construct Hierarchical Compact Memory with clustering and representation selection:

bash finch-llama_hier.sh

This will generate clustering outputs like:

hierarchical_clustering_results_yc2_8B.pkl
hierarchical_clustering_results_vitt_70B.pkl

You can modify the backbone, dataset, or levels within finch-llama_hier.py.

Note: This code is based on Hugging Face's LLaMA 3 model. You must first obtain access to LLaMA 3 from Hugging Face, and then insert your Hugging Face token into finch-llama_hier.py.

🚀 Training & Evaluation

Example (YouCook2 + Hierarchical Memory):

bash train_ret_yc2_hier.sh

Evaluate on YouCook2:

bash eval_ret_yc2_hier.sh

📊 Evaluation Metrics

We support standard dense video captioning metrics:

SODA
METEOR
CIDEr
ROUGE-L
BLEU-4

Evaluation is handled via the scripts in dvc_eval/.

📜 Citation

If you find our work useful, please consider citing:

@inproceedings{kim2025hicm2,
  title={HiCM$^2$: Hierarchical Compact Memory Modeling for Dense Video Captioning},
  author={Kim, Minkuk and Kim, Hyeon Bae and Moon, Jinyoung and Choi, Jinwoo and Kim, Seong Tae},
  booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
  volume={39},
  number={4},
  pages={4293--4301},
  year={2025}
}

📝 License

This project is licensed under the MIT License.

🙏 Acknowledgements

Our framework builds upon prior work from Vid2Seq, CLIP, and others. We thank the open-source community! This work was supported by Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government(MSIT) (No.RS-2022-00155911, Artificial Intelligence Convergence Innovation Human Resources Development (Kyung Hee University)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HiCM²: Hierarchical Compact Memory Modeling for Dense Video Captioning (AAAI 2025)

🧠 Overview

📁 Directory Structure

🛠️ Setup

📂 Dataset Download

🔗 Pretrained Models

🧩 Hierarchical Memory Construction

🚀 Training & Evaluation

📊 Evaluation Metrics

📜 Citation

📝 License

🙏 Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
analysis		analysis
asr_extract		asr_extract
collection		collection
dataset		dataset
dvc_eval		dvc_eval
model		model
util		util
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
args.py		args.py
down_t5.py		down_t5.py
dvc_ret.py		dvc_ret.py
eval_ret_vitt_hier.sh		eval_ret_vitt_hier.sh
eval_ret_yc2_hier.sh		eval_ret_yc2_hier.sh
finch-llama_hier.py		finch-llama_hier.py
finch-llama_hier.sh		finch-llama_hier.sh
hierarchical_clustering_results_vitt_70B.pkl		hierarchical_clustering_results_vitt_70B.pkl
hierarchical_clustering_results_yc2_70B.pkl		hierarchical_clustering_results_yc2_70B.pkl
requirements.txt		requirements.txt
train_ret_vitt_hier.sh		train_ret_vitt_hier.sh
train_ret_yc2_hier.sh		train_ret_yc2_hier.sh

Folders and files

Latest commit

History

Repository files navigation

HiCM²: Hierarchical Compact Memory Modeling for Dense Video Captioning (AAAI 2025)

🧠 Overview

📁 Directory Structure

🛠️ Setup

📂 Dataset Download

🔗 Pretrained Models

🧩 Hierarchical Memory Construction

🚀 Training & Evaluation

📊 Evaluation Metrics

📜 Citation

📝 License

🙏 Acknowledgements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages