ReTAG: Retrieval-Enhanced, Topic-Augmented Graph-Based Global Sensemaking

Introduction

ReTAG is a graph-based system for global sensemaking that synthesizes information and relationships across large-scale document collections. It leverages topic-augmented summarization and retrieval techniques to generate precise, context-rich answers to complex queries.

Installation

Cuda 11.8
Python 3.8.10

To install required libraries, run:

pip install -r requirements.txt

Data Preparation

Please use the following files as your corpus:

data/corpus/corpus_news_article.json
data/corpus/corpus_podcast.json

Please use the following files as your query:

data/query/news_article_total_questions.json
data/query/podcast_total_questions.json

Corpus Preprocessing (Chunking)

python data/process_corpus.py \
    --base_dir "YOUR_EXPERIMENT_DIRECTORY" \
    --input_path "data/corpus/corpus_podcast.json" \
    --output_path "retag/data/chunks/podcast_chunks.json"

Baseline

1. Entity-Relation Graph Construction

(1) Entity/Relation Extraction

./baseline_entity_relation_graph.sh

(2) Graph Construction

python entity_relation_graph/make_graph.py \
    --base_dir "YOUR_EXPERIMENT_DIRECTORY" \
    --input_path_reflect "output/entity_relation_graph/baseline/podcast/reflect.json" \
    --input_path_gleaning "output/entity_relation_graph/baseline/podcast/gleaning.json" # Gleaning path \
    --save_path "output/entity_relation_graph/baseline/podcast/graph.pkl" #Pickle file

2. Comunity-based Summarization

(1) Hierarchical Graph Clustering

python ./community_summarization/make_graph_hierarchy.py \
    --base_dir "YOUR_EXPERIMENT_DIRECTORY" \
    --graph_path "output/entity_relation_graph/baseline/podcast/graph.pkl" \
    --hierarchical_graph_save_path "output/community_summarization/baseline/podcast/hierarchy.json" \
    --children_dict_save_path "output/community_summarization/baseline/podcast/children_dict.json" \
    --final_dict_save_path "output/community_summarization/baseline/podcast/final_dict.json"

(2) Community Summarization

./baseline_community_summarization.sh

3. Response generation

./baseline_response_generation.sh

ReTAG

1. Topic mining

./retag_topic_mining.sh #topic extract
python topic_mining/make_topic_total.py \
    --base_dir "YOUR_EXPERIMENT_DIRECTORY" \
    --input_path_prefix "output/topic_mining/response/podcast/topic" \
    --max_level 5 \
    --save_path "output/topic_mining/total_topic/podcast.json"

./retag_dataset_desc.sh

2. Topic-augmented entity-relation graph construction

./retag_entity_relation_graph.sh
python entity_relation_graph/make_graph_retag.py \
    --base_dir "YOUR_EXPERIMENT_DIRECTORY" \
    --input_path_reflect "output/entity_relation_graph/retag/podcast/reflect.json" \
    --input_path_gleaning "output/entity_relation_graph/retag/podcast/gleaning.json" # Gleaning path \
    --save_path_prefix "output/entity_relation_graph/retag/podcast/graph/graph" #Pickle file

3. Topic-augmented community summarization

python community_summarization/make_graph_hierarchy_retag.py \
    --base_dir "YOUR_EXPERIMENT_DIRECTORY" \
    --topic_data_path "output/topic_mining/total_topic/podcast.json" \
    --graph_path_prefix "output/entity_relation_graph/retag/podcast/graph/graph" \
    --hierarchical_graph_save_path_prefix "output/community_summarization/retag/podcast/hierarchy/hierarchy" \
    --children_dict_save_path_prefix "output/community_summarization/retag/podcast/children_dict/children" \
    --final_dict_save_path_prefix "output/community_summarization/retag/podcast/final_dict/final"
./retag_community_summarization.sh

4. Response generation

# Topic classification
python response_generation/topic_select_retag.py \
    --model_id "Infermatic/Llama-3.3-70B-Instruct-FP8-Dynamic" \
    --download_dir "/home/work/graph_construction/llama_3.3_70b_instruct_fp8" \
    --num_gpus 2 \
    --batch_size 256 \
    --base_dir "path/to/your/exp/directory" \
    --topic_data_path "output/topic_mining/total_topic/podcast.json" \
    --question_data_path "data/query/podcast_total_questions.json" \
    --dataset_desc_path "output/topic_mining/response/podcast/desc_0.json" \
    --save_prefix "output/response_generation/retag/topic_select/podcast"

# Query keyword expansion
python response_generation/keyword_expansion_retag.py \
    --model_id "Infermatic/Llama-3.3-70B-Instruct-FP8-Dynamic" \
    --download_dir "/home/work/graph_construction/llama_3.3_70b_instruct_fp8" \
    --num_gpus 2 \
    --batch_size 256 \
    --base_dir "path/to/your/exp/directory" \
    --question_data_path "data/query/podcast_total_questions.json" \
    --dataset_description_path "output/topic_mining/response/podcast/desc_0.json" \
    --save_path_prefix "output/response_generation/retag/podcast/keywords"

# Retrieve community reports (max level can be different, you should check in graph hierarchy)
python response_generation/make_local_reports_retag.py \
    --base_dir "path/to/your/exp/directory" \
    --topic_data_path "output/topic_mining/total_topic/podcast.json" \
    --community_report_prefix "output/community_summarization/retag/podcast/community_summary/community_summary" \
    --hierarchy_prefix "output/community_summarization/retag/podcast/hierarchy/hierarchy" \
    --final_dict_path_prefix "output/community_summarization/retag/podcast/final_dict/final" \
    --local_reports_prefix "output/community_summarization/retag/podcast/local_reports/local_reports" \
    --max_level 5
python response_generation/make_local_reports_baseline.py \
    --base_dir "/home/work/graph_construction/retag" \
    --community_report_prefix "output/community_summarization/baseline/podcast/community_summary" \
    --hierarchy_path "output/community_summarization/baseline/podcast/hierarchy.json" \
    --final_dict_path "output/community_summarization/baseline/podcast/final_dict.json" \
    --local_reports_prefix "output/community_summarization/retag/podcast/local_reports/local_reports" \
    --max_level 5
python response_generation/retrieval_retag.py \
    --base_dir "path/to/your/exp/directory" \
    --topic_data_path "output/response_generation/retag/topic_select/podcast.json" \
    --keywords_data_path "output/response_generation/retag/podcast/keywords.json" \
    --local_reports_prefix "output/community_summarization/retag/podcast/local_reports/local_reports" \
    --hierarchy_prefix "output/community_summarization/retag/podcast/hierarchy/hierarchy" \
    --hierarchy_backup "output/community_summarization/baseline/podcast/hierarchy.json" \
    --save_prefix "output/community_summarization/retag/podcast/retrieved_community_summary/community_summary" \
    --max_level 5

# response generation
./retag_response_generation.sh

Evaluation

python evaluation/eval_gpt.py \
    --model "gpt-4o-mini" \
    --api_key "YOUR_API_KEY" \
    --batch_size 64 \
    --base_dir "path/to/your/exp/directory" \
    --data1_prefix "output/response_generation/retag/podcast/final_response" \
    --data1_max_level 5 \
    --data2_prefix "output/response_generation/baseline/podcast/final_response" \
    --data2_max_level 5 \
    --question_data_path "data/query/podcast_total_questions.json" \
    --response_prefix "output/evaluation/podcast"

# You can see the winning rate results in the command shell
python evaluation/eval_parsing.py \
    --base_dir "path/to/your/exp/directory" \
    --response_path "output/evaluation/podcast.json"

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ReTAG: Retrieval-Enhanced, Topic-Augmented Graph-Based Global Sensemaking

Introduction

Installation

Data Preparation

Corpus Preprocessing (Chunking)

Baseline

1. Entity-Relation Graph Construction

2. Comunity-based Summarization

3. Response generation

ReTAG

1. Topic mining

2. Topic-augmented entity-relation graph construction

3. Topic-augmented community summarization

4. Response generation

Evaluation

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
community_summarization		community_summarization
data		data
entity_relation_graph		entity_relation_graph
evaluation		evaluation
figures		figures
logos		logos
prompts		prompts
response_generation		response_generation
topic_mining		topic_mining
README.md		README.md
baseline_community_summarization.sh		baseline_community_summarization.sh
baseline_entity_relation_graph.sh		baseline_entity_relation_graph.sh
baseline_response_generation.sh		baseline_response_generation.sh
requirements.txt		requirements.txt
retag_community_summarization.sh		retag_community_summarization.sh
retag_dataset_desc.sh		retag_dataset_desc.sh
retag_entity_relation_graph.sh		retag_entity_relation_graph.sh
retag_response_generation.sh		retag_response_generation.sh
retag_topic_mining.sh		retag_topic_mining.sh

bykimby/retag

Folders and files

Latest commit

History

Repository files navigation

ReTAG: Retrieval-Enhanced, Topic-Augmented Graph-Based Global Sensemaking

Introduction

Installation

Data Preparation

Corpus Preprocessing (Chunking)

Baseline

1. Entity-Relation Graph Construction

2. Comunity-based Summarization

3. Response generation

ReTAG

1. Topic mining

2. Topic-augmented entity-relation graph construction

3. Topic-augmented community summarization

4. Response generation

Evaluation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages