This is the official implementation of our paper Scalable Graph Generative Modeling via Substructure Sequences, a self-supervised extension of our ICML'25 work GPM. G2PM addresses the fundamental scalability challenges in Graph Neural Networks (GNNs) by introducing a novel approach that goes beyond traditional message-passing architectures.
- π Breakthrough scalability with continuous performance gains up to 60M parameters
- π Novel sequence-based representation replacing traditional message passing
- π― Versatile performance across node, graph, and transfer learning tasks
- β‘ Optimized architecture design for maximum generalization capability
Traditional message-passing GNNs face several critical limitations:
- Constrained expressiveness
- Over-smoothing of node representations
- Over-squashing of information
- Limited capacity to model long-range dependencies
These issues particularly affect scalability, as increasing model size or data volume often fails to improve performance, limiting GNNs' potential as graph foundation models.
G2PM introduces a generative Transformer pre-training framework that:
- Represents graph instances (nodes, edges, or entire graphs) as sequences of substructures
- Employs generative pre-training over these sequences
- Learns generalizable and transferable representations without relying on traditional message-passing
- Demonstrates exceptional scalability on ogbn-arxiv benchmark
- Continues performance improvement up to 60M parameters
- Significantly outperforms previous approaches that plateau at ~3M parameters
- Shows strong performance across node classification, graph classification, and transfer learning tasks
- CUDA-compatible GPU (24GB memory minimum, 48GB recommended)
- CUDA 12.1
- Python 3.9+
# Create and activate conda environment
conda env create -f environment.yml
conda activate GPM
# Install DGL
pip install dgl -f https://data.dgl.ai/wheels/torch-2.4/cu121/repo.html
# Install PyG dependencies
pip install pyg_lib torch_scatter torch_sparse torch_cluster torch_spline_conv -f https://data.pyg.org/whl/torch-2.4.0+cu121.htmlThe code of G2PM is presented in folder /G2PM. You can run pretrain.py and specify any dataset to run experiments. To ensure reproducability, we provide hyper-parameters in config/pretrain.yaml. You can simply use command --use_params to set tuned hyper-parameters.
# Run with default parameters
python G2PM/pretrain.py --dataset computers --use_params-
Node Classification
pubmed,photo,computers,arxiv,products,wikics,flickr.
-
Graph Classification
imdb-b,reddit-m12k,hiv,pcba,sider,clintox,muv.
We also provide the interfaces of other widely used datasets in GPM. Please check the datasets in G2PM/data/pyg_data_loader.py for details.
--use_params: Use tuned hyperparameters--dataset: Target dataset name--epochs: Number of training epochs--batch_size: Batch size--lr: Learning rate
--pre_sample_pattern_num: Number of patterns per instance in total (used for pattern extraction)--num_patterns: Number of patterns per instance during training (used for pattern encoding)--pattern_size: Pattern size (random walk length)--mask_token: Mask token type (learnable,random,fixed,replace)--architecture: Reconstruction architecture (mae,simmim)
--hidden_dim: Hidden layer dimension--num_heads: Number of attention heads--num_enc_layers: Number of Transformer layers in encoder--num_dec_layers: Number of Transformer layers in decoder--dropout: Dropout rate
--mix_aug: Mix the augmentation strategies--mask_node: Mask node features--mask_pattern: Mask graph patterns
For complete configuration options, please refer to our code documentation.
βββ G2PM
βββ G2PM/ # Main package directory
β βββ data/ # Data loading and preprocessing
β βββ model/ # Model architectures
β βββ task/ # Task implementations
β βββ utils/ # Utility functions
β βββ pretrain.py # Pretraining script
βββ config/ # Configuration files
βββ assets/ # Images and assets
βββ data/ # Dataset storage
βββ patterns/ # Extracted graph patterns
βββ environment.yml # Conda environment spec
If you find this work useful, please cite our paper:
@article{wang2025scalable,
title={Scalable Graph Generative Modeling via Substructure Sequences},
author={Wang, Zehong and Zhang, Zheyuan and Ma, Tianyi and Zhang, Chuxu and Ye, Yanfang},
journal={arXiv preprint arXiv:2505.16130},
year={2025}
}
@inproceedings{wang2025generative,
title={Generative Graph Pattern Machine},
author={Zehong Wang and Zheyuan Zhang and Tianyi Ma and Chuxu Zhang and Yanfang Ye},
booktitle={The Thirty-ninth Annual Conference on Neural Information Processing Systems},
year={2025},
url={https://openreview.net/forum?id=tdMWo3jB21}
}For questions, please contact zwang43@nd.edu or open an issue.
This repository builds upon the excellent work from:
We thank these projects for their valuable contributions to the field.


