The repository shares code for the paper Towards AI Analyst: Querying Costly Features for Fraud and Money Laundering Detection.
Source code lives in the src folder. All scripts are included in the scripts folder.
Install the requirements from the requirements.txt file. The code has been run on Python/3.12.
Public data and models live in the local_caches folder. You can find all the necessary data on Google Drive here: towards-ai-analyst-data. The structure is divided into data folder and results folder.
The original Amaretto dataset can be found at https://github.com/necst/amaretto_dataset. Download from the data subfolder on the Google Drive.
amaretto_dataset_anon.csv.zip: file that is just the original Amaretto dataset processed into on .csv fileamaretto.pq: full dataframe with extracted features via feature engineering (as inscripts/data_prep/feature_engineering.py)amaretto.tar.gzthat unpacks to aamarettofolder withprior.pq,val_prior.pq, andtest_prior.pqthat are time split of prior featurescostly_features.pq,val_costly_features.pq, andtest_costly_features.pqthat are time split of costly features
Results are to be unpacked in local_caches/results folder from the file amaretto_results.tar.gz. The results include calculated scores and other data from the best trained models in all settings. If you'd like to get the trained models themselves, don't hesitate to write to us and we can provide them.
❯ tar -tzf amaretto_results.tar.gz
amaretto/
amaretto/dime/
amaretto/nn_classifier_prior_probabilities.pkl
amaretto/nn_classifier_full_probabilities.pkl
amaretto/nn_classifier_2f_probabilities.pkl
amaretto/dime/test.npz
amaretto/dime/val.npzYou can load the data and the model results and try for yourself using the notebook notebooks/costly_features_results.ipynb.
To cite the paper, please use following BibTeX:
@INPROCEEDINGS{11126734,
author={Mašková, Michaela and Šmídl, Václav},
booktitle={2025 IEEE 49th Annual Computers, Software, and Applications Conference (COMPSAC)},
title={Towards AI Analyst: Querying Costly Features for Fraud and Money Laundering Detection},
year={2025},
volume={},
number={},
pages={1905-1910},
keywords={Analytical models;Computational modeling;Machine learning;Production;Feature extraction;Data processing;Data models;Software;Fraud;Monitoring;fraud detection;anti-money laundering;costly features;machine learning},
doi={10.1109/COMPSAC65507.2025.00262}}
Gadgil, S., Covert, I.C. and Lee, S.I., Estimating Conditional Mutual Information for Dynamic Feature Selection. In The Twelfth International Conference on Learning Representations.