Data sources include:
- egrul folder with
.csvfiles listing all russian companies. Loaded for each region from Nalog.ru; - rosstat folder with
.xlsxfiles aggregating regional statistics from Rosstat.gov.ru. Version 2022; - msp folder with
.xlsxfiles from Nalog.ru. Includes small and middle companies only. HERE edit the version date used (see this parameter at mentioned page); - msp_xml folder with
.xmlfiles from Nalog.ru. Data very similar to previous one, but with different representations and few extra features included. - features.xlsx file created by hand. It lists all regional features used in analysis.
Further processing saves intermidiate files to data folder.
Config folder stores configurations for data and models' parameters. These parameters were obtained with Optuna optimization (not included).
From repo folder run:
docker build -t stat .docker run -it -v <CODE FOLDER>:/workdir -v <DATA FOLDER>:/workdir/ -m 16000m --cpus=4 -w="/workdir" stat
There are 2 options:
- Inside the container run
.sh(not implemented yet) with raw company data (.xls,.xlsx,.csvfile formats) from google drive, unpack it, delete the archived data. - Inside the container run
sh download.sh-- to download preprocessed company data (.parquetfile format) from google drive, unpack it, delete the archived data.
- preprocess.py handles raw data. Thus, it works after loading data with option 1. Output file
data/parquet/companies_feat.parquetcontains all companies mentioned in MSP registry and closed up to date (i.e. companies with finite 'lifetime' feature serving as a target variable). This step requiresdata_rawfolder. However you may skip it and use.parquetfiles loaded to folderdatawith option 2. - train.py performs the regression analysis with several algorythms and writes pretrained models and their metrics in
data/models/metrics.parquetfile. - run.py (not implemented yet) predicts lifetime for a company with parameters listed in
config\predict.yaml