PRTR transfers

1. Overview

1.1. Project tree

The following is the project tree considering only its most important files for a developer. Don't hesitate to fully check the folders, including the ancillary one that contains important information for the data processing.

PRTR_transfers
├── ancillary
│    
└── data_engineering
    ├── __init__.py
    ├── main.py
    ├── extract
    |   ├── __init__.py
    │   ├── config.yaml
    │   ├── main.py
    |   ├── common.py
    │   ├── npi_scraper.py
    │   ├── npri_scraper.py
    │   ├── tri_scraper.py
    │   ├── srs_scraper.py
    │   ├── nlm_scraper.py
    │   ├── pubchem_scraper.py
    │   └── output
    │ 
    ├── transform
    |   ├── __init__.py
    │   ├── main.py
    │   ├── common.py
    │   ├── industry_sector_standardizing.py
    │   ├── chemical_standardizing.py
    │   ├── naics_normalization.py
    │   ├── npi_transformer.py
    │   ├── npri_transformer.py
    │   ├── tri_transformer.py
    │   ├── database_normalization.py
    │   └── output
    │ 
    └── load
        ├── __init__.py
        ├── main.py
        ├── industry_sector.py
        ├── facility.py
        ├── prtr_system.py
        ├── record.py
        ├── substance.py
        ├── transfer.py
        ├── chemical.py
        ├── base.py
        └── output

1.2. Enhanced entity-relationship diagram (EERD) for the PRTR_transfers database

The EERD model in the following figure represents the PRTR_transfers database schema created after data engineering. The prtr_system table is shown without any explicit relationship between the other tables in the database. The reason is that the columns of the prtr_system table were not set as foreign key; however, its columns could be used to connect to other tables like the national_substance table to know the PRTR system the report comes from.

2. Requirements

2.1. Developers

2.1.1. Creating conda environment

A conda environment can be created by executing the following command:

conda env create -n PRTR -f environment.yml

The above command is written assuming that you are in the folder containing .yml file, i.e. the root folder PRTR_transfers.

2.1.2. Ovoiding ModuleNotFoundError and ImportError¹

If you are working as a Python developer, you should avoid both ModuleNotFoundError and ImportError (see the following link). Thus, follow the steps below to solve the above mentioned problems:

Run the following command in order to obtain the PRTR_transfers project location and then saving its path into the variable PACKAGE
```
PACKAGE=$(locate -br '^PRTR_transfers$')
```
Check the PACKAGE value by running the following command
```
echo $PACKAGE
```
Run the following command to add the PRTR_transfers project to the system paths
```
export PYTHONPATH="${PYTHONPATH}:$PACKAGE"
```

If you prefer to save the path to the PRTR_transfers project folder as a permanent environment variable, follow these steps:

Open the .bashrc file with the text editor of your preference (e.g., Visual Studio Code)
```
code ~/.bashrc
```

Scroll to the bottom of the file and add the following lines

export PACKAGE=$(locate -br '^PRTR_transfers$')
export PYTHONPATH="${PYTHONPATH}:$PACKAGE"

Save the file with the changes
You can open another terminal to verify that the variable has been successfully saved by running the following command
```
echo $PYTHONPATH
```

2.1.3. Installation of Relational Database Management System (RDMS)

The Extract, Transform, Load (ETL) procedure uses an Object–Relational Mapping (ORT) for data persistence by an RDMS. PostgreSQL and MySQL are the RDMS currently supported by the ETL. Thus, you must have installed any of these RDMSs to run the data engineering pipeline or the data-driven modeling module.

3. How to use

3.1. Data engineering module

You can use each .py file in the data engineering module separately. However, the developed module enables to run the ETL pipeline using the main.py inside the datan_engineering folder. Thus, follow the above steps:

In your terminal or command line, navigate to the data_engineering folder
Run the following command
```
python main.py --help
```

You will see the following help menu

usage: main.py [-h] [--rdbms RDBMS] [--password PASSWORD] [--username USERNAME] [--host HOST] [--port PORT] [--db_name DB_NAME]
               [--sql_file SQL_FILE]

optional arguments:
      -h, --help           show this help message and exit
      --rdbms {mysql,postgresql}
                           The Relational Database Management System (RDBMS) you would like to use
      --password PASSWORD  The password for using the RDBMS
      --username USERNAME  The username for using the RDBMS
      --host HOST          The computer hosting for the database
      --port PORT          Port used by the database engine
      --db_name DB_NAME    Database name
      --sql_file {True,False}
                           Would you like to obtain .SQL file

You must indicate the value for each parameter, e.g., if you would like to name your database as PRTR, you write --dn_name PRTR. Each argument except --password has a default value (see the table below)

Argument	Default	Comment
rdbms	mysql	Only two options: MySQL and PostgreSQL
username	root	root is the default username for MySQL. For PostgreSQL is postgres
host	127.0.0.1	127.0.0.1 (localhost) is the default host for MySQL. The same is for PostgreSQL
port	3306	3306 is the default port for MySQL. For PostgreSQL is 5432
db_name	PRTR_transfers	You are free to choose a name for the database
sql_file	False	Only two options: True and False

4. Notes

1: If you have troubles with this step, update updatedb by running sudo updatedb.

Name		Name	Last commit message	Last commit date
Latest commit History 256 Commits
ancillary		ancillary
data_engineering		data_engineering
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml
logo.svg		logo.svg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PRTR transfers

1. Overview

1.1. Project tree

1.2. Enhanced entity-relationship diagram (EERD) for the PRTR_transfers database

2. Requirements

2.1. Developers

2.1.1. Creating conda environment

2.1.2. Ovoiding ModuleNotFoundError and ImportError¹

2.1.3. Installation of Relational Database Management System (RDMS)

3. How to use

3.1. Data engineering module

4. Notes

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

PRTR transfers

1. Overview

1.1. Project tree

1.2. Enhanced entity-relationship diagram (EERD) for the PRTR_transfers database

2. Requirements

2.1. Developers

2.1.1. Creating conda environment

2.1.2. Ovoiding ModuleNotFoundError and ImportError1

2.1.3. Installation of Relational Database Management System (RDMS)

3. How to use

3.1. Data engineering module

4. Notes

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

2.1.2. Ovoiding ModuleNotFoundError and ImportError¹

Packages