nafig

Do you want to visualize missing values in your data? There are plenty amazing methods (check missingno for example) but they all look bulky when your data has too many columns. nafig will help you to build a perfect NA figure!

Installation

$ pip install -U nafig

or install with Poetry

$ poetry add nafig

Usage

Here are some examples of the usage both for simulated and real world data. Check this notebook to play with code yourself!

First, let's import the core function and other useful things:

>>> from nafig.plots import na_text_barplot  # The core function
>>> from nafig.utils import create_example_data  # To simulate data
>>> import pandas as pd  # To works with tables

>>> df, feature_types = create_example_data()

df is just a pandas dataframe with missing values. feature_types is an array, containing data type description for each column. This is just an example, so labels don't correspond to actual data types.

>>> feature_types[:10]
array(['Categorical', 'Categorical', 'Binary', 'Continuous', 'Continuous',
       'Continuous', 'Binary', 'Continuous', 'Continuous', 'Binary'],
      dtype='<U11')

This toy dataframe contains 300 columns. Visualization of missing data with heatmap would unfortunately be too bulky. How to explore missing data distribution in this dataset? Try NA text barplot!

>>> na_text_barplot(df, hue=feature_types, line_height=1.5)

Columns of the dataset are binned by percentage of the missing data in them. Colouring by feature types helps to understand, which types of data are missing. On Y-axis you can see the number of features in each group.

You can vary the number of bins using num_bins parameter:

>>> na_text_barplot(df, hue=feature_types, line_height=1.5, num_bins=20)

>>> na_text_barplot(df, hue=feature_types, line_height=2, num_bins=2, fig_width=8, font_size=3)

Now let's see some real data examples!

House prices missing data visualization

Data source: https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data?select=train.csv

>>> DATA_PATH = "data/house-prices/train.csv"
>>> house_prices_df = pd.read_csv(DATA_PATH, index_col=0)

This is a reasonably good data with most of the values present. But thanks to this plot, we can see, which features are the bad guys!

>>> na_text_barplot(house_prices_df, fig_width=17, num_bins=20, line_height=1.5)

Note that if you don't pass the hue parameter, features will be colored by the data type of the column. If you don't want to colorize features at all, set hue to False.

By setting remove_empty_bins to True, you can remove the empty bins. It will require a reader to pay more attention to the X-axis but will save you some space.

>>> na_text_barplot(house_prices_df, fig_width=10, num_bins=20, 
                    line_height=1.5, remove_empty_bins=True)

Seatle AirBnB dataset missing values vizualization

Data source: https://www.kaggle.com/datasets/airbnb/seattle

>>> airbnb_df = pd.read_csv("data/airbnb/listings.csv")

This dataset has a bit more missing data. On the plot we can see that all integer features are almost complete, and some object and floating number columns contain missing values

>>> na_text_barplot(airbnb_df, fig_width=18, line_height=1.8, font_size=9, remove_empty_bins=True)

Feel free to explore other parameters! There are more to help you create a perfect missing values visualization

Developers section

🚀 Features

Development features

Supports for Python 3.9 and higher.
Poetry as the dependencies manager. See configuration in pyproject.toml and setup.cfg.
Automatic codestyle with black, isort and pyupgrade.
Ready-to-use pre-commit hooks with code-formatting.
Type checks with mypy; docstring checks with darglint; security checks with safety and bandit
Testing with pytest.
Ready-to-use .editorconfig, .dockerignore, and .gitignore. You don't have to worry about those things.

Deployment features

GitHub integration: issue and pr templates.
Github Actions with predefined build workflow as the default CI/CD.
Everything is already set up for security checks, codestyle checks, code formatting, testing, linting, docker builds, etc with Makefile. More details in makefile-usage.
Dockerfile for your package.
Always up-to-date dependencies with @dependabot. You will only enable it.
Automatic drafts of new releases with Release Drafter. You may see the list of labels in release-drafter.yml. Works perfectly with Semantic Versions specification.

Makefile usage

Makefile contains a lot of functions for faster development.

1. Download and remove Poetry

To download and install Poetry run:

make poetry-download

To uninstall

make poetry-remove

2. Install all dependencies and pre-commit hooks

Install requirements:

make install

Pre-commit hooks coulb be installed after git init via

make pre-commit-install

3. Codestyle

Automatic formatting uses pyupgrade, isort and black.

make codestyle

# or use synonym
make formatting

Codestyle checks only, without rewriting files:

make check-codestyle

Note: check-codestyle uses isort, black and darglint library

Update all dev libraries to the latest version using one comand

make update-dev-deps

4. Code security

make check-safety

This command launches Poetry integrity checks as well as identifies security issues with Safety and Bandit.

make check-safety

5. Type checks

Run mypy static type checker

make mypy

6. Tests with coverage badges

Run pytest

make test

7. All linters

Of course there is a command to ~~rule~~ run all linters in one:

make lint

the same as:

make test && make check-codestyle && make mypy && make check-safety

8. Docker

make docker-build

which is equivalent to:

make docker-build VERSION=latest

Remove docker image with

make docker-remove

More information about docker.

9. Cleanup

Delete pycache files

make pycache-remove

Remove package build

make build-remove

Delete .DS_STORE files

make dsstore-remove

Remove .mypycache

make mypycache-remove

Or to remove all above run:

make cleanup

📈 Releases

You can see the list of available releases on the GitHub Releases page.

We follow Semantic Versions specification.

We use Release Drafter. As pull requests are merged, a draft release is kept up-to-date listing the changes, ready to publish when you’re ready. With the categories option, you can categorize pull requests in release notes using labels.

List of labels and corresponding titles

Label	Title in Releases
`enhancement`, `feature`	🚀 Features
`bug`, `refactoring`, `bugfix`, `fix`	🔧 Fixes & Refactoring
`build`, `ci`, `testing`	📦 Build System & CI/CD
`breaking`	💥 Breaking Changes
`documentation`	📝 Documentation
`dependencies`	⬆️ Dependencies updates

You can update it in release-drafter.yml.

GitHub creates the bug, enhancement, and documentation labels for you. Dependabot creates the dependencies label. Create the remaining labels on the Issues tab of your GitHub repository, when you need them.

🛡 License

This project is licensed under the terms of the MIT license. See LICENSE for more details.

📃 Citation

@misc{nafig,
  author = {VladimirShitov},
  title = {Package for plotting figures with NA data distribution},
  year = {2023},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/VladimirShitov/nafig}}
}

Credits

This project was generated with python-package-template

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
.github		.github
assets/images		assets/images
docker		docker
images		images
nafig		nafig
tests/test_example		tests/test_example
.dockerignore		.dockerignore
.editorconfig		.editorconfig
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
SECURITY.md		SECURITY.md
cookiecutter-config-file.yml		cookiecutter-config-file.yml
ehrapy_paper_plot.ipynb		ehrapy_paper_plot.ipynb
example.ipynb		example.ipynb
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.cfg		setup.cfg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

nafig

Installation

Usage

House prices missing data visualization

Seatle AirBnB dataset missing values vizualization

Developers section

🚀 Features

Development features

Deployment features

Makefile usage

📈 Releases

List of labels and corresponding titles

🛡 License

📃 Citation

Credits

About

Uh oh!

Releases 2

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

nafig

Installation

Usage

House prices missing data visualization

Seatle AirBnB dataset missing values vizualization

Developers section

🚀 Features

Development features

Deployment features

Makefile usage

📈 Releases

List of labels and corresponding titles

🛡 License

📃 Citation

Credits

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages