Skip to content

Commit de9be2f

Browse files
Small cleanups to README (#990)
* clean up readme * do use local parquet file
1 parent 8d6dab6 commit de9be2f

1 file changed

Lines changed: 7 additions & 21 deletions

File tree

README.md

Lines changed: 7 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -11,18 +11,19 @@
1111

1212
[Documentation](https://glum.readthedocs.io/en/latest/)
1313

14-
Generalized linear models (GLM) are a core statistical tool that include many common methods like least-squares regression, Poisson regression and logistic regression as special cases. At QuantCo, we have used GLMs in e-commerce pricing, insurance claims prediction and more. We have developed `glum`, a fast Python-first GLM library. The development was based on [a fork of scikit-learn](https://github.com/scikit-learn/scikit-learn/pull/9405), so it has a scikit-learn-like API. We are thankful for the starting point provided by Christian Lorentzen in that PR!
14+
Generalized linear models (GLM) are a core statistical tool that include many common methods like least-squares regression, Poisson regression, and logistic regression as special cases. At QuantCo, we have used GLMs in e-commerce pricing, insurance claims prediction, and more. We have developed `glum`, a fast Python-first GLM library. The development was based on [a fork of scikit-learn](https://github.com/scikit-learn/scikit-learn/pull/9405), so it has a scikit-learn-like API. We are thankful for the starting point provided by Christian Lorentzen in that PR!
1515

1616
We believe that for GLM development, broad support for distributions, regularization, and statistical inference, along with fast formula-based specification, is key. `glum` supports
1717

18-
* Built-in cross validation for optimal regularization, efficiently exploiting a “regularization path”
18+
* Built-in cross-validation for optimal regularization, efficiently exploiting a “regularization path”
1919
* L1 regularization, which produces sparse and easily interpretable solutions
2020
* L2 regularization, including variable matrix-valued (Tikhonov) penalties, which are useful in modeling correlated effects
2121
* Elastic net regularization
22-
* Normal, Poisson, logistic, gamma, and Tweedie distributions, plus varied and customizable link functions
22+
* Normal, Poisson, binomial, gamma, inverse Gaussian, negative binomial, and Tweedie distributions, plus varied and customizable link functions
2323
* Built-in formula-based model specification using `formulaic`
2424
* Classical statistical inference for unregularized models
2525
* Box constraints, linear inequality constraints, sample weights, offsets
26+
* Support for multiple dataframe backends (pandas, polars, and more) via `narwhals`
2627

2728
Performance also matters, so we conducted extensive benchmarks against other modern libraries. Although performance depends on the specific problem, we find that when N >> K (there are more observations than predictors), `glum` is consistently much faster for a wide range of problems. This repo includes the benchmarking tools in the `glum_benchmarks` module. For details, [see here](glum_benchmarks/README.md).
2829

@@ -33,19 +34,17 @@ Performance also matters, so we conducted extensive benchmarks against other mod
3334

3435
For more information on `glum`, including tutorials and API reference, please see [the documentation](https://glum.readthedocs.io/en/latest/).
3536

36-
Why did we choose the name `glum`? We wanted a name that had the letters GLM and wasn't easily confused with any existing implementation. And we thought glum sounded like a funny name (and not glum at all!). If you need a more professional sounding name, feel free to pronounce it as G-L-um. Or maybe it stands for "Generalized linear... ummm... modeling?"
37+
Why did we choose the name `glum`? We wanted a name that had the letters GLM and wasn't easily confused with any existing implementation. And we thought glum sounded like a funny name (and not glum at all!). If you need a more professional-sounding name, feel free to pronounce it as G-L-um. Or maybe it stands for "Generalized linear... ummm... modeling?"
3738

3839
# A classic example predicting housing prices
3940

4041
```python
4142
>>> import pandas as pd
42-
>>> from sklearn.datasets import fetch_openml
4343
>>> from glum import GeneralizedLinearRegressor
4444
>>>
4545
>>> # This dataset contains house sale prices for King County, which includes
4646
>>> # Seattle. It includes homes sold between May 2014 and May 2015.
47-
>>> # The full version of this dataset can be found at:
48-
>>> # https://www.openml.org/search?type=data&status=active&id=42092
47+
>>> # To download, use: sklearn.datasets.fetch_openml(name="house_sales", version=3)
4948
>>> house_data = pd.read_parquet("data/housing.parquet")
5049
>>>
5150
>>> # Use only select features
@@ -64,7 +63,6 @@ Why did we choose the name `glum`? We wanted a name that had the letters GLM and
6463
... ]
6564
... ].copy()
6665
>>>
67-
>>>
6866
>>> # Model whether a house had an above or below median price via a Binomial
6967
>>> # distribution. We'll be doing L1-regularized logistic regression.
7068
>>> price = house_data["price"]
@@ -77,18 +75,6 @@ Why did we choose the name `glum`? We wanted a name that had the letters GLM and
7775
>>>
7876
>>> _ = model.fit(X=X, y=y)
7977
>>>
80-
>>> # .report_diagnostics shows details about the steps taken by the iterative solver.
81-
>>> diags = model.get_formatted_diagnostics(full_report=True)
82-
>>> diags[['objective_fct']]
83-
objective_fct
84-
n_iter
85-
0 0.693091
86-
1 0.489500
87-
2 0.449585
88-
3 0.443681
89-
4 0.443498
90-
5 0.443497
91-
>>>
9278
>>> # Models can also be built with formulas from formulaic.
9379
>>> model_formula = GeneralizedLinearRegressor(
9480
... family='binomial',
@@ -111,4 +97,4 @@ conda install glum -c conda-forge
11197

11298
For optimal performance on an x86_64 architecture, we recommend using the MKL library
11399
(`conda install mkl`). By default, conda usually installs the openblas version, which
114-
is slower, but supported on all major architecture and OS.
100+
is slower, but supported on all major architectures and operating systems.

0 commit comments

Comments
 (0)