Skip to content

Commit 8d6dab6

Browse files
Update formula and french motor tutorials (#986)
* update formula and french motor tutorial * fix
1 parent 3601b87 commit 8d6dab6

2 files changed

Lines changed: 384 additions & 157 deletions

File tree

docs/tutorials/formula_interface/formula_interface.ipynb

Lines changed: 151 additions & 35 deletions
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@
1818
"\n",
1919
"This tutorial reimplements and extends the combined frequency-severity model from Chapter 4 of the [GLM tutorial](tutorials/glm_french_motor_tutorial/glm_french_motor.html). If you would like to know more about the setting, the data, or GLM modeling in general, please check that out first.\n",
2020
"\n",
21-
"**Sneak Peak**\n",
21+
"**Sneak Peek**\n",
2222
"\n",
2323
"Formulas can provide a concise and convenient way to specify many of the usual pre-processing steps, such as converting to categorical types, creating interactions, applying transformations, or even spline interpolation. As an example, consider the following formula:\n",
2424
"\n",
@@ -52,13 +52,19 @@
5252
{
5353
"cell_type": "code",
5454
"execution_count": 1,
55-
"metadata": {},
55+
"metadata": {
56+
"execution": {
57+
"iopub.execute_input": "2026-03-16T16:10:11.302339Z",
58+
"iopub.status.busy": "2026-03-16T16:10:11.302010Z",
59+
"iopub.status.idle": "2026-03-16T16:10:12.926504Z",
60+
"shell.execute_reply": "2026-03-16T16:10:12.926009Z"
61+
}
62+
},
5663
"outputs": [],
5764
"source": [
5865
"import matplotlib.pyplot as plt\n",
5966
"import numpy as np\n",
6067
"import pandas as pd\n",
61-
"import pytest\n",
6268
"import scipy.optimize as optimize\n",
6369
"import scipy.stats\n",
6470
"from dask_ml.preprocessing import Categorizer\n",
@@ -83,7 +89,14 @@
8389
{
8490
"cell_type": "code",
8591
"execution_count": 2,
86-
"metadata": {},
92+
"metadata": {
93+
"execution": {
94+
"iopub.execute_input": "2026-03-16T16:10:12.928178Z",
95+
"iopub.status.busy": "2026-03-16T16:10:12.927994Z",
96+
"iopub.status.idle": "2026-03-16T16:10:15.244947Z",
97+
"shell.execute_reply": "2026-03-16T16:10:15.244591Z"
98+
}
99+
},
87100
"outputs": [
88101
{
89102
"data": {
@@ -365,7 +378,7 @@
365378
"cell_type": "markdown",
366379
"metadata": {},
367380
"source": [
368-
"## 2. Reproducing the Model From the GLM Turorial<a class=\"anchor\"></a>\n",
381+
"## 2. Reproducing the Model From the GLM Tutorial<a class=\"anchor\"></a>\n",
369382
"\n",
370383
"Now, let us start by fitting a very simple model. As usual, let's divide our samples into a training and a test set so that we get valid out-of-sample goodness-of-fit measures. Perhaps less usually, we do not create separate `y` and `X` data frames for our label and features – the formula will take care of that for us.\n",
371384
"\n",
@@ -379,7 +392,14 @@
379392
{
380393
"cell_type": "code",
381394
"execution_count": 3,
382-
"metadata": {},
395+
"metadata": {
396+
"execution": {
397+
"iopub.execute_input": "2026-03-16T16:10:15.259806Z",
398+
"iopub.status.busy": "2026-03-16T16:10:15.259694Z",
399+
"iopub.status.idle": "2026-03-16T16:10:15.471707Z",
400+
"shell.execute_reply": "2026-03-16T16:10:15.470808Z"
401+
}
402+
},
383403
"outputs": [],
384404
"source": [
385405
"ss = ShuffleSplit(n_splits=1, test_size=0.1, random_state=42)\n",
@@ -398,13 +418,20 @@
398418
"cell_type": "markdown",
399419
"metadata": {},
400420
"source": [
401-
"This example demonstrates the basic idea behind formulas: the outcome variable and the predictors are separated by a tilde (`~`), and different prefictors are separated by plus signs (`+`). Thus, formulas provide a concise way of specifying a model without the need to create dataframes by hand."
421+
"This example demonstrates the basic idea behind formulas: the outcome variable and the predictors are separated by a tilde (`~`), and different predictors are separated by plus signs (`+`). Thus, formulas provide a concise way of specifying a model without the need to create dataframes by hand."
402422
]
403423
},
404424
{
405425
"cell_type": "code",
406426
"execution_count": 4,
407-
"metadata": {},
427+
"metadata": {
428+
"execution": {
429+
"iopub.execute_input": "2026-03-16T16:10:15.473697Z",
430+
"iopub.status.busy": "2026-03-16T16:10:15.473584Z",
431+
"iopub.status.idle": "2026-03-16T16:10:22.769294Z",
432+
"shell.execute_reply": "2026-03-16T16:10:22.768802Z"
433+
}
434+
},
408435
"outputs": [
409436
{
410437
"data": {
@@ -539,22 +566,29 @@
539566
{
540567
"cell_type": "code",
541568
"execution_count": 5,
542-
"metadata": {},
569+
"metadata": {
570+
"execution": {
571+
"iopub.execute_input": "2026-03-16T16:10:22.770437Z",
572+
"iopub.status.busy": "2026-03-16T16:10:22.770347Z",
573+
"iopub.status.idle": "2026-03-16T16:10:22.819876Z",
574+
"shell.execute_reply": "2026-03-16T16:10:22.819390Z"
575+
}
576+
},
543577
"outputs": [
544578
{
545579
"data": {
546580
"text/plain": [
547581
"ClaimNb int64\n",
548582
"Exposure float64\n",
549-
"Area object\n",
583+
"Area str\n",
550584
"VehPower int64\n",
551585
"VehAge int64\n",
552586
"DrivAge int64\n",
553587
"BonusMalus int64\n",
554-
"VehBrand object\n",
555-
"VehGas object\n",
588+
"VehBrand str\n",
589+
"VehGas str\n",
556590
"Density int64\n",
557-
"Region object\n",
591+
"Region str\n",
558592
"ClaimAmount float64\n",
559593
"ClaimAmountCut float64\n",
560594
"PurePremium float64\n",
@@ -577,13 +611,20 @@
577611
"cell_type": "markdown",
578612
"metadata": {},
579613
"source": [
580-
"Even though some of the variables are integers in this dataset, they are handled as categoricals thanks to the `C()` function. Strings, such as `VehBrand` or `VehGas` would have been handled as categorical by default anyway, but using the `C()` function never hurts: if applied to something that is already a caetgorical variable, it does not have any effect outside of the feature name."
614+
"Even though some of the variables are integers in this dataset, they are handled as categoricals thanks to the `C()` function. Strings, such as `VehBrand` or `VehGas` would have been handled as categorical by default anyway, but using the `C()` function never hurts: if applied to something that is already a categorical variable, it does not have any effect outside of the feature name."
581615
]
582616
},
583617
{
584618
"cell_type": "code",
585619
"execution_count": 6,
586-
"metadata": {},
620+
"metadata": {
621+
"execution": {
622+
"iopub.execute_input": "2026-03-16T16:10:22.821622Z",
623+
"iopub.status.busy": "2026-03-16T16:10:22.821528Z",
624+
"iopub.status.idle": "2026-03-16T16:10:30.355393Z",
625+
"shell.execute_reply": "2026-03-16T16:10:30.355061Z"
626+
}
627+
},
587628
"outputs": [
588629
{
589630
"data": {
@@ -719,13 +760,20 @@
719760
{
720761
"cell_type": "code",
721762
"execution_count": 7,
722-
"metadata": {},
763+
"metadata": {
764+
"execution": {
765+
"iopub.execute_input": "2026-03-16T16:10:30.356740Z",
766+
"iopub.status.busy": "2026-03-16T16:10:30.356657Z",
767+
"iopub.status.idle": "2026-03-16T16:10:30.401890Z",
768+
"shell.execute_reply": "2026-03-16T16:10:30.401512Z"
769+
}
770+
},
723771
"outputs": [
724772
{
725773
"data": {
726774
"text/plain": [
727775
"array([303.77443311, 548.47789523, 244.34438579, ..., 109.81572865,\n",
728-
" 67.98332028, 297.21717383])"
776+
" 67.98332028, 297.21717383], shape=(67802,))"
729777
]
730778
},
731779
"execution_count": 7,
@@ -743,15 +791,22 @@
743791
"source": [
744792
"## 4. Interactions and Structural Full-Rankness<a class=\"anchor\"></a>\n",
745793
"\n",
746-
"One of the biggest strengths of Wilkinson-formuals lie in their ability of concisely specifying interactions between terms. `glum` implements this as well, and in a very efficient way: the interactions of categorical features are encoded as a new categorical feature, making it possible to interact high-cardinality categoricals with each other. If this is not possible, because, for example, a categorical is interacted with a numeric variable, sparse representations are used when appropriate. In general, just as with `glum`'s categorical handling in general, you can be assured that `glum` you don't have to worry too much about the actual implementation, and can expect that `glum` will do the most efficient thing behind the scenes.\n",
794+
"One of the biggest strengths of Wilkinson-formulas lie in their ability of concisely specifying interactions between terms. `glum` implements this as well, and in a very efficient way: the interactions of categorical features are encoded as a new categorical feature, making it possible to interact high-cardinality categoricals with each other. If this is not possible, because, for example, a categorical is interacted with a numeric variable, sparse representations are used when appropriate. In general, just as with `glum`'s categorical handling in general, you can be assured that you don't have to worry too much about the actual implementation, and can expect that `glum` will do the most efficient thing behind the scenes.\n",
747795
"\n",
748796
"Let's see how that looks like on the insurance example! Suppose that we expect `VehPower` to have a different effect depending on `DrivAge` (e.g. performance cars might not be great for new drivers, but may be less problematic for more experienced ones). We can include the interaction of these variables as follows."
749797
]
750798
},
751799
{
752800
"cell_type": "code",
753801
"execution_count": 8,
754-
"metadata": {},
802+
"metadata": {
803+
"execution": {
804+
"iopub.execute_input": "2026-03-16T16:10:30.403259Z",
805+
"iopub.status.busy": "2026-03-16T16:10:30.403170Z",
806+
"iopub.status.idle": "2026-03-16T16:10:38.273339Z",
807+
"shell.execute_reply": "2026-03-16T16:10:38.272842Z"
808+
}
809+
},
755810
"outputs": [
756811
{
757812
"data": {
@@ -891,7 +946,14 @@
891946
{
892947
"cell_type": "code",
893948
"execution_count": 9,
894-
"metadata": {},
949+
"metadata": {
950+
"execution": {
951+
"iopub.execute_input": "2026-03-16T16:10:38.274479Z",
952+
"iopub.status.busy": "2026-03-16T16:10:38.274413Z",
953+
"iopub.status.idle": "2026-03-16T16:10:38.276592Z",
954+
"shell.execute_reply": "2026-03-16T16:10:38.276208Z"
955+
}
956+
},
895957
"outputs": [
896958
{
897959
"data": {
@@ -976,15 +1038,22 @@
9761038
" 2. Include the logarithm of a certain variable in the model.\n",
9771039
" 3. Include a basis spline interpolation of a variable to capture non-linearities in its effect.\n",
9781040
"\n",
979-
"1\\. works because because formulas can contain [Python operations](https://matthewwardrop.github.io/formulaic/guides/grammar/). 2. and 3. work because formulas are evaluated within a context that is aware of a number of [transforms](https://matthewwardrop.github.io/formulaic/guides/transforms/). To be precise, 2. is a regular transform and 3. is a stateful transform.\n",
1041+
"1\\. works because formulas can contain [Python operations](https://matthewwardrop.github.io/formulaic/guides/grammar/). 2. and 3. work because formulas are evaluated within a context that is aware of a number of [transforms](https://matthewwardrop.github.io/formulaic/guides/transforms/). To be precise, 2. is a regular transform and 3. is a stateful transform.\n",
9801042
"\n",
9811043
"Let's try it out!"
9821044
]
9831045
},
9841046
{
9851047
"cell_type": "code",
9861048
"execution_count": 10,
987-
"metadata": {},
1049+
"metadata": {
1050+
"execution": {
1051+
"iopub.execute_input": "2026-03-16T16:10:38.277724Z",
1052+
"iopub.status.busy": "2026-03-16T16:10:38.277661Z",
1053+
"iopub.status.idle": "2026-03-16T16:10:47.196866Z",
1054+
"shell.execute_reply": "2026-03-16T16:10:47.196455Z"
1055+
}
1056+
},
9881057
"outputs": [
9891058
{
9901059
"data": {
@@ -1118,7 +1187,14 @@
11181187
{
11191188
"cell_type": "code",
11201189
"execution_count": 11,
1121-
"metadata": {},
1190+
"metadata": {
1191+
"execution": {
1192+
"iopub.execute_input": "2026-03-16T16:10:47.197980Z",
1193+
"iopub.status.busy": "2026-03-16T16:10:47.197903Z",
1194+
"iopub.status.idle": "2026-03-16T16:10:47.783163Z",
1195+
"shell.execute_reply": "2026-03-16T16:10:47.782765Z"
1196+
}
1197+
},
11221198
"outputs": [
11231199
{
11241200
"data": {
@@ -1201,13 +1277,20 @@
12011277
"source": [
12021278
"### Variable Names\n",
12031279
"\n",
1204-
"`glum`'s formula interface provides a lot of control over how the resulting features are named. By default, it follows `formulaic`'s standards, but it can be customized by setting the `interaction_separator` and `categorical_format` paremeters."
1280+
"`glum`'s formula interface provides a lot of control over how the resulting features are named. By default, it follows `formulaic`'s standards, but it can be customized by setting the `interaction_separator` and `categorical_format` parameters."
12051281
]
12061282
},
12071283
{
12081284
"cell_type": "code",
12091285
"execution_count": 12,
1210-
"metadata": {},
1286+
"metadata": {
1287+
"execution": {
1288+
"iopub.execute_input": "2026-03-16T16:10:47.784411Z",
1289+
"iopub.status.busy": "2026-03-16T16:10:47.784343Z",
1290+
"iopub.status.idle": "2026-03-16T16:10:50.289682Z",
1291+
"shell.execute_reply": "2026-03-16T16:10:50.289219Z"
1292+
}
1293+
},
12111294
"outputs": [
12121295
{
12131296
"data": {
@@ -1345,12 +1428,27 @@
13451428
{
13461429
"cell_type": "code",
13471430
"execution_count": 13,
1348-
"metadata": {},
1349-
"outputs": [],
1431+
"metadata": {
1432+
"execution": {
1433+
"iopub.execute_input": "2026-03-16T16:10:50.290873Z",
1434+
"iopub.status.busy": "2026-03-16T16:10:50.290798Z",
1435+
"iopub.status.idle": "2026-03-16T16:10:50.305162Z",
1436+
"shell.execute_reply": "2026-03-16T16:10:50.304698Z"
1437+
}
1438+
},
1439+
"outputs": [
1440+
{
1441+
"name": "stdout",
1442+
"output_type": "stream",
1443+
"text": [
1444+
"Caught expected ValueError: The formula sets the intercept to False, contradicting fit_intercept=True. You should use fit_intercept to specify the intercept.\n"
1445+
]
1446+
}
1447+
],
13501448
"source": [
13511449
"formula_noint = \"PurePremium ~ DrivAge * VehPower - 1\"\n",
13521450
"\n",
1353-
"with pytest.raises(ValueError, match=\"The formula sets the intercept to False\"):\n",
1451+
"try:\n",
13541452
" t_glm8 = GeneralizedLinearRegressor(\n",
13551453
" family=TweedieDist,\n",
13561454
" alpha_search=True,\n",
@@ -1359,7 +1457,11 @@
13591457
" formula=formula_noint,\n",
13601458
" interaction_separator=\"__x__\",\n",
13611459
" categorical_format=\"{name}__{category}\",\n",
1362-
" ).fit(df_train, sample_weight=df[\"Exposure\"].values[train])"
1460+
" ).fit(df_train, sample_weight=df[\"Exposure\"].values[train])\n",
1461+
" raise AssertionError(\"Expected ValueError was not raised\")\n",
1462+
"except ValueError as e:\n",
1463+
" assert \"The formula sets the intercept to False\" in str(e)\n",
1464+
" print(f\"Caught expected ValueError: {e}\")"
13631465
]
13641466
},
13651467
{
@@ -1374,7 +1476,14 @@
13741476
{
13751477
"cell_type": "code",
13761478
"execution_count": 14,
1377-
"metadata": {},
1479+
"metadata": {
1480+
"execution": {
1481+
"iopub.execute_input": "2026-03-16T16:10:50.306335Z",
1482+
"iopub.status.busy": "2026-03-16T16:10:50.306261Z",
1483+
"iopub.status.idle": "2026-03-16T16:10:52.806289Z",
1484+
"shell.execute_reply": "2026-03-16T16:10:52.805883Z"
1485+
}
1486+
},
13781487
"outputs": [
13791488
{
13801489
"data": {
@@ -1515,8 +1624,15 @@
15151624
},
15161625
{
15171626
"cell_type": "code",
1518-
"execution_count": 15,
1519-
"metadata": {},
1627+
"execution_count": null,
1628+
"metadata": {
1629+
"execution": {
1630+
"iopub.execute_input": "2026-03-16T16:10:52.807428Z",
1631+
"iopub.status.busy": "2026-03-16T16:10:52.807354Z",
1632+
"iopub.status.idle": "2026-03-16T16:10:54.748684Z",
1633+
"shell.execute_reply": "2026-03-16T16:10:54.748153Z"
1634+
}
1635+
},
15201636
"outputs": [
15211637
{
15221638
"data": {
@@ -1628,6 +1744,7 @@
16281744
"\n",
16291745
"t_glm9 = GeneralizedLinearRegressor(\n",
16301746
" family=TweedieDist,\n",
1747+
" \n",
16311748
" alpha_search=True,\n",
16321749
" l1_ratio=1,\n",
16331750
" fit_intercept=False,\n",
@@ -1661,9 +1778,8 @@
16611778
"name": "python",
16621779
"nbconvert_exporter": "python",
16631780
"pygments_lexer": "ipython3",
1664-
"version": "3.12.2"
1665-
},
1666-
"orig_nbformat": 4
1781+
"version": "3.14.3"
1782+
}
16671783
},
16681784
"nbformat": 4,
16691785
"nbformat_minor": 2

0 commit comments

Comments
 (0)