|
18 | 18 | "\n", |
19 | 19 | "This tutorial reimplements and extends the combined frequency-severity model from Chapter 4 of the [GLM tutorial](tutorials/glm_french_motor_tutorial/glm_french_motor.html). If you would like to know more about the setting, the data, or GLM modeling in general, please check that out first.\n", |
20 | 20 | "\n", |
21 | | - "**Sneak Peak**\n", |
| 21 | + "**Sneak Peek**\n", |
22 | 22 | "\n", |
23 | 23 | "Formulas can provide a concise and convenient way to specify many of the usual pre-processing steps, such as converting to categorical types, creating interactions, applying transformations, or even spline interpolation. As an example, consider the following formula:\n", |
24 | 24 | "\n", |
|
52 | 52 | { |
53 | 53 | "cell_type": "code", |
54 | 54 | "execution_count": 1, |
55 | | - "metadata": {}, |
| 55 | + "metadata": { |
| 56 | + "execution": { |
| 57 | + "iopub.execute_input": "2026-03-16T16:10:11.302339Z", |
| 58 | + "iopub.status.busy": "2026-03-16T16:10:11.302010Z", |
| 59 | + "iopub.status.idle": "2026-03-16T16:10:12.926504Z", |
| 60 | + "shell.execute_reply": "2026-03-16T16:10:12.926009Z" |
| 61 | + } |
| 62 | + }, |
56 | 63 | "outputs": [], |
57 | 64 | "source": [ |
58 | 65 | "import matplotlib.pyplot as plt\n", |
59 | 66 | "import numpy as np\n", |
60 | 67 | "import pandas as pd\n", |
61 | | - "import pytest\n", |
62 | 68 | "import scipy.optimize as optimize\n", |
63 | 69 | "import scipy.stats\n", |
64 | 70 | "from dask_ml.preprocessing import Categorizer\n", |
|
83 | 89 | { |
84 | 90 | "cell_type": "code", |
85 | 91 | "execution_count": 2, |
86 | | - "metadata": {}, |
| 92 | + "metadata": { |
| 93 | + "execution": { |
| 94 | + "iopub.execute_input": "2026-03-16T16:10:12.928178Z", |
| 95 | + "iopub.status.busy": "2026-03-16T16:10:12.927994Z", |
| 96 | + "iopub.status.idle": "2026-03-16T16:10:15.244947Z", |
| 97 | + "shell.execute_reply": "2026-03-16T16:10:15.244591Z" |
| 98 | + } |
| 99 | + }, |
87 | 100 | "outputs": [ |
88 | 101 | { |
89 | 102 | "data": { |
|
365 | 378 | "cell_type": "markdown", |
366 | 379 | "metadata": {}, |
367 | 380 | "source": [ |
368 | | - "## 2. Reproducing the Model From the GLM Turorial<a class=\"anchor\"></a>\n", |
| 381 | + "## 2. Reproducing the Model From the GLM Tutorial<a class=\"anchor\"></a>\n", |
369 | 382 | "\n", |
370 | 383 | "Now, let us start by fitting a very simple model. As usual, let's divide our samples into a training and a test set so that we get valid out-of-sample goodness-of-fit measures. Perhaps less usually, we do not create separate `y` and `X` data frames for our label and features – the formula will take care of that for us.\n", |
371 | 384 | "\n", |
|
379 | 392 | { |
380 | 393 | "cell_type": "code", |
381 | 394 | "execution_count": 3, |
382 | | - "metadata": {}, |
| 395 | + "metadata": { |
| 396 | + "execution": { |
| 397 | + "iopub.execute_input": "2026-03-16T16:10:15.259806Z", |
| 398 | + "iopub.status.busy": "2026-03-16T16:10:15.259694Z", |
| 399 | + "iopub.status.idle": "2026-03-16T16:10:15.471707Z", |
| 400 | + "shell.execute_reply": "2026-03-16T16:10:15.470808Z" |
| 401 | + } |
| 402 | + }, |
383 | 403 | "outputs": [], |
384 | 404 | "source": [ |
385 | 405 | "ss = ShuffleSplit(n_splits=1, test_size=0.1, random_state=42)\n", |
|
398 | 418 | "cell_type": "markdown", |
399 | 419 | "metadata": {}, |
400 | 420 | "source": [ |
401 | | - "This example demonstrates the basic idea behind formulas: the outcome variable and the predictors are separated by a tilde (`~`), and different prefictors are separated by plus signs (`+`). Thus, formulas provide a concise way of specifying a model without the need to create dataframes by hand." |
| 421 | + "This example demonstrates the basic idea behind formulas: the outcome variable and the predictors are separated by a tilde (`~`), and different predictors are separated by plus signs (`+`). Thus, formulas provide a concise way of specifying a model without the need to create dataframes by hand." |
402 | 422 | ] |
403 | 423 | }, |
404 | 424 | { |
405 | 425 | "cell_type": "code", |
406 | 426 | "execution_count": 4, |
407 | | - "metadata": {}, |
| 427 | + "metadata": { |
| 428 | + "execution": { |
| 429 | + "iopub.execute_input": "2026-03-16T16:10:15.473697Z", |
| 430 | + "iopub.status.busy": "2026-03-16T16:10:15.473584Z", |
| 431 | + "iopub.status.idle": "2026-03-16T16:10:22.769294Z", |
| 432 | + "shell.execute_reply": "2026-03-16T16:10:22.768802Z" |
| 433 | + } |
| 434 | + }, |
408 | 435 | "outputs": [ |
409 | 436 | { |
410 | 437 | "data": { |
|
539 | 566 | { |
540 | 567 | "cell_type": "code", |
541 | 568 | "execution_count": 5, |
542 | | - "metadata": {}, |
| 569 | + "metadata": { |
| 570 | + "execution": { |
| 571 | + "iopub.execute_input": "2026-03-16T16:10:22.770437Z", |
| 572 | + "iopub.status.busy": "2026-03-16T16:10:22.770347Z", |
| 573 | + "iopub.status.idle": "2026-03-16T16:10:22.819876Z", |
| 574 | + "shell.execute_reply": "2026-03-16T16:10:22.819390Z" |
| 575 | + } |
| 576 | + }, |
543 | 577 | "outputs": [ |
544 | 578 | { |
545 | 579 | "data": { |
546 | 580 | "text/plain": [ |
547 | 581 | "ClaimNb int64\n", |
548 | 582 | "Exposure float64\n", |
549 | | - "Area object\n", |
| 583 | + "Area str\n", |
550 | 584 | "VehPower int64\n", |
551 | 585 | "VehAge int64\n", |
552 | 586 | "DrivAge int64\n", |
553 | 587 | "BonusMalus int64\n", |
554 | | - "VehBrand object\n", |
555 | | - "VehGas object\n", |
| 588 | + "VehBrand str\n", |
| 589 | + "VehGas str\n", |
556 | 590 | "Density int64\n", |
557 | | - "Region object\n", |
| 591 | + "Region str\n", |
558 | 592 | "ClaimAmount float64\n", |
559 | 593 | "ClaimAmountCut float64\n", |
560 | 594 | "PurePremium float64\n", |
|
577 | 611 | "cell_type": "markdown", |
578 | 612 | "metadata": {}, |
579 | 613 | "source": [ |
580 | | - "Even though some of the variables are integers in this dataset, they are handled as categoricals thanks to the `C()` function. Strings, such as `VehBrand` or `VehGas` would have been handled as categorical by default anyway, but using the `C()` function never hurts: if applied to something that is already a caetgorical variable, it does not have any effect outside of the feature name." |
| 614 | + "Even though some of the variables are integers in this dataset, they are handled as categoricals thanks to the `C()` function. Strings, such as `VehBrand` or `VehGas` would have been handled as categorical by default anyway, but using the `C()` function never hurts: if applied to something that is already a categorical variable, it does not have any effect outside of the feature name." |
581 | 615 | ] |
582 | 616 | }, |
583 | 617 | { |
584 | 618 | "cell_type": "code", |
585 | 619 | "execution_count": 6, |
586 | | - "metadata": {}, |
| 620 | + "metadata": { |
| 621 | + "execution": { |
| 622 | + "iopub.execute_input": "2026-03-16T16:10:22.821622Z", |
| 623 | + "iopub.status.busy": "2026-03-16T16:10:22.821528Z", |
| 624 | + "iopub.status.idle": "2026-03-16T16:10:30.355393Z", |
| 625 | + "shell.execute_reply": "2026-03-16T16:10:30.355061Z" |
| 626 | + } |
| 627 | + }, |
587 | 628 | "outputs": [ |
588 | 629 | { |
589 | 630 | "data": { |
|
719 | 760 | { |
720 | 761 | "cell_type": "code", |
721 | 762 | "execution_count": 7, |
722 | | - "metadata": {}, |
| 763 | + "metadata": { |
| 764 | + "execution": { |
| 765 | + "iopub.execute_input": "2026-03-16T16:10:30.356740Z", |
| 766 | + "iopub.status.busy": "2026-03-16T16:10:30.356657Z", |
| 767 | + "iopub.status.idle": "2026-03-16T16:10:30.401890Z", |
| 768 | + "shell.execute_reply": "2026-03-16T16:10:30.401512Z" |
| 769 | + } |
| 770 | + }, |
723 | 771 | "outputs": [ |
724 | 772 | { |
725 | 773 | "data": { |
726 | 774 | "text/plain": [ |
727 | 775 | "array([303.77443311, 548.47789523, 244.34438579, ..., 109.81572865,\n", |
728 | | - " 67.98332028, 297.21717383])" |
| 776 | + " 67.98332028, 297.21717383], shape=(67802,))" |
729 | 777 | ] |
730 | 778 | }, |
731 | 779 | "execution_count": 7, |
|
743 | 791 | "source": [ |
744 | 792 | "## 4. Interactions and Structural Full-Rankness<a class=\"anchor\"></a>\n", |
745 | 793 | "\n", |
746 | | - "One of the biggest strengths of Wilkinson-formuals lie in their ability of concisely specifying interactions between terms. `glum` implements this as well, and in a very efficient way: the interactions of categorical features are encoded as a new categorical feature, making it possible to interact high-cardinality categoricals with each other. If this is not possible, because, for example, a categorical is interacted with a numeric variable, sparse representations are used when appropriate. In general, just as with `glum`'s categorical handling in general, you can be assured that `glum` you don't have to worry too much about the actual implementation, and can expect that `glum` will do the most efficient thing behind the scenes.\n", |
| 794 | + "One of the biggest strengths of Wilkinson-formulas lie in their ability of concisely specifying interactions between terms. `glum` implements this as well, and in a very efficient way: the interactions of categorical features are encoded as a new categorical feature, making it possible to interact high-cardinality categoricals with each other. If this is not possible, because, for example, a categorical is interacted with a numeric variable, sparse representations are used when appropriate. In general, just as with `glum`'s categorical handling in general, you can be assured that you don't have to worry too much about the actual implementation, and can expect that `glum` will do the most efficient thing behind the scenes.\n", |
747 | 795 | "\n", |
748 | 796 | "Let's see how that looks like on the insurance example! Suppose that we expect `VehPower` to have a different effect depending on `DrivAge` (e.g. performance cars might not be great for new drivers, but may be less problematic for more experienced ones). We can include the interaction of these variables as follows." |
749 | 797 | ] |
750 | 798 | }, |
751 | 799 | { |
752 | 800 | "cell_type": "code", |
753 | 801 | "execution_count": 8, |
754 | | - "metadata": {}, |
| 802 | + "metadata": { |
| 803 | + "execution": { |
| 804 | + "iopub.execute_input": "2026-03-16T16:10:30.403259Z", |
| 805 | + "iopub.status.busy": "2026-03-16T16:10:30.403170Z", |
| 806 | + "iopub.status.idle": "2026-03-16T16:10:38.273339Z", |
| 807 | + "shell.execute_reply": "2026-03-16T16:10:38.272842Z" |
| 808 | + } |
| 809 | + }, |
755 | 810 | "outputs": [ |
756 | 811 | { |
757 | 812 | "data": { |
|
891 | 946 | { |
892 | 947 | "cell_type": "code", |
893 | 948 | "execution_count": 9, |
894 | | - "metadata": {}, |
| 949 | + "metadata": { |
| 950 | + "execution": { |
| 951 | + "iopub.execute_input": "2026-03-16T16:10:38.274479Z", |
| 952 | + "iopub.status.busy": "2026-03-16T16:10:38.274413Z", |
| 953 | + "iopub.status.idle": "2026-03-16T16:10:38.276592Z", |
| 954 | + "shell.execute_reply": "2026-03-16T16:10:38.276208Z" |
| 955 | + } |
| 956 | + }, |
895 | 957 | "outputs": [ |
896 | 958 | { |
897 | 959 | "data": { |
|
976 | 1038 | " 2. Include the logarithm of a certain variable in the model.\n", |
977 | 1039 | " 3. Include a basis spline interpolation of a variable to capture non-linearities in its effect.\n", |
978 | 1040 | "\n", |
979 | | - "1\\. works because because formulas can contain [Python operations](https://matthewwardrop.github.io/formulaic/guides/grammar/). 2. and 3. work because formulas are evaluated within a context that is aware of a number of [transforms](https://matthewwardrop.github.io/formulaic/guides/transforms/). To be precise, 2. is a regular transform and 3. is a stateful transform.\n", |
| 1041 | + "1\\. works because formulas can contain [Python operations](https://matthewwardrop.github.io/formulaic/guides/grammar/). 2. and 3. work because formulas are evaluated within a context that is aware of a number of [transforms](https://matthewwardrop.github.io/formulaic/guides/transforms/). To be precise, 2. is a regular transform and 3. is a stateful transform.\n", |
980 | 1042 | "\n", |
981 | 1043 | "Let's try it out!" |
982 | 1044 | ] |
983 | 1045 | }, |
984 | 1046 | { |
985 | 1047 | "cell_type": "code", |
986 | 1048 | "execution_count": 10, |
987 | | - "metadata": {}, |
| 1049 | + "metadata": { |
| 1050 | + "execution": { |
| 1051 | + "iopub.execute_input": "2026-03-16T16:10:38.277724Z", |
| 1052 | + "iopub.status.busy": "2026-03-16T16:10:38.277661Z", |
| 1053 | + "iopub.status.idle": "2026-03-16T16:10:47.196866Z", |
| 1054 | + "shell.execute_reply": "2026-03-16T16:10:47.196455Z" |
| 1055 | + } |
| 1056 | + }, |
988 | 1057 | "outputs": [ |
989 | 1058 | { |
990 | 1059 | "data": { |
|
1118 | 1187 | { |
1119 | 1188 | "cell_type": "code", |
1120 | 1189 | "execution_count": 11, |
1121 | | - "metadata": {}, |
| 1190 | + "metadata": { |
| 1191 | + "execution": { |
| 1192 | + "iopub.execute_input": "2026-03-16T16:10:47.197980Z", |
| 1193 | + "iopub.status.busy": "2026-03-16T16:10:47.197903Z", |
| 1194 | + "iopub.status.idle": "2026-03-16T16:10:47.783163Z", |
| 1195 | + "shell.execute_reply": "2026-03-16T16:10:47.782765Z" |
| 1196 | + } |
| 1197 | + }, |
1122 | 1198 | "outputs": [ |
1123 | 1199 | { |
1124 | 1200 | "data": { |
|
1201 | 1277 | "source": [ |
1202 | 1278 | "### Variable Names\n", |
1203 | 1279 | "\n", |
1204 | | - "`glum`'s formula interface provides a lot of control over how the resulting features are named. By default, it follows `formulaic`'s standards, but it can be customized by setting the `interaction_separator` and `categorical_format` paremeters." |
| 1280 | + "`glum`'s formula interface provides a lot of control over how the resulting features are named. By default, it follows `formulaic`'s standards, but it can be customized by setting the `interaction_separator` and `categorical_format` parameters." |
1205 | 1281 | ] |
1206 | 1282 | }, |
1207 | 1283 | { |
1208 | 1284 | "cell_type": "code", |
1209 | 1285 | "execution_count": 12, |
1210 | | - "metadata": {}, |
| 1286 | + "metadata": { |
| 1287 | + "execution": { |
| 1288 | + "iopub.execute_input": "2026-03-16T16:10:47.784411Z", |
| 1289 | + "iopub.status.busy": "2026-03-16T16:10:47.784343Z", |
| 1290 | + "iopub.status.idle": "2026-03-16T16:10:50.289682Z", |
| 1291 | + "shell.execute_reply": "2026-03-16T16:10:50.289219Z" |
| 1292 | + } |
| 1293 | + }, |
1211 | 1294 | "outputs": [ |
1212 | 1295 | { |
1213 | 1296 | "data": { |
|
1345 | 1428 | { |
1346 | 1429 | "cell_type": "code", |
1347 | 1430 | "execution_count": 13, |
1348 | | - "metadata": {}, |
1349 | | - "outputs": [], |
| 1431 | + "metadata": { |
| 1432 | + "execution": { |
| 1433 | + "iopub.execute_input": "2026-03-16T16:10:50.290873Z", |
| 1434 | + "iopub.status.busy": "2026-03-16T16:10:50.290798Z", |
| 1435 | + "iopub.status.idle": "2026-03-16T16:10:50.305162Z", |
| 1436 | + "shell.execute_reply": "2026-03-16T16:10:50.304698Z" |
| 1437 | + } |
| 1438 | + }, |
| 1439 | + "outputs": [ |
| 1440 | + { |
| 1441 | + "name": "stdout", |
| 1442 | + "output_type": "stream", |
| 1443 | + "text": [ |
| 1444 | + "Caught expected ValueError: The formula sets the intercept to False, contradicting fit_intercept=True. You should use fit_intercept to specify the intercept.\n" |
| 1445 | + ] |
| 1446 | + } |
| 1447 | + ], |
1350 | 1448 | "source": [ |
1351 | 1449 | "formula_noint = \"PurePremium ~ DrivAge * VehPower - 1\"\n", |
1352 | 1450 | "\n", |
1353 | | - "with pytest.raises(ValueError, match=\"The formula sets the intercept to False\"):\n", |
| 1451 | + "try:\n", |
1354 | 1452 | " t_glm8 = GeneralizedLinearRegressor(\n", |
1355 | 1453 | " family=TweedieDist,\n", |
1356 | 1454 | " alpha_search=True,\n", |
|
1359 | 1457 | " formula=formula_noint,\n", |
1360 | 1458 | " interaction_separator=\"__x__\",\n", |
1361 | 1459 | " categorical_format=\"{name}__{category}\",\n", |
1362 | | - " ).fit(df_train, sample_weight=df[\"Exposure\"].values[train])" |
| 1460 | + " ).fit(df_train, sample_weight=df[\"Exposure\"].values[train])\n", |
| 1461 | + " raise AssertionError(\"Expected ValueError was not raised\")\n", |
| 1462 | + "except ValueError as e:\n", |
| 1463 | + " assert \"The formula sets the intercept to False\" in str(e)\n", |
| 1464 | + " print(f\"Caught expected ValueError: {e}\")" |
1363 | 1465 | ] |
1364 | 1466 | }, |
1365 | 1467 | { |
|
1374 | 1476 | { |
1375 | 1477 | "cell_type": "code", |
1376 | 1478 | "execution_count": 14, |
1377 | | - "metadata": {}, |
| 1479 | + "metadata": { |
| 1480 | + "execution": { |
| 1481 | + "iopub.execute_input": "2026-03-16T16:10:50.306335Z", |
| 1482 | + "iopub.status.busy": "2026-03-16T16:10:50.306261Z", |
| 1483 | + "iopub.status.idle": "2026-03-16T16:10:52.806289Z", |
| 1484 | + "shell.execute_reply": "2026-03-16T16:10:52.805883Z" |
| 1485 | + } |
| 1486 | + }, |
1378 | 1487 | "outputs": [ |
1379 | 1488 | { |
1380 | 1489 | "data": { |
|
1515 | 1624 | }, |
1516 | 1625 | { |
1517 | 1626 | "cell_type": "code", |
1518 | | - "execution_count": 15, |
1519 | | - "metadata": {}, |
| 1627 | + "execution_count": null, |
| 1628 | + "metadata": { |
| 1629 | + "execution": { |
| 1630 | + "iopub.execute_input": "2026-03-16T16:10:52.807428Z", |
| 1631 | + "iopub.status.busy": "2026-03-16T16:10:52.807354Z", |
| 1632 | + "iopub.status.idle": "2026-03-16T16:10:54.748684Z", |
| 1633 | + "shell.execute_reply": "2026-03-16T16:10:54.748153Z" |
| 1634 | + } |
| 1635 | + }, |
1520 | 1636 | "outputs": [ |
1521 | 1637 | { |
1522 | 1638 | "data": { |
|
1628 | 1744 | "\n", |
1629 | 1745 | "t_glm9 = GeneralizedLinearRegressor(\n", |
1630 | 1746 | " family=TweedieDist,\n", |
| 1747 | + " \n", |
1631 | 1748 | " alpha_search=True,\n", |
1632 | 1749 | " l1_ratio=1,\n", |
1633 | 1750 | " fit_intercept=False,\n", |
|
1661 | 1778 | "name": "python", |
1662 | 1779 | "nbconvert_exporter": "python", |
1663 | 1780 | "pygments_lexer": "ipython3", |
1664 | | - "version": "3.12.2" |
1665 | | - }, |
1666 | | - "orig_nbformat": 4 |
| 1781 | + "version": "3.14.3" |
| 1782 | + } |
1667 | 1783 | }, |
1668 | 1784 | "nbformat": 4, |
1669 | 1785 | "nbformat_minor": 2 |
|
0 commit comments