[MRG] EHN refactoring of the ratio argument.#413
[MRG] EHN refactoring of the ratio argument.#413glemaitre merged 24 commits intoscikit-learn-contrib:masterfrom
Conversation
Codecov Report
@@ Coverage Diff @@
## master #413 +/- ##
==========================================
- Coverage 98.77% 98.71% -0.07%
==========================================
Files 68 70 +2
Lines 4014 4188 +174
==========================================
+ Hits 3965 4134 +169
- Misses 49 54 +5
Continue to review full report at Codecov.
|
|
Hello @glemaitre! Thanks for updating the PR. Cheers ! There are no PEP8 issues in this Pull Request. 🍻 Comment last updated on March 27, 2018 at 15:54 Hours UTC |
|
@jorisvandenbossche I would like to have your feedback as well. You can check the documentation of those three classes which is representative of the full PR:
|
| behaviour. | ||
| The parameter ``sampling_target`` control which sample of the link will be | ||
| removed. For instance, the default (i.e., ``sampling_target='auto'``) will | ||
| remove the sample from the majority class. Both samples from the majority and |
There was a problem hiding this comment.
I would change remove the sample from for remove samples from. + I don't really understand what control which sample of the link will be removed means. link confuses me
| @@ -0,0 +1,241 @@ | |||
| """ | |||
| ====================================================================== | |||
| Usage of the ``sampling_target`` parameter for the different algorithm | |||
There was a problem hiding this comment.
Howto use the sampling_target parameter (depending on the sampling strategy)
jorisvandenbossche
left a comment
There was a problem hiding this comment.
Big diff so didn't yet look at everything, but:
-
given how many times you repeat the explanation in the docstring, might be worth looking at a way how to share this to avoid repetition
-
I am not fully sure about "sampling_target" as keyword name. For the string options, this is an appropriate name, but for the float not really. Possible (although longer) alternatives:
sampling_strategy,sampling_protocol
| minority class after resampling and the number of samples in the | ||
| majority class, respectively. | ||
|
|
||
| .. warning:: |
There was a problem hiding this comment.
if you indent this two spaced, then it is included in the list (which is better I think)
| sampling_target : float, str, dict or callable, (default='auto') | ||
| Sampling information to resample the data set. | ||
|
|
||
| - When ``float``, it correspond to the ratio :math:`\\alpha_{os}` |
|
|
||
| ``'minority'``: resample only the minority class; | ||
|
|
||
| ``'majority'``: resample only the majority class; |
There was a problem hiding this comment.
Since this is a RandomOversampler, does 'majority' make any sense?
|
|
||
| ``'auto'``: equivalent to ``'not minority'``. | ||
|
|
||
| - When ``list``, the list contains the targeted classes. |
There was a problem hiding this comment.
This is not clear to me what it does.
| - If ``dict``, the keys correspond to the targeted classes. The values | ||
| correspond to the desired number of samples. | ||
| - If callable, function taking ``y`` and returns a ``dict``. The keys | ||
| sampling_target : float, str, dict, callable, (default='auto') |
| plot_pie(y) | ||
|
|
||
| ############################################################################### | ||
| # Using ``sampling_target`` in resampling algorithm |
|
|
||
| print('Information of the iris data set after making it' | ||
| ' imbalanced using a callable: \n sampling_target={} \n y: {}' | ||
| .format(sampling_target, Counter(y))) |
There was a problem hiding this comment.
sampling_target is from the previous example
| binary_mask = np.bitwise_or(y == 0, y == 2) | ||
| binary_y = y[binary_mask] | ||
| binary_X = X[binary_mask] | ||
|
|
There was a problem hiding this comment.
can you show the counter of the data? So you can afterwards compare the number after resampling
| # | ||
| # ``sampling_target`` can be given as a string which specify the class targeted | ||
| # by the resampling. With under- and over-sampling, the number of samples will | ||
| # be equalized. |
There was a problem hiding this comment.
emphasize you are no longer using the binary data
|
|
||
| fig, ax = plt.subplots() | ||
| ax.pie(sizes, explode=explode, labels=labels, shadow=True, | ||
| autopct='%1.1f%%') |
There was a problem hiding this comment.
would it be possible to add both absolute number and percentages?
|
Thanks @jorisvandenbossche Regarding the repetition, do you have something in mind? If it comes back to inject the proper docstring, I am not sure how to do that. If you think about a glossary, it could be cool but the issue is that we will have a docstring which will be generic for all over-, under-, cleaning-samplers. What I mean is: So somehow the user needs to know what he is using and select the proper explanation which was something that I wanted to avoid from the previous things that we had. |
|
Lack of time here to review this PR but I have two comments to make.
I had the same idea in #241
I believe that words like strategy and protocol are very nice, even on their own. |
I think that pre-adding sampling is not harming. I can imagine the case of a meta-estimator using an estimator from scikit-learn which use the
I still have the concern that it makes it a bit more difficult to contribute at first but at the end we ensure documentation quality. So I am incline to admit that I was wrong :) |
|
Ok so @massich @jorisvandenbossche @chkoar if you have any other remarks regarding the API, it would be nice. Regarding the examples, I want to make them better in a next PR. |
Sorry @glemaitre. My bad. I was never referred to the class docstrings. Maybe I haven't said that explicitly. I actually said that for the |
|
Actually good point. We can do that in another PR when subsitution class will be merged.
|
Reference Issue
closes #411
closes #406
What does this implement/fix? Explain your changes.
TODO
float, check that the ratio will not over-sample or under-sample when it should not.sampling_strategyAny other comments?