Update dtmmodel.py#806
Conversation
Adds a check if any documents in the supplied corpus are empty, which breaks the DIM model without providing a usable error description to the user.
|
Yeah, this is a documented problem with the DIM code. This should be useful. |
| lencorpus = sum(1 for _ in corpus) | ||
| if lencorpus == 0: | ||
| raise ValueError("cannot compute DTM over an empty corpus") | ||
| if any([i == 0 for i in [len(text) for text in corpus.get_texts()]]): |
There was a problem hiding this comment.
should it be and model=='fixed' as well?
|
@Eickho Thanks for the PR. Please add a changelog and a check to only affect |
|
@Eickho will you be taking this up? |
Adds a check for empty (not a single word) documents in the corpus supplied to the DTM implementation only if the DIM mode (model = "fixed") is used.
| lencorpus = sum(1 for _ in corpus) | ||
| if lencorpus == 0: | ||
| raise ValueError("cannot compute DTM over an empty corpus") | ||
| if any([i == 0 for i in [len(text) for text in corpus.get_texts()]]) and model == "fixed: |
There was a problem hiding this comment.
@Eickho, Wouldn't it be better to do if model == "fixed" and any([i == 0 for i in [len(text) for text in corpus.get_texts()]]):
so that it won't bother checking the condition if it isn't in fixed mode?
There was a problem hiding this comment.
Yes that would be better, updating.
Inversed conditions as proposed by @bhargavvader
Added description of check for empty (no words) documents in the DIM mode of the DTM wrapper.
Added issue no.
| lencorpus = sum(1 for _ in corpus) | ||
| if lencorpus == 0: | ||
| raise ValueError("cannot compute DTM over an empty corpus") | ||
| if model == "fixed" and any([i == 0 for i in [len(text) for text in corpus.get_texts()]]): |
There was a problem hiding this comment.
Replace with any(not text for text in corpus.get_texts()) (more Pythonic). This looks unnecessarily complicated.
Adds a check if any documents in the supplied corpus are empty, which breaks the DIM model without providing a usable error description to the user. It should be noted that this only seems to cause errors with the DIM model and not the DTM model. However, I don't think there is any upside to having empty documents in the DTM anyway?