Update word2vec.ipynb#2423
Conversation
|
Good catch! Congrats on your first contribution to gensim 👍 |
| " def __iter__(self):\n", | ||
| " for fname in os.listdir(self.dirname):\n", | ||
| " for line in smart_open(os.path.join(self.dirname, fname), 'rb'):\n", | ||
| " for line in smart_open(os.path.join(self.dirname, fname), 'r'):\n", |
There was a problem hiding this comment.
Please use an explicit encoding. Opening a file in "text mode", without specifying the encoding, should always be considered an error.
There are two ways to do this:
- either open in binary mode and then decode the strings; or
- open in text mode but provide an explicit
encodingparameter.
The latter is only available in Python 3, IIRC.
There was a problem hiding this comment.
Opening a file in "text mode", without specifying the encoding, should always be considered an error.
Why should it be considered an error? There is a default encoding that works the vast majority of the time.
The latter is only available in Python 3, IIRC.
We're using smart_open here, so it doesn't matter which Python version we're running.
There was a problem hiding this comment.
Oh, we back-ported the encoding parameter into python2, in smart_open? That's cool, I didn't know (or forgot).
Why should it be considered an error? There is a default encoding that works the vast majority of the time.
It's an error because it consistently leads to errors and subtle bugs (in addition to being unnecessarily vague about the code intent). We never want to rely on any "default encodings", nor encourage such engineering patterns in our examples or own code.
If a piece of code is expecting utf8 (or ascii or …), it should say so.
There was a problem hiding this comment.
There are dozens of places where we invoke smart_open without specifying the encoding to open text. Should we go through the docs and explicitly specify the encoding in each case?
There was a problem hiding this comment.
Definitely! Same goes for Gensim or any other of our code or tutorials. I'm not sure when / how such bugs slipped through, but if you find any, please squash them without mercy!
Is it possible that they're somehow related to the previous behaviour of smart_open as "binary by default", which was changed only recently to match the built-in open?
There was a problem hiding this comment.
No, because we never updated the documentation to use the new open function.
This code sample doesn't work with 'rb' mode. The error would be "TypeError: can't concat str to bytes".