Skip to content

Update word2vec.ipynb#2423

Merged
mpenkov merged 1 commit into
piskvorky:developfrom
asyabo:patch-1
Apr 15, 2019
Merged

Update word2vec.ipynb#2423
mpenkov merged 1 commit into
piskvorky:developfrom
asyabo:patch-1

Conversation

@asyabo
Copy link
Copy Markdown
Contributor

@asyabo asyabo commented Mar 19, 2019

This code sample doesn't work with 'rb' mode. The error would be "TypeError: can't concat str to bytes".

@mpenkov
Copy link
Copy Markdown
Collaborator

mpenkov commented Apr 15, 2019

Good catch! Congrats on your first contribution to gensim 👍

@mpenkov mpenkov merged commit ff107d6 into piskvorky:develop Apr 15, 2019
" def __iter__(self):\n",
" for fname in os.listdir(self.dirname):\n",
" for line in smart_open(os.path.join(self.dirname, fname), 'rb'):\n",
" for line in smart_open(os.path.join(self.dirname, fname), 'r'):\n",
Copy link
Copy Markdown
Owner

@piskvorky piskvorky Apr 20, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use an explicit encoding. Opening a file in "text mode", without specifying the encoding, should always be considered an error.

There are two ways to do this:

  1. either open in binary mode and then decode the strings; or
  2. open in text mode but provide an explicit encoding parameter.

The latter is only available in Python 3, IIRC.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opening a file in "text mode", without specifying the encoding, should always be considered an error.

Why should it be considered an error? There is a default encoding that works the vast majority of the time.

The latter is only available in Python 3, IIRC.

We're using smart_open here, so it doesn't matter which Python version we're running.

Copy link
Copy Markdown
Owner

@piskvorky piskvorky Apr 20, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, we back-ported the encoding parameter into python2, in smart_open? That's cool, I didn't know (or forgot).

Why should it be considered an error? There is a default encoding that works the vast majority of the time.

It's an error because it consistently leads to errors and subtle bugs (in addition to being unnecessarily vague about the code intent). We never want to rely on any "default encodings", nor encourage such engineering patterns in our examples or own code.
If a piece of code is expecting utf8 (or ascii or …), it should say so.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are dozens of places where we invoke smart_open without specifying the encoding to open text. Should we go through the docs and explicitly specify the encoding in each case?

Copy link
Copy Markdown
Owner

@piskvorky piskvorky Apr 20, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely! Same goes for Gensim or any other of our code or tutorials. I'm not sure when / how such bugs slipped through, but if you find any, please squash them without mercy!

Is it possible that they're somehow related to the previous behaviour of smart_open as "binary by default", which was changed only recently to match the built-in open?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, because we never updated the documentation to use the new open function.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants