feat: report file name of file that chardet fails to read#3524
feat: report file name of file that chardet fails to read#3524corneliusroemer wants to merge 8 commits intocodespell-project:mainfrom
Conversation
resolves codespell-project#3519 Tested and it works now, reporting the file name: ``` codespell --write-changes -i3 -C 5 -H -f -e --count -s --builtin clear,rare,names Failed to decode file ./pep_sphinx_extensions/tests/pep_lint/test_pep_number.py using detected encoding Windows-1254. Traceback (most recent call last): File "/Users/corneliusromer/micromamba/envs/codespell/bin/codespell", line 8, in <module> sys.exit(_script_main()) ^^^^^^^^^^^^^^ File "/Users/corneliusromer/code/codespell/codespell_lib/_codespell.py", line 1103, in _script_main return main(*sys.argv[1:]) ^^^^^^^^^^^^^^^^^^^ File "/Users/corneliusromer/code/codespell/codespell_lib/_codespell.py", line 1300, in main bad_count += parse_file( ^^^^^^^^^^^ File "/Users/corneliusromer/code/codespell/codespell_lib/_codespell.py", line 945, in parse_file lines, encoding = file_opener.open(filename) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/corneliusromer/code/codespell/codespell_lib/_codespell.py", line 232, in open return self.open_with_chardet(filename) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/corneliusromer/code/codespell/codespell_lib/_codespell.py", line 246, in open_with_chardet lines = self.get_lines(f) ^^^^^^^^^^^^^^^^^ File "/Users/corneliusromer/code/codespell/codespell_lib/_codespell.py", line 303, in get_lines lines = f.readlines() ^^^^^^^^^^^^^ File "/Users/corneliusromer/micromamba/envs/codespell/lib/python3.12/encodings/cp1254.py", line 23, in decode return codecs.charmap_decode(input,self.errors,decoding_table)[0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 1349: character maps to <undefined> ```
| raise | ||
| else: | ||
| lines = self.get_lines(f) | ||
| f.close() |
There was a problem hiding this comment.
To minimize changes, I would suggest:
try:
f = open(filename, encoding=encoding, newline="")
except LookupError:
print(
f"ERROR: Don't know how to handle encoding {encoding}: {filename}",
file=sys.stderr,
)
raise
else:
try:
lines = f.readlines()
except UnicodeDecodeError:
print(f"ERROR: Could not detect encoding: {filename}", file=sys.stderr)
raise
finally:
f.close()There was a problem hiding this comment.
Why do we need to minimize changes? I'm happy to go with whatever you want but let me try to convince you (with Zen of Python):
- "There should be an obvious way to do it": context managers are the way one should open files. Not use finally, it's messy
- "Flat is better than nested": Your suggestion has loads of try/else/try/finally nesting, it's hard to grok
There was a problem hiding this comment.
Do you really think this is not much more readable?
try:
with open(filename, encoding=encoding, newline="") as f:
lines = self.get_lines(f)
except LookupError: # Raised by open() if encoding is unknown
error_msg = f"ERROR: Chardet returned unknown encoding for: {filename}."
print(error_msg, file=sys.stderr)
raise
except UnicodeDecodeError: # Raised by self.get_lines() if decoding fails
error_msg = f"ERROR: Failed decoding file: {filename}"
print(error_msg, file=sys.stderr)
raiseAlso note that you introduced a bug by replacing self.get_lines(f) with f.readlines() in your suggestion 🙈
There was a problem hiding this comment.
- I agree that in general flat is better, but I would also like to limit the code under
tryto the strict minimum, as a way to document which exceptions each piece of code is expected to raise. Some linters do enforce that.openonly raisesLookupErrorreadlinesonly raisesUnicodeDecodeError
- I'd like to keep the codebase consistent. See Fix uncaught exception on empty files #2195.
There was a problem hiding this comment.
I do agree with using context managers to open files. I just didn't know how to make it compatible with try.
There was a problem hiding this comment.
I agree, nice to keep try local, but it's a tradeoff with nesting. I commented instead to make clear what raises what.
There was a problem hiding this comment.
Can you be more specific regarding your "I would like to keep codebase consistent"? I don't know what exactly you would like me to change. Is anything inconsistent?
There was a problem hiding this comment.
open_with_chardet and open_with_internal (already fixed in #2195) should be kept as similar as possible and even share code if possible at all.
There was a problem hiding this comment.
Thanks for the clarification!
…nges. We require the type info, otherwise mypy fails
|
I've added tests because codecov failed otherwise. Not sure these are super important but what's done is done! Learned something about testing on the way - and mocking! |
|
I am happy with codecov failing on exceptions, but I don't have rights to merge when CI tests fail. With that said, adding tests looks like the best option. |
|
Indeed, I will make an issue there if I remember
…On Tue, Aug 20, 2024, 09:50 Dimitri Papadopoulos Orfanos < ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In codespell_lib/_codespell.py
<#3524 (comment)>
:
>
- return lines, f.encoding
+ return lines, encoding
Indeed. The chardet documentation should describe the return values in
more detail.
—
Reply to this email directly, view it on GitHub
<#3524 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AF77AQNAITVX26F5UWGKDP3ZSLYM5AVCNFSM6AAAAABMWRLYDWVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDENBXGIZDSNBYG4>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
|
Could you rebase? |
resolves #3519
Tested and it works now, reporting the file name: