Conversation
|
MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅ |
|
Hi @crisely09 , thank you for your contribution! |
I have added the example metadata into the datasets folder. I am not sure how to generate the output folder. |
|
Thanks! I'll review later on today. |
|
I have noticed that the way the json is loaded is suuuuper slow, I am trying something to accelerate the Reading of Json files when jsonPath is used. |
|
OK, I have fixed most of the issues, I really don't know how to fix the MyPy and Python format flows. |
python/mlcroissant/mlcroissant/_src/operation_graph/operations/parse_json.py
Outdated
Show resolved
Hide resolved
python/mlcroissant/mlcroissant/_src/operation_graph/operations/parse_json.py
Outdated
Show resolved
Hide resolved
python/mlcroissant/mlcroissant/_src/operation_graph/operations/parse_json.py
Show resolved
Hide resolved
Yeah, mypy is annoying :S So the logs point to: For the formatting error, have you tried runnin black with the same specifications ( |
python/mlcroissant/mlcroissant/_src/operation_graph/operations/parse_json.py
Show resolved
Hide resolved
| """ | ||
| # raw JSON fallback: one‐cell DataFrame | ||
| fh.seek(0) | ||
| content = json.load(fh) |
There was a problem hiding this comment.
I wonder whether it might make sense to use orjson.loads here as well? Wouldn't it maximise performance and be more consistent?
python/mlcroissant/mlcroissant/_src/operation_graph/operations/parse_json.py
Show resolved
Hide resolved
|
Thank you @crisely09 for the PR! I like this approach, the refactoring of the JSON parsing logic into the two classes makes the codebase cleaner and more modular. And having support for FHIR-formatted data is great! I left a few comments, let me know if you have further problems with the tests. I'll be OOO next week, but maybe @marcenacp or @benjelloun can unblock you with the needed approvals if needed? |
I went back to the logs, and the errors seem to be related to files I haven't modified, |
Thanks a lot @ccl-core for the careful review ! I think I have addressed all your comments, feel free to have another look. |
|
Hello @ccl-core, I had to fix a few things, but some tests are still failing. I am not sure I am causing this to fail, could you have a look? |
Hi @crisely09 , sorry, I was OOO last week :) Let me try to see if I can reproduce the mypy errors in my workspace! |
|
Hi @crisely09 , the mypy errors were due to a new version of mypy, and were unrelated to your changes, as you already pointed out (the CI was failing since a few weeks anyways https://github.com/mlcommons/croissant/actions/workflows/ci.yml :) ) |
python/mlcroissant/mlcroissant/_src/operation_graph/operations/parse_json.py
Outdated
Show resolved
Hide resolved
|
|
||
| # Load entire JSON file (could be a list or a single dict). | ||
| raw = fh.read() | ||
| data = orjson.loads(raw) |
There was a problem hiding this comment.
You can see here an example of how to lazily load a library: 4fbd358
I believe the mypy tests should be fixed now. The failures in the notebook tests probably stem from the refactored JSON parsing logic. |
Thank you for the review!! |
|
Hi @ccl-core , There is something I would like to discuss with you, I made a change in the way the bounding boxes are loaded, in a way that one RecordSet contains all bounding boxes from the same contentUrl, this makes more sense to me than having one RecordSet per bounding box. If we can have a chat on zoom/meet/teams it would be much better. |
marcenacp
left a comment
There was a problem hiding this comment.
Thanks for your contribution! I have a few clarification questions about why we need to introduce new dependencies and how we could simplify the parsers/readers.
| # simple JSONPath → JMESPath | ||
| jm = json_path.lstrip("$.") # drop the "$." | ||
| expr = jmespath.compile(jm) | ||
| engine = "jmespath" |
There was a problem hiding this comment.
OoC, why do we need both JSONPath and JMESPath? Why not use JSONPath everywhere?
There was a problem hiding this comment.
It's a performance optimization. Here's the reasoning:
JMESPath is much faster for simple paths like
jsonpath_rw supports the full JSONPath spec, including recursive descent (..) and complex filters, but it's significantly slower because it walks the entire JSON tree.
The code at line 166 routes between them:
Simple paths ($.foo.bar, no ..) → JMESPath (fast)
Complex paths with .. or filters → jsonpath_rw (full-featured, slower)
For FHIR files which can have thousands of NDJSON lines, using JMESPath for simple extractions makes a big difference in throughput. jsonpath_rw is kept as a fallback for paths that JMESPath can't express.
| EncodingFormat.FHIR, | ||
| ): | ||
| # JSON_LINES and FHIR do the same thing | ||
| reader = JsonlReader(self.fields) |
There was a problem hiding this comment.
Implementing our own readers/parsers looks scary because it can quickly become a rabbit hole of bugs! So I have a few questions:
- Is there an existing reader for FHIR files or do we really have to do it manually? Is fhir.resources an option?
- Can we handle it as nested JSON, e.g. using
pd.json_normalize?
python/mlcroissant/mlcroissant/_src/operation_graph/operations/parse_json.py
Outdated
Show resolved
Hide resolved
| """Module to manage "bounding boxes" annotations on images.""" | ||
|
|
||
| from typing import Any | ||
| from typing import Any, List, Union |
There was a problem hiding this comment.
Can you please split the PR into 2 PRs:
- One for FHIR files (this one)
- One for bounding boxes
?
There was a problem hiding this comment.
Done, I kept only the FHIR related changes
… handling libraries
We would like to use Croissant recordsets to read FHIR (nested JSON Lines), wildly used in the medical sector.
This PR is an "easy" approach to enable the support for FHIR (application/fhir+json) encoding format.