Add EncodingFormat for FHIR files by crisely09 · Pull Request #883 · mlcommons/croissant

crisely09 · 2025-05-28T09:54:25Z

We would like to use Croissant recordsets to read FHIR (nested JSON Lines), wildly used in the medical sector.
This PR is an "easy" approach to enable the support for FHIR (application/fhir+json) encoding format.

github-actions · 2025-05-28T09:54:38Z

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

ccl-core · 2025-05-28T10:05:48Z

Hi @crisely09 , thank you for your contribution!
To increase our test coverage and enrich the example datasets for Croissant users, would you mind adding an example dataset which uses the new FHIR format to https://github.com/mlcommons/croissant/tree/main/datasets/1.1 ?

crisely09 · 2025-05-28T10:31:39Z

Hi @crisely09 , thank you for your contribution! To increase our test coverage and enrich the example datasets for Croissant users, would you mind adding an example dataset which uses the new FHIR format to https://github.com/mlcommons/croissant/tree/main/datasets/1.1 ?

I have added the example metadata into the datasets folder. I am not sure how to generate the output folder.
Also, I don't know what is the format error I am getting in the read.py file.
Thanks a lot for your help.

datasets/1.1/pharmaccess-momcare-fhir/metadata.json

ccl-core · 2025-05-28T10:36:50Z

Thanks! I'll review later on today.
You can generate the output records using this script: https://github.com/mlcommons/croissant/blob/main/python/mlcroissant/mlcroissant/scripts/load.py

crisely09 · 2025-05-28T10:57:57Z

I have noticed that the way the json is loaded is suuuuper slow, I am trying something to accelerate the Reading of Json files when jsonPath is used.

crisely09 · 2025-05-28T13:32:02Z

OK, I have fixed most of the issues, I really don't know how to fix the MyPy and Python format flows.

datasets/1.1/pharmaccess-momcare-fhir/metadata.json

python/mlcroissant/mlcroissant/_src/operation_graph/operations/parse_json.py

ccl-core · 2025-05-29T21:52:32Z

OK, I have fixed most of the issues, I really don't know how to fix the MyPy and Python format flows.

Yeah, mypy is annoying :S So the logs point to:
mlcroissant/_src/operation_graph/operations/read.py:137: error: Incompatible types in assignment (expression has type "JsonlReader", variable has type "JsonReader") [assignment]
So it seems like MyPy believes the variable reader is expected to hold an object of type JsonReader -- I guess MyPy infers the type of reader from its first assignment reader = JsonReader(self.fields)? Have you tried to explicitly declare the possible types for reader, like with reader: JsonReader | JsonlReader before the conditional block? I guess another option could be to use a typing.Protocol, but I would give it a try with the first method first...

For the formatting error, have you tried runnin black with the same specifications (--check --line-length 88 --preview etc) as we do in the tests? This should hopefully fix the tests.

python/mlcroissant/mlcroissant/_src/operation_graph/operations/parse_json.py

ccl-core · 2025-05-29T21:22:14Z

python/mlcroissant/mlcroissant/_src/operation_graph/operations/parse_json.py

+        """
+        # raw JSON fallback: one‐cell DataFrame
+        fh.seek(0)
+        content = json.load(fh)


I wonder whether it might make sense to use orjson.loads here as well? Wouldn't it maximise performance and be more consistent?

Yes, makes total sense.

python/mlcroissant/mlcroissant/_src/operation_graph/operations/parse_json.py

ccl-core · 2025-05-29T21:55:45Z

Thank you @crisely09 for the PR! I like this approach, the refactoring of the JSON parsing logic into the two classes makes the codebase cleaner and more modular. And having support for FHIR-formatted data is great!

I left a few comments, let me know if you have further problems with the tests. I'll be OOO next week, but maybe @marcenacp or @benjelloun can unblock you with the needed approvals if needed?

crisely09 · 2025-05-30T15:24:53Z

OK, I have fixed most of the issues, I really don't know how to fix the MyPy and Python format flows.

Yeah, mypy is annoying :S So the logs point to: mlcroissant/_src/operation_graph/operations/read.py:137: error: Incompatible types in assignment (expression has type "JsonlReader", variable has type "JsonReader") [assignment] So it seems like MyPy believes the variable reader is expected to hold an object of type JsonReader -- I guess MyPy infers the type of reader from its first assignment reader = JsonReader(self.fields)? Have you tried to explicitly declare the possible types for reader, like with reader: JsonReader | JsonlReader before the conditional block? I guess another option could be to use a typing.Protocol, but I would give it a try with the first method first...

I went back to the logs, and the errors seem to be related to files I haven't modified, base_node.py for instance.

crisely09 · 2025-05-30T15:27:02Z

Thank you @crisely09 for the PR! I like this approach, the refactoring of the JSON parsing logic into the two classes makes the codebase cleaner and more modular. And having support for FHIR-formatted data is great!

I left a few comments, let me know if you have further problems with the tests. I'll be OOO next week, but maybe @marcenacp or @benjelloun can unblock you with the needed approvals if needed?

Thanks a lot @ccl-core for the careful review ! I think I have addressed all your comments, feel free to have another look.

crisely09 · 2025-06-06T09:43:48Z

Hello @ccl-core, I had to fix a few things, but some tests are still failing. I am not sure I am causing this to fail, could you have a look?

ccl-core · 2025-06-11T08:59:38Z

Hello @ccl-core, I had to fix a few things, but some tests are still failing. I am not sure I am causing this to fail, could you have a look?

Hi @crisely09 , sorry, I was OOO last week :) Let me try to see if I can reproduce the mypy errors in my workspace!

ccl-core · 2025-06-12T15:00:52Z

Hi @crisely09 , the mypy errors were due to a new version of mypy, and were unrelated to your changes, as you already pointed out (the CI was failing since a few weeks anyways https://github.com/mlcommons/croissant/actions/workflows/ci.yml :) )
I sent #890 that should hopefully fix the issue.

python/mlcroissant/mlcroissant/_src/operation_graph/operations/parse_json.py

ccl-core · 2025-06-12T15:11:51Z

python/mlcroissant/mlcroissant/_src/operation_graph/operations/parse_json.py

+
+        # Load entire JSON file (could be a list or a single dict).
+        raw = fh.read()
+        data = orjson.loads(raw)


You can see here an example of how to lazily load a library: 4fbd358

ccl-core · 2025-06-12T15:17:56Z

Hello @ccl-core, I had to fix a few things, but some tests are still failing. I am not sure I am causing this to fail, could you have a look?

I believe the mypy tests should be fixed now. The failures in the notebook tests probably stem from the refactored JSON parsing logic.

crisely09 · 2025-06-23T07:18:55Z

Hello @ccl-core, I had to fix a few things, but some tests are still failing. I am not sure I am causing this to fail, could you have a look?

I believe the mypy tests should be fixed now. The failures in the notebook tests probably stem from the refactored JSON parsing logic.

Thank you for the review!!
I will have a look at the parsing logic, to keep the expected behavior for this type of files.

crisely09 · 2025-07-23T15:09:30Z

Hi @ccl-core ,
I managed to fix things to make the test pass. Could you have another look, please?

There is something I would like to discuss with you, I made a change in the way the bounding boxes are loaded, in a way that one RecordSet contains all bounding boxes from the same contentUrl, this makes more sense to me than having one RecordSet per bounding box. If we can have a chat on zoom/meet/teams it would be much better.
Thank you!

marcenacp

Thanks for your contribution! I have a few clarification questions about why we need to introduce new dependencies and how we could simplify the parsers/readers.

marcenacp · 2025-07-25T10:03:28Z

python/mlcroissant/mlcroissant/_src/operation_graph/operations/parse_json.py

+                # simple JSONPath → JMESPath
+                jm = json_path.lstrip("$.")  # drop the "$."
+                expr = jmespath.compile(jm)
+                engine = "jmespath"


OoC, why do we need both JSONPath and JMESPath? Why not use JSONPath everywhere?

It's a performance optimization. Here's the reasoning:

JMESPath is much faster for simple paths like $.resourceType or $.entry[*].resource.id — it's a lightweight lookup, essentially a dict key traversal.
jsonpath_rw supports the full JSONPath spec, including recursive descent (..) and complex filters, but it's significantly slower because it walks the entire JSON tree.
The code at line 166 routes between them:

Simple paths ($.foo.bar, no ..) → JMESPath (fast)
Complex paths with .. or filters → jsonpath_rw (full-featured, slower)
For FHIR files which can have thousands of NDJSON lines, using JMESPath for simple extractions makes a big difference in throughput. jsonpath_rw is kept as a fallback for paths that JMESPath can't express.

marcenacp · 2025-07-25T10:04:58Z

python/mlcroissant/mlcroissant/_src/operation_graph/operations/read.py

+                    EncodingFormat.FHIR,
+                ):
+                    # JSON_LINES and FHIR do the same thing
+                    reader = JsonlReader(self.fields)


Implementing our own readers/parsers looks scary because it can quickly become a rabbit hole of bugs! So I have a few questions:

Is there an existing reader for FHIR files or do we really have to do it manually? Is fhir.resources an option?

Can we handle it as nested JSON, e.g. using pd.json_normalize?

python/mlcroissant/mlcroissant/_src/operation_graph/operations/parse_json.py

python/mlcroissant/mlcroissant/_src/core/ml/bounding_box.py

marcenacp · 2025-07-25T10:09:37Z

python/mlcroissant/mlcroissant/_src/core/ml/bounding_box.py

 """Module to manage "bounding boxes" annotations on images."""

-from typing import Any
+from typing import Any, List, Union


Can you please split the PR into 2 PRs:

One for FHIR files (this one)

One for bounding boxes

?

Done, I kept only the FHIR related changes

python/mlcroissant/mlcroissant/_src/core/ml/bounding_box.py

python/mlcroissant/recipes/bounding-boxes.ipynb

… handling libraries

crisely09 requested a review from a team as a code owner May 28, 2025 09:54

ccl-core self-requested a review May 28, 2025 10:01

ccl-core reviewed May 28, 2025

View reviewed changes

datasets/1.1/pharmaccess-momcare-fhir/metadata.json Show resolved Hide resolved

ccl-core reviewed May 29, 2025

View reviewed changes

datasets/1.1/pharmaccess-momcare-fhir/metadata.json Outdated Show resolved Hide resolved

ccl-core reviewed May 29, 2025

View reviewed changes

python/mlcroissant/mlcroissant/_src/operation_graph/operations/parse_json.py Outdated Show resolved Hide resolved

ccl-core reviewed May 29, 2025

View reviewed changes

python/mlcroissant/mlcroissant/_src/operation_graph/operations/parse_json.py Outdated Show resolved Hide resolved

ccl-core reviewed May 29, 2025

View reviewed changes

python/mlcroissant/mlcroissant/_src/operation_graph/operations/parse_json.py Show resolved Hide resolved

ccl-core reviewed May 29, 2025

View reviewed changes

ccl-core reviewed Jun 12, 2025

View reviewed changes

python/mlcroissant/mlcroissant/_src/operation_graph/operations/parse_json.py Outdated Show resolved Hide resolved

ccl-core reviewed Jun 12, 2025

View reviewed changes

ccl-core requested a review from marcenacp July 25, 2025 10:02

marcenacp reviewed Jul 25, 2025

View reviewed changes

stefanches7 mentioned this pull request Oct 9, 2025

Adding support for reading medical images #862

Open

add reading option for fhir

4a7aba3

crisely09 added 17 commits February 16, 2026 14:19

fix isort

9aff86a

fix test expectations

3fbfac9

fix format

8fe4af4

fix flakes

9df8a42

fix expectation of tests

1aed383

if not replaced to if is None

515f346

read bounding boxes all at once

f477e8f

lazy load orjson

3fdf02d

remove imports of orjson

2344db9

fix python format black

ace9c0c

run black again

0d61acd

update bounding_box parsing to pass the test

cd83e47

trying to include all cases for bounding boxes

827fcf4

fix format and pytype

c2ba57f

trying to fix format errors

52a2532

reverted box changes, added fhir resources library and lazy load JSON…

d642034

… handling libraries

cleaning

bd91ce6

crisely09 force-pushed the read-fhir branch from 15c49b0 to bd91ce6 Compare February 16, 2026 15:11

crisely09 added 5 commits February 16, 2026 16:16

revert notebook unintentional changes

bb1d109

fix metadata example

e75993e

More fixes

fb7a7a3

fix mypy

acf8e14

Remove bounding box changes (moved to separate PR)

e804f29

crisely09 force-pushed the read-fhir branch from 95e6575 to e804f29 Compare February 16, 2026 15:50

crisely09 added 3 commits February 23, 2026 09:01

restore notebooks

a9d98a3

missing docs

a762804

really restore notebooks

b20e9b6

crisely09 force-pushed the read-fhir branch from 5c923a8 to b20e9b6 Compare February 23, 2026 14:25

crisely09 added 2 commits February 23, 2026 15:37

notebook restore

60acc8f

make sure implementation does't break previous behaviour

b097456

Conversation

crisely09 commented May 28, 2025

Uh oh!

github-actions bot commented May 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ccl-core commented May 28, 2025

Uh oh!

crisely09 commented May 28, 2025

Uh oh!

Uh oh!

ccl-core commented May 28, 2025

Uh oh!

crisely09 commented May 28, 2025

Uh oh!

crisely09 commented May 28, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ccl-core commented May 29, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ccl-core commented May 29, 2025

Uh oh!

crisely09 commented May 30, 2025

Uh oh!

crisely09 commented May 30, 2025

Uh oh!

crisely09 commented Jun 6, 2025

Uh oh!

ccl-core commented Jun 11, 2025

Uh oh!

ccl-core commented Jun 12, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ccl-core commented Jun 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

crisely09 commented Jun 23, 2025

Uh oh!

crisely09 commented Jul 23, 2025

Uh oh!

marcenacp left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

github-actions bot commented May 28, 2025 •

edited

Loading

ccl-core commented Jun 12, 2025 •

edited

Loading