Automatically generate examples documentation adv (#19294) #19750

cj-zhukov · 2026-01-11T12:52:01Z

Which issue does this PR close?

Closes #Automatically generate examples documentation and add CI sync check #19294.

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

cj-zhukov · 2026-01-11T13:03:59Z

High-Level Overview

In previous PRs, a basic script was introduced that only parsed example group names #19371 and #19491

This PR implements a more comprehensive solution for keeping the examples README up-to-date:

Introduces a new examples.toml file, which contains metadata for examples, including:

subcommand — the command used with cargo run --example [group] -- [subcommand]
file — the actual Rust source file
desc — a short description of the example

Updates the generate_examples_docs.sh script to:

Parse examples.toml
Check that all examples in the filesystem exist and are listed in the TOML
Generate a new README file as README-NEW.md

Integrates CI validation:

README-NEW.md is formatted with DataFusion’s Prettier
Compared against the committed README.md
If there are differences, the CI check fails and prints a clear diff, along with instructions for updating the README after verifying examples.toml

This ensures that the README for examples stays in sync with the actual examples and encourages maintainers to keep examples.toml accurate.

cj-zhukov · 2026-01-11T13:05:31Z

@Jefffrey since you helped with previous PRs related to example documentation, it would be great if you could take a look at this one as well. Your feedback on the CI check approach or any improvements would be much appreciated.

Jefffrey

Sorry for the late review. I haven't looked too in-depth but I have some concerns with this approach:

A lot of bash scripting; I know we have other scripts like this, but I wonder if this amount of logic is better served as Python or even Rust scripts
- Some of the Bash scripting behaviour is not ideal too, like how it doesn't capitalze SQL and IO properly
Needing to have a separate metadata toml file doesn't seem ideal, but I can't think of another way to handle this other than manually parsing the Rust code of the examples (which is worse)
- We managed to use a Rust based approach for function docs because we used a Rust binary to access the functions which are exposed as a lib; that won't work for examples 🤔

I don't have too many ideas myself unfortunately, would love if anyone else can chime in 😅

cj-zhukov · 2026-01-17T06:49:34Z

@Jefffrey Thanks for the review and for raising these points - they’re all fair concerns.

I agree this is pushing Bash beyond simple glue logic. I started there to avoid introducing new dependencies and to stay consistent with existing CI scripts, but if the overall approach makes sense, I’d be happy to follow up with a Rust or Python implementation to improve maintainability.

Regarding examples.toml: I also don’t love having a separate metadata file, but I couldn’t find a reliable single source of truth for example subcommands and descriptions. Parsing Rust source felt more brittle, and unlike function docs, examples aren’t exposed via a library API that we can introspect.

The capitalization issues you noticed (e.g. SQL / IO) are artifacts of the current heuristics and can be fixed - either via normalization or explicit metadata.

I’m definitely open to alternative ideas here and would love more input if others have suggestions.

cj-zhukov · 2026-01-21T14:06:32Z

Based on this discussion, I’m working on a new Rust-based implementation to replace generate_examples_docs.sh.
At the same time, I’m experimenting with a code-driven approach that removes the need for examples.toml, so the README can stay in sync directly with the examples themselves.

Once this new implementation is ready, I’ll share it here so we can review the updated approach and decide whether it’s a better fit.

cj-zhukov · 2026-01-23T11:25:24Z

I replaced the shell-based examples documentation generator with a Rust implementation.
What changed:

Replaced generate_examples_docs.sh with a Rust binary (examples-docs)
Removed examples.toml
Documentation is now generated directly from structured doc comments in each example group’s main.rs
CI script was updated to call the Rust generator instead of the shell script

Design change:

Files lived in examples/<group>/*.rs
Metadata (subcommand, description) lived in examples.toml
A shell script merged the two

Now:

Each example group’s main.rs contains the authoritative documentation for that group
The Rust generator parses this information and renders the README directly
There is a single source of truth

Personally, I find this approach cleaner and more “Rust-native” than maintaining a complex shell script and a separate TOML file. A follow-up PR could switch this to a dedicated parsing crate (e.g. nom) if the parsing rules become more complex or harder to maintain. That said, I’m very open to feedback and alternative designs if there are concerns about this direction.

Jefffrey

I prefer this Rust solution 👍

I think we can merge this PR and do any improvements in further PRs 🚀

Jefffrey · 2026-01-26T05:22:28Z

ci/scripts/check_examples_docs.sh

-        continue
-    fi
+echo "▶ Formatting generated README with Prettier…"
+npx [email protected] \


Something to look at in a followup issue is unifying our prettier versions somehow 🤔

datafusion/dev/update_config_docs.sh

Lines 241 to 242 in 8023947

echo "Running prettier"

npx [email protected] --write "$TARGET_FILE"

datafusion/dev/update_function_docs.sh

Lines 116 to 117 in 4d63f8c

echo "Running prettier"

npx [email protected] --write "$TARGET_FILE"

datafusion/ci/scripts/doc_prettier_check.sh

Lines 23 to 27 in 4d63f8c

SCRIPT_NAME="$(basename "${BASH_SOURCE[0]}")"

PRETTIER_VERSION="2.7.1"

PRETTIER_TARGETS=(

'{datafusion,datafusion-cli,datafusion-examples,dev,docs}/**/*.md'

'!datafusion/CHANGELOG.md'

I agree with you, it's a good point

Jefffrey · 2026-01-26T05:26:09Z

datafusion-examples/README.md

 | file_stream_provider  | [`custom_data_source/file_stream_provider.rs`](examples/custom_data_source/file_stream_provider.rs)   | Read/write via FileStreamProvider for streams |

-## Data IO Examples
+## Data Io Examples


Would be nice if we could fix these capitalization cases

I'm going to work on this

cj-zhukov · 2026-01-26T14:07:38Z

@Jefffrey Thanks for the review!

I’ve fixed the capitalization issues for group titles (e.g. Data IO, SQL Ops, UDF, etc.) by introducing explicit handling for common abbreviations instead of naive Title Case.

For now, I kept the existing lightweight parser for extracting metadata from main.rs doc comments and just improved it where needed. As a follow-up PR, would it make sense to explore using a small parsing crate (e.g. nom) to make this logic more robust and easier to evolve, or is the preference to keep this part dependency-free?

Happy to follow whatever direction makes the most sense here.

Jefffrey · 2026-01-26T14:40:26Z

For now, I kept the existing lightweight parser for extracting metadata from main.rs doc comments and just improved it where needed. As a follow-up PR, would it make sense to explore using a small parsing crate (e.g. nom) to make this logic more robust and easier to evolve, or is the preference to keep this part dependency-free?

I'm not overly concerned about dependency footprint for this; we can always feature gate it so people running datafusion examples won't have it by default. My only concern is complexity in the code in the DataFusion repo. If we pull in nom but it leads to a more complex bin here then its probably not worth the tradeoff; but if it leads to simpler/more robust code here then that would be great.

cj-zhukov · 2026-01-27T05:13:20Z

@Jefffrey Thanks, that makes sense.

I’ll keep this PR focused on improving the current implementation and avoid adding extra complexity here. For now, I think it’s best to keep the existing lightweight parser as-is and only make targeted improvements where needed.

As follow-ups, I’m planning:

a small PR to unify the Prettier versions used in CI
a separate exploratory PR to evaluate a nom-based parser only if it clearly simplifies the parsing logic and improves robustness; otherwise, I agree it’s not worth the tradeoff

Happy to adjust direction if you’d prefer a different approach.

Jefffrey · 2026-01-29T03:03:57Z

Should be good to merge once conflicts are resolved @cj-zhukov

Automatically generate examples documentation adv (apache#19294)

2071da7

github-actions bot added the development-process Related to development process of DataFusion label Jan 11, 2026

Sergey Zhukov added 4 commits January 14, 2026 13:41

fix dubious ownership error in repository

513f1f9

run doc_prettier_check.sh with allow-dirty

70c96d6

doc_prettier_check.sh uses npx to run prettier

7704287

use prettier to format README-NEW

15cc69c

Jefffrey reviewed Jan 16, 2026

View reviewed changes

rust docs generator instead of bash

cc5a9db

Sergey Zhukov added 3 commits January 23, 2026 14:51

rust docs generator

e0f9a64

clippy uninlined_format_args

c93a165

refactor: impl trait and add comments

4b8ee9e

Jefffrey approved these changes Jan 26, 2026

View reviewed changes

fix capitalization cases

ec6c924

This was referenced Jan 27, 2026

Unify the Prettier versions #20024

Open

Explore replacing ad-hoc parsing logic in datafusion-examples with a nom-based parser #20025

Open

	echo "Running prettier"
	npx [email protected] --write "$TARGET_FILE"

	SCRIPT_NAME="$(basename "${BASH_SOURCE[0]}")"
	PRETTIER_VERSION="2.7.1"
	PRETTIER_TARGETS=(
	'{datafusion,datafusion-cli,datafusion-examples,dev,docs}/*/.md'
	'!datafusion/CHANGELOG.md'

Automatically generate examples documentation adv (#19294) #19750

Are you sure you want to change the base?

Automatically generate examples documentation adv (#19294) #19750

Conversation

cj-zhukov commented Jan 11, 2026

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

cj-zhukov commented Jan 11, 2026

High-Level Overview

Uh oh!

cj-zhukov commented Jan 11, 2026

Uh oh!

Jefffrey left a comment

Choose a reason for hiding this comment

Uh oh!

cj-zhukov commented Jan 17, 2026

Uh oh!

cj-zhukov commented Jan 21, 2026

Uh oh!

cj-zhukov commented Jan 23, 2026

Uh oh!

Jefffrey left a comment

Choose a reason for hiding this comment

Uh oh!

Jefffrey Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

cj-zhukov Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

Jefffrey Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

cj-zhukov Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

cj-zhukov commented Jan 26, 2026

Uh oh!

Jefffrey commented Jan 26, 2026

Uh oh!

cj-zhukov commented Jan 27, 2026

Uh oh!

Jefffrey commented Jan 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants