Skip to content

fix: speed up generating ProjectionPlan for full schema#4743

Merged
westonpace merged 2 commits into
lance-format:mainfrom
rerun-io:tsaucer/full-schema-projection-speedup
Sep 19, 2025
Merged

fix: speed up generating ProjectionPlan for full schema#4743
westonpace merged 2 commits into
lance-format:mainfrom
rerun-io:tsaucer/full-schema-projection-speedup

Conversation

@timsaucer
Copy link
Copy Markdown
Contributor

In #4478 a change was introduced that added the ProjectionPlan::full() function. This will get the entire schema for a dataset and turn every column into a SQL expression in order to call from_expressions. This is causing us a significant performance hit when we have datasets that have very large schemas. Since we know for certain that this function is getting all of the columns, we should be able to short circuit the work being done in parsing strings to columns against the schema and instead jump straight to evaluating each column in the schema as an Expr.

Running one of our benchmarks (time in seconds):

  • On commit 2b774a4 (before PR 4478): 0.4338
  • On commit eeea03c (after PR 4478): 0.7486
  • eeea03c with my update in this PR: 0.4274

@github-actions
Copy link
Copy Markdown
Contributor

ACTION NEEDED
Lance follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

For details on the error please inspect the "PR Title Check" action.

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Sep 16, 2025

Codecov Report

❌ Patch coverage is 93.75000% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 80.74%. Comparing base (8b9ad8c) to head (e335448).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
rust/lance-datafusion/src/projection.rs 93.75% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4743      +/-   ##
==========================================
+ Coverage   80.72%   80.74%   +0.01%     
==========================================
  Files         321      321              
  Lines      124068   124885     +817     
  Branches   124068   124885     +817     
==========================================
+ Hits       100154   100834     +680     
- Misses      20341    20452     +111     
- Partials     3573     3599      +26     
Flag Coverage Δ
unittests 80.74% <93.75%> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@timsaucer timsaucer changed the title Speed up generating ProjectionPlan for full schema fix: speed up generating ProjectionPlan for full schema Sep 16, 2025
@github-actions github-actions Bot added the bug Something isn't working label Sep 16, 2025
@timsaucer
Copy link
Copy Markdown
Contributor Author

The one failing test does not appear related to this PR, but I don't have the ability to rerun failing jobs.

@LuQQiu LuQQiu requested a review from westonpace September 18, 2025 22:24
Copy link
Copy Markdown
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch.

We definitely need more benchmarks like this helping us regress PRs. What was the benchmark doing? Just a simple query? Was the schema unusually wide?

@westonpace westonpace merged commit 2c1c359 into lance-format:main Sep 19, 2025
36 of 37 checks passed
timsaucer added a commit to rerun-io/lance that referenced this pull request Sep 19, 2025
…#4743)

In lance-format#4478 a change was introduced
that added the `ProjectionPlan::full()` function. This will get the
entire schema for a dataset and turn every column into a SQL expression
in order to call `from_expressions`. This is causing us a significant
performance hit when we have datasets that have very large schemas.
Since we know for certain that this function is getting all of the
columns, we should be able to short circuit the work being done in
parsing strings to columns against the schema and instead jump straight
to evaluating each column in the schema as an `Expr`.

Running one of our benchmarks (time in seconds):

- On commit 2b774a4 (before PR 4478): 0.4338
- On commit eeea03c (after PR 4478): 0.7486
- eeea03c with my update in this PR: 0.4274
@timsaucer
Copy link
Copy Markdown
Contributor Author

Nice catch.

We definitely need more benchmarks like this helping us regress PRs. What was the benchmark doing? Just a simple query? Was the schema unusually wide?

Yes, this table was very wide and our benchmark is currently running 500-2000 queries, so the time spent on this became noticeable.

jackye1995 pushed a commit to jackye1995/lance that referenced this pull request Jan 21, 2026
…#4743)

In lance-format#4478 a change was introduced
that added the `ProjectionPlan::full()` function. This will get the
entire schema for a dataset and turn every column into a SQL expression
in order to call `from_expressions`. This is causing us a significant
performance hit when we have datasets that have very large schemas.
Since we know for certain that this function is getting all of the
columns, we should be able to short circuit the work being done in
parsing strings to columns against the schema and instead jump straight
to evaluating each column in the schema as an `Expr`.

Running one of our benchmarks (time in seconds):

- On commit 2b774a4 (before PR 4478): 0.4338
- On commit eeea03c (after PR 4478): 0.7486
- eeea03c with my update in this PR: 0.4274
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants