fix: speed up generating ProjectionPlan for full schema by timsaucer · Pull Request #4743 · lance-format/lance

timsaucer · 2025-09-16T12:39:45Z

In #4478 a change was introduced that added the ProjectionPlan::full() function. This will get the entire schema for a dataset and turn every column into a SQL expression in order to call from_expressions. This is causing us a significant performance hit when we have datasets that have very large schemas. Since we know for certain that this function is getting all of the columns, we should be able to short circuit the work being done in parsing strings to columns against the schema and instead jump straight to evaluating each column in the schema as an Expr.

Running one of our benchmarks (time in seconds):

On commit 2b774a4 (before PR 4478): 0.4338
On commit eeea03c (after PR 4478): 0.7486
eeea03c with my update in this PR: 0.4274

github-actions · 2025-09-16T12:40:04Z

ACTION NEEDED
Lance follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

For details on the error please inspect the "PR Title Check" action.

codecov-commenter · 2025-09-16T13:17:48Z

Codecov Report

❌ Patch coverage is 93.75000% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 80.74%. Comparing base (8b9ad8c) to head (e335448).
⚠️ Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
rust/lance-datafusion/src/projection.rs	93.75%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #4743      +/-   ##
==========================================
+ Coverage   80.72%   80.74%   +0.01%     
==========================================
  Files         321      321              
  Lines      124068   124885     +817     
  Branches   124068   124885     +817     
==========================================
+ Hits       100154   100834     +680     
- Misses      20341    20452     +111     
- Partials     3573     3599      +26

Flag	Coverage Δ
unittests	`80.74% <93.75%> (+0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

timsaucer · 2025-09-17T16:55:51Z

The one failing test does not appear related to this PR, but I don't have the ability to rerun failing jobs.

westonpace

Nice catch.

We definitely need more benchmarks like this helping us regress PRs. What was the benchmark doing? Just a simple query? Was the schema unusually wide?

…#4743) In lance-format#4478 a change was introduced that added the `ProjectionPlan::full()` function. This will get the entire schema for a dataset and turn every column into a SQL expression in order to call `from_expressions`. This is causing us a significant performance hit when we have datasets that have very large schemas. Since we know for certain that this function is getting all of the columns, we should be able to short circuit the work being done in parsing strings to columns against the schema and instead jump straight to evaluating each column in the schema as an `Expr`. Running one of our benchmarks (time in seconds): - On commit 2b774a4 (before PR 4478): 0.4338 - On commit eeea03c (after PR 4478): 0.7486 - eeea03c with my update in this PR: 0.4274

timsaucer · 2025-09-20T00:42:26Z

Nice catch.

We definitely need more benchmarks like this helping us regress PRs. What was the benchmark doing? Just a simple query? Was the schema unusually wide?

Yes, this table was very wide and our benchmark is currently running 500-2000 queries, so the time spent on this became noticeable.

…#4743) In lance-format#4478 a change was introduced that added the `ProjectionPlan::full()` function. This will get the entire schema for a dataset and turn every column into a SQL expression in order to call `from_expressions`. This is causing us a significant performance hit when we have datasets that have very large schemas. Since we know for certain that this function is getting all of the columns, we should be able to short circuit the work being done in parsing strings to columns against the schema and instead jump straight to evaluating each column in the schema as an `Expr`. Running one of our benchmarks (time in seconds): - On commit 2b774a4 (before PR 4478): 0.4338 - On commit eeea03c (after PR 4478): 0.7486 - eeea03c with my update in this PR: 0.4274

Apply correction to 0.36 for performance of full projection

6d2d1eb

timsaucer mentioned this pull request Sep 7, 2025

Rerun PRs rerun-io/opensource#2

Open

timsaucer changed the title ~~Speed up generating ProjectionPlan for full schema~~ fix: speed up generating ProjectionPlan for full schema Sep 16, 2025

github-actions Bot added the bug Something isn't working label Sep 16, 2025

Preserve column casing

e335448

LuQQiu requested a review from westonpace September 18, 2025 22:24

westonpace approved these changes Sep 18, 2025

View reviewed changes

westonpace merged commit 2c1c359 into lance-format:main Sep 19, 2025
36 of 37 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: speed up generating ProjectionPlan for full schema#4743

fix: speed up generating ProjectionPlan for full schema#4743
westonpace merged 2 commits into
lance-format:mainfrom
rerun-io:tsaucer/full-schema-projection-speedup

timsaucer commented Sep 16, 2025

Uh oh!

github-actions Bot commented Sep 16, 2025

Uh oh!

codecov-commenter commented Sep 16, 2025 •

edited

Loading

Uh oh!

timsaucer commented Sep 17, 2025

Uh oh!

westonpace left a comment

Uh oh!

Uh oh!

timsaucer commented Sep 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

timsaucer commented Sep 16, 2025

Uh oh!

github-actions Bot commented Sep 16, 2025

Uh oh!

codecov-commenter commented Sep 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

timsaucer commented Sep 17, 2025

Uh oh!

westonpace left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

timsaucer commented Sep 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov-commenter commented Sep 16, 2025 •

edited

Loading