fix: speed up generating ProjectionPlan for full schema#4743
Conversation
|
ACTION NEEDED The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification. For details on the error please inspect the "PR Title Check" action. |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #4743 +/- ##
==========================================
+ Coverage 80.72% 80.74% +0.01%
==========================================
Files 321 321
Lines 124068 124885 +817
Branches 124068 124885 +817
==========================================
+ Hits 100154 100834 +680
- Misses 20341 20452 +111
- Partials 3573 3599 +26
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
The one failing test does not appear related to this PR, but I don't have the ability to rerun failing jobs. |
westonpace
left a comment
There was a problem hiding this comment.
Nice catch.
We definitely need more benchmarks like this helping us regress PRs. What was the benchmark doing? Just a simple query? Was the schema unusually wide?
…#4743) In lance-format#4478 a change was introduced that added the `ProjectionPlan::full()` function. This will get the entire schema for a dataset and turn every column into a SQL expression in order to call `from_expressions`. This is causing us a significant performance hit when we have datasets that have very large schemas. Since we know for certain that this function is getting all of the columns, we should be able to short circuit the work being done in parsing strings to columns against the schema and instead jump straight to evaluating each column in the schema as an `Expr`. Running one of our benchmarks (time in seconds): - On commit 2b774a4 (before PR 4478): 0.4338 - On commit eeea03c (after PR 4478): 0.7486 - eeea03c with my update in this PR: 0.4274
Yes, this table was very wide and our benchmark is currently running 500-2000 queries, so the time spent on this became noticeable. |
…#4743) In lance-format#4478 a change was introduced that added the `ProjectionPlan::full()` function. This will get the entire schema for a dataset and turn every column into a SQL expression in order to call `from_expressions`. This is causing us a significant performance hit when we have datasets that have very large schemas. Since we know for certain that this function is getting all of the columns, we should be able to short circuit the work being done in parsing strings to columns against the schema and instead jump straight to evaluating each column in the schema as an `Expr`. Running one of our benchmarks (time in seconds): - On commit 2b774a4 (before PR 4478): 0.4338 - On commit eeea03c (after PR 4478): 0.7486 - eeea03c with my update in this PR: 0.4274
In #4478 a change was introduced that added the
ProjectionPlan::full()function. This will get the entire schema for a dataset and turn every column into a SQL expression in order to callfrom_expressions. This is causing us a significant performance hit when we have datasets that have very large schemas. Since we know for certain that this function is getting all of the columns, we should be able to short circuit the work being done in parsing strings to columns against the schema and instead jump straight to evaluating each column in the schema as anExpr.Running one of our benchmarks (time in seconds):