feat: allow reading the _rowoffset#4478
Conversation
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #4478 +/- ##
==========================================
+ Coverage 81.88% 82.21% +0.32%
==========================================
Files 304 307 +3
Lines 124071 125502 +1431
Branches 124071 125502 +1431
==========================================
+ Hits 101601 103186 +1585
+ Misses 18652 18468 -184
- Partials 3818 3848 +30
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
| disable_scoring_autoprojection: bool, default False | ||
| Currently, when a search (vector or full text) is performed, the scoring | ||
| column (_distance, _score) is added to the end of the output even when a | ||
| projection is specified. In the future, this will change. The columns will |
There was a problem hiding this comment.
not sure who the consumer of these docs is, but it might be useful to explain why it's added (presumably to help with sort operations higher up in the exec plan?)
There was a problem hiding this comment.
The consumer is the user here. This only affects the final projection to the user. So if a sort exec needs the column then it will be there for the sort exec and then dropped later.
There is no real good reason why it is added except when we coded it up originally we thought it might be an interesting column for the user to see and didn't make a way for them to not see it.
| /// a count query) | ||
| pub desires_row_addr: bool, | ||
| /// Needs the row address converted into a row offset | ||
| pub needs_row_offset: bool, |
There was a problem hiding this comment.
is it right to understand this as "must project row offset"?
There was a problem hiding this comment.
Yes...maybe?
The whole concept of "projection plan" is a little sketchy. We have the physical projection, which is the columns that we need to grab from the fragment readers. row id and row offset are considered part of the physical projection because they are delivered by the fragment reader.
Then we have the requested_output_expr which is the set of columns the user is actually asking for. This includes columns that don't exist yet (because they need to be dynamically calculated for example)
Then we have the row offset which, currently, is kind of "in between". It gets added by the scanner (not the fragment reader) but it isn't exactly a UDF (I suppose it could be but it needs the dataset metadata which would make it a bit weird for a UDF) so we can't just throw it into the requested_output_expr.
As a compromise, we have this field, whose whole purpose is to inform the scanner that it needs to add the "calculate row offset" exec node.
There was a problem hiding this comment.
I've renamed the field must_add_row_offset
| Self::internal_new(input, Arc::new(frag_id_to_offset)) | ||
| } | ||
|
|
||
| // A dummy placeholder to use in emergency when we can't find the fragment info (better than panic) |
There was a problem hiding this comment.
can you spell out the consequences/implications of this?
There was a problem hiding this comment.
I'll update the comment. The consequence will be a nonsense row offset that is really large.
There was a problem hiding this comment.
I ended up getting rid of this and returning an error instead as that's probably better than a garbage value that might seem accurate (not sure why I thought panic was the only alternative when I wrote the comment)
Xuanwo
left a comment
There was a problem hiding this comment.
Mostly LGTM, only have a naming thing.
| desires_row_addr: false, | ||
| desires_row_id: false, | ||
| } | ||
| fn add_system_columns(schema: &ArrowSchema) -> ArrowSchema { |
There was a problem hiding this comment.
I find Self::add_system_columns(&full_schema) is a bit confusing. Is it a good idea to move it out of ProjectionPlan as a standalone funcation with new like add_system_columns_in_schema?
There was a problem hiding this comment.
Hmm...I'm not entirely sure what your concern is. Can you elaborate?
This function is supposed to take a schema and then add a bunch of columns to the end of the schema. So we start with a schema [X, Y, Z] and we end with the schema [X, Y, Z, _rowid, _rowaddr, _rowoffset].
At the moment it is static (no &self) because the set of system columns is a constant. In the future, maybe that won't be the case (maybe system columns can be determined by UDFs or some other pluggable mechanism?) Once that happens this method would probably move over to whatever "system column registry" we have.
The row offset (_rowoffset) is similar to the row id and the row address but slightly different. The row offset is always a number from 0 to len(dataset). The row offset is highly unstable (it can change from rows being deleted or from compaction, updates, or anything else that modifies a dataset) but it has one useful property. The user does not need to query the list of available row offset up front for sampling purposes. This makes it useful in some cases for sampling a dataset. In addition, this PR reworks the way we do projection. It generalizes things like _rowaddr and _rowid as "system columns". We can now read _rowid and _rowaddr without requiring the user to call with_row_id or with_row_address. If they are specified in a projection then they will show up in the output in the order in which they were projected. However, if the user calls with_row_id or with_row_offset then we will still always place them at at the end. In addition, scoring columns (e.g. `_distance`, `_score`) can be specified in the projection. To avoid breaking changes we still include those columns even if they are not included in a projection. However, this behavior will change in the future. This PR adds a warning that will be emitted and a new flag (`disable_scoring_autoprojection`) that can be used to opt in to the new behavior early to avoid the upcoming breaking change.
In #4478 a change was introduced that added the `ProjectionPlan::full()` function. This will get the entire schema for a dataset and turn every column into a SQL expression in order to call `from_expressions`. This is causing us a significant performance hit when we have datasets that have very large schemas. Since we know for certain that this function is getting all of the columns, we should be able to short circuit the work being done in parsing strings to columns against the schema and instead jump straight to evaluating each column in the schema as an `Expr`. Running one of our benchmarks (time in seconds): - On commit 2b774a4 (before PR 4478): 0.4338 - On commit eeea03c (after PR 4478): 0.7486 - eeea03c with my update in this PR: 0.4274
…#4743) In lance-format#4478 a change was introduced that added the `ProjectionPlan::full()` function. This will get the entire schema for a dataset and turn every column into a SQL expression in order to call `from_expressions`. This is causing us a significant performance hit when we have datasets that have very large schemas. Since we know for certain that this function is getting all of the columns, we should be able to short circuit the work being done in parsing strings to columns against the schema and instead jump straight to evaluating each column in the schema as an `Expr`. Running one of our benchmarks (time in seconds): - On commit 2b774a4 (before PR 4478): 0.4338 - On commit eeea03c (after PR 4478): 0.7486 - eeea03c with my update in this PR: 0.4274
…#4743) In lance-format#4478 a change was introduced that added the `ProjectionPlan::full()` function. This will get the entire schema for a dataset and turn every column into a SQL expression in order to call `from_expressions`. This is causing us a significant performance hit when we have datasets that have very large schemas. Since we know for certain that this function is getting all of the columns, we should be able to short circuit the work being done in parsing strings to columns against the schema and instead jump straight to evaluating each column in the schema as an `Expr`. Running one of our benchmarks (time in seconds): - On commit 2b774a4 (before PR 4478): 0.4338 - On commit eeea03c (after PR 4478): 0.7486 - eeea03c with my update in this PR: 0.4274
The row offset (_rowoffset) is similar to the row id and the row address but slightly different. The row offset is always a number from 0 to len(dataset).
The row offset is highly unstable (it can change from rows being deleted or from compaction, updates, or anything else that modifies a dataset) but it has one useful property. The user does not need to query the list of available row offset up front for sampling purposes. This makes it useful in some cases for sampling a dataset.
In addition, this PR reworks the way we do projection. It generalizes things like _rowaddr and _rowid as "system columns". We can now read _rowid and _rowaddr without requiring the user to call with_row_id or with_row_address. If they are specified in a projection then they will show up in the output in the order in which they were projected. However, if the user calls with_row_id or with_row_offset then we will still always place them at at the end.
In addition, scoring columns (e.g.
_distance,_score) can be specified in the projection. To avoid breaking changes we still include those columns even if they are not included in a projection. However, this behavior will change in the future. This PR adds a warning that will be emitted and a new flag (disable_scoring_autoprojection) that can be used to opt in to the new behavior early to avoid the upcoming breaking change.