feat: allow reading the _rowoffset by westonpace · Pull Request #4478 · lance-format/lance

westonpace · 2025-08-14T22:28:32Z

The row offset (_rowoffset) is similar to the row id and the row address but slightly different. The row offset is always a number from 0 to len(dataset).

The row offset is highly unstable (it can change from rows being deleted or from compaction, updates, or anything else that modifies a dataset) but it has one useful property. The user does not need to query the list of available row offset up front for sampling purposes. This makes it useful in some cases for sampling a dataset.

In addition, this PR reworks the way we do projection. It generalizes things like _rowaddr and _rowid as "system columns". We can now read _rowid and _rowaddr without requiring the user to call with_row_id or with_row_address. If they are specified in a projection then they will show up in the output in the order in which they were projected. However, if the user calls with_row_id or with_row_offset then we will still always place them at at the end.

In addition, scoring columns (e.g. _distance, _score) can be specified in the projection. To avoid breaking changes we still include those columns even if they are not included in a projection. However, this behavior will change in the future. This PR adds a warning that will be emitted and a new flag (disable_scoring_autoprojection) that can be used to opt in to the new behavior early to avoid the upcoming breaking change.

codecov-commenter · 2025-08-14T23:05:23Z

Codecov Report

❌ Patch coverage is 78.04296% with 92 lines in your changes missing coverage. Please review.
✅ Project coverage is 82.21%. Comparing base (db14186) to head (bfd8c6c).
⚠️ Report is 21 commits behind head on main.

Files with missing lines	Patch %	Lines
rust/lance/src/io/exec/rowids.rs	66.86%	48 Missing and 8 partials ⚠️
rust/lance/src/dataset/scanner.rs	85.10%	12 Missing and 9 partials ⚠️
rust/lance-datafusion/src/projection.rs	85.98%	14 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #4478      +/-   ##
==========================================
+ Coverage   81.88%   82.21%   +0.32%     
==========================================
  Files         304      307       +3     
  Lines      124071   125502    +1431     
  Branches   124071   125502    +1431     
==========================================
+ Hits       101601   103186    +1585     
+ Misses      18652    18468     -184     
- Partials     3818     3848      +30

Flag	Coverage Δ
unittests	`82.21% <78.04%> (+0.32%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

wkalt · 2025-08-19T14:37:26Z

+        disable_scoring_autoprojection: bool, default False
+            Currently, when a search (vector or full text) is performed, the scoring
+            column (_distance, _score) is added to the end of the output even when a
+            projection is specified.  In the future, this will change.  The columns will


not sure who the consumer of these docs is, but it might be useful to explain why it's added (presumably to help with sort operations higher up in the exec plan?)

The consumer is the user here. This only affects the final projection to the user. So if a sort exec needs the column then it will be there for the sort exec and then dropped later.

There is no real good reason why it is added except when we coded it up originally we thought it might be an interesting column for the user to see and didn't make a way for them to not see it.

wkalt · 2025-08-19T14:40:08Z

-    /// a count query)
-    pub desires_row_addr: bool,
+    /// Needs the row address converted into a row offset
+    pub needs_row_offset: bool,


is it right to understand this as "must project row offset"?

Yes...maybe?

The whole concept of "projection plan" is a little sketchy. We have the physical projection, which is the columns that we need to grab from the fragment readers. row id and row offset are considered part of the physical projection because they are delivered by the fragment reader.

Then we have the requested_output_expr which is the set of columns the user is actually asking for. This includes columns that don't exist yet (because they need to be dynamically calculated for example)

Then we have the row offset which, currently, is kind of "in between". It gets added by the scanner (not the fragment reader) but it isn't exactly a UDF (I suppose it could be but it needs the dataset metadata which would make it a bit weird for a UDF) so we can't just throw it into the requested_output_expr.

As a compromise, we have this field, whose whole purpose is to inform the scanner that it needs to add the "calculate row offset" exec node.

I've renamed the field must_add_row_offset

wkalt · 2025-08-19T14:45:40Z

+        Self::internal_new(input, Arc::new(frag_id_to_offset))
+    }
+
+    // A dummy placeholder to use in emergency when we can't find the fragment info (better than panic)


can you spell out the consequences/implications of this?

I'll update the comment. The consequence will be a nonsense row offset that is really large.

I ended up getting rid of this and returning an error instead as that's probably better than a garbage value that might seem accurate (not sure why I thought panic was the only alternative when I wrote the comment)

Xuanwo

Mostly LGTM, only have a naming thing.

Xuanwo · 2025-08-19T14:44:02Z

-            desires_row_addr: false,
-            desires_row_id: false,
-        }
+    fn add_system_columns(schema: &ArrowSchema) -> ArrowSchema {


I find Self::add_system_columns(&full_schema) is a bit confusing. Is it a good idea to move it out of ProjectionPlan as a standalone funcation with new like add_system_columns_in_schema?

Hmm...I'm not entirely sure what your concern is. Can you elaborate?

This function is supposed to take a schema and then add a bunch of columns to the end of the schema. So we start with a schema [X, Y, Z] and we end with the schema [X, Y, Z, _rowid, _rowaddr, _rowoffset].

At the moment it is static (no &self) because the set of system columns is a constant. In the future, maybe that won't be the case (maybe system columns can be determined by UDFs or some other pluggable mechanism?) Once that happens this method would probably move over to whatever "system column registry" we have.

The row offset (_rowoffset) is similar to the row id and the row address but slightly different. The row offset is always a number from 0 to len(dataset). The row offset is highly unstable (it can change from rows being deleted or from compaction, updates, or anything else that modifies a dataset) but it has one useful property. The user does not need to query the list of available row offset up front for sampling purposes. This makes it useful in some cases for sampling a dataset. In addition, this PR reworks the way we do projection. It generalizes things like _rowaddr and _rowid as "system columns". We can now read _rowid and _rowaddr without requiring the user to call with_row_id or with_row_address. If they are specified in a projection then they will show up in the output in the order in which they were projected. However, if the user calls with_row_id or with_row_offset then we will still always place them at at the end. In addition, scoring columns (e.g. `_distance`, `_score`) can be specified in the projection. To avoid breaking changes we still include those columns even if they are not included in a projection. However, this behavior will change in the future. This PR adds a warning that will be emitted and a new flag (`disable_scoring_autoprojection`) that can be used to opt in to the new behavior early to avoid the upcoming breaking change.

In #4478 a change was introduced that added the `ProjectionPlan::full()` function. This will get the entire schema for a dataset and turn every column into a SQL expression in order to call `from_expressions`. This is causing us a significant performance hit when we have datasets that have very large schemas. Since we know for certain that this function is getting all of the columns, we should be able to short circuit the work being done in parsing strings to columns against the schema and instead jump straight to evaluating each column in the schema as an `Expr`. Running one of our benchmarks (time in seconds): - On commit 2b774a4 (before PR 4478): 0.4338 - On commit eeea03c (after PR 4478): 0.7486 - eeea03c with my update in this PR: 0.4274

…#4743) In lance-format#4478 a change was introduced that added the `ProjectionPlan::full()` function. This will get the entire schema for a dataset and turn every column into a SQL expression in order to call `from_expressions`. This is causing us a significant performance hit when we have datasets that have very large schemas. Since we know for certain that this function is getting all of the columns, we should be able to short circuit the work being done in parsing strings to columns against the schema and instead jump straight to evaluating each column in the schema as an `Expr`. Running one of our benchmarks (time in seconds): - On commit 2b774a4 (before PR 4478): 0.4338 - On commit eeea03c (after PR 4478): 0.7486 - eeea03c with my update in this PR: 0.4274

allow reading the row offset

aa7923c

github-actions Bot added enhancement New feature or request python labels Aug 14, 2025

westonpace added 4 commits August 15, 2025 06:52

Address clippy suggestions

efa709e

Change the way system columns are specified

8e69cb2

Fix broken tests, clippy suggestions

fdf00b0

Fix broken unit test

3fbfe2f

This was referenced Aug 18, 2025

Rework data loaders #3727

Open

feat: add __getitems__ method impl for torch integration lancedb/lancedb#2596

Merged

wkalt reviewed Aug 19, 2025

View reviewed changes

Comment thread rust/lance/src/io/exec/rowids.rs Outdated

wkalt reviewed Aug 19, 2025

View reviewed changes

Xuanwo approved these changes Aug 19, 2025

View reviewed changes

Address PR suggestions

bfd8c6c

westonpace merged commit eeea03c into lance-format:main Aug 20, 2025
26 checks passed

timsaucer mentioned this pull request Sep 16, 2025

fix: speed up generating ProjectionPlan for full schema #4743

Merged

westonpace mentioned this pull request Oct 15, 2025

No easy way to obtain row index from queries? #3140

Closed

Conversation

westonpace commented Aug 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov-commenter commented Aug 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Xuanwo left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

westonpace commented Aug 14, 2025 •

edited

Loading

codecov-commenter commented Aug 14, 2025 •

edited

Loading