Skip to content

feat: allow reading the _rowoffset#4478

Merged
westonpace merged 6 commits into
lance-format:mainfrom
westonpace:feat/read-row-offset
Aug 20, 2025
Merged

feat: allow reading the _rowoffset#4478
westonpace merged 6 commits into
lance-format:mainfrom
westonpace:feat/read-row-offset

Conversation

@westonpace
Copy link
Copy Markdown
Member

@westonpace westonpace commented Aug 14, 2025

The row offset (_rowoffset) is similar to the row id and the row address but slightly different. The row offset is always a number from 0 to len(dataset).

The row offset is highly unstable (it can change from rows being deleted or from compaction, updates, or anything else that modifies a dataset) but it has one useful property. The user does not need to query the list of available row offset up front for sampling purposes. This makes it useful in some cases for sampling a dataset.

In addition, this PR reworks the way we do projection. It generalizes things like _rowaddr and _rowid as "system columns". We can now read _rowid and _rowaddr without requiring the user to call with_row_id or with_row_address. If they are specified in a projection then they will show up in the output in the order in which they were projected. However, if the user calls with_row_id or with_row_offset then we will still always place them at at the end.

In addition, scoring columns (e.g. _distance, _score) can be specified in the projection. To avoid breaking changes we still include those columns even if they are not included in a projection. However, this behavior will change in the future. This PR adds a warning that will be emitted and a new flag (disable_scoring_autoprojection) that can be used to opt in to the new behavior early to avoid the upcoming breaking change.

@github-actions github-actions Bot added enhancement New feature or request python labels Aug 14, 2025
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Aug 14, 2025

Codecov Report

❌ Patch coverage is 78.04296% with 92 lines in your changes missing coverage. Please review.
✅ Project coverage is 82.21%. Comparing base (db14186) to head (bfd8c6c).
⚠️ Report is 21 commits behind head on main.

Files with missing lines Patch % Lines
rust/lance/src/io/exec/rowids.rs 66.86% 48 Missing and 8 partials ⚠️
rust/lance/src/dataset/scanner.rs 85.10% 12 Missing and 9 partials ⚠️
rust/lance-datafusion/src/projection.rs 85.98% 14 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4478      +/-   ##
==========================================
+ Coverage   81.88%   82.21%   +0.32%     
==========================================
  Files         304      307       +3     
  Lines      124071   125502    +1431     
  Branches   124071   125502    +1431     
==========================================
+ Hits       101601   103186    +1585     
+ Misses      18652    18468     -184     
- Partials     3818     3848      +30     
Flag Coverage Δ
unittests 82.21% <78.04%> (+0.32%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

disable_scoring_autoprojection: bool, default False
Currently, when a search (vector or full text) is performed, the scoring
column (_distance, _score) is added to the end of the output even when a
projection is specified. In the future, this will change. The columns will
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure who the consumer of these docs is, but it might be useful to explain why it's added (presumably to help with sort operations higher up in the exec plan?)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The consumer is the user here. This only affects the final projection to the user. So if a sort exec needs the column then it will be there for the sort exec and then dropped later.

There is no real good reason why it is added except when we coded it up originally we thought it might be an interesting column for the user to see and didn't make a way for them to not see it.

Comment thread rust/lance-datafusion/src/projection.rs Outdated
/// a count query)
pub desires_row_addr: bool,
/// Needs the row address converted into a row offset
pub needs_row_offset: bool,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it right to understand this as "must project row offset"?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes...maybe?

The whole concept of "projection plan" is a little sketchy. We have the physical projection, which is the columns that we need to grab from the fragment readers. row id and row offset are considered part of the physical projection because they are delivered by the fragment reader.

Then we have the requested_output_expr which is the set of columns the user is actually asking for. This includes columns that don't exist yet (because they need to be dynamically calculated for example)

Then we have the row offset which, currently, is kind of "in between". It gets added by the scanner (not the fragment reader) but it isn't exactly a UDF (I suppose it could be but it needs the dataset metadata which would make it a bit weird for a UDF) so we can't just throw it into the requested_output_expr.

As a compromise, we have this field, whose whole purpose is to inform the scanner that it needs to add the "calculate row offset" exec node.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've renamed the field must_add_row_offset

Comment thread rust/lance/src/io/exec/rowids.rs Outdated
Comment thread rust/lance/src/io/exec/rowids.rs Outdated
Self::internal_new(input, Arc::new(frag_id_to_offset))
}

// A dummy placeholder to use in emergency when we can't find the fragment info (better than panic)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you spell out the consequences/implications of this?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll update the comment. The consequence will be a nonsense row offset that is really large.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ended up getting rid of this and returning an error instead as that's probably better than a garbage value that might seem accurate (not sure why I thought panic was the only alternative when I wrote the comment)

Copy link
Copy Markdown
Collaborator

@Xuanwo Xuanwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly LGTM, only have a naming thing.

desires_row_addr: false,
desires_row_id: false,
}
fn add_system_columns(schema: &ArrowSchema) -> ArrowSchema {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find Self::add_system_columns(&full_schema) is a bit confusing. Is it a good idea to move it out of ProjectionPlan as a standalone funcation with new like add_system_columns_in_schema?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm...I'm not entirely sure what your concern is. Can you elaborate?

This function is supposed to take a schema and then add a bunch of columns to the end of the schema. So we start with a schema [X, Y, Z] and we end with the schema [X, Y, Z, _rowid, _rowaddr, _rowoffset].

At the moment it is static (no &self) because the set of system columns is a constant. In the future, maybe that won't be the case (maybe system columns can be determined by UDFs or some other pluggable mechanism?) Once that happens this method would probably move over to whatever "system column registry" we have.

@westonpace westonpace merged commit eeea03c into lance-format:main Aug 20, 2025
26 checks passed
fangbo pushed a commit to fangbo/lance that referenced this pull request Aug 21, 2025
The row offset (_rowoffset) is similar to the row id and the row address
but slightly different. The row offset is always a number from 0 to
len(dataset).

The row offset is highly unstable (it can change from rows being deleted
or from compaction, updates, or anything else that modifies a dataset)
but it has one useful property. The user does not need to query the list
of available row offset up front for sampling purposes. This makes it
useful in some cases for sampling a dataset.

In addition, this PR reworks the way we do projection. It generalizes
things like _rowaddr and _rowid as "system columns". We can now read
_rowid and _rowaddr without requiring the user to call with_row_id or
with_row_address. If they are specified in a projection then they will
show up in the output in the order in which they were projected.
However, if the user calls with_row_id or with_row_offset then we will
still always place them at at the end.

In addition, scoring columns (e.g. `_distance`, `_score`) can be
specified in the projection. To avoid breaking changes we still include
those columns even if they are not included in a projection. However,
this behavior will change in the future. This PR adds a warning that
will be emitted and a new flag (`disable_scoring_autoprojection`) that
can be used to opt in to the new behavior early to avoid the upcoming
breaking change.
westonpace pushed a commit that referenced this pull request Sep 19, 2025
In #4478 a change was introduced
that added the `ProjectionPlan::full()` function. This will get the
entire schema for a dataset and turn every column into a SQL expression
in order to call `from_expressions`. This is causing us a significant
performance hit when we have datasets that have very large schemas.
Since we know for certain that this function is getting all of the
columns, we should be able to short circuit the work being done in
parsing strings to columns against the schema and instead jump straight
to evaluating each column in the schema as an `Expr`.

Running one of our benchmarks (time in seconds):

- On commit 2b774a4 (before PR 4478): 0.4338
- On commit eeea03c (after PR 4478): 0.7486
- eeea03c with my update in this PR: 0.4274
timsaucer added a commit to rerun-io/lance that referenced this pull request Sep 19, 2025
…#4743)

In lance-format#4478 a change was introduced
that added the `ProjectionPlan::full()` function. This will get the
entire schema for a dataset and turn every column into a SQL expression
in order to call `from_expressions`. This is causing us a significant
performance hit when we have datasets that have very large schemas.
Since we know for certain that this function is getting all of the
columns, we should be able to short circuit the work being done in
parsing strings to columns against the schema and instead jump straight
to evaluating each column in the schema as an `Expr`.

Running one of our benchmarks (time in seconds):

- On commit 2b774a4 (before PR 4478): 0.4338
- On commit eeea03c (after PR 4478): 0.7486
- eeea03c with my update in this PR: 0.4274
jackye1995 pushed a commit to jackye1995/lance that referenced this pull request Jan 21, 2026
…#4743)

In lance-format#4478 a change was introduced
that added the `ProjectionPlan::full()` function. This will get the
entire schema for a dataset and turn every column into a SQL expression
in order to call `from_expressions`. This is causing us a significant
performance hit when we have datasets that have very large schemas.
Since we know for certain that this function is getting all of the
columns, we should be able to short circuit the work being done in
parsing strings to columns against the schema and instead jump straight
to evaluating each column in the schema as an `Expr`.

Running one of our benchmarks (time in seconds):

- On commit 2b774a4 (before PR 4478): 0.4338
- On commit eeea03c (after PR 4478): 0.7486
- eeea03c with my update in this PR: 0.4274
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request python

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants