feat: introduce GenerateRel for lateral view and unnest operations by EpsilonPrime · Pull Request #917 · substrait-io/substrait

EpsilonPrime · 2025-12-13T00:46:31Z

Add GenerateRel, a new relational operator that applies table functions to produce zero or more output rows per input row. This enables SQL LATERAL VIEW, EXPLODE, and UNNEST operations for flattening nested data structures.

Changes

Adds TableFunction message based on original plans for such a message
Adds GenerateRel message which is the first use of table functions

Key Design Decisions

Dedicated TableFunction message: Provides type safety and clear semantics for multi-row producing functions, following the pattern of ScalarFunction, WindowFunction, and AggregateFunction.
preserve_on_empty flag: Controls NULL handling when generators produce zero rows. Named to reflect behavior (row preservation) rather than SQL join terminology ("outer"). Equivalent to Spark's LATERAL VIEW OUTER or standard SQL's LEFT JOIN LATERAL.
No automatic ordinal column: Unlike ExpandRel which appends an i32 ordinal, GenerateRel relies on table functions to include position explicitly (e.g., posexplode includes pos field in output) when needed.
TableFunction not in Expression: Table functions produce multiple rows, incompatible with Expression semantics (single value). Only used in relational operators like GenerateRel.

Comparison to ExpandRel

ExpandRel: Fixed cardinality (plan time), for grouping sets/cube
GenerateRel: Variable cardinality (runtime), for lateral view/unnest

Related Work

Will be used to unfork the changes made in Apache Gluten PR #574.

This change is

benbellick

Very excited for this one, as it would be useful for modeling Unnest at my job.

Couple things that I think are high priority:

Could we introduce the schema changes for table functions in simple_extension_schema.yaml?
Could we introduce some proper examples into something like functions_explode.yaml to show these definitions? This also helps ensure that downstream libraries are properly handling these things.
For my particular use case, I would love to be able to represent queries like SELECT UNNEST(ARRAY[1,2]), UNNEST(ARRAY[4,5,6]) which has a zipping behavior rather than cartesian product. What would be the best way to represent these? (One way I can think of is as variadic unnest call which has zipping behavior.)

All in all this will be great and I'm happy to help implement them in the downstream libraries.

benbellick · 2025-12-13T19:26:49Z

proto/substrait/algebra.proto

+  //   standard SQL via LEFT JOIN LATERAL.
+  //
+  // Optional; defaults to false.
+  bool preserve_on_empty = 4;


This is a similar idea to this PR here from @yongchul: #890

I wonder if there is some way to unify these ideas?

An alternative would be to make this part of TableFunction but if there are other table expressions they might need this functionality.

Based on my other comment on ApplyRel, this can be actually an enum of CROSS, OUTER, SEMI, and ANTISEMI, default to CROSS and stay in GenerateRel.

benbellick · 2025-12-13T19:30:02Z

proto/substrait/algebra.proto

+  // For compatibility and optimization hints.
+  substrait.extensions.AdvancedExtension advanced_extension = 10;


Suggested change

// For compatibility and optimization hints.

substrait.extensions.AdvancedExtension advanced_extension = 10;

substrait.extensions.AdvancedExtension advanced_extension = 10;

None of the other usages of advanced_extension has a comment, so I think lets just drop it to be consistent.

Also, I understand wanting to set the field number to 10 for consistency, but there are actually usages of advanced_extension which are 10, 9, 5, and 4 in this file. Any reason now to just make it 5 so that it is the next field? TBH though this is not particularly important to me.

Having them all the same makes it easier for some proto uses (but I can't name any at the moment). As part of Substrait 1.0 I'd want to see these moved up to 100 or higher. Lower field numbers compress better so it'd be nice to reserve some space (not that we really need 100 reserved field numbers).

proto/substrait/algebra.proto

benbellick · 2025-12-13T19:35:37Z

proto/substrait/algebra.proto

+    //
+    // Required.
+    Type output_type = 4;
+  }


FYI this PR has a lot of overlap with this one I wrote before: #876

Though what you have here is more general than I what I did, and so I think your approach is better. I just wanted to share in case any of the discussion there is useful. One thing in particular that was discussed there: have you at all thought about variadic handling for table functions?

The output needs to be determinable from the input. So for more complicated structures we likely need to upgrade the grammar. For now we can achieve most of what we need with function overloading.

site/docs/expressions/table_functions.md

site/docs/relations/logical_relations.md

westonpace

I have a general question. There are (at least) two ways in which table functions are utilized. I'm not entirely sure if these are two different concepts or the same concept.

Explode / lateral join / cross apply

I believe GenerateRel captures these cases. The input to the table function is the output of another relation. The arguments will (typically) include field references (or scalar functions applied to field references).

From

The other way in which table functions are applied is by using them directly in the FROM clause. For example SELECT * FROM myfunc(7). For example see postgres. In this case the arguments are (typically? always?) all literals.

Is the TableFunction message here intended to eventually be usable in both contexts? This PR itself only adds support for the first (which is fine). Still, it seems like message itself wouldn't have to change. I think the only change I can imagine is adding a new source rel that has no input rels and specifies the table function invocation.

Add GenerateRel, a new relational operator that applies table functions to produce zero or more output rows per input row. This enables SQL LATERAL VIEW, EXPLODE, and UNNEST operations for flattening nested data structures. - Add TableFunction message (lines 1222-1262) - New expression type for functions that produce multiple rows - Fields: function_reference, arguments, options, output_type - Output type must be a Struct defining generated columns - Used exclusively in relational contexts (not in Expression.rex_type) - Add GenerateRel message (lines 559-639) - Applies table function to each input row - Fields: common, input, generator, preserve_on_empty, advanced_extension - Comprehensive inline documentation with examples - Property maintenance: distribution and orderedness not maintained - Output schema: [input_fields..., generated_fields...] - Comparison to ExpandRel included in comments - Add GenerateRel to Rel union (line 677) - Field number 23 in Rel.rel_type oneof - site/docs/relations/logical_relations.md - Complete Generate Operation section with signature table - Common table functions reference (explode, posexplode, unnest) - Empty collection handling (preserve_on_empty flag) - Detailed examples: array explode, posexplode, map explode - Comparison table with ExpandRel - site/docs/expressions/table_functions.md - Comprehensive rewrite of table functions documentation - TableFunction message field descriptions - Differences from scalar/window/aggregate functions - Usage examples with GenerateRel - Extension mechanism explanation 1. **Dedicated TableFunction message**: Provides type safety and clear semantics for multi-row producing functions, following the pattern of ScalarFunction, WindowFunction, and AggregateFunction. 2. **preserve_on_empty flag**: Controls NULL handling when generators produce zero rows. Named to reflect behavior (row preservation) rather than SQL join terminology ("outer"). Equivalent to Spark's LATERAL VIEW OUTER or standard SQL's LEFT JOIN LATERAL. 3. **No automatic ordinal column**: Unlike ExpandRel which appends an i32 ordinal, GenerateRel relies on table functions to include position explicitly (e.g., posexplode includes pos field in output). 4. **TableFunction not in Expression**: Table functions produce multiple rows, incompatible with Expression semantics (single value). Only used in relational operators like GenerateRel. - ExpandRel: Fixed cardinality (plan time), for grouping sets/cube - GenerateRel: Variable cardinality (runtime), for lateral view/unnest Based on Apache Gluten PR substrait-io#574 but with significant improvements: - Well-documented fields (Gluten had minimal documentation) - Dedicated TableFunction type (vs generic Expression) - preserve_on_empty flag for outer semantics - Comprehensive examples and property maintenance rules None. This is a purely additive change using a new field number in Rel. - Protobuf compilation verified with protoc - Follows all Substrait conventions (common field 1, input field 2, advanced_extension field 10) - No trailing whitespace, proper line endings

EpsilonPrime · 2025-12-22T07:35:13Z

I have a general question. There are (at least) two ways in which table functions are utilized. I'm not entirely sure if these are two different concepts or the same concept.

...

Is the TableFunction message here intended to eventually be usable in both contexts? This PR itself only adds support for the first (which is fine). Still, it seems like message itself wouldn't have to change. I think the only change I can imagine is adding a new source rel that has no input rels and specifies the table function invocation.

I believe it would work as long as you had a relation that worked just fine without requiring an input (which is currently the case here). I've expanded the implementation to create a TableExpression which gives us even more future capability but we will still have the same requirement -- we need to have a relation that allows us to provide no input.

json_decode is one such function. Getting a little meta, I could also see a function (say substrait_execute) that takes a substrait plan as an argument using this. :)

benbellick · 2026-01-13T14:58:41Z

@EpsilonPrime the link in your description to #574 is linking to an unrelated substrait issue. I think the intention is to link to apache/gluten#574. Is that correct?

benbellick

I haven't done a complete review yet but just dumping my comments so far. Hopefully will get a chance to do rest later. Thanks!

extensions/functions_table_generic.yaml

benbellick · 2026-01-13T15:18:03Z

extensions/functions_table_generic.yaml

@@ -0,0 +1,109 @@
+%YAML 1.2
+---
+urn: extension:io.substrait:functions_table_generic


Very minor nit, but IMO adding _generic on the end doesn't really provide any extra information. Might as well just keep the URN shorter. Same goes with the file name. Feel free to ignore this one though

Suggested change

urn: extension:io.substrait:functions_table_generic

urn: extension:io.substrait:functions_table

This is mirroring the aggregate functions. I would go with table_functions over functions_table. One could functions is also devoid of information but table by itself fails to have enough information to make it clear what it is on its own.

benbellick · 2026-01-13T15:20:46Z

extensions/functions_table_generic.yaml

+---
+urn: extension:io.substrait:functions_table_generic
+table_functions:
+  - name: explode


You have two explodes in the yaml file here. I don't believe that this is technically incorrect, though I opened up an issue (#931) proposing making it the case that function names must be unique.

Is it possible / sensible to make this one function with two impls? And then just have the description describe both? I think this is simpler to handle for both implementers and easier for humans to read.

benbellick · 2026-01-13T15:25:33Z

extensions/functions_table_generic.yaml

+          - name: input4
+            value: list<T4>
+            description: Fourth array to unnest.
+        return: struct<T1, T2, T3, T4>


This was kind of the exact difficulty I was dealing with in my version of the unnest PR. I think this is a fine solution for now, but it will be great in the future to figure out a more flexible grammar for describing this behavior for an arbitrarily large number of inputs.

tests/cases/table_generic/explode_array.test

benbellick · 2026-01-13T16:04:49Z

proto/substrait/algebra.proto

  }
 }

+// TableExpression represents expressions that produce zero or more rows,


If TableExpression only makes sense in the context of GenerateRel, should we move this proto definition to be within the scope of GenerateRel?

Table expressions are definitely possible elsewhere. I could see read relations being redesigned using table expressions.

benbellick · 2026-01-13T16:05:36Z

proto/substrait/algebra.proto

+// TableExpression expressions cannot be composed with other expressions.
+message TableExpression {
+  oneof table_expr_type {
+    TableFunction table_function = 1;


What are the other things here that you anticipate adding in the future?

It's possible that table expressions could have their own operators (perhaps implementing joins, projects, or fetches). I'm using Expression as a guide here.

I plan to send a proposal to add ApplyRel by next week, which will behave like a join at high-level but taking a RelOp tree on the subquery part. We could consolidate that with this GeneratorRel with following extensions to GenerateRel.

TableExpression should have RelOp subquery in its oneof.

subquery should reference the input field by scoped field reference (outer reference).

There is join type equivalent options (cross, outer, semi, and anti-semi). These will determine the behavior of the apply as well as output schema.

@EpsilonPrime do you see we need a seperate ApplyRel or extend GeneratorRel?

One needs preserve on empty, the other RelOp. That feels different enough to me. Fortunately there's enough room for ApplyRel in TableExpression.

yongchul · 2026-01-28T22:11:38Z

extensions/functions_table_generic.yaml

+
+      Some systems call this unnest.


This also applies to the other explode

yongchul · 2026-01-28T22:14:34Z

extensions/functions_table_generic.yaml

+
+      For each input row, this function generates one output row for each element
+      in the input array. The output is a struct with two fields:
+      - 'pos': 0-indexed position of the element in the array (i64)


silly thing. some systems do use 1-indexed. do we want to have it as a function option?

Could conversion from 0-indexed to 1-indexed is something that could be handled by their Substrait layer? If so, then we always canonicalize one way.

yongchul · 2026-01-28T22:18:23Z

extensions/functions_table_generic.yaml

+    impls:
+      - args:
+          - name: input
+            value: list<any1>
+            description: The array to unnest.
+        return: struct<any1>
+      - args:
+          - name: input1
+            value: list<any1>
+            description: First array to unnest.
+          - name: input2
+            value: list<any2>
+            description: Second array to unnest.
+        return: struct<any1, any2>
+      - args:
+          - name: input1
+            value: list<any1>
+            description: First array to unnest.
+          - name: input2
+            value: list<any2>
+            description: Second array to unnest.
+          - name: input3
+            value: list<any3>
+            description: Third array to unnest.
+        return: struct<any1, any2, any3>
+      - args:
+          - name: input1
+            value: list<any1>
+            description: First array to unnest.
+          - name: input2
+            value: list<any2>
+            description: Second array to unnest.
+          - name: input3
+            value: list<any3>
+            description: Third array to unnest.
+          - name: input4
+            value: list<any4>
+            description: Fourth array to unnest.
+        return: struct<any1, any2, any3, any4>


can this be expressed using inconsistent variadic arguments? or we need to extend the type grammar?

We need to extend the type grammar. I left that to a future project.

yongchul · 2026-01-28T22:30:08Z

proto/substrait/algebra.proto

+  //   This is equivalent to Spark's "LATERAL VIEW OUTER" or can be achieved in
+  //   standard SQL via LEFT JOIN LATERAL.
+  //
+  // Optional; defaults to false.


not: Optional:?

yongchul · 2026-01-28T22:44:20Z

proto/substrait/algebra.proto

+// TableExpression expressions cannot be composed with other expressions.
+message TableExpression {
+  oneof table_expr_type {
+    TableFunction table_function = 1;


I plan to send a proposal to add ApplyRel by next week, which will behave like a join at high-level but taking a RelOp tree on the subquery part. We could consolidate that with this GeneratorRel with following extensions to GenerateRel.

TableExpression should have RelOp subquery in its oneof.

subquery should reference the input field by scoped field reference (outer reference).

There is join type equivalent options (cross, outer, semi, and anti-semi). These will determine the behavior of the apply as well as output schema.

@EpsilonPrime do you see we need a seperate ApplyRel or extend GeneratorRel?

yongchul · 2026-01-28T23:13:28Z

site/docs/expressions/table_functions.md

+- `explode(array<i32>)` → `Struct{value: i32}`
+- `posexplode(array<string>)` → `Struct{pos: i64, value: string}`
+- `explode(map<string, i32>)` → `Struct{key: string, value: i32}`


wait.. do we have named struct? 😄

Section on named structs here

Some more info here

My understanding is that they are actually just STRUCTS always, but you can use them when it is convenient to pretend something has names 🤷

yongchul · 2026-01-28T23:18:09Z

site/docs/expressions/table_functions.md

+| **Input scope** | Single row | Window of rows | Group of rows | Single row |
+| **Output cardinality** | Exactly 1 value | Exactly 1 value | Exactly 1 value | 0 or more rows |
+| **Output type** | Scalar type | Scalar type | Scalar type | Struct type |
+| **Can appear in Expression** | Yes | Yes (WindowFunction) | No | No |


it could appear in the make_array or make_struct like functions if we have.. no?

yongchul · 2026-01-28T23:38:03Z

site/docs/relations/logical_relations.md

+|----------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| Inputs               | 1                                                                                                                                                              |
+| Outputs              | 1                                                                                                                                                              |
+| Property Maintenance | Distribution and orderedness are not maintained. The variable number of output rows per input breaks distribution properties, and row multiplication breaks ordering. |


input distribution is preserved as long as the input fields are intact and available -- which should be the case by definition.

yongchul · 2026-01-28T23:39:33Z

site/docs/relations/logical_relations.md

+| Inputs               | 1                                                                                                                                                              |
+| Outputs              | 1                                                                                                                                                              |
+| Property Maintenance | Distribution and orderedness are not maintained. The variable number of output rows per input breaks distribution properties, and row multiplication breaks ordering. |
+| Direct Output Order  | Input fields followed by generated fields. The output schema is: `[input_field_1, ..., input_field_N, generated_field_1, ..., generated_field_M]` where generated fields come from the table function's output type. Unlike ExpandRel which appends an i32 ordinal column, GenerateRel does NOT automatically add an ordinal column. Table functions that need position information (like posexplode) include it explicitly in their output_type. The RelCommon.emit field can be used to reorder columns or project out unwanted fields. |


isn't output type is STRUCT? So we automatically flatten the struct in direct output order?

Returning a struct is possible but it will be harder to use. We have columns we might as well use them.

yongchul · 2026-01-28T23:43:39Z

site/docs/relations/logical_relations.md

+| Outputs              | 1                                                                                                                                                              |
+| Property Maintenance | Distribution and orderedness are not maintained. The variable number of output rows per input breaks distribution properties, and row multiplication breaks ordering. |
+| Direct Output Order  | Input fields followed by generated fields. The output schema is: `[input_field_1, ..., input_field_N, generated_field_1, ..., generated_field_M]` where generated fields come from the table function's output type. Unlike ExpandRel which appends an i32 ordinal column, GenerateRel does NOT automatically add an ordinal column. Table functions that need position information (like posexplode) include it explicitly in their output_type. The RelCommon.emit field can be used to reorder columns or project out unwanted fields. |
+


can the position information be part of optional behavior of generator rel? In that way, we don't need extra function and implementation can opt to pick which mode it wants... something like

enum OffsetField { NONE, // not generating offset value (default) ZERO_BASED, // 0-based index ONE_BASED // 1-based index }

when the option is there, the last field is the offset.

EpsilonPrime force-pushed the generaterel branch from 7553b03 to c1d33fc Compare December 13, 2025 00:56

EpsilonPrime marked this pull request as ready for review December 13, 2025 00:57

EpsilonPrime requested review from cpcloud, jacques-n, vbarua, westonpace and yongchul as code owners December 13, 2025 00:57

benbellick reviewed Dec 13, 2025

View reviewed changes

westonpace reviewed Dec 19, 2025

View reviewed changes

EpsilonPrime force-pushed the generaterel branch from 7c02259 to bbed1e8 Compare December 22, 2025 02:21

EpsilonPrime force-pushed the generaterel branch 2 times, most recently from 773d5a7 to b08bb10 Compare December 22, 2025 05:38

review notes

afdef63

EpsilonPrime force-pushed the generaterel branch from b08bb10 to afdef63 Compare December 22, 2025 05:42

EpsilonPrime added 3 commits December 21, 2025 21:49

examples

28d923c

non-variadic unnest

4d5ab68

syntax fixes

1d8b067

EpsilonPrime force-pushed the generaterel branch from 3892139 to 1d8b067 Compare December 22, 2025 06:25

fix coverage test

e53b43c

table function tests

b250ed0

EpsilonPrime force-pushed the generaterel branch from a79c08f to b250ed0 Compare December 22, 2025 07:41

better test case parser error messages

4152410

EpsilonPrime requested review from benbellick and westonpace January 12, 2026 23:40

benbellick reviewed Jan 13, 2026

View reviewed changes

EpsilonPrime added 2 commits January 13, 2026 17:52

added the description of table function test format

c640f8b

use any1 instead of convenient parameter names

bac2fd6

benbellick mentioned this pull request Jan 21, 2026

Substrait: Round-trip generate_series table function scans via ReadRel advanced_extension apache/datafusion#19407

Open

Merge branch 'main' into generaterel

12e7143

EpsilonPrime requested a review from benbellick January 23, 2026 05:53

yongchul reviewed Jan 28, 2026

View reviewed changes

benbellick mentioned this pull request Jan 30, 2026

feat: introduction of simple table functions #876

Closed

yongchul mentioned this pull request Mar 25, 2026

feat(extensions): add subscript_operator and index_of functions #1020

Open

		// For compatibility and optimization hints.
		substrait.extensions.AdvancedExtension advanced_extension = 10;

	urn: extension:io.substrait:functions_table_generic
	urn: extension:io.substrait:functions_table

Conversation

EpsilonPrime commented Dec 13, 2025 • edited by jacques-n Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Key Design Decisions

Comparison to ExpandRel

Related Work

Uh oh!

benbellick left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

westonpace left a comment

Choose a reason for hiding this comment

Explode / lateral join / cross apply

From

Uh oh!

EpsilonPrime commented Dec 22, 2025

Uh oh!

benbellick commented Jan 13, 2026

Uh oh!

benbellick left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

EpsilonPrime commented Dec 13, 2025 •

edited by jacques-n

Loading