Support json_object pushdown#10744
Conversation
|
Review failed due to infrastructure/execution failure after retries. Please re-trigger review. ℹ️ Learn more details on Pantheon AI. |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds TiFlash pushdown and execution support for json_object: registers and implements a variadic FunctionJsonObject, maps TiDB JsonObjectSig to "json_object" for DAG pushdown, implements binary JSON object encoding, and adds unit and fullstack tests covering semantics and error cases. Changes
Sequence DiagramsequenceDiagram
participant Planner as TiDB Planner
participant DAGUtils as DAG Mapper
participant TiFlash as TiFlash Executor
participant Func as FunctionJsonObject
participant JsonBin as JsonBinary Encoder
participant Output as Result Buffer
Planner->>DAGUtils: build DAG with JsonObjectSig
DAGUtils->>TiFlash: map to "json_object" and send DAG
TiFlash->>Func: invoke json_object per row
Func->>Func: validate args (even count, string keys)
Func->>Func: collect key/value pairs
Func->>JsonBin: buildBinaryJsonObjectInBuffer(keys, values)
JsonBin->>JsonBin: write type, counts, offsets, keys, values, total size
JsonBin->>Output: emit binary JSON per row
Output-->>Planner: return results
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
📝 Coding Plan
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
/check-issue-triage-complete |
|
/run-check-issue-triage-complete |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@dbms/src/Functions/FunctionsJson.h`:
- Around line 1110-1116: The two new throw Exception(...) calls in
FunctionsJson.h omit ErrorCodes; update them to use DB::Exception with an
appropriate ErrorCodes value: for the NULL member name throw use throw
Exception("JSON documents may not contain NULL member names.",
ErrorCodes::BAD_ARGUMENTS) and for the oversized key throw use throw
Exception("TiDB/TiFlash does not yet support JSON objects with the key length >=
65536", ErrorCodes::TOO_LARGE_STRING) so downstream error handling can match on
ErrorCodes.
In `@dbms/src/Functions/tests/gtest_json_object.cpp`:
- Line 15: Remove the unnecessary include of <Columns/ColumnNullable.h> from
gtest_json_object.cpp since the test does not reference ColumnNullable; delete
the `#include` "<Columns/ColumnNullable.h>" directive (or replace it only if a
different, actually-used header is required by tests in this file) and then
rebuild/run tests to verify the unresolved include warning is gone.
In `@dbms/src/TiDB/Decode/JsonBinary.cpp`:
- Around line 1123-1125: The throw for oversized JSON key should include a
concrete ErrorCodes value; update the validation branch that checks key.size
(before UInt16 key_len = key.size) to throw DB::Exception (or Exception) with an
explicit ErrorCodes constant, e.g. replace throw Exception("TiDB/TiFlash does
not yet support JSON objects with the key length >= 65536"); with throw
Exception("TiDB/TiFlash does not yet support JSON objects with the key length >=
65536", ErrorCodes::ARGUMENT_OUT_OF_BOUND); so the Exception construction
includes ErrorCodes::ARGUMENT_OUT_OF_BOUND to classify the error.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 03906700-8e79-4001-be11-30024edbffa7
📒 Files selected for processing (7)
dbms/src/Flash/Coprocessor/DAGUtils.cppdbms/src/Functions/FunctionsJson.cppdbms/src/Functions/FunctionsJson.hdbms/src/Functions/tests/gtest_json_object.cppdbms/src/TiDB/Decode/JsonBinary.cppdbms/src/TiDB/Decode/JsonBinary.htests/fullstack-test/expr/json_object.test
There was a problem hiding this comment.
♻️ Duplicate comments (2)
dbms/src/Functions/tests/gtest_json_object.cpp (1)
15-15:⚠️ Potential issue | 🔴 CriticalRemove unresolved header include before merge.
Line 15 includes
<Columns/ColumnNullable.h>, and static analysis reports it as missing; this blocks compilation.🛠️ Proposed fix
-#include <Columns/ColumnNullable.h> `#include` <TestUtils/FunctionTestUtils.h> `#include` <TestUtils/TiFlashTestBasic.h>🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@dbms/src/Functions/tests/gtest_json_object.cpp` at line 15, Remove the unresolved include <Columns/ColumnNullable.h> from the test file (gtest_json_object.cpp); locate the include line that references ColumnNullable.h and delete it (or replace it with the correct existing header if the test actually needs nullable column types—verify usage of ColumnNullable or ColumnNullable::Ptr in the file and include the proper header name instead). Ensure compilation passes by either removing unused dependency or adding the correct header that exists in the codebase.dbms/src/Functions/FunctionsJson.h (1)
1110-1116:⚠️ Potential issue | 🟡 MinorAdd
ErrorCodesto new exceptions.At Line 1110 and Line 1116, new
Exceptionthrows omitErrorCodes, which makes error handling less consistent.🛠️ Proposed fix
- throw Exception("JSON documents may not contain NULL member names."); + throw Exception( + "JSON documents may not contain NULL member names.", + ErrorCodes::ILLEGAL_COLUMN); - throw Exception("TiDB/TiFlash does not yet support JSON objects with the key length >= 65536"); + throw Exception( + "TiDB/TiFlash does not yet support JSON objects with the key length >= 65536", + ErrorCodes::ILLEGAL_COLUMN);As per coding guidelines,
Use DB::Exception for error handling in C++. Pattern: throw Exception("Message", ErrorCodes::SOME_CODE);.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@dbms/src/Functions/FunctionsJson.h` around lines 1110 - 1116, The two new throws in FunctionsJson.h (the NULL member-name check and the key length >= 65536 check) must use DB::Exception with an ErrorCodes enum value; update both throw sites to throw Exception("JSON documents may not contain NULL member names.", ErrorCodes::BAD_ARGUMENTS) and throw Exception("TiDB/TiFlash does not yet support JSON objects with the key length >= 65536", ErrorCodes::BAD_ARGUMENTS) (or another appropriate ErrorCodes constant), and add/include the ErrorCodes declaration if not present so the identifiers resolve.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@dbms/src/Functions/FunctionsJson.h`:
- Around line 1110-1116: The two new throws in FunctionsJson.h (the NULL
member-name check and the key length >= 65536 check) must use DB::Exception with
an ErrorCodes enum value; update both throw sites to throw Exception("JSON
documents may not contain NULL member names.", ErrorCodes::BAD_ARGUMENTS) and
throw Exception("TiDB/TiFlash does not yet support JSON objects with the key
length >= 65536", ErrorCodes::BAD_ARGUMENTS) (or another appropriate ErrorCodes
constant), and add/include the ErrorCodes declaration if not present so the
identifiers resolve.
In `@dbms/src/Functions/tests/gtest_json_object.cpp`:
- Line 15: Remove the unresolved include <Columns/ColumnNullable.h> from the
test file (gtest_json_object.cpp); locate the include line that references
ColumnNullable.h and delete it (or replace it with the correct existing header
if the test actually needs nullable column types—verify usage of ColumnNullable
or ColumnNullable::Ptr in the file and include the proper header name instead).
Ensure compilation passes by either removing unused dependency or adding the
correct header that exists in the codebase.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: a7c32392-2130-416a-9655-2e3c3a5f255c
📒 Files selected for processing (2)
dbms/src/Functions/FunctionsJson.hdbms/src/Functions/tests/gtest_json_object.cpp
dbms/src/Functions/FunctionsJson.h
Outdated
| { | ||
| const auto * key_nullmap = nullmaps[col]; | ||
| if (!sources[col] || (key_nullmap && (*key_nullmap)[i])) | ||
| throw Exception("JSON documents may not contain NULL member names."); |
There was a problem hiding this comment.
[P2] Missing ErrorCodes on json_object execution-time errors
Why: DB::Exception defaults to code 0 when no error code is provided, which can misclassify user-facing errors and violate project error-handling guidance. The error code 0 is propagated into DAG responses, impacting observability and error classification downstream.
Evidence: throw Exception("JSON documents may not contain NULL member names."); at dbms/src/Functions/FunctionsJson.h:1110
The Exception constructor defaults to code 0 (dbms/src/Common/Exception.h:48), and this code is then observable in tipb::Error responses (dbms/src/Flash/Coprocessor/DAGDriver.cpp:307, 335). Note that other execution-path exceptions in FunctionsJson.h also lack error codes (lines 1116, 1696).
There was a problem hiding this comment.
Issue addressed - ErrorCodes added to json_object error handling.
ℹ️ Learn more details on Pantheon AI.
dbms/src/Functions/FunctionsJson.h
Outdated
| reserve_size += source ? source->getSizeForReserve() : rows; | ||
| JsonBinary::JsonBinaryWriteBuffer write_buffer(data_to, reserve_size); | ||
|
|
||
| std::map<String, JsonBinary> key_value_map; |
There was a problem hiding this comment.
[P2] json_object per-row std::map + key copies (allocation-heavy)
Why: Repeated map node allocations + String key copies per output row can add avoidable CPU/allocator overhead in large scans. The function constructs an ordered std::map<String, JsonBinary> for every row and materializes each key into a new std::string, causing repeated node allocations/deallocations in a tight loop.
Evidence: std::map<String, JsonBinary> key_value_map; ... key_value_map.clear(); ... String key(...); at dbms/src/Functions/FunctionsJson.h:1095
The map is required to maintain key ordering (as JsonBinary lookup assumes ascending order per dbms/src/TiDB/Decode/JsonBinary.cpp:924), but the current implementation is allocation-intensive. A lower-allocation approach would need to preserve ordering and "last wins" duplicate key semantics.
There was a problem hiding this comment.
Issue addressed - allocation optimizations applied to reduce per-row overhead.
ℹ️ Learn more details on Pantheon AI.
There was a problem hiding this comment.
🧹 Nitpick comments (3)
dbms/src/Functions/tests/gtest_json_object.cpp (2)
104-122: Cover the max accepted key length too.The error case proves 65536 is rejected, but the off-by-one risk is 65535. A single success case at
std::numeric_limits<UInt16>::max()would guard both the function-level andJsonBinary-level checks against drifting to>=.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@dbms/src/Functions/tests/gtest_json_object.cpp` around lines 104 - 122, Add a positive test that uses a key of length std::numeric_limits<UInt16>::max() to ensure the boundary is accepted: within the TEST_F(TestJsonObject, TestErrors) block after the existing too-long-key assertion, create a key string of length std::numeric_limits<UInt16>::max(), call executeFunction("json_object", {createColumn<String>({that_key}), value}) and assert it does not throw (or assert successful result) so the max accepted key length is covered; reference executeFunction, json_object, createColumn<String>, and std::numeric_limits<UInt16>::max() when locating where to add this case.
67-100: Exercise a lookup path in the new tests.These cases only validate the result via
cast_json_as_string(), which exercises sequential decoding but not the search path thatjson_extract()/searchObjectKey()depend on. Adding one assertion againstjson_extract(json_object(...), '$.a')orjson_keys(json_object(...))would pin down the new binary-object invariants.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@dbms/src/Functions/tests/gtest_json_object.cpp` around lines 67 - 100, In TestJsonObject::TestBasicSemantics, add an assertion that exercises the lookup/search path (not just cast_json_as_string) by calling json_extract or json_keys on the json_object result produced by executeFunctionWithCast; e.g., after the block that builds inputs with keys "a" and "b" (the one using executeFunctionWithCast({0,1,2,3}, inputs)), call executeFunctionWithCast or the test helper to run json_extract(json_object(...), '$.a') (or json_keys(json_object(...))) and ASSERT_COLUMN_EQ the expected values so that json_extract/searchObjectKey invariants are validated in addition to sequential decoding. Ensure you reference the same inputs/res variable and place the assertion immediately after that block.dbms/src/TiDB/Decode/JsonBinary.cpp (1)
1104-1174: Make the key-ordering contract explicit in the builder.
searchObjectKey()later assumes strictly ascending keys, butbuildBinaryJsonObjectInBuffer()will currently serialize whatever order it gets. Today that only works becauseFunctionJsonObjectsorts and de-dupes first, so the invariant is easy to violate from a future call site.One defensive option
void JsonBinary::buildBinaryJsonObjectInBuffer( const std::vector<StringRef> & keys, const std::vector<JsonBinary> & values, JsonBinaryWriteBuffer & write_buffer) { RUNTIME_CHECK(keys.size() == values.size()); + for (size_t i = 1; i < keys.size(); ++i) + RUNTIME_CHECK(keys[i - 1] < keys[i]); write_buffer.write(TYPE_CODE_OBJECT);🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@dbms/src/TiDB/Decode/JsonBinary.cpp` around lines 1104 - 1174, buildBinaryJsonObjectInBuffer currently writes keys in whatever order it receives but searchObjectKey expects strictly ascending keys; update buildBinaryJsonObjectInBuffer to enforce the contract by either (a) sorting the keys (and reordering values in parallel) and removing duplicates before encoding, or (b) adding a clear runtime check/assert that validates keys are strictly increasing and throws a descriptive Exception if not; reference the functions buildBinaryJsonObjectInBuffer and searchObjectKey and the caller FunctionJsonObject to locate where to apply the sort+dedupe or the assertion.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@dbms/src/Functions/tests/gtest_json_object.cpp`:
- Around line 104-122: Add a positive test that uses a key of length
std::numeric_limits<UInt16>::max() to ensure the boundary is accepted: within
the TEST_F(TestJsonObject, TestErrors) block after the existing too-long-key
assertion, create a key string of length std::numeric_limits<UInt16>::max(),
call executeFunction("json_object", {createColumn<String>({that_key}), value})
and assert it does not throw (or assert successful result) so the max accepted
key length is covered; reference executeFunction, json_object,
createColumn<String>, and std::numeric_limits<UInt16>::max() when locating where
to add this case.
- Around line 67-100: In TestJsonObject::TestBasicSemantics, add an assertion
that exercises the lookup/search path (not just cast_json_as_string) by calling
json_extract or json_keys on the json_object result produced by
executeFunctionWithCast; e.g., after the block that builds inputs with keys "a"
and "b" (the one using executeFunctionWithCast({0,1,2,3}, inputs)), call
executeFunctionWithCast or the test helper to run json_extract(json_object(...),
'$.a') (or json_keys(json_object(...))) and ASSERT_COLUMN_EQ the expected values
so that json_extract/searchObjectKey invariants are validated in addition to
sequential decoding. Ensure you reference the same inputs/res variable and place
the assertion immediately after that block.
In `@dbms/src/TiDB/Decode/JsonBinary.cpp`:
- Around line 1104-1174: buildBinaryJsonObjectInBuffer currently writes keys in
whatever order it receives but searchObjectKey expects strictly ascending keys;
update buildBinaryJsonObjectInBuffer to enforce the contract by either (a)
sorting the keys (and reordering values in parallel) and removing duplicates
before encoding, or (b) adding a clear runtime check/assert that validates keys
are strictly increasing and throws a descriptive Exception if not; reference the
functions buildBinaryJsonObjectInBuffer and searchObjectKey and the caller
FunctionJsonObject to locate where to apply the sort+dedupe or the assertion.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 9c386c3b-3e7a-479f-91df-d0ab12a11750
📒 Files selected for processing (3)
dbms/src/Functions/FunctionsJson.hdbms/src/Functions/tests/gtest_json_object.cppdbms/src/TiDB/Decode/JsonBinary.cpp
Co-authored-by: Liqi Geng <gengliqiii@gmail.com>
| } | ||
| for (const auto arg_idx : ext::range(0, arguments.size())) | ||
| { | ||
| if (!arguments[arg_idx]->onlyNull()) |
There was a problem hiding this comment.
When arg_idx % 2 == 0 and arguments[arg_idx]->onlyNull() == true, I think we can throw exception here as JSON documents may not contain NULL member names.
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: gengliqi, xzhangxian1008 The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
[LGTM Timeline notifier]Timeline:
|
Signed-off-by: xufei <xufeixw@mail.ustc.edu.cn>
What problem does this PR solve?
Issue Number: close #10743
Problem Summary:
json_objectcan already be planned for TiFlash pushdown on the TiDB side, but TiFlash execution onmasterwas still missing the function implementation and DAG mapping. This PR adds the missing TiFlash support so pushed-downjson_objectexpressions can run end-to-end.What is changed and how it works?
json_objectexecution support in the JSON function implementationjson_objectin the DAG/function mapping used by pushdown executionEXPLAINassertion commented out temporarily until the TiDB-side explain output change is mergedCheck List
Tests
Side effects
Documentation
Release note
Summary by CodeRabbit
New Features
Serialization
Bug Fixes
Tests