[CALCITE-4460] Support custom delimiter when parsing CSV tables#4893
[CALCITE-4460] Support custom delimiter when parsing CSV tables#4893Diveyam-Mishra wants to merge 2 commits intoapache:mainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
Adds support for configuring a custom field separator when parsing CSV tables in Calcite’s file adapter, enabling non-comma delimited files (e.g., |, \t) via model JSON while preserving comma as the default.
Changes:
- Add
"separator"operand parsing and validation inCsvTableFactory, defaulting toCSVParser.DEFAULT_SEPARATORwhen omitted. - Thread the separator through
CsvTable/CsvTranslatableTableintoCsvEnumerator, including row-type deduction and CSV reader creation. - Add tests and fixtures covering pipe delimiter support, invalid multi-character separators, and default behavior compatibility.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| file/src/main/java/org/apache/calcite/adapter/file/CsvTableFactory.java | Parses/validates optional separator operand and passes it into table construction. |
| file/src/main/java/org/apache/calcite/adapter/file/CsvTable.java | Stores separator and uses it for row-type deduction and field type inference. |
| file/src/main/java/org/apache/calcite/adapter/file/CsvTranslatableTable.java | Passes separator into CsvEnumerator during projection execution. |
| file/src/main/java/org/apache/calcite/adapter/file/CsvEnumerator.java | Adds separator-aware constructors and uses it in openCsv and deduceRowType. |
| file/src/main/java/org/apache/calcite/adapter/file/CsvStreamReader.java | Adds a separator-accepting constructor for streaming reads. |
| file/src/test/java/org/apache/calcite/adapter/file/FileAdapterTest.java | Adds coverage for custom separator, invalid separator, and default behavior. |
| file/src/test/resources/custom-separator.json | New model demonstrating `"separator": " |
| file/src/test/resources/sales-csv/PIPE_DELIMITED.csv | New pipe-delimited CSV fixture for tests. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
18bf557 to
9165ba4
Compare
|
It would be worth documenting this new feature. |
| DEPTNO:int|NAME:string | ||
| 10|"Sales" | ||
| 20|"Marketing" | ||
| 30|"Accounts" |
There was a problem hiding this comment.
Could you add a test record containing an escape character? I'd like to verify whether the custom delimiter and escape character function correctly in combination. Additionally, I had another thought: regarding support for custom escape characters—I'm not sure if Calcite supports this, though naturally, this doesn't necessarily need to be implemented in the current PR. Overall, this PR looks good.
|
I notice that you added duplicate constructors for You should probably change user documentation. Make sure you test escaping/quoting with the custom separator. +1 to merge when the above issues are fixed. |
9165ba4 to
06d9ccd
Compare
|
I have approved the CI, you can take a look at the detail. |
2435a40 to
14096ae
Compare
|
I think this is done from my end now U can check @xiedeyantu |
| @Override public Enumerator<@Nullable Object[]> enumerator() { | ||
| return new CsvEnumerator<>(source, cancelFlag, false, null, | ||
| CsvEnumerator.arrayConverter(fieldTypes, fields, false)); | ||
| CsvEnumerator.arrayConverter(fieldTypes, fields, false), ','); |
There was a problem hiding this comment.
Can we add a default parameter of ',' without modifying so many constructors, and have the constructors without the "char separator" also call the constructors with the "char separator"?
There was a problem hiding this comment.
@xiedeyantu In my opinion it is better to modify the constructors (and calls to them) than to create several overloaded constructors. This is allowed in this case because the constructors are package-protected.
There was a problem hiding this comment.
If you prefer not to add the constructor, I'm fine with that as long as the functionality is met. I was just trying to say that adding a constructor could save us from a lot of changes (although in this PR, even though a new constructor was added, some unnecessary changes weren't removed). I think this PR is ready if we revert it to the previous version(no new constructor). What do you think? @julianhyde
There was a problem hiding this comment.
I can make the necessary changes and revert to the previous approach if that’s preferred. Otherwise, I’ll wait for confirmation before updating the PR further.
| * [Elasticsearch adapter](elasticsearch_adapter.html) | ||
| (<a href="{{ site.apiRoot }}/org/apache/calcite/adapter/elasticsearch/package-summary.html">calcite-elasticsearch</a>) | ||
| * [File adapter](file_adapter.html) (<a href="{{ site.apiRoot }}/org/apache/calcite/adapter/file/package-summary.html">calcite-file</a>) | ||
| * [File adapter](file_adapter.html) (<a href="{{ site.apiRoot }}/org/apache/calcite/adapter/file/package-summary.html">calcite-file</a>), including support for custom CSV delimiters |
There was a problem hiding this comment.
I think we can remove ", including support for custom CSV delimiters".
14096ae to
331defd
Compare
|



Jira Link
CALCITE-4460
Changes Proposed
This PR adds support for custom field delimiters (separators) in the Calcite file adapter. Previously, the adapter hardcoded comma (
,) as the only separator, preventing use of other common delimiters like pipes (|) or tabs.What changed
CsvTableFactory: Parses an optional"separator"operand from the model JSON. Validates it is exactly one character; rejects multi-character values withIllegalArgumentException.CsvTable: Addedseparatorfield and a new constructor that accepts it. Passes separator toCsvEnumerator.deduceRowType().CsvTranslatableTable: Added separator constructor; passes separator toCsvEnumeratorinproject().CsvEnumerator: Added separator parameter to constructors,openCsv(), anddeduceRowType(). UsesCSVReader(reader, separator).CsvStreamReader: Added package-private constructorCsvStreamReader(Source, char).All existing constructors default separator to
CSVParser.DEFAULT_SEPARATOR(comma), preserving full backward compatibility.Scope
Per guidance from @julianhyde, changes are limited to the file adapter only. The CSV adapter (example code) is kept simple and unchanged.
Example configuration
{ "version": "1.0", "defaultSchema": "TEST", "schemas": [ { "name": "TEST", "tables": [ { "name": "MY_TABLE", "type": "custom", "factory": "org.apache.calcite.adapter.file.CsvTableFactory", "operand": { "file": "data.psv", "separator": "|" } } ] } ] }Tests added
testCsvCustomSeparatorPipe — reads pipe-delimited data, verifies correct parsing
testCsvCustomSeparatorInvalidMultiChar — verifies multi-character separator is rejected
testCsvDefaultSeparatorBackwardCompat — verifies omitting separator defaults to comma