-
Notifications
You must be signed in to change notification settings - Fork 17
Description
I have a table that reads correctly using Spark + Delta Lake Libraries, but I'm having trouble reading via pv.
do you know which downstream dependency could be giving me this error?
Error: ArrowError(ExternalError(Execution("Failed to map column projection for field mycolumn. Incompatible data types List(Field { name: "element", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: None }) and List(Field { name: "item", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: None })")))
I checked the schema from the delta transaction log and didn't see a hardcoded item or element:
❯ aws s3 cp s3://mybucket/year=2022/month=6/day=9/myprefix/_delta_log/00000000000000000000.json - | head -n 3 | tail -n 1 | jq '.metaData.schemaString | fromjson | .fields[] | select(.name == "mycolumn")'
{
"name": "mycolumn",
"type": {
"type": "array",
"elementType": "string",
"containsNull": true
},
"nullable": true,
"metadata": {}
}
When I look at the schema of a sample parquet file on s3, I do indeed see that the item in the list is called element:
pqrs schema =(s5cmd cat s3://mybucket/year=2022/month=6/day=9/myprefix/_partition=00001/part-00037-cb2e71c3-4f26-4de0-9e9a-18298489ccdc.c000.snappy.parquet)
...
message spark_schema {
...
OPTIONAL group mycolumn (LIST) {
REPEATED group list {
OPTIONAL BYTE_ARRAY element (UTF8);
}
}
...
}
I see this exact error is from here: https://github.com/apache/arrow-datafusion/blob/aad82fbb32dc1bb4d03e8b36297f8c9a3148df89/datafusion/core/src/physical_plan/file_format/mod.rs#L253
And I also see that element is hardcoded in delta-rs here:
https://github.com/delta-io/delta-rs/blob/83b8296fa5d55ebe050b022ed583dc57152221fe/rust/src/delta_arrow.rs#L38-L48 (pr: delta-io/delta-rs#228)
But I can't seem to find where the schema mismatch is coming from.