-
Notifications
You must be signed in to change notification settings - Fork 218
Support for historical data validation causes error in CSV files without headers #1018
Description
The recent changes introduced in #1006 use INTERSECT to get the names of the columns that exist in both the data contract and the data files (parquet or CSV). However, when CSV files have no headers and the option names is not used, DuckDB assigns default names. Except in very rare cases, an intersection between those default names and the data contract names will return an empty array, which causes the INSERT in the next step to throw an error (due to empty SELECT statement).
Apart from this error, these changes also introduced an inconsistency - the SQL object created by the create_view_with_schema_union function will either be a table (if converted_types) or a view (fallback). I suspect the use of a table instead of a view may have further implications in terms of performance, as data will automatically be loaded into memory.
Finally, I would raise the question of whether it makes sense to treat non-required fields as optional rather than nullable. According to ODCS, the required key "indicates if the element may contain Null values", not if it can be entirely absent from the data. Perhaps it would be useful to have historical data support as an option but I am not sure that it makes sense to have it as the default behavior.