Skip to content

Support for historical data validation causes error in CSV files without headers #1018

@jfpascoal

Description

@jfpascoal

The recent changes introduced in #1006 use INTERSECT to get the names of the columns that exist in both the data contract and the data files (parquet or CSV). However, when CSV files have no headers and the option names is not used, DuckDB assigns default names. Except in very rare cases, an intersection between those default names and the data contract names will return an empty array, which causes the INSERT in the next step to throw an error (due to empty SELECT statement).

Apart from this error, these changes also introduced an inconsistency - the SQL object created by the create_view_with_schema_union function will either be a table (if converted_types) or a view (fallback). I suspect the use of a table instead of a view may have further implications in terms of performance, as data will automatically be loaded into memory.

Finally, I would raise the question of whether it makes sense to treat non-required fields as optional rather than nullable. According to ODCS, the required key "indicates if the element may contain Null values", not if it can be entirely absent from the data. Perhaps it would be useful to have historical data support as an option but I am not sure that it makes sense to have it as the default behavior.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions