Thanks for putting this together, this is a great improvement on the state of JSON handling in datafusion! I've been trying this out in Arroyo (ArroyoSystems/arroyo#640), and it's definitely a big perf improvement on what we've been doing.
One limitation I've encountered is dealing with arrays of objects. For example, something like
{
"a": [
{"b": 5},
{"b": 9},
{"b": 3}
]
}
A common pattern in Arroyo is to extract an array from JSON data, then call our unnest operator on it (which unrolls a list into individual rows). For that, we need a way to get an Arrow list out of the data. But there's no function that's capable of returning an array.
Two options I've considered:
- Some syntax to "map" over a list in the json path (like
json_get_int(s, 'a', *, 'b')), where the * represents an array that we're mapping over; this would return a SQL array/arrow list [5, 9, 3]
- A
get_json_array function, which returns an array of JsonUnion objects, which could then be queried without needing to be deserialized again (like get_json_int(unnest(get_json_array(s, 'a')), 'b)).
I've got most of a working prototype of the second one, but I'd be interested in thoughts on what approach makes sense (or whether it's within the interest of this project to support more complex JSON structures like this).
Thanks for putting this together, this is a great improvement on the state of JSON handling in datafusion! I've been trying this out in Arroyo (ArroyoSystems/arroyo#640), and it's definitely a big perf improvement on what we've been doing.
One limitation I've encountered is dealing with arrays of objects. For example, something like
{ "a": [ {"b": 5}, {"b": 9}, {"b": 3} ] }A common pattern in Arroyo is to extract an array from JSON data, then call our
unnestoperator on it (which unrolls a list into individual rows). For that, we need a way to get an Arrow list out of the data. But there's no function that's capable of returning an array.Two options I've considered:
json_get_int(s, 'a', *, 'b')), where the*represents an array that we're mapping over; this would return a SQL array/arrow list[5, 9, 3]get_json_arrayfunction, which returns an array of JsonUnion objects, which could then be queried without needing to be deserialized again (likeget_json_int(unnest(get_json_array(s, 'a')), 'b)).I've got most of a working prototype of the second one, but I'd be interested in thoughts on what approach makes sense (or whether it's within the interest of this project to support more complex JSON structures like this).