-
Notifications
You must be signed in to change notification settings - Fork 210
Description
blaze 读取orc 格式缺少列。
错误日志:
java.lang.RuntimeException: poll record batch error: Execution error: native execution panics: Execution error: Execution error: output_with_sender[Shuffle] error: Execution error: output_with_sender[Limit] error: Execution error: output_with_sender[Limit]: output() returns error: Execution error: Execution error: output_with_sender[Project]: output() returns error: Execution error: Execution error: index out of bounds: the len is 31 but the index is 31
at org.apache.spark.sql.blaze.JniBridge.nextBatch(Native Method)
at org.apache.spark.sql.blaze.BlazeCallNativeWrapper$$anon$1.hasNext(BlazeCallNativeWrapper.scala:80)
at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:31)
at scala.collection.Iterator.foreach(Iterator.scala:943)
at scala.collection.Iterator.foreach$(Iterator.scala:943)
at org.apache.spark.util.CompletionIterator.foreach(CompletionIterator.scala:25)
at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)
查看对应的代码发现orc_exec.rs中的FileOpener 中的open函数

ProjectionMask::roots(builder.file_metadata().root_data_type(), projection); projection 生成和orc mask需要的index不匹配。
orc 数据组织格式为:

列如:
``
若 hive schema :
biz_col_name_list : List<String>,column_index 0
dist_scene_list: List<String>, column_index 1
entry_name_1st : String, column_index 2
entry_name_2nd: String, column_index 3
```
orc meta 则为:
`RootDataType {
children: [
NamedColumn { name: "biz_col_name_list", data_type: List { column_index: 1, child: String { column_index: 2 } } }, NamedColumn { name: "dist_scene_list", data_type: List { column_index: 3, child: String { column_index: 4 } } },
NamedColumn { name: "entry_name_1st", data_type: String { column_index: 5 } },
NamedColumn { name: "entry_name_2nd", data_type: String { column_index: 6 } }] }`
可以看出 hive schema中的column index 与 orc meta中的column index 有区别。