Skip to content

orcScan reads missing data column #715

@ASiegeLion

Description

@ASiegeLion

blaze 读取orc 格式缺少列。
错误日志:
java.lang.RuntimeException: poll record batch error: Execution error: native execution panics: Execution error: Execution error: output_with_sender[Shuffle] error: Execution error: output_with_sender[Limit] error: Execution error: output_with_sender[Limit]: output() returns error: Execution error: Execution error: output_with_sender[Project]: output() returns error: Execution error: Execution error: index out of bounds: the len is 31 but the index is 31
at org.apache.spark.sql.blaze.JniBridge.nextBatch(Native Method)
at org.apache.spark.sql.blaze.BlazeCallNativeWrapper$$anon$1.hasNext(BlazeCallNativeWrapper.scala:80)
at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:31)
at scala.collection.Iterator.foreach(Iterator.scala:943)
at scala.collection.Iterator.foreach$(Iterator.scala:943)
at org.apache.spark.util.CompletionIterator.foreach(CompletionIterator.scala:25)
at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)

查看对应的代码发现orc_exec.rs中的FileOpener 中的open函数
image

ProjectionMask::roots(builder.file_metadata().root_data_type(), projection); projection 生成和orc mask需要的index不匹配。

orc 数据组织格式为:
image
列如:
``
若 hive schema :

biz_col_name_list : List<String>,column_index 0
dist_scene_list:  List<String>,  column_index 1
entry_name_1st : String,  column_index 2
entry_name_2nd: String, column_index 3
``` 

orc meta  则为:

`RootDataType {
 children: [  
 NamedColumn { name: "biz_col_name_list", data_type: List { column_index: 1, child: String { column_index: 2 } } }, NamedColumn { name: "dist_scene_list", data_type: List { column_index: 3, child: String { column_index: 4 } } }, 
NamedColumn { name: "entry_name_1st", data_type: String { column_index: 5 } }, 
NamedColumn { name: "entry_name_2nd", data_type: String { column_index: 6 } }] }`

可以看出 hive schema中的column index 与 orc meta中的column index 有区别。
 






Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions