Skip to content

hive table is ignore caseSensitve, and hive table location just parquet files (schema with upper chars,eg componentId, userName ), after enable blaze, spark sql with upper filter condition won't return any data.  #670

@frencopei

Description

@frencopei

Describe the bug
hive table is ignore caseSensitve, and hive table location just parquet files (schema with upper chars,eg componentId, userName ), after enable blaze, spark sql with upper filter condition won't return any data.

To Reproduce
Steps to reproduce the behavior:

  1. spark.sql("set spark.sql.caseSensitive=false")
    
  2. val executSql = """
        select  dnum
                 from report.tb_39e85e2e76e444e195c6db2df728751e_34b7dfe549
                 where dt between '2024-11-20' and   '2024-11-27'
                 and componentId='255'    limit 50
     """
    
  3. val df = spark.sql(executSql) 
     println(df.schema)
     df.show(10)
    
  4. package scala jar.

  5. spark-submit --class com.***.myapp.Test --master yarn --conf spark.sql.hive.convertMetastoreParquet=true --conf spark.blaze.enable=true --conf spark.sql.extensions=org.apache.spark.sql.blaze.BlazeSparkSessionExtension --conf spark.shuffle.manager=org.apache.spark.sql.execution.blaze.shuffle.BlazeShuffleManager --conf spark.sql.caseSensitive=false cosn://dc-sh-prod-03-1323003688/tasklibs/spark3.2.2_myapp.jar

  6. executor logs:

测试sql :

       select  dnum, 3680 as moneys
                from report.tb_39e85e2e76e444e195c6db2df728751e_34b7dfe549
                where to_date_udf(year,month,day) between  date_sub('2024-11-27',7) and  '2024-11-27'
                and componentId='255' limit 50

userGroupInfo.getUserField : dnum

StructType(StructField(dnum,StringType,true), StructField(moneys,IntegerType,false))
+----+------+
|dnum|moneys|
+----+------+
+----+------+

obviusely, it cannt return any data. just filter conditions cause : componentId

  1. Expected behavior (when set spark.blaze.enable=false )

自动化分析任务导入的sql :

       select  dnum, 3680 as moneys
                from report.tb_39e85e2e76e444e195c6db2df728751e_34b7dfe549
                where to_date_udf(year,month,day) between  date_sub('2024-11-27',7) and  '2024-11-27'
                and componentId='255' limit 50

dataframe schema:

StructType(StructField(dnum,StringType,true), StructField(moneys,IntegerType,false))
+---------+------+
| dnum|moneys|
+---------+------+
|649409512| 3680|
|666687060| 3680|
|667198577| 3680|
|672462560| 3680|
|668511291| 3680|
|661643626| 3680|
|669103964| 3680|
|660927197| 3680|
|671793888| 3680|
|637719401| 3680|
+---------+------+
only showing top 10 rows

append:
A: hive table create scripts :

CREATE EXTERNAL TABLE report.tb_39e85e2e76e444e195c6db2df728751e_34b7dfe549(
android_id string,
systempid string,
appnm string,
appversion string,
appversioncode string,
biversion string,
cardstyleid string,
city string,
clientdatetime string,
componentcontentid string,
componentid string,
componentname string,
componentposition string,
componenttypeid string,
componentversion string,
datasource string,
dateofweek string,
datetime string,
dayofquarter string,
dayofyear string,
deviceid string,
devicetype string,
dnum string,
hour string,
id string,
imei string,
ip string,
launcherversionname string,
launcherdnum string,
launchervercode string,
mac string,
minute string,
nation string,
networktype string,
packagenm string,
phonetype string,
postconfigversion string,
projectid string,
province string,
region string,
remote_addr string,
scenetemplateid string,
scenetemplatename string,
second string,
sendtime string,
signature string,
systype string,
sysversion string,
systemvercode string,
tabposition string,
tclosversion string,
type string,
userid string,
weekofyear string,
wlanmac string,
xforwarded string,
packagename string,
componentstatus string,
musicstatus string,
componenttitle string,
vid string,
receipttime string)
PARTITIONED BY (
year bigint,
month bigint,
day bigint,
cleanhour bigint)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
'hdfs://xxxxxx/data/report/584f9c5bab31fb1d59e138e1/39e85e2e76e444e195c6db2df728751e/34B7DFE549'

B location parquet schema:
11

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions