Skip to content

Conversation

@ghkang98
Copy link
Contributor

solve the problem of inconsistent filter data in the doris eq-delete scenario

Problem Summary:
When Doris reads iceberg's data, if there is eq-deleted data, when filtering the data, the filter will have dirty data due to the incorrect use of the resize-fill function, which will eventually cause Doris's data to be filtered incorrectly.
doris-iceberg-eq-delete-bug.pdf

  • Behavior changed:
    yes
    added a reinit function to reset the size of a data and fill all the data with the new data area
  • Does this need documentation?
    No.

@Thearas
Copy link
Contributor

Thearas commented May 26, 2025

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@ghkang98
Copy link
Contributor Author

run buildall


/// reset the array capacity
/// fill the new additional elements using the value
void resize_fill(size_t n, const T& value) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add some ut for resize fill, we should make sure it only reset the additional new element's value


/// reset the array capacity
/// fill all elements using the value
void reinit(size_t n, const T& value) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do not add this method. Implement this logic at invoking place.

_filter = std::make_unique<IColumn::Filter>(rows, 0);
} else {
_filter->resize_fill(rows, 0);
_filter->reinit(rows, 0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

call assign method?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, Use assign function instead

@doris-robot
Copy link

TPC-H: Total hot run time: 33701 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit e42dd8f3c872dc8988deef67abd317b7cbdb00b5, data reload: false

------ Round 1 ----------------------------------
q1	26803	5000	5032	5000
q2	2086	294	193	193
q3	10452	1281	715	715
q4	10222	1018	521	521
q5	7543	2414	2368	2368
q6	188	166	143	143
q7	933	751	597	597
q8	9314	1344	1052	1052
q9	6832	4989	5166	4989
q10	6831	2324	1882	1882
q11	484	287	282	282
q12	348	356	216	216
q13	17802	3700	3090	3090
q14	234	239	211	211
q15	527	494	482	482
q16	432	428	379	379
q17	603	866	352	352
q18	7562	7172	7210	7172
q19	2665	980	557	557
q20	327	327	218	218
q21	3677	2560	2311	2311
q22	1023	1019	971	971
Total cold run time: 116888 ms
Total hot run time: 33701 ms

----- Round 2, with runtime_filter_mode=off -----
q1	5214	5070	5047	5047
q2	246	325	231	231
q3	2146	2671	2275	2275
q4	1388	1816	1393	1393
q5	4450	4465	4407	4407
q6	210	171	129	129
q7	2032	1971	1767	1767
q8	2633	2632	2544	2544
q9	7263	7172	7152	7152
q10	3058	3172	2764	2764
q11	567	507	478	478
q12	663	755	585	585
q13	3561	3903	3263	3263
q14	281	327	312	312
q15	556	478	471	471
q16	446	488	432	432
q17	1162	1570	1389	1389
q18	7841	7570	7518	7518
q19	806	913	885	885
q20	2015	1984	1849	1849
q21	4816	4507	4428	4428
q22	1106	1048	1025	1025
Total cold run time: 52460 ms
Total hot run time: 50344 ms

@ghkang98 ghkang98 force-pushed the iceberg-eqdelete-filter branch from e42dd8f to b79c533 Compare May 26, 2025 11:47
@doris-robot
Copy link

TPC-DS: Total hot run time: 192497 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit e42dd8f3c872dc8988deef67abd317b7cbdb00b5, data reload: false

query1	1399	1105	1091	1091
query2	6255	1814	1815	1814
query3	11013	4479	4551	4479
query4	53655	25594	23079	23079
query5	5182	511	464	464
query6	336	221	203	203
query7	4898	502	299	299
query8	314	295	240	240
query9	5773	2648	2673	2648
query10	453	345	281	281
query11	15052	14981	15115	14981
query12	172	114	110	110
query13	1029	534	426	426
query14	10164	6371	6224	6224
query15	207	208	174	174
query16	7224	638	517	517
query17	1140	735	621	621
query18	1782	392	306	306
query19	221	192	167	167
query20	134	133	125	125
query21	202	125	106	106
query22	4370	4476	4227	4227
query23	34547	33643	33688	33643
query24	6759	2424	2419	2419
query25	464	485	404	404
query26	723	274	154	154
query27	2877	512	349	349
query28	3096	2156	2184	2156
query29	561	552	427	427
query30	273	248	192	192
query31	835	864	786	786
query32	72	73	63	63
query33	476	346	314	314
query34	776	856	522	522
query35	807	847	731	731
query36	926	972	923	923
query37	106	98	77	77
query38	4183	4330	4160	4160
query39	1715	1495	1458	1458
query40	212	128	113	113
query41	57	52	57	52
query42	127	111	116	111
query43	508	539	482	482
query44	1350	863	854	854
query45	180	172	167	167
query46	846	1022	664	664
query47	1846	1845	1805	1805
query48	427	459	337	337
query49	688	535	442	442
query50	693	731	425	425
query51	4229	4261	4268	4261
query52	117	117	100	100
query53	236	268	189	189
query54	615	586	527	527
query55	87	90	94	90
query56	335	336	316	316
query57	1161	1144	1099	1099
query58	278	273	263	263
query59	2551	2594	2458	2458
query60	323	323	300	300
query61	130	124	126	124
query62	728	715	654	654
query63	224	192	190	190
query64	1826	996	671	671
query65	4291	4234	4234	4234
query66	748	388	299	299
query67	16113	15514	15340	15340
query68	7381	881	525	525
query69	545	327	280	280
query70	1248	1107	1120	1107
query71	489	338	306	306
query72	5486	4735	4842	4735
query73	1465	654	374	374
query74	8877	8901	8976	8901
query75	3796	3210	2703	2703
query76	4173	1199	769	769
query77	625	360	278	278
query78	10079	10199	9338	9338
query79	2342	814	587	587
query80	693	512	452	452
query81	495	251	224	224
query82	446	127	96	96
query83	357	250	238	238
query84	293	97	96	96
query85	788	352	363	352
query86	392	329	291	291
query87	4426	4454	4370	4370
query88	3584	2289	2283	2283
query89	405	314	278	278
query90	1770	206	205	205
query91	147	142	112	112
query92	70	66	54	54
query93	1870	917	582	582
query94	654	384	289	289
query95	374	291	284	284
query96	496	564	281	281
query97	2724	2773	2649	2649
query98	227	208	206	206
query99	1443	1394	1325	1325
Total cold run time: 299257 ms
Total hot run time: 192497 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 29.03 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit e42dd8f3c872dc8988deef67abd317b7cbdb00b5, data reload: false

query1	0.04	0.04	0.04
query2	0.13	0.10	0.12
query3	0.25	0.20	0.19
query4	1.59	0.20	0.11
query5	0.45	0.43	0.44
query6	1.19	0.66	0.66
query7	0.02	0.02	0.02
query8	0.04	0.04	0.03
query9	0.59	0.51	0.54
query10	0.57	0.57	0.57
query11	0.15	0.11	0.11
query12	0.15	0.12	0.11
query13	0.62	0.59	0.59
query14	0.78	0.80	0.82
query15	0.89	0.85	0.85
query16	0.38	0.38	0.39
query17	1.09	1.01	1.01
query18	0.23	0.22	0.21
query19	2.01	1.84	1.83
query20	0.02	0.01	0.01
query21	15.39	0.88	0.56
query22	0.77	1.06	0.74
query23	14.93	1.40	0.59
query24	7.61	0.78	0.77
query25	0.49	0.26	0.08
query26	0.57	0.16	0.14
query27	0.06	0.05	0.05
query28	9.33	0.91	0.45
query29	12.60	4.04	3.36
query30	0.26	0.09	0.07
query31	2.81	0.60	0.39
query32	3.22	0.55	0.47
query33	3.03	3.00	3.08
query34	15.73	5.10	4.51
query35	4.53	4.51	4.46
query36	0.66	0.50	0.49
query37	0.09	0.07	0.07
query38	0.05	0.04	0.04
query39	0.03	0.02	0.03
query40	0.18	0.15	0.13
query41	0.09	0.03	0.03
query42	0.04	0.03	0.02
query43	0.04	0.03	0.03
Total cold run time: 103.7 s
Total hot run time: 29.03 s

@ghkang98
Copy link
Contributor Author

run buildall

yiguolei
yiguolei previously approved these changes May 27, 2025
Copy link
Contributor

@yiguolei yiguolei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label May 27, 2025
@github-actions
Copy link
Contributor

PR approved by at least one committer and no changes requested.

@github-actions
Copy link
Contributor

PR approved by anyone and no changes requested.

@ghkang98 ghkang98 force-pushed the iceberg-eqdelete-filter branch from b79c533 to b1e2401 Compare May 27, 2025 03:13
@github-actions github-actions bot removed the approved Indicates a PR has been approved by one committer. label May 27, 2025
@ghkang98
Copy link
Contributor Author

run buildall

@yiguolei yiguolei added usercase Important user case type label dev/2.1.x dev/3.0.x labels May 27, 2025
@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label May 27, 2025
@github-actions
Copy link
Contributor

PR approved by at least one committer and no changes requested.

Copy link
Contributor

@suxiaogang223 suxiaogang223 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@doris-robot
Copy link

TPC-H: Total hot run time: 33911 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit b1e24014e8110e8d77cfd369f39ae063b2e70e04, data reload: false

------ Round 1 ----------------------------------
q1	26481	5054	4969	4969
q2	2071	281	191	191
q3	10501	1245	715	715
q4	10236	1018	533	533
q5	7643	2357	2387	2357
q6	186	170	135	135
q7	903	745	609	609
q8	9303	1293	1151	1151
q9	6890	5140	5167	5140
q10	6869	2329	1912	1912
q11	482	287	269	269
q12	339	358	213	213
q13	17781	3652	3061	3061
q14	229	230	223	223
q15	541	500	485	485
q16	440	428	372	372
q17	616	869	381	381
q18	7727	7112	7156	7112
q19	1794	992	559	559
q20	335	340	221	221
q21	3879	2627	2330	2330
q22	1061	1039	973	973
Total cold run time: 116307 ms
Total hot run time: 33911 ms

----- Round 2, with runtime_filter_mode=off -----
q1	5239	5093	5115	5093
q2	239	323	230	230
q3	2144	2673	2312	2312
q4	1365	1791	1367	1367
q5	4452	4392	4434	4392
q6	217	170	123	123
q7	2020	1919	1725	1725
q8	2589	2697	2614	2614
q9	7284	7316	7009	7009
q10	3061	3206	2759	2759
q11	583	513	494	494
q12	677	787	658	658
q13	3539	3916	3324	3324
q14	298	336	285	285
q15	525	469	476	469
q16	439	493	439	439
q17	1162	1576	1407	1407
q18	7699	7753	7432	7432
q19	813	785	836	785
q20	2080	1981	1830	1830
q21	4863	4493	4410	4410
q22	1109	1061	1036	1036
Total cold run time: 52397 ms
Total hot run time: 50193 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 193364 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit b1e24014e8110e8d77cfd369f39ae063b2e70e04, data reload: false

query1	1401	1086	1057	1057
query2	6216	1853	1843	1843
query3	11173	4693	4817	4693
query4	25569	23608	23322	23322
query5	4792	598	451	451
query6	317	200	199	199
query7	3993	489	284	284
query8	289	233	223	223
query9	8529	2621	2626	2621
query10	489	327	268	268
query11	15233	15066	14877	14877
query12	171	121	102	102
query13	1549	516	415	415
query14	8934	6056	6141	6056
query15	201	189	176	176
query16	7229	646	493	493
query17	1161	725	609	609
query18	1962	413	310	310
query19	196	198	167	167
query20	124	116	127	116
query21	220	124	112	112
query22	4554	4552	4354	4354
query23	34716	33817	33614	33614
query24	8585	2429	2458	2429
query25	519	494	409	409
query26	1226	275	153	153
query27	2878	522	354	354
query28	4677	2198	2167	2167
query29	740	556	453	453
query30	271	224	196	196
query31	911	872	793	793
query32	80	63	63	63
query33	540	384	321	321
query34	865	862	529	529
query35	787	856	744	744
query36	996	1007	927	927
query37	116	103	76	76
query38	4279	4375	4227	4227
query39	1527	1463	1444	1444
query40	224	125	105	105
query41	85	53	56	53
query42	127	114	113	113
query43	507	523	482	482
query44	1369	851	833	833
query45	188	179	175	175
query46	852	1043	648	648
query47	1829	1876	1831	1831
query48	420	437	332	332
query49	780	507	451	451
query50	668	701	437	437
query51	4248	4366	4225	4225
query52	107	114	102	102
query53	225	259	184	184
query54	602	579	512	512
query55	92	84	84	84
query56	322	337	311	311
query57	1215	1207	1161	1161
query58	292	274	271	271
query59	2738	2856	2762	2762
query60	366	334	319	319
query61	154	148	147	147
query62	768	748	686	686
query63	231	195	203	195
query64	4215	1134	684	684
query65	4417	4336	4223	4223
query66	1004	395	302	302
query67	15837	15630	15589	15589
query68	8650	893	527	527
query69	506	306	261	261
query70	1221	1088	1133	1088
query71	465	324	299	299
query72	5582	4880	4898	4880
query73	759	691	352	352
query74	9265	8843	9018	8843
query75	3900	3185	2676	2676
query76	3707	1198	794	794
query77	777	365	273	273
query78	10130	10339	9276	9276
query79	1914	881	575	575
query80	599	508	513	508
query81	476	256	213	213
query82	449	124	95	95
query83	258	253	231	231
query84	243	95	88	88
query85	810	353	314	314
query86	343	293	279	279
query87	4459	4422	4327	4327
query88	3454	2293	2266	2266
query89	405	320	290	290
query90	1892	205	204	204
query91	147	139	114	114
query92	74	61	60	60
query93	1429	965	585	585
query94	668	399	310	310
query95	371	289	278	278
query96	514	572	336	336
query97	2702	2734	2652	2652
query98	229	209	210	209
query99	1314	1408	1297	1297
Total cold run time: 279609 ms
Total hot run time: 193364 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 29.16 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit b1e24014e8110e8d77cfd369f39ae063b2e70e04, data reload: false

query1	0.04	0.03	0.03
query2	0.12	0.10	0.12
query3	0.26	0.19	0.19
query4	1.59	0.20	0.19
query5	0.45	0.45	0.45
query6	1.16	0.66	0.67
query7	0.02	0.02	0.02
query8	0.05	0.04	0.04
query9	0.59	0.52	0.51
query10	0.57	0.58	0.56
query11	0.16	0.10	0.11
query12	0.15	0.12	0.12
query13	0.62	0.60	0.60
query14	0.78	0.79	0.81
query15	0.89	0.84	0.88
query16	0.38	0.39	0.38
query17	1.06	1.02	1.05
query18	0.22	0.22	0.21
query19	1.90	1.81	1.83
query20	0.01	0.01	0.01
query21	15.40	0.90	0.55
query22	0.76	1.23	0.64
query23	15.28	1.37	0.60
query24	6.74	0.90	1.77
query25	0.54	0.22	0.09
query26	0.60	0.15	0.14
query27	0.05	0.05	0.05
query28	10.05	0.79	0.45
query29	12.54	4.02	3.27
query30	0.26	0.09	0.06
query31	2.84	0.59	0.39
query32	3.22	0.55	0.47
query33	3.04	3.08	3.11
query34	15.79	5.12	4.51
query35	4.50	4.59	4.54
query36	0.66	0.50	0.47
query37	0.09	0.06	0.06
query38	0.05	0.04	0.03
query39	0.04	0.03	0.03
query40	0.17	0.14	0.13
query41	0.08	0.04	0.02
query42	0.04	0.03	0.02
query43	0.04	0.04	0.03
Total cold run time: 103.8 s
Total hot run time: 29.16 s

@yiguolei yiguolei merged commit ae2dcaf into apache:master May 28, 2025
29 of 32 checks passed
github-actions bot pushed a commit that referenced this pull request May 28, 2025
)

solve the problem of inconsistent filter data in the doris eq-delete
scenario

Problem Summary:
When Doris reads iceberg's data, if there is eq-deleted data, when
filtering the data, the filter will have dirty data due to the incorrect
use of the resize-fill function, which will eventually cause Doris's
data to be filtered incorrectly.

[doris-iceberg-eq-delete-bug.pdf](https://github.com/user-attachments/files/20439214/doris-iceberg-eq-delete-bug.pdf)
github-actions bot pushed a commit that referenced this pull request May 28, 2025
)

solve the problem of inconsistent filter data in the doris eq-delete
scenario

Problem Summary:
When Doris reads iceberg's data, if there is eq-deleted data, when
filtering the data, the filter will have dirty data due to the incorrect
use of the resize-fill function, which will eventually cause Doris's
data to be filtered incorrectly.

[doris-iceberg-eq-delete-bug.pdf](https://github.com/user-attachments/files/20439214/doris-iceberg-eq-delete-bug.pdf)
yiguolei pushed a commit that referenced this pull request May 28, 2025
dataroaring pushed a commit that referenced this pull request May 29, 2025
koarz pushed a commit to koarz/doris that referenced this pull request Jun 4, 2025
…che#51253)

solve the problem of inconsistent filter data in the doris eq-delete
scenario

Problem Summary:
When Doris reads iceberg's data, if there is eq-deleted data, when
filtering the data, the filter will have dirty data due to the incorrect
use of the resize-fill function, which will eventually cause Doris's
data to be filtered incorrectly.

[doris-iceberg-eq-delete-bug.pdf](https://github.com/user-attachments/files/20439214/doris-iceberg-eq-delete-bug.pdf)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by one committer. dev/2.1.11-merged dev/3.0.6-merged reviewed usercase Important user case type label

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants