Skip to content

[CELEBORN-1701] Support stage rerun for shuffle data lost#2894

Closed
FMX wants to merge 2 commits intoapache:mainfrom
FMX:b1701
Closed

[CELEBORN-1701] Support stage rerun for shuffle data lost#2894
FMX wants to merge 2 commits intoapache:mainfrom
FMX:b1701

Conversation

@FMX
Copy link
Contributor

@FMX FMX commented Nov 8, 2024

What changes were proposed in this pull request?

If shuffle data is lost and enabled throw fetch failures, triggered stage rerun.

Why are the changes needed?

Rerun stage for shuffle lost scenarios.

Does this PR introduce any user-facing change?

NO.

How was this patch tested?

GA.

@turboFei
Copy link
Member

turboFei commented Nov 8, 2024

Thank you @FMX
It works as expected, the stage-0 rerun.

image

Copy link
Member

@turboFei turboFei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, have tested this improvement in our celeborn cluster

@RexXiong RexXiong closed this in 42d5d42 Nov 12, 2024
RexXiong pushed a commit that referenced this pull request Nov 12, 2024
### What changes were proposed in this pull request?
If shuffle data is lost and enabled throw fetch failures, triggered stage rerun.

### Why are the changes needed?
Rerun stage for shuffle lost scenarios.

### Does this PR introduce _any_ user-facing change?
NO.

### How was this patch tested?
GA.

Closes #2894 from FMX/b1701.

Authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
(cherry picked from commit 42d5d42)
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
@RexXiong RexXiong changed the title [CELEBORN-1071] Support stage rerun for shuffle data lost [CELEBORN-1701] Support stage rerun for shuffle data lost Nov 12, 2024
@RexXiong
Copy link
Contributor

Merge to main(v0.6.0) and branch-0.5(v0.5.2)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants