-
Notifications
You must be signed in to change notification settings - Fork 21
Description
Using PBS Pro on Derecho, we are experiencing an issue in the UFS-SRW where tasks are reported as dead due to hitting the max unknown count:
12/29/25 08:04:22 MST :: FV3LAM_wflow.xml :: Cycle 202311100000, Task make_grid, jobid=4479944, in state DEAD (Unknown), giving up because job state could not be determined 3 consecutive times, try=2 (of 2)
However, the jobs (and job logs) report an exit status of 0. I could not determine from the rocoto source where the dead state is originating. It seems that the job state is already dead when the exit status is checked: https://github.com/christopherwharrop/rocoto/blob/79304a1c47a18ee68c45a52799935c481d0d6d56/lib/workflowmgr/pbsprobatchsystem.rb#L299C22-L299C27. (I am a newbie with the rocoto source code FYI.)
Here is a db snippet of the DEAD task make_grid with an Exit_status of 0 (other tasks are similar). I’m not sure why the make_grid task entry is duplicated.
> sqlite3 FV3LAM_wflow.db ".mode column" ".headers on" "select * from jobs where taskname = 'make_grid'"
id jobid taskname cycle cores state native_state exit_status tries nunknowns duration
-- ------- --------- ---------- ----- ----- ------------ ----------- ----- --------- --------
1 4479944 make_grid 1699574400 24 DEAD Unknown 0 2 3 0.0
10 4479944 make_grid 1699574400 24 DEAD Unknown 0 2 3 0.0
A couple other items of note:
- A rewind/reboot will often address the status issue but not always.
- The UFS-WM does not experience the same polling issue.
- Increasing the interval between
rocotoruncalls also seems to address the issue.
Are there any recommendations for additional troubleshooting or knobs we can check? Thank you!