Skip to content

PBS Pro rocoto issue #121

@benkozi

Description

@benkozi

Using PBS Pro on Derecho, we are experiencing an issue in the UFS-SRW where tasks are reported as dead due to hitting the max unknown count:

12/29/25 08:04:22 MST :: FV3LAM_wflow.xml :: Cycle 202311100000, Task make_grid, jobid=4479944, in state DEAD (Unknown), giving up because job state could not be determined 3 consecutive times, try=2 (of 2)

However, the jobs (and job logs) report an exit status of 0. I could not determine from the rocoto source where the dead state is originating. It seems that the job state is already dead when the exit status is checked: https://github.com/christopherwharrop/rocoto/blob/79304a1c47a18ee68c45a52799935c481d0d6d56/lib/workflowmgr/pbsprobatchsystem.rb#L299C22-L299C27. (I am a newbie with the rocoto source code FYI.)

Here is a db snippet of the DEAD task make_grid with an Exit_status of 0 (other tasks are similar). I’m not sure why the make_grid task entry is duplicated.

> sqlite3 FV3LAM_wflow.db ".mode column" ".headers on" "select * from jobs where taskname = 'make_grid'"
id  jobid    taskname   cycle       cores  state  native_state  exit_status  tries  nunknowns  duration
--  -------  ---------  ----------  -----  -----  ------------  -----------  -----  ---------  --------
1   4479944  make_grid  1699574400  24     DEAD   Unknown       0            2      3          0.0    
10  4479944  make_grid  1699574400  24     DEAD   Unknown       0            2      3          0.0

A couple other items of note:

  • A rewind/reboot will often address the status issue but not always.
  • The UFS-WM does not experience the same polling issue.
  • Increasing the interval between rocotorun calls also seems to address the issue.

Are there any recommendations for additional troubleshooting or knobs we can check? Thank you!

cc @MichaelLueken

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions