Skip to content

prov/psm3: Internal polling must use timeout to prevent infinite loop…#11916

Open
tatarintsevsv wants to merge 1 commit intoofiwg:mainfrom
tatarintsevsv:psm3_res_lack
Open

prov/psm3: Internal polling must use timeout to prevent infinite loop…#11916
tatarintsevsv wants to merge 1 commit intoofiwg:mainfrom
tatarintsevsv:psm3_res_lack

Conversation

@tatarintsevsv
Copy link
Copy Markdown

@tatarintsevsv tatarintsevsv commented Feb 25, 2026

… during resources acquiring

Due to a lack of resources, the AMSH_POLL_UNTIL() and PSMI_BLOCKUNTIL() may never complete. If resources polling was not finished within PSMI_MIN_EP_CONNECT_TIMEOUT, an interrupt will occur and error will be returned.

Fixes #11893

Copy link
Copy Markdown
Contributor

@j-xiong j-xiong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition to the comment below, note that the return value of psm3_amsh_am_short_request and psm3_amsh_generic_inner is not handled by the upper layer in its call chain. As a result, in case of resource exhaustion, the application is able to break out from the infinite loop but may silently treat this error as success.

@tatarintsevsv
Copy link
Copy Markdown
Author

In addition to the comment below, note that the return value of psm3_amsh_am_short_request and psm3_amsh_generic_inner is not handled by the upper layer in its call chain. As a result, in case of resource exhaustion, the application is able to break out from the infinite loop but may silently treat this error as success.

Error hadling added.
In some cases i'm not sure how to handle error properly, so app will just aborted

@j-xiong
Copy link
Copy Markdown
Contributor

j-xiong commented Feb 26, 2026

Thanks. I do see a few other places in the call chain the error code is still ignored (e.g. psm3_amsh_short_reply, psm3_amsh_long_reply, psm3_am_reqq_drain). However, those are probably fine since they either lead to send side timeout due to not getting the reply or is already in the error processing path. @acgoldma Do you have any comment?

The review process here is different from some other repos that do auto squash on merging. Please squash the commits and do a force push. For this specific case, it is fine to keep the commits separate, but then you will need to add proper commit message and "signed-off" line.

@tatarintsevsv
Copy link
Copy Markdown
Author

Thanks! Commits squashed

@acgoldma
Copy link
Copy Markdown
Contributor

acgoldma commented Mar 2, 2026

I will review the code and get back to you.
That being said, the change in PSMI_BLOCKUNTIL() does not seem right. psm3_poll_internal already uses timers and will return when then expire with PSM2_EP_NO_RESOURCES.
Can you maybe dump the psm3 stack to show more about where you hang is?

@tatarintsevsv
Copy link
Copy Markdown
Author

I will review the code and get back to you. That being said, the change in PSMI_BLOCKUNTIL() does not seem right. psm3_poll_internal already uses timers and will return when then expire with PSM2_EP_NO_RESOURCES. Can you maybe dump the psm3 stack to show more about where you hang is?

Thanks, i will wait for your review!
My test case i'm described here
There little example described the problem.
I think problem is not that psm3_poll_internal returns some incorrect. The problem is that the loop doesn't check for a oveall timeout.
I will give you call stack as soon as possible. Thanks!

Copy link
Copy Markdown
Contributor

@acgoldma acgoldma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think there is a better solution as we should already have timers waiting for ACKs. the point of the yield to receive thread is that the rec thread should check timers.

@tatarintsevsv tatarintsevsv requested a review from acgoldma March 4, 2026 05:26
@tatarintsevsv
Copy link
Copy Markdown
Author

some fixes pushed

Copy link
Copy Markdown
Contributor

@acgoldma acgoldma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

@j-xiong
Copy link
Copy Markdown
Contributor

j-xiong commented Mar 26, 2026

Please remove the second commit. Or use rebase instead of merge if you want the CI to pick up all the changes from main.

… during resources acquiring

Due to a lack of resources, the AMSH_POLL_UNTIL() and PSMI_BLOCKUNTIL() may never complete.
If resources polling was not finished within timeout, an interrupt will occur
and error will be returned.

Signed-off-by: Sergey Tatarintsev <s.tatarintsev@postgrespro.ru>
@tatarintsevsv
Copy link
Copy Markdown
Author

Please remove the second commit. Or use rebase instead of merge if you want the CI to pick up all the changes from main.

done

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

prov/psm3: freezing due to lack of resources

3 participants