Skip to content

Scale Set Listener Stops Responding #3204

@jameshounshell

Description

@jameshounshell

Checks

Controller Version

0.7.0

Deployment Method

Helm

Checks

  • This isn't a question or user support case (For Q&A and community support, go to Discussions).
  • I've read the Changelog before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes

To Reproduce

1. Install both ARC and the ScaleSet helm charts to a kubernetes cluster.
2. Run Github Actions workflows/jobs normally with a self hosted runner.

Beyond this we do not have specific steps to reproduce and would very much appreciate suggestions for how we could gather more information to trigger this bug.

Describe the bug

Symptom:

  • Developers reported that jobs were pending for over 20 minutes (in the github workflow job view)
  • Restarted the controller and nothing happened
  • Deleted the listener and once a new listener pod starts, then new runner pods start. (Alternative we have now discovered that after about an hour, the listener will begin responding again and the logs says "refreshing token")

Diagnostics:

  • The listener stops logging
  • The only association we can make to alert on is using something like this ((gha_assigned_jobs > 0) and (rate(gha_assigned_jobs[10m]) == 0)) from the metrics provided by the listener.
    image

Another symptom I've noticed: When the controller pod is updated (for instance, between 0.7.0 and 0.8.0) two inconsistencies happen:

  • If there are old job pods running, they will block the new listener from starting. I need to delete the pods and do a rollout restart of the controller to get the listener going
  • Just updating the deployument doesn't seem to consistently start the listener either. There is always a chance that a rollout restart is needed to get it going.

A separate instance where we observed the issue

Currently running listener gha-runner-scale-set-controller:0.7.0
No logs were being generated, had found a workflow job in a pending state (from the listener logs).
10 Runners were online but the runner group was offline when looking at our org's action runner group settings ( url like this: https://github.com/organizations/bigcorp/settings/actions/runner-groups/9 )

Verified listener metrics had flatlined, for example:
image

Then the queued job being watched was assigned with no intervention.
Below is the section of logs where the listener had stopped and then started working again with no intervention.

2023-12-28T17:10:14Z	INFO	service	process batched runner scale set job messages.	{"messageId": 13462, "batchSize": 4}
2023-12-28T17:10:14Z	INFO	service	job started message received.	{"RequestId": 1026947, "RunnerId": 1062477}
2023-12-28T17:10:14Z	INFO	service	update job info for runner	{"runnerName": "self-hosted-pqbsj-runner-dkmf2", "ownerName": "bigcorp", "repoName": "cc-api-orc-kt", "workflowRef": "bigcorp/github-actions-shared/.github/workflows/sql-migration-lint.yaml@refs/heads/v1", "workflowRunId": 7349734264, "jobDisplayName": "build / sql-lint / lint-migrations", "requestId": 1026947}
2023-12-28T17:10:14Z	INFO	KubernetesManager	Created merge patch json for EphemeralRunner status update	{"json": "{\"status\":{\"jobDisplayName\":\"build / sql-lint / lint-migrations\",\"jobRepositoryName\":\"bigcorp/cc-api-orc-kt\",\"jobRequestId\":1026947,\"jobWorkflowRef\":\"bigcorp/github-actions-shared/.github/workflows/sql-migration-lint.yaml@refs/heads/v1\",\"workflowRunId\":7349734264}}"}
2023-12-28T17:10:14Z	INFO	service	job started message received.	{"RequestId": 1026948, "RunnerId": 1062478}
2023-12-28T17:10:14Z	INFO	service	update job info for runner	{"runnerName": "self-hosted-pqbsj-runner-gl8pq", "ownerName": "bigcorp", "repoName": "cc-api-orc-kt", "workflowRef": "bigcorp/github-actions-shared/.github/workflows/gradle.yaml@refs/heads/v1", "workflowRunId": 7349734264, "jobDisplayName": "build / build / code build", "requestId": 1026948}
2023-12-28T17:10:14Z	INFO	KubernetesManager	Created merge patch json for EphemeralRunner status update	{"json": "{\"status\":{\"jobDisplayName\":\"build / build / code build\",\"jobRepositoryName\":\"bigcorp/cc-api-orc-kt\",\"jobRequestId\":1026948,\"jobWorkflowRef\":\"bigcorp/github-actions-shared/.github/workflows/gradle.yaml@refs/heads/v1\",\"workflowRunId\":7349734264}}"}
2023-12-28T17:10:14Z	INFO	service	job assigned message received.	{"RequestId": 1026947}
2023-12-28T17:10:14Z	INFO	service	job assigned message received.	{"RequestId": 1026948}
2023-12-28T17:10:14Z	INFO	auto_scaler	acquiring jobs.	{"request count": 0, "requestIds": "[]"}
2023-12-28T17:10:15Z	INFO	auto_scaler	deleted message.	{"messageId": 13462}
2023-12-28T17:10:15Z	INFO	service	waiting for message...

2023-12-28T18:10:15Z	INFO	refreshing_client	message queue token is expired during GetNextMessage, refreshing...
2023-12-28T18:10:15Z	INFO	refreshing token	{"githubConfigUrl": "https://github.com/bigcorp"}
2023-12-28T18:10:15Z	INFO	getting access token for GitHub App auth	{"accessTokenURL": "https://api.github.com/app/installations/43625644/access_tokens"}
2023-12-28T18:10:15Z	INFO	getting runner registration token	{"registrationTokenURL": "https://api.github.com/orgs/bigcorp/actions/runners/registration-token"}
2023-12-28T18:10:15Z	INFO	getting Actions tenant URL and JWT	{"registrationURL": "https://api.github.com/actions/runner-registration"}
2023-12-28T18:10:16Z	INFO	service	process message.	{"messageId": 13463, "messageType": "RunnerScaleSetJobMessages"}
2023-12-28T18:10:16Z	INFO	service	current runner scale set statistics.	{"available jobs": 26, "acquired jobs": 0, "assigned jobs": 0, "running jobs": 0, "registered runners": 10, "busy runners": 0, "idle runners": 10}
2023-12-28T18:10:16Z	INFO	service	process batched runner scale set job messages.	{"messageId": 13463, "batchSize": 1}
2023-12-28T18:10:16Z	INFO	service	job completed message received.	{"RequestId": 1026947, "Result": "succeeded", "RunnerId": 1062477, "RunnerName": "self-hosted-pqbsj-runner-dkmf2"}
2023-12-28T18:10:16Z	INFO	auto_scaler	acquiring jobs.	{"request count": 0, "requestIds": "[]"}

Describe the expected behavior

Ideally the listener would never stop responding.

Additional Context

The only additional thing we tried was using the opt out button on our github app advanced features. This was kind of a hail mary since we saw the logs related to refreshing the token.
image
It seems to have helped but maybe that's just a fluke.

Controller Logs

Here's the logs from the time 10 min before through 10 min after the lister stops responding.

https://gist.github.com/jameshounshell/597358f9b0b624d1f80f98057ddddcf0

Runner Pod Logs

N/A

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workinggha-runner-scale-setRelated to the gha-runner-scale-set mode

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions