-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Description
Checks
- I've already read https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners-with-actions-runner-controller/troubleshooting-actions-runner-controller-errors and I'm sure my issue is not covered in the troubleshooting guide.
- I am using charts that are officially provided
Controller Version
0.7.0
Deployment Method
Helm
Checks
- This isn't a question or user support case (For Q&A and community support, go to Discussions).
- I've read the Changelog before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes
To Reproduce
1. Install both ARC and the ScaleSet helm charts to a kubernetes cluster.
2. Run Github Actions workflows/jobs normally with a self hosted runner.
Beyond this we do not have specific steps to reproduce and would very much appreciate suggestions for how we could gather more information to trigger this bug.Describe the bug
Symptom:
- Developers reported that jobs were pending for over 20 minutes (in the github workflow job view)
- Restarted the controller and nothing happened
- Deleted the listener and once a new listener pod starts, then new runner pods start. (Alternative we have now discovered that after about an hour, the listener will begin responding again and the logs says "refreshing token")
Diagnostics:
- The listener stops logging
- The only association we can make to alert on is using something like this ((gha_assigned_jobs > 0) and (rate(gha_assigned_jobs[10m]) == 0)) from the metrics provided by the listener.

Another symptom I've noticed: When the controller pod is updated (for instance, between 0.7.0 and 0.8.0) two inconsistencies happen:
- If there are old job pods running, they will block the new listener from starting. I need to delete the pods and do a rollout restart of the controller to get the listener going
- Just updating the deployument doesn't seem to consistently start the listener either. There is always a chance that a rollout restart is needed to get it going.
A separate instance where we observed the issue
Currently running listener gha-runner-scale-set-controller:0.7.0
No logs were being generated, had found a workflow job in a pending state (from the listener logs).
10 Runners were online but the runner group was offline when looking at our org's action runner group settings ( url like this: https://github.com/organizations/bigcorp/settings/actions/runner-groups/9 )
Verified listener metrics had flatlined, for example:

Then the queued job being watched was assigned with no intervention.
Below is the section of logs where the listener had stopped and then started working again with no intervention.
2023-12-28T17:10:14Z INFO service process batched runner scale set job messages. {"messageId": 13462, "batchSize": 4}
2023-12-28T17:10:14Z INFO service job started message received. {"RequestId": 1026947, "RunnerId": 1062477}
2023-12-28T17:10:14Z INFO service update job info for runner {"runnerName": "self-hosted-pqbsj-runner-dkmf2", "ownerName": "bigcorp", "repoName": "cc-api-orc-kt", "workflowRef": "bigcorp/github-actions-shared/.github/workflows/sql-migration-lint.yaml@refs/heads/v1", "workflowRunId": 7349734264, "jobDisplayName": "build / sql-lint / lint-migrations", "requestId": 1026947}
2023-12-28T17:10:14Z INFO KubernetesManager Created merge patch json for EphemeralRunner status update {"json": "{\"status\":{\"jobDisplayName\":\"build / sql-lint / lint-migrations\",\"jobRepositoryName\":\"bigcorp/cc-api-orc-kt\",\"jobRequestId\":1026947,\"jobWorkflowRef\":\"bigcorp/github-actions-shared/.github/workflows/sql-migration-lint.yaml@refs/heads/v1\",\"workflowRunId\":7349734264}}"}
2023-12-28T17:10:14Z INFO service job started message received. {"RequestId": 1026948, "RunnerId": 1062478}
2023-12-28T17:10:14Z INFO service update job info for runner {"runnerName": "self-hosted-pqbsj-runner-gl8pq", "ownerName": "bigcorp", "repoName": "cc-api-orc-kt", "workflowRef": "bigcorp/github-actions-shared/.github/workflows/gradle.yaml@refs/heads/v1", "workflowRunId": 7349734264, "jobDisplayName": "build / build / code build", "requestId": 1026948}
2023-12-28T17:10:14Z INFO KubernetesManager Created merge patch json for EphemeralRunner status update {"json": "{\"status\":{\"jobDisplayName\":\"build / build / code build\",\"jobRepositoryName\":\"bigcorp/cc-api-orc-kt\",\"jobRequestId\":1026948,\"jobWorkflowRef\":\"bigcorp/github-actions-shared/.github/workflows/gradle.yaml@refs/heads/v1\",\"workflowRunId\":7349734264}}"}
2023-12-28T17:10:14Z INFO service job assigned message received. {"RequestId": 1026947}
2023-12-28T17:10:14Z INFO service job assigned message received. {"RequestId": 1026948}
2023-12-28T17:10:14Z INFO auto_scaler acquiring jobs. {"request count": 0, "requestIds": "[]"}
2023-12-28T17:10:15Z INFO auto_scaler deleted message. {"messageId": 13462}
2023-12-28T17:10:15Z INFO service waiting for message...
2023-12-28T18:10:15Z INFO refreshing_client message queue token is expired during GetNextMessage, refreshing...
2023-12-28T18:10:15Z INFO refreshing token {"githubConfigUrl": "https://github.com/bigcorp"}
2023-12-28T18:10:15Z INFO getting access token for GitHub App auth {"accessTokenURL": "https://api.github.com/app/installations/43625644/access_tokens"}
2023-12-28T18:10:15Z INFO getting runner registration token {"registrationTokenURL": "https://api.github.com/orgs/bigcorp/actions/runners/registration-token"}
2023-12-28T18:10:15Z INFO getting Actions tenant URL and JWT {"registrationURL": "https://api.github.com/actions/runner-registration"}
2023-12-28T18:10:16Z INFO service process message. {"messageId": 13463, "messageType": "RunnerScaleSetJobMessages"}
2023-12-28T18:10:16Z INFO service current runner scale set statistics. {"available jobs": 26, "acquired jobs": 0, "assigned jobs": 0, "running jobs": 0, "registered runners": 10, "busy runners": 0, "idle runners": 10}
2023-12-28T18:10:16Z INFO service process batched runner scale set job messages. {"messageId": 13463, "batchSize": 1}
2023-12-28T18:10:16Z INFO service job completed message received. {"RequestId": 1026947, "Result": "succeeded", "RunnerId": 1062477, "RunnerName": "self-hosted-pqbsj-runner-dkmf2"}
2023-12-28T18:10:16Z INFO auto_scaler acquiring jobs. {"request count": 0, "requestIds": "[]"}
Describe the expected behavior
Ideally the listener would never stop responding.
Additional Context
The only additional thing we tried was using the opt out button on our github app advanced features. This was kind of a hail mary since we saw the logs related to refreshing the token.

It seems to have helped but maybe that's just a fluke.
Controller Logs
Here's the logs from the time 10 min before through 10 min after the lister stops responding.
https://gist.github.com/jameshounshell/597358f9b0b624d1f80f98057ddddcf0Runner Pod Logs
N/A