Scale Set Listener Stops Responding

### Checks

- [X] I've already read https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners-with-actions-runner-controller/troubleshooting-actions-runner-controller-errors and I'm sure my issue is not covered in the troubleshooting guide.
- [X] I am using charts that are officially provided

### Controller Version

0.7.0

### Deployment Method

Helm

### Checks

- [X] This isn't a question or user support case (For Q&A and community support, go to [Discussions](https://github.com/actions/actions-runner-controller/discussions)).
- [X] I've read the [Changelog](https://github.com/actions/actions-runner-controller/blob/master/docs/gha-runner-scale-set-controller/README.md#changelog) before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes

### To Reproduce

```markdown
1. Install both ARC and the ScaleSet helm charts to a kubernetes cluster.
2. Run Github Actions workflows/jobs normally with a self hosted runner.

Beyond this we do not have specific steps to reproduce and would very much appreciate suggestions for how we could gather more information to trigger this bug.
```


### Describe the bug

Symptom:
- Developers reported that jobs were pending for over 20 minutes (in the github workflow job view)
- Restarted the controller and nothing happened
- Deleted the listener and once a new listener pod starts, then new runner pods start. (Alternative we have now discovered that after about an hour, the listener will begin responding again and the logs says "refreshing token")

Diagnostics:
- The listener stops logging
- The only association we can make to alert on is using something like this ((gha_assigned_jobs > 0) and (rate(gha_assigned_jobs[10m]) == 0)) from the metrics provided by the listener.
![image](https://github.com/actions/actions-runner-controller/assets/23533165/31dd8619-14ac-4ac7-95a9-89ffc21563b7)


---
Another symptom I've noticed: When the controller pod is updated (for instance, between 0.7.0 and 0.8.0) two inconsistencies happen:

- If there are old job pods running, they will block the new listener from starting. I need to delete the pods and do a rollout restart of the controller to get the listener going
- Just updating the deployument doesn't seem to consistently start the listener either. There is always a chance that a rollout restart is needed to get it going.

---
A separate instance where we observed the issue

Currently running listener gha-runner-scale-set-controller:0.7.0
No logs were being generated, had found a workflow job in a pending state (from the listener logs).
10 Runners were online but the runner group was offline when looking at our org's action runner group settings ( url like this: https://github.com/organizations/bigcorp/settings/actions/runner-groups/9 )

Verified listener metrics had flatlined, for example:
![image](https://github.com/actions/actions-runner-controller/assets/23533165/d5cd4579-c0c6-4a91-a892-1ce855f99bdb)

Then the queued job being watched was assigned with no intervention.
Below is the section of logs where the listener had stopped and then started working again with no intervention.

```
2023-12-28T17:10:14Z	INFO	service	process batched runner scale set job messages.	{"messageId": 13462, "batchSize": 4}
2023-12-28T17:10:14Z	INFO	service	job started message received.	{"RequestId": 1026947, "RunnerId": 1062477}
2023-12-28T17:10:14Z	INFO	service	update job info for runner	{"runnerName": "self-hosted-pqbsj-runner-dkmf2", "ownerName": "bigcorp", "repoName": "cc-api-orc-kt", "workflowRef": "bigcorp/github-actions-shared/.github/workflows/sql-migration-lint.yaml@refs/heads/v1", "workflowRunId": 7349734264, "jobDisplayName": "build / sql-lint / lint-migrations", "requestId": 1026947}
2023-12-28T17:10:14Z	INFO	KubernetesManager	Created merge patch json for EphemeralRunner status update	{"json": "{\"status\":{\"jobDisplayName\":\"build / sql-lint / lint-migrations\",\"jobRepositoryName\":\"bigcorp/cc-api-orc-kt\",\"jobRequestId\":1026947,\"jobWorkflowRef\":\"bigcorp/github-actions-shared/.github/workflows/sql-migration-lint.yaml@refs/heads/v1\",\"workflowRunId\":7349734264}}"}
2023-12-28T17:10:14Z	INFO	service	job started message received.	{"RequestId": 1026948, "RunnerId": 1062478}
2023-12-28T17:10:14Z	INFO	service	update job info for runner	{"runnerName": "self-hosted-pqbsj-runner-gl8pq", "ownerName": "bigcorp", "repoName": "cc-api-orc-kt", "workflowRef": "bigcorp/github-actions-shared/.github/workflows/gradle.yaml@refs/heads/v1", "workflowRunId": 7349734264, "jobDisplayName": "build / build / code build", "requestId": 1026948}
2023-12-28T17:10:14Z	INFO	KubernetesManager	Created merge patch json for EphemeralRunner status update	{"json": "{\"status\":{\"jobDisplayName\":\"build / build / code build\",\"jobRepositoryName\":\"bigcorp/cc-api-orc-kt\",\"jobRequestId\":1026948,\"jobWorkflowRef\":\"bigcorp/github-actions-shared/.github/workflows/gradle.yaml@refs/heads/v1\",\"workflowRunId\":7349734264}}"}
2023-12-28T17:10:14Z	INFO	service	job assigned message received.	{"RequestId": 1026947}
2023-12-28T17:10:14Z	INFO	service	job assigned message received.	{"RequestId": 1026948}
2023-12-28T17:10:14Z	INFO	auto_scaler	acquiring jobs.	{"request count": 0, "requestIds": "[]"}
2023-12-28T17:10:15Z	INFO	auto_scaler	deleted message.	{"messageId": 13462}
2023-12-28T17:10:15Z	INFO	service	waiting for message...

2023-12-28T18:10:15Z	INFO	refreshing_client	message queue token is expired during GetNextMessage, refreshing...
2023-12-28T18:10:15Z	INFO	refreshing token	{"githubConfigUrl": "https://github.com/bigcorp"}
2023-12-28T18:10:15Z	INFO	getting access token for GitHub App auth	{"accessTokenURL": "https://api.github.com/app/installations/43625644/access_tokens"}
2023-12-28T18:10:15Z	INFO	getting runner registration token	{"registrationTokenURL": "https://api.github.com/orgs/bigcorp/actions/runners/registration-token"}
2023-12-28T18:10:15Z	INFO	getting Actions tenant URL and JWT	{"registrationURL": "https://api.github.com/actions/runner-registration"}
2023-12-28T18:10:16Z	INFO	service	process message.	{"messageId": 13463, "messageType": "RunnerScaleSetJobMessages"}
2023-12-28T18:10:16Z	INFO	service	current runner scale set statistics.	{"available jobs": 26, "acquired jobs": 0, "assigned jobs": 0, "running jobs": 0, "registered runners": 10, "busy runners": 0, "idle runners": 10}
2023-12-28T18:10:16Z	INFO	service	process batched runner scale set job messages.	{"messageId": 13463, "batchSize": 1}
2023-12-28T18:10:16Z	INFO	service	job completed message received.	{"RequestId": 1026947, "Result": "succeeded", "RunnerId": 1062477, "RunnerName": "self-hosted-pqbsj-runner-dkmf2"}
2023-12-28T18:10:16Z	INFO	auto_scaler	acquiring jobs.	{"request count": 0, "requestIds": "[]"}
```

### Describe the expected behavior

Ideally the listener would never stop responding.

### Additional Context


The only additional thing we tried was using the `opt out` button on our github app advanced features. This was kind of a hail mary since we saw the logs related to refreshing the token.
<img width="771" alt="image" src="https://github.com/actions/actions-runner-controller/assets/23533165/e8a08737-f0d4-4d2d-8f65-1ff8409b33ab">
It seems to have helped but maybe that's just a fluke.



### Controller Logs

```shell
Here's the logs from the time 10 min before through 10 min after the lister stops responding.

https://gist.github.com/jameshounshell/597358f9b0b624d1f80f98057ddddcf0
```


### Runner Pod Logs

```shell
N/A
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scale Set Listener Stops Responding #3204

Checks

Controller Version

Deployment Method

Checks

To Reproduce

Describe the bug

Describe the expected behavior

Additional Context

Controller Logs

Runner Pod Logs

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Scale Set Listener Stops Responding #3204

Description

Checks

Controller Version

Deployment Method

Checks

To Reproduce

Describe the bug

Describe the expected behavior

Additional Context

Controller Logs

Runner Pod Logs

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions