Checks
Controller Version
0.27.1
Helm Chart Version
0.22.1
CertManager Version
1.9.1
Deployment Method
Helm
cert-manager installation
Checks
Resource Definitions
Name: prod-1-sandbox
Namespace: runners
Labels: app.kubernetes.io/managed-by=Helm
Annotations: meta.helm.sh/release-name: runners
meta.helm.sh/release-namespace: runners
API Version: actions.summerwind.dev/v1alpha1
Kind: RunnerSet
Metadata:
Creation Timestamp: 2023-04-01T12:24:22Z
Generation: 1
Managed Fields:
API Version: actions.summerwind.dev/v1alpha1
Fields Type: FieldsV1
fieldsV1:
f:metadata:
f:annotations:
.:
f:meta.helm.sh/release-name:
f:meta.helm.sh/release-namespace:
f:labels:
.:
f:app.kubernetes.io/managed-by:
f:spec:
.:
f:dockerdWithinRunnerContainer:
f:ephemeral:
f:organization:
f:replicas:
f:selector:
.:
f:matchLabels:
.:
f:app:
f:serviceName:
f:template:
.:
f:metadata:
.:
f:labels:
.:
f:app:
f:spec:
.:
f:containers:
f:securityContext:
.:
f:fsGroup:
f:volumes:
Manager: helm
Operation: Update
Time: 2023-04-01T12:24:22Z
API Version: actions.summerwind.dev/v1alpha1
Fields Type: FieldsV1
fieldsV1:
f:status:
.:
f:availableReplicas:
f:desiredReplicas:
f:readyReplicas:
f:replicas:
f:updatedReplicas:
Manager: manager
Operation: Update
Subresource: status
Time: 2023-04-01T13:02:57Z
Resource Version: 39471951
UID: f1997923-966c-4385-83ef-8a794b3e378c
Spec:
Dockerd Within Runner Container: true
Ephemeral: true
Organization: sandbox
Replicas: 2
Selector:
Match Labels:
App: prod-1-sandbox
Service Name: prod-1-sandbox
Template:
Metadata:
Labels:
App: prod-1-sandbox
Spec:
Containers:
Env:
Name: RUNNER_GRACEFUL_STOP_TIMEOUT
Value: 120
Image: <>/gha-runner:0.0.57
Name: runner
Resources:
Limits:
kvm: 1
Requests:
Cpu: 500m
Memory: 2G
Volume Mounts:
Mount Path: /etc/var
Name: cert
Read Only: true
Security Context:
Fs Group: 1000
Volumes:
Name: cert
Secret:
Optional: false
Secret Name: cert
Status:
Available Replicas: 90
Desired Replicas: 2
Ready Replicas: 88
Replicas: 90
Updated Replicas: 90
Events: <none>
To Reproduce
Helm chart changes done on the controller side:
runner:
statusUpdateHook:
enabled: true
ref: #2465
Describe the bug
I tried enabling statusUpdateHook #1268 by @fgalind1 👏 but unfortunately it is not working for me as expected. I do not see any status changes via kubectl get pods -A -w but rather run into #288.
kubectl get runnerset -n runners
NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE
prod-1-name1 1 48 48 48 3d18h
prod-1-name2 1 32 32 32 11d
prod-1-name2 2 42 42 42 11d
prod-1-name4 10 110 110 110 11d
prod-1-sandbox 2 102 102 102 11d
prod-1-name5 6 125 125 125 2d11h
We use a custom runner image and I have updated it to use the update-status and other job hook scripts that invoke the API and set the two env var that enable job started and completed hooks:
export ACTIONS_RUNNER_HOOK_JOB_STARTED=/hooks/job-started.sh
export ACTIONS_RUNNER_HOOK_JOB_COMPLETED=/hooks/job-completed.sh
Also later updated the scripts to include repo info as done by @Moser-ss #2093 after upgrading ARC to 0.27.1


Helm chart changes done on the controller side:
runner:
statusUpdateHook:
enabled: true
I did not include the other flag for kubernetes container mode and it is set to default
rbac:
allowGrantingKubernetesContainerModePermissions: false
I even tried to run the same API from within the Runner container and it seems like it does not have the correct privileges. It seems like the serviceaccount, role, and role binding are not created successfully during processRunnerCreation() - but I could not find anything related to role creation in the controller logs.
curl --cacert ${serviceaccount}/ca.crt --header "Content-Type: application/merge-patch+json" --header "Authorization: Bearer ${token}" --show-error "${apiserver}/apis/actions.summerwind.dev/v1alpha1/namespaces/${namespace}/runners/${HOSTNAME}/status"
{
"kind": "Status",
"apiVersion": "v1",
"metadata": {},
"status": "Failure",
"message": "runners.actions.summerwind.dev \"runner-ztqh2-0\" is forbidden: User \"system:serviceaccount:runners:default\" cannot get resource \"runners/status\" in API group \"actions.summerwind.dev\" in the namespace \"runners\"",
"reason": "Forbidden",
"details": {
"name": "runner-ztqh2-0",
"group": "actions.summerwind.dev",
"kind": "runners"
},
"code": 403
I also do not see any roles, rolebindings created with this policy. Is this expected?
I do see the controller logs running started displaying a lot of #288
https://gist.github.com/dhawalseth/38f7b56b50f74a0f6f43b78c120deac9
which caused us into this unlimited number of pods issue which might be a symptom of using statusUpdateHook flag.
To remediate, we took some of the following steps based on the discussions and comments from @mumoshu in the parent issue #288 and #1646,
- we updated the capacity of the node pool,
- check auto update runner is turned off,
- there is no firewall between apiserver and nodes,
- Cleaned up a lot of offline runner using GH api to reduce the API response data pagination,
- /runner/config file is present
- Runner registration was successful as per _diag logs
but instead what we observed was controller started bringing up more and more runners until it could not provision more with the new capacity since it is not registering runner properly and so trying to provision more.
The runner diag logs do not show a failed registration though as seen in the above #1646 issue and it seems like it is able to register itself with GH: https://gist.github.com/dhawalseth/d352c969ba7364aaccf2d964e24a0f76
Any insights into this issue would be really helpful. Please let me know if there is anything that I may be missing here or could have misconfigured.
Describe the expected behavior
Runners update status and get registered correctly.
Whole Controller Logs
https://gist.github.com/dhawalseth/38f7b56b50f74a0f6f43b78c120deac9
Whole Runner Pod Logs
https://gist.github.com/dhawalseth/d352c969ba7364aaccf2d964e24a0f76
Additional Context
No response
Checks
Controller Version
0.27.1
Helm Chart Version
0.22.1
CertManager Version
1.9.1
Deployment Method
Helm
cert-manager installation
Checks
Resource Definitions
To Reproduce
Helm chart changes done on the controller side: runner: statusUpdateHook: enabled: trueref: #2465
Describe the bug
I tried enabling statusUpdateHook #1268 by @fgalind1 👏 but unfortunately it is not working for me as expected. I do not see any status changes via
kubectl get pods -A -wbut rather run into #288.We use a custom runner image and I have updated it to use the update-status and other job hook scripts that invoke the API and set the two env var that enable job started and completed hooks:
export ACTIONS_RUNNER_HOOK_JOB_STARTED=/hooks/job-started.shexport ACTIONS_RUNNER_HOOK_JOB_COMPLETED=/hooks/job-completed.shAlso later updated the scripts to include repo info as done by @Moser-ss #2093 after upgrading ARC to 0.27.1
Helm chart changes done on the controller side:
I did not include the other flag for kubernetes container mode and it is set to default
I even tried to run the same API from within the Runner container and it seems like it does not have the correct privileges. It seems like the serviceaccount, role, and role binding are not created successfully during processRunnerCreation() - but I could not find anything related to role creation in the controller logs.
I also do not see any roles, rolebindings created with this policy. Is this expected?
I do see the controller logs running started displaying a lot of #288
https://gist.github.com/dhawalseth/38f7b56b50f74a0f6f43b78c120deac9
which caused us into this unlimited number of pods issue which might be a symptom of using statusUpdateHook flag.
To remediate, we took some of the following steps based on the discussions and comments from @mumoshu in the parent issue #288 and #1646,
but instead what we observed was controller started bringing up more and more runners until it could not provision more with the new capacity since it is not registering runner properly and so trying to provision more.
The runner diag logs do not show a failed registration though as seen in the above #1646 issue and it seems like it is able to register itself with GH: https://gist.github.com/dhawalseth/d352c969ba7364aaccf2d964e24a0f76
Any insights into this issue would be really helpful. Please let me know if there is anything that I may be missing here or could have misconfigured.
Describe the expected behavior
Runners update status and get registered correctly.
Whole Controller Logs
https://gist.github.com/dhawalseth/38f7b56b50f74a0f6f43b78c120deac9
Whole Runner Pod Logs
https://gist.github.com/dhawalseth/d352c969ba7364aaccf2d964e24a0f76
Additional Context
No response