Skip to content

Runner pod statusUpdateHook not working and unlimited runners start provisioning #2468

@dhawalseth

Description

@dhawalseth

Checks

Controller Version

0.27.1

Helm Chart Version

0.22.1

CertManager Version

1.9.1

Deployment Method

Helm

cert-manager installation

  • Yes
  • Yes

Checks

  • This isn't a question or user support case (For Q&A and community support, go to Discussions. It might also be a good idea to contract with any of contributors and maintainers if your business is so critical and therefore you need priority support
  • I've read releasenotes before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes
  • My actions-runner-controller version (v0.x.y) does support the feature
  • I've already upgraded ARC (including the CRDs, see charts/actions-runner-controller/docs/UPGRADING.md for details) to the latest and it didn't fix the issue
  • I've migrated to the workflow job webhook event (if you using webhook driven scaling)

Resource Definitions

Name:         prod-1-sandbox
Namespace:    runners
Labels:       app.kubernetes.io/managed-by=Helm
Annotations:  meta.helm.sh/release-name: runners
              meta.helm.sh/release-namespace: runners
API Version:  actions.summerwind.dev/v1alpha1
Kind:         RunnerSet
Metadata:
  Creation Timestamp:  2023-04-01T12:24:22Z
  Generation:          1
  Managed Fields:
    API Version:  actions.summerwind.dev/v1alpha1
    Fields Type:  FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .:
          f:meta.helm.sh/release-name:
          f:meta.helm.sh/release-namespace:
        f:labels:
          .:
          f:app.kubernetes.io/managed-by:
      f:spec:
        .:
        f:dockerdWithinRunnerContainer:
        f:ephemeral:
        f:organization:
        f:replicas:
        f:selector:
          .:
          f:matchLabels:
            .:
            f:app:
        f:serviceName:
        f:template:
          .:
          f:metadata:
            .:
            f:labels:
              .:
              f:app:
          f:spec:
            .:
            f:containers:
            f:securityContext:
              .:
              f:fsGroup:
            f:volumes:
    Manager:      helm
    Operation:    Update
    Time:         2023-04-01T12:24:22Z
    API Version:  actions.summerwind.dev/v1alpha1
    Fields Type:  FieldsV1
    fieldsV1:
      f:status:
        .:
        f:availableReplicas:
        f:desiredReplicas:
        f:readyReplicas:
        f:replicas:
        f:updatedReplicas:
    Manager:         manager
    Operation:       Update
    Subresource:     status
    Time:            2023-04-01T13:02:57Z
  Resource Version:  39471951
  UID:               f1997923-966c-4385-83ef-8a794b3e378c
Spec:
  Dockerd Within Runner Container:  true
  Ephemeral:                        true
  Organization:                     sandbox
  Replicas:                         2
  Selector:
    Match Labels:
      App:       prod-1-sandbox
  Service Name:  prod-1-sandbox
  Template:
    Metadata:
      Labels:
        App:  prod-1-sandbox
    Spec:
      Containers:
        Env:
          Name:   RUNNER_GRACEFUL_STOP_TIMEOUT
          Value:  120
        Image:    <>/gha-runner:0.0.57
        Name:     runner
        Resources:
          Limits:
            kvm:  1
          Requests:
            Cpu:     500m
            Memory:  2G
        Volume Mounts:
          Mount Path:  /etc/var
          Name:        cert
          Read Only:   true
      Security Context:
        Fs Group:  1000
      Volumes:
        Name:  cert
        Secret:
          Optional:     false
          Secret Name:  cert
Status:
  Available Replicas:  90
  Desired Replicas:    2
  Ready Replicas:      88
  Replicas:            90
  Updated Replicas:    90
Events:                <none>

To Reproduce

Helm chart changes done on the controller side:

runner:
  statusUpdateHook:
    enabled: true

ref: #2465

Describe the bug

I tried enabling statusUpdateHook #1268 by @fgalind1 👏 but unfortunately it is not working for me as expected. I do not see any status changes via kubectl get pods -A -w but rather run into #288.

kubectl get runnerset -n runners

NAME                     DESIRED   CURRENT   UP-TO-DATE   AVAILABLE   AGE
prod-1-name1             1         48        48           48          3d18h
prod-1-name2             1         32        32           32          11d
prod-1-name2             2         42        42           42          11d
prod-1-name4            10        110       110          110         11d
prod-1-sandbox         2         102       102          102         11d
prod-1-name5            6         125       125          125         2d11h

We use a custom runner image and I have updated it to use the update-status and other job hook scripts that invoke the API and set the two env var that enable job started and completed hooks:

export ACTIONS_RUNNER_HOOK_JOB_STARTED=/hooks/job-started.sh
export ACTIONS_RUNNER_HOOK_JOB_COMPLETED=/hooks/job-completed.sh

Also later updated the scripts to include repo info as done by @Moser-ss #2093 after upgrading ARC to 0.27.1

Screenshot 2023-04-01 at 5 57 15 AM

Screenshot 2023-04-01 at 5 57 20 AM

Helm chart changes done on the controller side:

runner:
  statusUpdateHook:
    enabled: true

I did not include the other flag for kubernetes container mode and it is set to default

rbac:
    allowGrantingKubernetesContainerModePermissions: false

I even tried to run the same API from within the Runner container and it seems like it does not have the correct privileges. It seems like the serviceaccount, role, and role binding are not created successfully during processRunnerCreation() - but I could not find anything related to role creation in the controller logs.

 curl --cacert ${serviceaccount}/ca.crt  --header "Content-Type: application/merge-patch+json" --header "Authorization: Bearer ${token}" --show-error  "${apiserver}/apis/actions.summerwind.dev/v1alpha1/namespaces/${namespace}/runners/${HOSTNAME}/status"
{
  "kind": "Status",
  "apiVersion": "v1",
  "metadata": {},
  "status": "Failure",
  "message": "runners.actions.summerwind.dev \"runner-ztqh2-0\" is forbidden: User \"system:serviceaccount:runners:default\" cannot get resource \"runners/status\" in API group \"actions.summerwind.dev\" in the namespace \"runners\"",
  "reason": "Forbidden",
  "details": {
    "name": "runner-ztqh2-0",
    "group": "actions.summerwind.dev",
    "kind": "runners"
  },
  "code": 403

I also do not see any roles, rolebindings created with this policy. Is this expected?

I do see the controller logs running started displaying a lot of #288
https://gist.github.com/dhawalseth/38f7b56b50f74a0f6f43b78c120deac9
which caused us into this unlimited number of pods issue which might be a symptom of using statusUpdateHook flag.

To remediate, we took some of the following steps based on the discussions and comments from @mumoshu in the parent issue #288 and #1646,

  1. we updated the capacity of the node pool,
  2. check auto update runner is turned off,
  3. there is no firewall between apiserver and nodes,
  4. Cleaned up a lot of offline runner using GH api to reduce the API response data pagination,
  5. /runner/config file is present
  6. Runner registration was successful as per _diag logs

but instead what we observed was controller started bringing up more and more runners until it could not provision more with the new capacity since it is not registering runner properly and so trying to provision more.

The runner diag logs do not show a failed registration though as seen in the above #1646 issue and it seems like it is able to register itself with GH: https://gist.github.com/dhawalseth/d352c969ba7364aaccf2d964e24a0f76

Any insights into this issue would be really helpful. Please let me know if there is anything that I may be missing here or could have misconfigured.

Describe the expected behavior

Runners update status and get registered correctly.

Whole Controller Logs

https://gist.github.com/dhawalseth/38f7b56b50f74a0f6f43b78c120deac9

Whole Runner Pod Logs

https://gist.github.com/dhawalseth/d352c969ba7364aaccf2d964e24a0f76

Additional Context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingneeds triageRequires review from the maintainers

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions