Skip to content

Conversation

@supertetelman
Copy link
Contributor

@supertetelman supertetelman commented Mar 10, 2022

Our Helm installs were doing a mix of running from localhost and/or kube-master[0]. This was causing issues in the nfs-client-provisioner because the CentOS kubespray installer was not properly installing kubectl on the kube-master nodes.

For now I am aligning everything to what we did in GPU Operator. In the future, it would make sense to use the now functional helm Ansible module and run things from localhost (the provisioning node) instead of the kube-master[0]. This would simply allow us to install less binaries on the management nodes, but beyond that it is not a necessary change.

Also added the standard proxy commands to a few places where they were missing in helm installs.

Additionally I moved the block of code that runs helm/kubectl commands to be after the block where we actually install the proper kubectl/helm binaries. This was causing issues in the edge-cases on CentOS because of how different software was installed across Ubuntu/CentOS.

The automated testing already tests all the paths that this touches.

- include: ../bootstrap/bootstrap-openshift.yml

# GPU operator
- hosts: kube-master[0]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The expectation is that Helm commands are run from the provisioning node. No need to install Helm and run it on the management systems.

@supertetelman supertetelman changed the title [WIP] Debugging CentOS test failures related to missing kubectl Fix some ordering in k8s-cluster.yml to install Helm properly and run all commands from localhost - fixing CentOS install Mar 23, 2022
@supertetelman supertetelman marked this pull request as ready for review March 23, 2022 02:31
@supertetelman supertetelman added the next-release Critical for the next release label Mar 23, 2022
@supertetelman supertetelman requested a review from ajdecon March 23, 2022 02:31
Copy link
Contributor

@ajdecon ajdecon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two issues to address:

  1. ansible-lint failed with a minor spacing issue:
Linting ./nfs-client-provisioner
WARNING  Listing 1 violation(s) that are fatal
tasks/main.yml:4: [var-spacing] [LOW] Variables should have spaces before and after: "{{k8s_nfs_client_repo_name}}"
Warning: var-spacing Variables should have spaces before and after: "{{k8s_nfs_client_repo_name}}"
You can skip specific rules or tags by adding them to your configuration file:
# .ansible-lint
warn_list:  # or 'skip_list' to silence them completely
  - var-spacing  # Variables should have spaces before and after:  {{ var_name }}

Finished with 1 failure(s), 0 warning(s) on 2 files.
  1. The Jenkins end-to-end test failed. This might be a transient failure, so it's worth re-running, then debugging if it repeats.
TASK [install nfs-client-provisioner] ******************************************
fatal: [localhost]: FAILED! => changed=false 
  cmd:
  - /usr/local/bin/helm
  - upgrade
  - --install
  - nfs-subdir-external-provisioner
  - nfs-subdir-external-provisioner/nfs-subdir-external-provisioner
  - --create-namespace
  - --namespace
  - deepops-nfs-client-provisioner
  - --version
  - 4.0.13
  - --set
  - nfs.server=127.0.0.1
  - --set
  - nfs.path=/export/deepops_nfs
  - --set
  - storageClass.defaultClass=true
  - --wait
  delta: '0:00:00.060010'
  end: '2022-03-23 03:29:00.175032'
  msg: non-zero return code
  rc: 1
  start: '2022-03-23 03:29:00.115022'
  stderr: |-
    Error: Kubernetes cluster unreachable: <html><head><meta http-equiv='refresh' content='1;url=/login?from=%2Fversion%3Ftimeout%3D32s'/><script>window.location.replace('/login?from=%2Fversion%3Ftimeout%3D32s');</script></head><body style='background-color:white; color:white;'>
  
  
    Authentication required
    <!--
    You are authenticated as: anonymous
    Groups that you are in:
  
    Permission you need to have (but didn't): hudson.model.Hudson.Read
     ... which is implied by: hudson.security.Permission.GenericRead
     ... which is implied by: hudson.model.Hudson.Administer
    -->
  
    </body></html>
  stderr_lines: <omitted>
  stdout: ''
  stdout_lines: <omitted>

@supertetelman supertetelman changed the title Fix some ordering in k8s-cluster.yml to install Helm properly and run all commands from localhost - fixing CentOS install Fix some ordering in k8s-cluster.yml to install Helm properly and run all commands from kube-master[0]- fixing CentOS install Mar 23, 2022
@supertetelman supertetelman requested a review from ajdecon March 24, 2022 15:50
@ajdecon ajdecon merged commit c01e64f into NVIDIA:master Mar 24, 2022
@ajdecon ajdecon mentioned this pull request Apr 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

next-release Critical for the next release

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants