Shutting down the worker nodes and master nodes

Procedure
  1. Create a temporary auth information in case that OAuth feature is not working.

    This account is used by the user during the startup process once all the nodes are up. In case the user is not able to connect with the default kubeadmin account, the tmp-admin account can be used for logging in to the cluster.

    1. Create a temporary service admin.
      oc create sa tmp-admin -n default
    2. Bind a cluster admin role to the tmp-admin user.
      oc adm policy add-cluster-role-to-user cluster-admin -z tmp-admin -n default
    3. Get the temporary token for the tmp-admin user.
      token=$(oc get secret -o jsonpath="{.data.token}" $(oc get sa tmp-admin -o yaml -n default  | grep " tmp-admin-token-" | awk '{ print $3 }') -n default | base64 -d)
    4. Verify that token has been generated.
      echo $token
      
      Sample:
            eyJhbGciOi...
      
    5. Generate a kubeconfig file for temporary admin (tmp-admin).
      env KUBECONFIG=$(pwd)/tmpadmin-kubeconfig oc login --token=$token https://api.<your domain>:6443/
  2. To make the worker nodes unschedulable and evict the pods, perform the following substeps:
    1. Select a worker node in the cluster and make it unscheduled using the following command:
      oc adm cordon < worker-node-n >
    2. Drain the node.
      oc adm drain <worker-node> --force --ignore-daemonsets --delete-local-data
      NOTE:
      When the last worker node is drained, ignore the following error messages. These error messages are expected when the last node is drained, as other worker nodes are already drained. Press CTRL C to exit the command and return to prompt.
      oc adm drain worker-01.ocp4.qac.com --force --ignore-daemonsets --delete-local-data
      node/worker-01.ocp4.qac.com already cordoned
      WARNING: ignoring DaemonSet-managed Pods: kube-system/hpe-csi-node-ppkg4, openshift-cluster-node-tuning-operator/tuned-5ccg6, openshift-dns/dns-default-7cf2g, openshift-image-registry/node-ca-dtglx, openshift-machine-config-operator/machine-config-daemon-xhlbz, openshift-monitoring/node-exporter-gqfd8, openshift-multus/multus-ns8zb, openshift-sdn/ovs-s4jjm, openshift-sdn/sdn-bfddj; deleting Pods not managed by ReplicationController, ReplicaSet, Job, DaemonSet or StatefulSet: sanity/simple
      evicting pod "simple"
      evicting pod "alertmanager-main-1"
      evicting pod "alertmanager-main-0"
      evicting pod "csp-service-7bcb94744d-w7g8l"
      evicting pod "alertmanager-main-2"
      evicting pod "grafana-56df66bfd5-784lk"
      evicting pod "kube-state-metrics-77b94cbbff-nh8cz"
      evicting pod "openshift-state-metrics-76bdf54b5f-79q9m"
      evicting pod "prometheus-adapter-548f8c485d-8tvgz"
      evicting pod "hpe-csi-controller-8f4485ccb-zsqw9"
      evicting pod "prometheus-adapter-548f8c485d-h6hpk"
      evicting pod "benchmark-operator-fd9997496-x4zwp"
      evicting pod "prometheus-k8s-0"
      evicting pod "prometheus-k8s-1"
      evicting pod "telemeter-client-6f8fcdf6d7-trgnp"
      evicting pod "image-registry-6bb8dc7445-hm4pw"
      evicting pod "router-default-7f748fbd5f-n96g2"
      evicting pod "certified-operators-5db975968d-4bnlf"
      evicting pod "community-operators-86dc5b948c-k7cxq"
      evicting pod "redhat-operators-544fc4b9b5-cd8c2"
      error when evicting pod "router-default-7f748fbd5f-n96g2" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
      evicting pod "router-default-7f748fbd5f-n96g2"
      error when evicting pod "router-default-7f748fbd5f-n96g2" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
      evicting pod "router-default-7f748fbd5f-n96g2"
      error when evicting pod "router-default-7f748fbd5f-n96g2" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
      evicting pod "router-default-7f748fbd5f-n96g2"
      error when evicting pod "router-default-7f748fbd5f-n96g2" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
      evicting pod "router-default-7f748fbd5f-n96g2"
      error when evicting pod "router-default-7f748fbd5f-n96g2" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
      evicting pod "router-default-7f748fbd5f-n96g2"
      error when evicting pod "router-default-7f748fbd5f-n96g2" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
      evicting pod "router-default-7f748fbd5f-n96g2"
      error when evicting pod "router-default-7f748fbd5f-n96g2" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
      pod/grafana-56df66bfd5-784lk evicted
      pod/openshift-state-metrics-76bdf54b5f-79q9m evicted
      pod/alertmanager-main-2 evicted
      evicting pod "router-default-7f748fbd5f-n96g2"
      error when evicting pod "router-default-7f748fbd5f-n96g2" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
      pod/telemeter-client-6f8fcdf6d7-trgnp evicted
      pod/hpe-csi-controller-8f4485ccb-zsqw9 evicted
      pod/benchmark-operator-fd9997496-x4zwp evicted
      pod/prometheus-adapter-548f8c485d-8tvgz evicted
      pod/image-registry-6bb8dc7445-hm4pw evicted
      pod/prometheus-k8s-0 evicted
      pod/certified-operators-5db975968d-4bnlf evicted
      pod/simple evicted
      evicting pod "router-default-7f748fbd5f-n96g2"
      error when evicting pod "router-default-7f748fbd5f-n96g2" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
      pod/alertmanager-main-1 evicted
      pod/kube-state-metrics-77b94cbbff-nh8cz evicted
      pod/community-operators-86dc5b948c-k7cxq evicted
      pod/redhat-operators-544fc4b9b5-cd8c2 evicted
      pod/alertmanager-main-0 evicted
      pod/prometheus-adapter-548f8c485d-h6hpk evicted
      pod/prometheus-k8s-1 evicted
      evicting pod "router-default-7f748fbd5f-n96g2"
      error when evicting pod "router-default-7f748fbd5f-n96g2" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
    NOTE:

    Repeat steps 2a and 2b for all the worker nodes in the cluster. Ensure you have shutdown all the worker nodes in the cluster before proceed to shutting down master nodes.

  3. Shut down all worker nodes.
    workers=($(oc get nodes -l node-role.kubernetes.io/worker=,node-role.kubernetes.io/infra!= --no-headers -o custom-columns=CONTAINER:.metadata.name))
          for worker in ${workers[@]}
          do
            echo "==== Shutdown $worker ====" 
            ssh core@$worker sudo shutdown -h now
          done
    
    NOTE:

    Ensure to shut down all the worker nodes before proceeding to master nodes. User can re-verify the worker nodes from the ILO console.

  4. Stop static pods on each master nodes and shut down all master nodes.
    1. Stop static pods on each master nodes.
      masters=($(oc get nodes -l node-role.kubernetes.io/master --no-headers -o custom-columns=CONTAINER:.metadata.name))
            for master in ${masters[@]}
            do
              echo "==== Stop static pods on $master ====" 
              ssh core@$master 'sudo mkdir -p /etc/kubernetes/manifests.stop && sudo mv -v $(ls /etc/kubernetes/manifests/*) /etc/kubernetes/manifests.stop'
              while :;
              do
                ssh core@$master sudo crictl ps | grep -v -e operator -e cert | grep -qw -e etcd-member -e kube-apiserver-[0-9]* -e kube-controller-manager-[0-9]* -e scheduler || break
                sleep 5
              done
            done
      
    2. Shut down all master nodes.
      for master in ${masters[@]}
      do
          echo "==== Shutdown $master ===="
          ssh core@$master sudo shutdown -h now
      done
  5. Shut down all the storage Nimble nodes.
    1. In the NimbleOS GUI, go to Administration > Shut down.
      NOTE:

      Make sure all hosts are disconnected from the array to avoid unnecessary data service outage.

    2. Click Shut Down Array.
    3. Enter the administrator password and click Shut down.
  6. Shut down the registry VM (applicable only for disconnected mode of deployment).
    • If the VM is created using virt-manager, run the following command:
      virsh shutdown <registry_vm_name>

      OR

    • Log in to the registry VM and run the following command:
      sudo shutdown 
  7. Shut down all the infrastructure services like DHCP server, DNS, load balancer, and so on.