SR-IOV configuration fails

Symptom
The nps show --data vim command displays SRIOV_CONFIGURE_FAILED as the status.
[root@npsvm logs]# nps show --data vim
+-------------------+-----------------------------------------------+
|  Key              | Value                                         |
+-------------------+-----------------------------------------------+
| cluster_info      | /nps/v2/computes/vim/rhocp/cluster_info/      |
| compute_plane     | /nps/v2/computes/vim/rhocp/compute_plane/     |
| control_plane     | /nps/v2/computes/vim/rhocp/control_plane/     |
| persistent_volume | /nps/v2/computes/vim/rhocp/persistent_volume/ |
| sriov             | /nps/v2/computes/vim/rhocp/sriov/             |
| state             | SRIOV_CONFIGURE_FAILED                        |
| storage_backend   | /nps/v2/computes/vim/rhocp/storage_backend/   |
+-------------------+-----------------------------------------------+
To troubleshoot the SR-IOV configuration failures, analyze the log files to know the root cause of issue. The log files of the installer are created in the following locations:
  • /var/nps/logs/rhocp/rhocp-cli-<date>.log
  • /var/nps/logs/rhocp/nps-rhocp-cli-ansible.log
The SR-IOV configuration files are created in the following locations:
  • Configuration files are generated in /var/nps/vim/sriov-data/ directory.

  • Profile configs are generated in /var/nps/vim/sriov-data/profiles/ directory.

Solution 1
Cause
The nps show --data vim command displays SRIOV_CONFIGURE_FAILED as the status due to one of the following causes:
  • Unsupported VF count (maximum supported for Mellanox is 8).

  • Incorrect interface name entries in the input JSON file.

Action
  1. Navigate to the /var/nps/vim/sriov-data directory and check the configuration file <profilesname.yaml> for the SR-IOV profile configuration details.
    vim /var/nps/vim/sriov-data/profiles/<profilename.yaml>

    Ensure that interface names and VF counts are proper. For example, the interface names for DL325 with Mellanox NICs must be ens1f0, ens1f1, ens3f0, and ens3f1. Note that Mellanox NIC supports only 8 VFs.

  2. If there is any mismatch in the interface names or the VF counts, download, modify, and reupload the JSON file.
    1. Download the working configuration file as a JSON and modify your inputs in the downloaded JSON file. See Rectifying an incorrect user input in the JSON file.
    2. Reupload the JSON file. See Uploading Input JSON file.
  3. Rerun the nps autodeploy command.
Solution 2
Action
  1. Login to the NPS toolkit VM and execute the following command before running any oc command:
    export KUBECONFIG=/var/nps/ISO/ign_config/auth/kubeconfig
  2. Verify if the namespace is created for the SR-IOV network operator.
    1. Run the following command to get the namespace status:
      oc get namespace openshift-sriov-network-operator
      A sample output is as follows:
      NAME                                STATUS    AGE
      openshift-sriov-network-operator    Active    4d9h
      Verify the namespace status, it must be in Active state.
    2. If the namespace creation failed, then verify the cluster operator's health.
    3. Run the following command to verify the cluster operators:
      oc get co

      Ensure all the cluster operators are AVAILABLE.

  3. Verify if the operator group is created in the namespace.
    1. Run the following command to get the operator group status:
      oc project openshift-sriov-network-operator 
      oc get OperatorGroup 
      A sample output is as follows:
      NAME                                AGE
      sriov-network-operators             4d11h
      Verify the namespace status, it must be in Active state.
    2. If the operator group creation failed, then verify the namespace status.
  4. Verify if the subscription cr is created in the namespace.
    1. Run the following command to get the subscription cr status:
      oc get namespace openshift-sriov-network-operator
      oc get Subscription
      For online mode of installation, example output is as follows:
      NAME                                 PACKAGE                 SOURCE            CHANNEL
      sriov-network-operator-subscription  sriov-network-operator  redhat-operators  4.3
      For offline mode of installation, example output is as follows:
      NAME                                 PACKAGE                 SOURCE                   CHANNEL
      sriov-network-operator-subscription  sriov-network-operator  redhat-operator-catalog  4.3
    2. Ensure that the SOURCE field must not be none. If the SOURCE field is none, verify the catalog sources.
      Run the following command to list the catalog sources:
      oc get namespace openshift-sriov-network-operator
      oc get catalogsource -n openshift-marketplace
      The catalog sources for online mode of installation, example output is as follows:
      NAME                   DISPLAY                TYPE    PUBLISHER    AGE
      certified-operators    Certified Operators    grpc    Red Hat      13d
      community-operators    Community Operators    grpc    Red Hat      13d
      redhat-operators       Red Hat Operators      grpc    Red Hat      13d
      If catalog source is not available, verify the cluster operator status.
      The catalog sources for offline mode of installation, example output is as follows:
      NAME                      DISPLAY                TYPE    PUBLISHER    AGE
      redhat-operator-catalog   RH-OPERATOR-CATALOG    grpc    grpc         6d3h
      If catalog source is not available, ensure the OLM is configured properly.
    3. Verify the sriov-network-config-daemon pod is running in all the worker nodes.
      NAME                                    READY STATUS  RESTARTS AGE    IP           NODE
      network-resources-injector-6clfd        1/1   Running 0        4d23h  XX.XX.XX.XX  master-1.ocp4.lab.com
      network-resources-injector-bpl9w        1/1   Running 0        4d23h  XX.XX.XX.XX  master-0.ocp4.lab.com
      network-resources-injector-tddph        1/1   Running 0        4d23h  XX.XX.XX.XX  master-2.ocp4.lab.com
      operator-webhook-2wmdl                  1/1   Running 0        4d23h  XX.XX.XX.XX  master-1.ocp4.lab.com
      operator-webhook-6xv9x                  1/1   Running 0        4d23h  XX.XX.XX.XX  master-0.ocp4.lab.com
      operator-webhook-7z2z8                  1/1   Running 0        4d23h  XX.XX.XX.XX  master-2.ocp4.lab.com
      sriov-network-config-daemon-7sdnj       1/1   Running 1        4d23h  XX.XX.XX.XX  worker-0.ocp4.lab.com
      sriov-network-config-daemon-h5hqx       1/1   Running 0        4d23h  XX.XX.XX.XX  worker-2.ocp4.lab.com
      sriov-network-config-daemon-jq9cx       1/1   Running 0        4d23h  XX.XX.XX.XX  worker-1.ocp4.lab.com
      sriov-network-operator-594b797ffb-6vwkg 1/1   Running 0        4d23h  XX.XX.XX.XX  master-1.ocp4.lab.com
    4. Verify worker nodes are labeled based on the sriov profile assigned to that worker node.
      Run the following command to get the node label:
      oc get nodes --show-labels
    5. Ensure the node is labeled properly based on the label provide in node_label for that profile.
      "<profile_name>": { "node_label": "4-port-sriov"}
    6. If the worker node is not labeled, verify the server vim state. If the server vim state is not REGISTERED, then that node will be skipped considering it is not ready to configure SRIOV on that worker node.
  5. Verify if the SR-IOV policy is applied based on the node labeled.
    1. Run the following command to list the available sriov network node policies:
      oc get namespace openshift-sriov-network-operator
      oc get SriovNetworkNodePolicy
      A sample output is as follows:
      NAME              AGE
      4portsriov        58s
      default           4d23h

      Sriov network node policy is created based on the profile name.

      Sample profile data: "4portsriov": {“<profile config values>"}

    2. Once SR-IOV network node policy is applied, default pods are created on that worker node.
    3. Run the following command to verify the pods:
      oc get namespace openshift-sriov-network-operator
      oc get pods -o wide
      A sample output is as follows:
      NAME                                    READY STATUS   RESTARTS AGE   IP            NODE
      network-resources-injector-6clfd        1/1   Running  0        4d23h XX.XX.XX.XX   master-1.ocp4.lab.com
      network-resources-injector-bpl9w        1/1   Running  0        4d23h XX.XX.XX.XX   master-0.ocp4.lab.com
      network-resources-injector-tddph        1/1   Running  0        4d23h XX.XX.XX.XX   master-2.ocp4.lab.com
      operator-webhook-2wmdl                  1/1   Running  0        4d23h XX.XX.XX.XX   master-1.ocp4.lab.com
      operator-webhook-6xv9x                  1/1   Running  0        4d23h XX.XX.XX.XX   master-0.ocp4.lab.com
      operator-webhook-7z2z8                  1/1   Running  0        4d23h XX.XX.XX.XX   master-2.ocp4.lab.com
      sriov-cni-wwfb6                         1/1   Running  0        3m11s XX.XX.XX.XX   worker-0.ocp4.lab.com
      sriov-device-plugin-zqq41               1/1   Running  0        2m30s XX.XX.XX.XX   worker-0.ocp4.lab.com
      sriov-network-config-daemon-7sdnj       1/1   Running  1        4d23h XX.XX.XX.XX   worker-0.ocp4.lab.com
      sriov-network-config-daemon-h5hqx       1/1   Running  0        4d23h XX.XX.XX.XX   worker-2.ocp4.lab.com
      sriov-network-config-daemon-jq9cx       1/1   Running  0        4d23h XX.XX.XX.XX   worker-1.ocp4.lab.com
      sriov-network-operator-594b797ffb-6vwkg 1/1   Running  0        4d23h XX.XX.XX.XX   master-1.ocp4.lab.com

      Verify that sriov-cni and sriov-device-plugin pods are running on that worker node. If pods are not running, then the sriov policy configuration fails.

    4. If the sriov network policy configuration failed for VF count limit, patch the profile with the correct vf_count value.
    5. If there is a mismatch in the sriov network policy configuration failed for interface name:
      • Verify if the interface names provided in the sriov profile are matching with the available interfaces in the worker nodes.
        • Run the following command from bastion to login for the coreos worker node.

          ssh core@<worker_node_hostname>
          # ssh core@worker-0.ocp4.nfv.com
          Warning: the ECDSA host key for 'worker-0.ocp4.nfv.com' differs from the key 
          for the IP address 'XX.XX.XX.XX'
          Offending key for IP in /root/.ssh/known_hosts:1
          Matching host key in /root/.ssh/known_hosts:17
          Are you sure you want to continue connecting (yes/no)? yes
          Red Hat Enterprose Linux CoreOS 43.81.202004130853.0
            Part of OpenShift 4.3, RHCOS is a Kubernetes native operating system
            managed by the Machine Config operator ('clusteroperator/machine-config).
          
          WARNING: Direct SSH access to machines is not recommended; instead,
          make configuration changes via 'machineconfig' objects:
            https://docs.openshift.com/container-platform/4.3/architecture/architecture-rhcos.html
          
          ---
          Last login: Mon May 18 12:19:28 2020 from XX.XX.XX.XX
          [systemd]
          Failed Units: 1
            NetworkManager-wait-online.service
          [core@worker-0 ~] $
          [core@worker-0 ~] $
        • Run the following command to display the interfaces available in the worker node:

          ip link show
      • If there is a mismatch in the interfaces, patch the profile with the correct interfaces.