AMD GPU Operator
GPU Operator Documentation Site: https://instinct.docs.amd.com/projects/gpu-operator
Introduction
AMD GPU Operator simplifies the deployment and management of AMD Instinct GPU accelerators within Kubernetes clusters. This project enables seamless configuration and operation of GPU-accelerated workloads, including machine learning, Generative AI, and other GPU-intensive applications.
Components
- AMD GPU Operator Controller
- K8s Device Plugin
- K8s Node Labeller
- Device Metrics Exporter
- Device Test Runner
- Node Feature Discovery Operator
- Kernel Module Management Operator
Features
- Streamlined GPU driver installation and management
- Comprehensive metrics collection and export
- Easy deployment of AMD GPU device plugin for Kubernetes
- Automated labeling of nodes with AMD GPU capabilities
- Compatibility with standard Kubernetes environments
- Efficient GPU resource allocation for containerized workloads
- GPU health monitoring and troubleshooting
Compatibility
- ROCm DKMS Compatibility: Please refer to the ROCM official website for the compatability matrix for ROCM driver.
- Kubernetes: 1.29.0+
Prerequisites
- Kubernetes v1.29.0+
- Helm v3.2.0+
kubectlCLI tool configured to access your cluster- Cert Manager Install it by running these commands if not already installed in the cluster:
helm repo add jetstack https://charts.jetstack.io --force-update
helm install cert-manager jetstack/cert-manager \
--namespace cert-manager \
--create-namespace \
--version v1.15.1 \
--set crds.enabled=true
Quick Start
1. Add the AMD Helm Repository
helm repo add rocm https://rocm.github.io/gpu-operator
helm repo update
2. Install the Operator
Basic installation:
helm install amd-gpu-operator rocm/gpu-operator-charts \
--namespace kube-amd-gpu \
--create-namespace \
--version=v1.4.0
Installation Options
- Skip NFD installation: `--set node-feature-discovery.enabled=false`
- Skip KMM installation: `--set kmm.enabled=false`
It is strongly recommended to use AMD-optimized KMM images included in the operator release.
3. Install Custom Resource
After the installation of AMD GPU Operator:
-
By default there will be a default
DeviceConfiginstalled. If you are using defaultDeviceConfig, you can modify the defaultDeviceConfigto adjust the config for your own use case.kubectl edit deviceconfigs -n kube-amd-gpu default -
If you installed without default
DeviceConfig(either by using--set crds.defaultCR.install=falseor installing a chart prior to v1.3.0), you need to create theDeviceConfigcustom resource in order to trigger the operator start to work. By preparing theDeviceConfigin the YAML file, you can create the resouce by runningkubectl apply -f deviceconfigs.yaml. -
For custom resource definition and more detailed information, please refer to Custom Resource Installation Guide.
-
Potential Failures with default
DeviceConfig:a. Operand pods are stuck in
Init:0/1state: It means your GPU worker doesn't have inbox GPU driver loaded. We suggest check the Driver Installation Guide then modify the defaultDeviceConfigto ask Operator to install the out-of-tree GPU driver for your worker nodes.kubectl edit deviceconfigs -n kube-amd-gpu defaultb. No operand pods showed up: It is possible that default
DeviceConfigselectorfeature.node.kubernetes.io/amd-gpu: "true"cannot find any matched node.- Check node label
kubectl get node -oyaml | grep -e "amd-gpu:" -e "amd-vgpu:" - If you are using GPU in the VM, you may need to change the default
DeviceConfigselector tofeature.node.kubernetes.io/amd-vgpu: "true" - You can always customize the node selector of the
DeviceConfig.
- Check node label
Grafana Dashboards
Following dashboards are provided for visualizing GPU metrics collected from device-metrics-exporter:
- Overview Dashboard: Provides a comprehensive view of the GPU cluster.
- GPU Detail Dashboard: Offers a detailed look at individual GPUs.
- Job Detail Dashboard: Presents detailed GPU usage for specific jobs in SLURM and Kubernetes environments.
- Node Detail Dashboard: Displays detailed GPU usage at the host level.
Support
For bugs and feature requests, please file an issue on our GitHub Issues page.
License
The AMD GPU Operator is licensed under the Apache License 2.0.
gpu-operator-charts
AMD GPU Operator simplifies the deployment and management of AMD Instinct GPU accelerators within Kubernetes clusters.
Homepage: https://github.com/ROCm/gpu-operator
Maintainers
| Name | Url | |
|---|---|---|
| Yan Sun Yan.Sun3@amd.com |
Source Code
Requirements
Kubernetes: >= 1.29.0-0
| Repository | Name | Version |
|---|---|---|
| file://./charts/kmm | kmm | v1.0.0 |
| https://kubernetes-sigs.github.io/node-feature-discovery/charts | node-feature-discovery | v0.16.1 |
Values
| Key | Type | Default | Description |
|---|---|---|---|
| controllerManager.affinity | object | {"nodeAffinity":{"preferredDuringSchedulingIgnoredDuringExecution":[{"preference":{"matchExpressions":[{"key":"node-role.kubernetes.io/control-plane","operator":"Exists"}]},"weight":1}]}} |
Deployment affinity configs for controller manager |
| controllerManager.manager.image.repository | string | "docker.io/rocm/gpu-operator" |
AMD GPU operator controller manager image repository |
| controllerManager.manager.image.tag | string | "v1.4.0" |
AMD GPU operator controller manager image tag |
| controllerManager.manager.imagePullPolicy | string | "Always" |
Image pull policy for AMD GPU operator controller manager pod |
| controllerManager.manager.imagePullSecrets | string | "" |
Image pull secret name for pulling AMD GPU operator controller manager image if registry needs credential to pull image |
| controllerManager.nodeSelector | object | {} |
Node selector for AMD GPU operator controller manager deployment |
| crds.defaultCR.install | bool | true |
Deploy default DeviceConfig during helm chart installation |
| crds.defaultCR.upgrade | bool | false |
Deploy / Patch default DeviceConfig during helm chart upgrade. Be careful about this option: 1. Your customized change on default DeviceConfig may be overwritten 2. Your existing DeviceConfig may conflict with upgraded default DeviceConfig |
| deviceConfig.spec.commonConfig.initContainerImage | string | "busybox:1.36" |
init container image |
| deviceConfig.spec.commonConfig.utilsContainer.image | string | "docker.io/rocm/gpu-operator-utils:v1.4.0" |
gpu operator utility container image |
| deviceConfig.spec.commonConfig.utilsContainer.imagePullPolicy | string | "IfNotPresent" |
utility container image pull policy |
| deviceConfig.spec.commonConfig.utilsContainer.imageRegistrySecret | object | {} |
utility container image pull secret, e.g. {"name": "mySecretName"} |
| deviceConfig.spec.configManager.config | object | {} |
config map for config manager, e.g. {"name": "myConfigMap"} |
| deviceConfig.spec.configManager.configManagerTolerations | list | [] |
config manager tolerations |
| deviceConfig.spec.configManager.enable | bool | false |
enable/disable the config manager |
| deviceConfig.spec.configManager.image | string | "docker.io/rocm/device-config-manager:v1.4.0" |
config manager image |
| deviceConfig.spec.configManager.imagePullPolicy | string | "IfNotPresent" |
image pull policy for config manager image |
| deviceConfig.spec.configManager.imageRegistrySecret | object | {} |
image pull secret for config manager image, e.g. {"name": "myPullSecret"} |
| deviceConfig.spec.configManager.selector | object | {} |
node selector for config manager, if not specified it will reuse spec.selector |
| deviceConfig.spec.configManager.upgradePolicy.maxUnavailable | int | 1 |
the maximum number of Pods that can be unavailable during the update process |
| deviceConfig.spec.configManager.upgradePolicy.upgradeStrategy | string | "RollingUpdate" |
the type of daemonset upgrade, RollingUpdate or OnDelete |
| deviceConfig.spec.devicePlugin.devicePluginArguments | object | {} |
pass supported flags and their values while starting device plugin daemonset, e.g. {"resource_naming_strategy": "single"} or {"resource_naming_strategy": "mixed"} |
| deviceConfig.spec.devicePlugin.devicePluginImage | string | "rocm/k8s-device-plugin:latest" |
device plugin image |
| deviceConfig.spec.devicePlugin.devicePluginImagePullPolicy | string | "IfNotPresent" |
device plugin image pull policy |
| deviceConfig.spec.devicePlugin.devicePluginTolerations | list | [] |
device plugin tolerations |
| deviceConfig.spec.devicePlugin.enableNodeLabeller | bool | true |
enable / disable node labeller |
| deviceConfig.spec.devicePlugin.imageRegistrySecret | object | {} |
image pull secret for device plugin and node labeller, e.g. {"name": "mySecretName"} |
| deviceConfig.spec.devicePlugin.nodeLabellerArguments | list | [] |
pass supported labels while starting node labeller daemonset, default ["vram", "cu-count", "simd-count", "device-id", "family", "product-name", "driver-version"], also support ["compute-memory-partition", "compute-partitioning-supported", "memory-partitioning-supported"] |
| deviceConfig.spec.devicePlugin.nodeLabellerImage | string | "rocm/k8s-device-plugin:labeller-latest" |
node labeller image |
| deviceConfig.spec.devicePlugin.nodeLabellerImagePullPolicy | string | "IfNotPresent" |
node labeller image pull policy |
| deviceConfig.spec.devicePlugin.nodeLabellerTolerations | list | [] |
node labeller tolerations |
| deviceConfig.spec.devicePlugin.upgradePolicy.maxUnavailable | int | 1 |
the maximum number of Pods that can be unavailable during the update process |
| deviceConfig.spec.devicePlugin.upgradePolicy.upgradeStrategy | string | "RollingUpdate" |
the type of daemonset upgrade, RollingUpdate or OnDelete |
| deviceConfig.spec.driver.blacklist | bool | false |
enable/disable putting a blacklist amdgpu entry in modprobe config, which requires node labeller to run |
| deviceConfig.spec.driver.enable | bool | false |
enable/disable out-of-tree driver management, set to false to use inbox driver |
| deviceConfig.spec.driver.image | string | "docker.io/myUserName/driverImage" |
image repository to store out-of-tree driver image, DO NOT put image tag since operator automatically manage it for users |
| deviceConfig.spec.driver.imageBuild | object | {} |
configure the out-of-tree driver image build within the cluster. e.g. {"baseImageRegistry":"docker.io","baseImageRegistryTLS":{"baseImageRegistry":"docker.io","baseImageRegistryTLS":{"insecure":"false","insecureSkipTLSVerify":"false"}}} |
| deviceConfig.spec.driver.imageRegistrySecret | object | {} |
image pull secret for pull/push access of the driver image repository, input secret name like {"name": "mysecret"} |
| deviceConfig.spec.driver.imageRegistryTLS.insecure | bool | false |
set to true to use plain HTTP for driver image repository |
| deviceConfig.spec.driver.imageRegistryTLS.insecureSkipTLSVerify | bool | false |
set to true to skip TLS validation for driver image repository |
| deviceConfig.spec.driver.imageSign | object | {} |
specify the secrets to sign the out-of-tree kernel module inside driver image for secure boot, e.g. input private / public key secret {"keySecret":{"name":"privateKeySecret"},"certSecret":{"name":"publicKeySecret"}} |
| deviceConfig.spec.driver.tolerations | list | [] |
configure driver tolerations so that operator can manage out-of-tree drivers on tainted nodes |
| deviceConfig.spec.driver.upgradePolicy.enable | bool | true |
enable/disable automatic driver upgrade feature |
| deviceConfig.spec.driver.upgradePolicy.maxParallelUpgrades | int | 3 |
how many nodes can be upgraded in parallel |
| deviceConfig.spec.driver.upgradePolicy.maxUnavailableNodes | string | "25%" |
maximum number of nodes that can be in a failed upgrade state beyond which upgrades will stop to keep cluster at a minimal healthy state |
| deviceConfig.spec.driver.upgradePolicy.nodeDrainPolicy.force | bool | true |
whether force draining is allowed or not |
| deviceConfig.spec.driver.upgradePolicy.nodeDrainPolicy.gracePeriodSeconds | int | -1 |
the time kubernetes waits for a pod to shut down gracefully after receiving a termination signal, zero means immediate, minus value means follow pod defined grace period |
| deviceConfig.spec.driver.upgradePolicy.nodeDrainPolicy.timeoutSeconds | int | 300 |
the length of time in seconds to wait before giving up drain, zero means infinite |
| deviceConfig.spec.driver.upgradePolicy.podDeletionPolicy.force | bool | true |
whether force deletion is allowed or not |
| deviceConfig.spec.driver.upgradePolicy.podDeletionPolicy.gracePeriodSeconds | int | -1 |
the time kubernetes waits for a pod to shut down gracefully after receiving a termination signal, zero means immediate, minus value means follow pod defined grace period |
| deviceConfig.spec.driver.upgradePolicy.podDeletionPolicy.timeoutSeconds | int | 300 |
the length of time in seconds to wait before giving up on pod deletion, zero means infinite |
| deviceConfig.spec.driver.upgradePolicy.rebootRequired | bool | true |
whether reboot each worker node or not during the driver upgrade |
| deviceConfig.spec.driver.version | string | "6.4" |
specify an out-of-tree driver version to install |
| deviceConfig.spec.metricsExporter.config | object | {} |
name of the metrics exporter config map, e.g. {"name": "metricConfigMapName"} |
| deviceConfig.spec.metricsExporter.enable | bool | true |
enable / disable device metrics exporter |
| deviceConfig.spec.metricsExporter.image | string | "docker.io/rocm/device-metrics-exporter:v1.4.0" |
metrics exporter image |
| deviceConfig.spec.metricsExporter.imagePullPolicy | string | "IfNotPresent" |
metrics exporter image pull policy |
| deviceConfig.spec.metricsExporter.imageRegistrySecret | object | {} |
metrics exporter image pull secret, e.g. {"name": "pullSecretName"} |
| deviceConfig.spec.metricsExporter.nodePort | int | 32500 |
external port for pulling metrics from outside the cluster for NodePort service, in the range 30000-32767 (assigned automatically by default) |
| deviceConfig.spec.metricsExporter.port | int | 5000 |
internal port used for in-cluster and node access to pull metrics from the metrics-exporter (default 5000). |
| deviceConfig.spec.metricsExporter.prometheus.serviceMonitor.attachMetadata | object | {} |
define if Prometheus should attach node metadata to the target, e.g. {"node": "true"} |
| deviceConfig.spec.metricsExporter.prometheus.serviceMonitor.authorization | object | {} |
optional Prometheus authorization configuration for accessing the endpoint |
| deviceConfig.spec.metricsExporter.prometheus.serviceMonitor.enable | bool | false |
enable or disable ServiceMonitor creation |
| deviceConfig.spec.metricsExporter.prometheus.serviceMonitor.honorLabels | bool | true |
choose the metric's labels on collisions with target labels |
| deviceConfig.spec.metricsExporter.prometheus.serviceMonitor.honorTimestamps | bool | false |
control whether the scrape endpoints honor timestamps |
| deviceConfig.spec.metricsExporter.prometheus.serviceMonitor.interval | string | "30s" |
frequency to scrape metrics. Accepts values with time unit suffix: "30s", "1m", "2h", "500ms" |
| deviceConfig.spec.metricsExporter.prometheus.serviceMonitor.labels | object | {} |
additional labels to add to the ServiceMonitor |
| deviceConfig.spec.metricsExporter.prometheus.serviceMonitor.metricRelabelings | list | [] |
relabeling rules applied to individual scraped metrics |
| deviceConfig.spec.metricsExporter.prometheus.serviceMonitor.relabelings | list | [] |
relabelConfigs to apply to samples before ingestion |
| deviceConfig.spec.metricsExporter.prometheus.serviceMonitor.tlsConfig | object | {} |
TLS settings used by Prometheus to connect to the metrics endpoint |
| deviceConfig.spec.metricsExporter.rbacConfig.clientCAConfigMap | object | {} |
reference to a configmap containing the client CA (key: ca.crt) for mTLS client validation, e.g. {"name": "configMapName"} |
| deviceConfig.spec.metricsExporter.rbacConfig.disableHttps | bool | false |
disable https protecting the proxy endpoint |
| deviceConfig.spec.metricsExporter.rbacConfig.enable | bool | false |
enable/disable kube rbac proxy |
| deviceConfig.spec.metricsExporter.rbacConfig.image | string | "quay.io/brancz/kube-rbac-proxy:v0.18.1" |
kube rbac proxy side car container image |
| deviceConfig.spec.metricsExporter.rbacConfig.secret | object | {} |
certificate secret to mount in kube-rbac container for TLS, self signed certificates will be generated by default, e.g. {"name": "secretName"} |
| deviceConfig.spec.metricsExporter.rbacConfig.staticAuthorization.clientName | string | "" |
expected CN (Common Name) from client cert (e.g., Prometheus SA identity) |
| deviceConfig.spec.metricsExporter.rbacConfig.staticAuthorization.enable | bool | false |
enables static authorization using client certificate CN |
| deviceConfig.spec.metricsExporter.selector | object | {} |
metrics exporter node selector, if not specified it will reuse spec.selector |
| deviceConfig.spec.metricsExporter.serviceType | string | "ClusterIP" |
type of service for exposing metrics endpoint, ClusterIP or NodePort |
| deviceConfig.spec.metricsExporter.tolerations | list | [] |
metrics exporter tolerations |
| deviceConfig.spec.metricsExporter.upgradePolicy.maxUnavailable | int | 1 |
the maximum number of Pods that can be unavailable during the update process |
| deviceConfig.spec.metricsExporter.upgradePolicy.upgradeStrategy | string | "RollingUpdate" |
the type of daemonset upgrade, RollingUpdate or OnDelete |
| deviceConfig.spec.selector | object | {"feature.node.kubernetes.io/amd-gpu":"true"} |
Set node selector for the default DeviceConfig |
| deviceConfig.spec.testRunner.config | object | {} |
test runner config map, e.g. {"name": "myConfigMap"} |
| deviceConfig.spec.testRunner.enable | bool | false |
enable / disable test runner |
| deviceConfig.spec.testRunner.image | string | "docker.io/rocm/test-runner:v1.4.0" |
test runner image |
| deviceConfig.spec.testRunner.imagePullPolicy | string | "IfNotPresent" |
test runner image pull policy |
| deviceConfig.spec.testRunner.imageRegistrySecret | object | {} |
test runner image pull secret |
| deviceConfig.spec.testRunner.logsLocation.hostPath | string | "/var/log/amd-test-runner" |
host directory to save test run logs |
| deviceConfig.spec.testRunner.logsLocation.logsExportSecrets | list | [] |
a list of secrets that contain connectivity info to multiple cloud providers |
| deviceConfig.spec.testRunner.logsLocation.mountPath | string | "/var/log/amd-test-runner" |
test runner internal mounted directory to save test run logs |
| deviceConfig.spec.testRunner.selector | object | {} |
test runner node selector, if not specified it will reuse spec.selector |
| deviceConfig.spec.testRunner.tolerations | list | [] |
test runner tolerations |
| deviceConfig.spec.testRunner.upgradePolicy.maxUnavailable | int | 1 |
the maximum number of Pods that can be unavailable during the update process |
| deviceConfig.spec.testRunner.upgradePolicy.upgradeStrategy | string | "RollingUpdate" |
the type of daemonset upgrade, RollingUpdate or OnDelete |
| installdefaultNFDRule | bool | true |
Default NFD rule will detect amd gpu based on pci vendor ID |
| kmm.enabled | bool | true |
Set to true/false to enable/disable the installation of kernel module management (KMM) operator |
| node-feature-discovery.enabled | bool | true |
Set to true/false to enable/disable the installation of node feature discovery (NFD) operator |
| node-feature-discovery.worker.nodeSelector | object | {} |
Set nodeSelector for NFD worker daemonset |
| node-feature-discovery.worker.tolerations | list | [{"effect":"NoExecute","key":"amd-dcm","operator":"Equal","value":"up"}] |
Set tolerations for NFD worker daemonset |
| upgradeCRD | bool | true |
CRD will be patched as pre-upgrade/pre-rollback hook when doing helm upgrade/rollback to current helm chart |
| kmm.controller.affinity | object | {"nodeAffinity":{"preferredDuringSchedulingIgnoredDuringExecution":[{"preference":{"matchExpressions":[{"key":"node-role.kubernetes.io/control-plane","operator":"Exists"}]},"weight":1}]}} |
Affinity for the KMM controller manager deployment |
| kmm.controller.manager.args[0] | string | "--config=controller_config.yaml" |
|
| kmm.controller.manager.containerSecurityContext.allowPrivilegeEscalation | bool | false |
|
| kmm.controller.manager.env.relatedImageBuild | string | "gcr.io/kaniko-project/executor:v1.23.2" |
KMM kaniko builder image for building driver image within cluster |
| kmm.controller.manager.env.relatedImageBuildPullSecret | string | "" |
Image pull secret name for pulling KMM kaniko builder image if registry needs credential to pull image |
| kmm.controller.manager.env.relatedImageSign | string | "docker.io/rocm/kernel-module-management-signimage:v1.4.0" |
KMM signer image for signing driver image's kernel module with given key pairs within cluster |
| kmm.controller.manager.env.relatedImageSignPullSecret | string | "" |
Image pull secret name for pulling KMM signer image if registry needs credential to pull image |
| kmm.controller.manager.env.relatedImageWorker | string | "docker.io/rocm/kernel-module-management-worker:v1.4.0" |
KMM worker image for loading / unloading driver kernel module on worker nodes |
| kmm.controller.manager.env.relatedImageWorkerPullSecret | string | "" |
Image pull secret name for pulling KMM worker image if registry needs credential to pull image |
| kmm.controller.manager.image.repository | string | "docker.io/rocm/kernel-module-management-operator" |
KMM controller manager image repository |
| kmm.controller.manager.image.tag | string | "v1.4.0" |
KMM controller manager image tag |
| kmm.controller.manager.imagePullPolicy | string | "Always" |
Image pull policy for KMM controller manager pod |
| kmm.controller.manager.imagePullSecrets | string | "" |
Image pull secret name for pulling KMM controller manager image if registry needs credential to pull image |
| kmm.controller.manager.resources.limits.cpu | string | "500m" |
|
| kmm.controller.manager.resources.limits.memory | string | "384Mi" |
|
| kmm.controller.manager.resources.requests.cpu | string | "10m" |
|
| kmm.controller.manager.resources.requests.memory | string | "64Mi" |
|
| kmm.controller.manager.tolerations[0].effect | string | "NoSchedule" |
|
| kmm.controller.manager.tolerations[0].key | string | "node-role.kubernetes.io/master" |
|
| kmm.controller.manager.tolerations[0].operator | string | "Equal" |
|
| kmm.controller.manager.tolerations[0].value | string | "" |
|
| kmm.controller.manager.tolerations[1].effect | string | "NoSchedule" |
|
| kmm.controller.manager.tolerations[1].key | string | "node-role.kubernetes.io/control-plane" |
|
| kmm.controller.manager.tolerations[1].operator | string | "Equal" |
|
| kmm.controller.manager.tolerations[1].value | string | "" |
|
| kmm.controller.nodeSelector | object | {} |
Node selector for the KMM controller manager deployment |
| kmm.controller.replicas | int | 1 |
|
| kmm.controller.serviceAccount.annotations | object | {} |
|
| kmm.controllerMetricsService.ports[0].name | string | "https" |
|
| kmm.controllerMetricsService.ports[0].port | int | 8443 |
|
| kmm.controllerMetricsService.ports[0].protocol | string | "TCP" |
|
| kmm.controllerMetricsService.ports[0].targetPort | string | "https" |
|
| kmm.controllerMetricsService.type | string | "ClusterIP" |
|
| kmm.kubernetesClusterDomain | string | "cluster.local" |
|
| kmm.managerConfig.controllerConfigYaml | string | "healthProbeBindAddress: :8081\nwebhookPort: 9443\nleaderElection:\n enabled: true\n resourceID: kmm.sigs.x-k8s.io\nmetrics:\n enableAuthnAuthz: true\n bindAddress: 0.0.0.0:8443\n secureServing: true\nworker:\n runAsUser: 0\n seLinuxType: spc_t\n firmwareHostPath: /var/lib/firmware" |
|
| kmm.webhookServer.affinity | object | {"nodeAffinity":{"preferredDuringSchedulingIgnoredDuringExecution":[{"preference":{"matchExpressions":[{"key":"node-role.kubernetes.io/control-plane","operator":"Exists"}]},"weight":1}]}} |
KMM webhook's deployment affinity configs |
| kmm.webhookServer.nodeSelector | object | {} |
KMM webhook's deployment node selector |
| kmm.webhookServer.replicas | int | 1 |
|
| kmm.webhookServer.webhookServer.args[0] | string | "--config=controller_config.yaml" |
|
| kmm.webhookServer.webhookServer.args[1] | string | "--enable-module" |
|
| kmm.webhookServer.webhookServer.args[2] | string | "--enable-namespace" |
|
| kmm.webhookServer.webhookServer.args[3] | string | "--enable-preflightvalidation" |
|
| kmm.webhookServer.webhookServer.containerSecurityContext.allowPrivilegeEscalation | bool | false |
|
| kmm.webhookServer.webhookServer.image.repository | string | "docker.io/rocm/kernel-module-management-webhook-server" |
KMM webhook image repository |
| kmm.webhookServer.webhookServer.image.tag | string | "v1.4.0" |
KMM webhook image tag |
| kmm.webhookServer.webhookServer.imagePullPolicy | string | "Always" |
Image pull policy for KMM webhook pod |
| kmm.webhookServer.webhookServer.imagePullSecrets | string | "" |
Image pull secret name for pulling KMM webhook image if registry needs credential to pull image |
| kmm.webhookServer.webhookServer.resources.limits.cpu | string | "500m" |
|
| kmm.webhookServer.webhookServer.resources.limits.memory | string | "384Mi" |
|
| kmm.webhookServer.webhookServer.resources.requests.cpu | string | "10m" |
|
| kmm.webhookServer.webhookServer.resources.requests.memory | string | "64Mi" |
|
| kmm.webhookServer.webhookServer.tolerations[0].effect | string | "NoSchedule" |
|
| kmm.webhookServer.webhookServer.tolerations[0].key | string | "node-role.kubernetes.io/master" |
|
| kmm.webhookServer.webhookServer.tolerations[0].operator | string | "Equal" |
|
| kmm.webhookServer.webhookServer.tolerations[0].value | string | "" |
|
| kmm.webhookServer.webhookServer.tolerations[1].effect | string | "NoSchedule" |
|
| kmm.webhookServer.webhookServer.tolerations[1].key | string | "node-role.kubernetes.io/control-plane" |
|
| kmm.webhookServer.webhookServer.tolerations[1].operator | string | "Equal" |
|
| kmm.webhookServer.webhookServer.tolerations[1].value | string | "" |
|
| kmm.webhookService.ports[0].port | int | 443 |
|
| kmm.webhookService.ports[0].protocol | string | "TCP" |
|
| kmm.webhookService.ports[0].targetPort | int | 9443 |
|
| kmm.webhookService.type | string | "ClusterIP" |