操作系统:Ubuntu 22.04(x86架构)

NPU: 910B

一、驱动安装

1.1 创建用户及用户组

groupadd HwHiAiUser && useradd -g HwHiAiUser -d /home/HwHiAiUser -m HwHiAiUser -s /bin/bash

1.2 下载驱动及固件

官网:社区版-固件与驱动-昇腾社区

wget https://ascend-repo.obs.cn-east-2.myhuaweicloud.com/Ascend%20HDK/Ascend%20HDK%2023.0.3/Ascend-hdk-910b-npu-firmware_7.1.0.5.220.run
wget https://ascend-repo.obs.cn-east-2.myhuaweicloud.com/Ascend%20HDK/Ascend%20HDK%2023.0.3/Ascend-hdk-910b-npu-driver_23.0.3_linux-x86-64.run

1.3 安装驱动

chmod +x Ascend-hdk-910b-npu-driver_23.0.3_linux-x86-64.run
./Ascend-hdk-910b-npu-driver_23.0.3_linux-x86-64.run --full --install-for-all

1.4 安装固件

chmod +x Ascend-hdk-910b-npu-firmware_7.1.0.5.220.run
./Ascend-hdk-910b-npu-firmware_7.1.0.5.220.run --full

1.5 验证安装

npu-smi info
+------------------------------------------------------------------------------------------------+
| npu-smi 23.0.3                   Version: 23.0.3                                               |
+---------------------------+---------------+----------------------------------------------------+
| NPU   Name                | Health        | Power(W)    Temp(C)           Hugepages-Usage(page)|
| Chip                      | Bus-Id        | AICore(%)   Memory-Usage(MB)  HBM-Usage(MB)        |
+===========================+===============+====================================================+
| 0     910B2C              | OK            | 87.2        33                0    / 0             |
| 0                         | 0000:5A:00.0  | 0           0    / 0          3160 / 65536         |
+===========================+===============+====================================================+
| 1     910B2C              | OK            | 89.2        35                0    / 0             |
| 0                         | 0000:19:00.0  | 0           0    / 0          3159 / 65536         |
+===========================+===============+====================================================+
| 2     910B2C              | OK            | 87.8        35                0    / 0             |
| 0                         | 0000:49:00.0  | 0           0    / 0          3159 / 65536         |
+===========================+===============+====================================================+
| 3     910B2C              | OK            | 89.8        35                0    / 0             |
| 0                         | 0000:39:00.0  | 0           0    / 0          3159 / 65536         |
+===========================+===============+====================================================+
| 4     910B2C              | OK            | 87.9        34                0    / 0             |
| 0                         | 0000:DA:00.0  | 0           0    / 0          3159 / 65536         |
+===========================+===============+====================================================+
| 5     910B2C              | OK            | 99.5        35                0    / 0             |
| 0                         | 0000:99:00.0  | 0           0    / 0          3159 / 65536         |
+===========================+===============+====================================================+
| 6     910B2C              | OK            | 90.8        36                0    / 0             |
| 0                         | 0000:B8:00.0  | 0           0    / 0          3159 / 65536         |
+===========================+===============+====================================================+
| 7     910B2C              | OK            | 90.4        35                0    / 0             |
| 0                         | 0000:C8:00.0  | 0           0    / 0          3159 / 65536         |
+===========================+===============+====================================================+
| 8     910B2C              | OK            | 87.0        34                0    / 0             |
| 0                         | 0000:59:00.0  | 0           0    / 0          3159 / 65536         |
+===========================+===============+====================================================+
| 9     910B2C              | OK            | 89.4        34                0    / 0             |
| 0                         | 0000:18:00.0  | 0           0    / 0          3159 / 65536         |
+===========================+===============+====================================================+
| 10    910B2C              | OK            | 91.0        33                0    / 0             |
| 0                         | 0000:48:00.0  | 0           0    / 0          3159 / 65536         |
+===========================+===============+====================================================+
| 11    910B2C              | OK            | 93.6        36                0    / 0             |
| 0                         | 0000:38:00.0  | 0           0    / 0          3159 / 65536         |
+===========================+===============+====================================================+
| 12    910B2C              | OK            | 91.5        36                0    / 0             |
| 0                         | 0000:D9:00.0  | 0           0    / 0          3159 / 65536         |
+===========================+===============+====================================================+
| 13    910B2C              | OK            | 99.7        36                0    / 0             |
| 0                         | 0000:98:00.0  | 0           0    / 0          3159 / 65536         |
+===========================+===============+====================================================+
| 14    910B2C              | OK            | 88.6        33                0    / 0             |
| 0                         | 0000:B9:00.0  | 0           0    / 0          3159 / 65536         |
+===========================+===============+====================================================+
| 15    910B2C              | OK            | 96.5        36                0    / 0             |
| 0                         | 0000:C9:00.0  | 0           0    / 0          3159 / 65536         |
+===========================+===============+====================================================+
+---------------------------+---------------+----------------------------------------------------+
| NPU     Chip              | Process id    | Process name             | Process memory(MB)      |
+===========================+===============+====================================================+
| No running processes found in NPU 0                                                            |
+===========================+===============+====================================================+
| No running processes found in NPU 1                                                            |
+===========================+===============+====================================================+
| No running processes found in NPU 2                                                            |
+===========================+===============+====================================================+
| No running processes found in NPU 3                                                            |
+===========================+===============+====================================================+
| No running processes found in NPU 4                                                            |
+===========================+===============+====================================================+
| No running processes found in NPU 5                                                            |
+===========================+===============+====================================================+
| No running processes found in NPU 6                                                            |
+===========================+===============+====================================================+
| No running processes found in NPU 7                                                            |
+===========================+===============+====================================================+
| No running processes found in NPU 8                                                            |
+===========================+===============+====================================================+
| No running processes found in NPU 9                                                            |
+===========================+===============+====================================================+
| No running processes found in NPU 10                                                           |
+===========================+===============+====================================================+
| No running processes found in NPU 11                                                           |
+===========================+===============+====================================================+
| No running processes found in NPU 12                                                           |
+===========================+===============+====================================================+
| No running processes found in NPU 13                                                           |
+===========================+===============+====================================================+
| No running processes found in NPU 14                                                           |
+===========================+===============+====================================================+
| No running processes found in NPU 15                                                           |
+===========================+===============+====================================================+

二、K8S 1.25 接入昇腾NPU

2.1 创建NPU插件

来源:mind-cluster: mind-cluster 组件代码仓 - Gitee.com

apiVersion: v1
kind: ServiceAccount
metadata:
  name: ascend-device-plugin-sa-910
  namespace: kube-system
---
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: pods-node-ascend-device-plugin-role-910
rules:
  - apiGroups: [""]
    resources: ["pods"]
    verbs: ["get", "list", "update", "watch", "patch"]
  - apiGroups: [""]
    resources: ["nodes"]
    verbs: ["get", "patch"]
  - apiGroups: [ "" ]
    resources: [ "nodes/proxy" ]
    verbs: [ "get" ]
  - apiGroups: [""]
    resources: ["nodes/status"]
    verbs: ["get", "patch", "update"]
  - apiGroups: [""]
    resources: ["configmaps"]
    verbs: ["get", "create", "update", "list", "watch"]
  - apiGroups: [ "" ]
    resources: [ "events" ]
    verbs: [ "create" ]
---
kind: ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: pods-node-ascend-device-plugin-rolebinding-910
subjects:
  - kind: ServiceAccount
    name: ascend-device-plugin-sa-910
    namespace: kube-system
roleRef:
  kind: ClusterRole
  name: pods-node-ascend-device-plugin-role-910
  apiGroup: rbac.authorization.k8s.io
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: ascend-device-plugin-daemonset
  namespace: kube-system
spec:
  selector:
    matchLabels:
      name: ascend-device-plugin-ds
  updateStrategy:
    type: RollingUpdate
  template:
    metadata:
      ##### For Kubernetes versions lower than 1.19, seccomp is used with annotations.
      annotations:
        scheduler.alpha.kubernetes.io/critical-pod: ""
        seccomp.security.alpha.kubernetes.io/pod: runtime/default
      labels:
        name: ascend-device-plugin-ds
    spec:
      ##### For Kubernetes version 1.19 and above, seccomp is used with securityContext:seccompProfile
#      securityContext:
#        seccompProfile:
#          type: RuntimeDefault
      tolerations:
        - key: CriticalAddonsOnly
          operator: Exists
        - key: huawei.com/Ascend910
          operator: Exists
          effect: NoSchedule
        - key: "device-plugin"
          operator: "Equal"
          value: "v2"
          effect: NoSchedule
      priorityClassName: "system-node-critical"
      nodeSelector:
        accelerator: huawei-Ascend910
      serviceAccountName: ascend-device-plugin-sa-910
      containers:
      - image: ascend-k8sdeviceplugin:v3.0.0
        name: device-plugin-01
        resources:
          requests:
            memory: 500Mi
            cpu: 500m
          limits:
            memory: 500Mi
            cpu: 500m
        command: [ "/bin/bash", "-c", "--"]
        args: [ "device-plugin  -useAscendDocker=true
                 -logFile=/var/log/mindx-dl/devicePlugin/devicePlugin.log -logLevel=0" ]
        securityContext:
          privileged: true
          readOnlyRootFilesystem: true
        imagePullPolicy: Never
        volumeMounts:
          - name: device-plugin
            mountPath: /var/lib/kubelet/device-plugins
          - name: pod-resource
            mountPath: /var/lib/kubelet/pod-resources
          - name: hiai-driver
            mountPath: /usr/local/Ascend/driver
            readOnly: true
          - name: log-path
            mountPath: /var/log/mindx-dl/devicePlugin
          - name: tmp
            mountPath: /tmp
          - name: lingqu-log
            mountPath: /var/log/lingqu
          - name: containerd
            mountPath: /run/containerd
            readOnly: true
          - name: localtime
            mountPath: /etc/localtime
            readOnly: true
          - name: data-trace-file-dir
            mountPath: /user/cluster-info/datatrace-config
        env:
          - name: NODE_NAME
            valueFrom:
              fieldRef:
                fieldPath: spec.nodeName
          - name: HOST_IP
            valueFrom:
              fieldRef:
                fieldPath: status.hostIP
      volumes:
        - name: device-plugin
          hostPath:
            path: /var/lib/kubelet/device-plugins
        - name: pod-resource
          hostPath:
            path: /var/lib/kubelet/pod-resources
        - name: hiai-driver
          hostPath:
            path: /usr/local/Ascend/driver
        - name: log-path
          hostPath:
            path: /var/log/mindx-dl/devicePlugin
            type: Directory
        - name: data-trace-file-dir
          hostPath:
            path: /user/cluster-info/datatrace-config
            type: DirectoryOrCreate
        - name: tmp
          hostPath:
            path: /tmp
        - name: lingqu-log
          hostPath:
            path: /var/log/lingqu
            type: DirectoryOrCreate
        - name: containerd
          hostPath:
            path: /run/containerd # update the directory where the containerd.sock file is located When using older version of Docker
        - name: localtime
          hostPath:
            path: /etc/localtime

2.2 AscendDocker Runtime下载安装

2.2.1 下载链接

wget -c https://gitee.com/ascend/ascend-docker-runtime/releases/download/v6.0.0-RC3/Ascend-docker-runtime_{version}_linux-{arch}.run

2.2.2 Containerd runtime 安装步骤

  1. 校验安装包完整
./Ascend-docker-runtime_{version}_linux-{arch}.run --check

输出示例

[WARNING]: --check is meaningless...
Verifying archive integrity... All good.
  1. 添加可执行权限
chmod u+x Ascend-docker-runtime_{version}_linux-{arch}.run
  1. 安装Runtime
  • 默认路径安装
./Ascend-docker-runtime_{version}_linux-{arch}.run --install
  • 自定义路径安装
./Ascend-docker-runtime_{version}_linux-{arch}.run --install --install-path=<custom-path>

成功提示

[INFO] Ascend Docker Runtime install success

2.2.3 修改Containerd配置

  • 无配置文件时
mkdir /etc/containerd
containerd config default > /etc/containerd/config.toml
  • 已有配置文件时
vim /etc/containerd/config.toml
  1. 更新runtime路径
[plugins."io.containerd.runtime.v1.linux"]
  runtime = "/usr/local/Ascend/Ascend-Docker-Runtime/ascend-docker-runtime"  # 替换为实际路径

Logo

鲲鹏昇腾开发者社区是面向全社会开放的“联接全球计算开发者,聚合华为+生态”的社区,内容涵盖鲲鹏、昇腾资源,帮助开发者快速获取所需的知识、经验、软件、工具、算力,支撑开发者易学、好用、成功,成为核心开发者。

更多推荐