华为云AI开发平台ModelArts在lite资源池上使用Snt9B完成分布式训练任务_云淘科技

场景描述

本案例介绍如何在Snt9B上进行分布式训练任务。lite资源池已经默认安装volcano调度器,训练任务默认使用volcano job形式下发lite池集群。训练测试用例使用NLP的bert模型,详细代码和指导可参考Bert。

操作步骤

拉取镜像。本测试镜像为bert_pretrain_mindspore:v1,已经把测试数据和代码打进镜像中。

docker pull swr.cn-southwest-2.myhuaweicloud.com/os-public-repo/bert_pretrain_mindspore:v1
docker tag swr.cn-southwest-2.myhuaweicloud.com/os-public-repo/bert_pretrain_mindspore:v1 bert_pretrain_mindspore:v1

在主机上新建config.yaml文件。

config.yaml文件用于配置pod,本示例中使用sleep命令启动pod,便于进入pod调试。您也可以修改command为对应的任务启动命令(如“python train.py”),任务会在启动容器后执行。

config.yaml内容如下:

apiVersion: v1
kind: ConfigMap
metadata:
  name: configmap1980-yourvcjobname     # 前缀使用“configmap1980-”不变,后接vcjob的名字
  namespace: default                      # 命名空间自选,需要和下边的vcjob处在同一命名空间
  labels:
    ring-controller.cce: ascend-1980   # 保持不动
data:                    #data内容保持不动,初始化完成,会被volcano插件自动修改
  jobstart_hccl.json: |
    {
        "status":"initializing"
    }
---
apiVersion: batch.volcano.sh/v1alpha1   # The value cannot be changed. The volcano API must be used.
kind: Job                               # Only the job type is supported at present.
metadata:
  name: yourvcjobname                  # job名字,需要和configmap中名字保持联系
  namespace: default                      # 和configmap保持一致
  labels:
    ring-controller.cce: ascend-1980   # 保持不动
    fault-scheduling: "force"
spec:
  minAvailable: 1                       # The value of minAvailable is 1 in a single-node scenario and N in an N-node distributed scenario.
  schedulerName: volcano                # 保持不动,Use the Volcano scheduler to schedule jobs.
  policies:
    - event: PodEvicted
      action: RestartJob
  plugins:
    configmap1980:
    - --rank-table-version=v2  # 保持不动,生成v2版本ranktablefile
    env: []
    svc:
    - --publish-not-ready-addresses=true
  maxRetry: 3
  queue: default
  tasks:
  - name: "yourvcjobname-1"
    replicas: 1                              # The value of replicas is 1 in a single-node scenario and N in an N-node scenario. The number of NPUs in the requests field is 8 in an N-node scenario.
    template:
      metadata:
        labels:
          app: mindspore
          ring-controller.cce: ascend-1980  # 保持不动,The value must be the same as the label in ConfigMap and cannot be changed.
      spec:
        affinity:
          podAntiAffinity:
            requiredDuringSchedulingIgnoredDuringExecution:
              - labelSelector:
                  matchExpressions:
                    - key: volcano.sh/job-name
                      operator: In
                      values:
                        - yourvcjobname
                topologyKey: kubernetes.io/hostname
        containers:
        - image: bert_pretrain_mindspore:v1               # 镜像地址,Training framework image, which can be modified.
          imagePullPolicy: IfNotPresent
          name: mindspore
          env:
          - name: name                               # The value must be the same as that of Jobname.
            valueFrom:
              fieldRef:
                fieldPath: metadata.name
          - name: ip                                       # IP address of the physical node, which is used to identify the node where the pod is running
            valueFrom:
              fieldRef:
                fieldPath: status.hostIP
          - name: framework
            value: "MindSpore"
          command:
          - "sleep"
          - "1000000000000000000"
          resources:
            requests:
              huawei.com/ascend-1980: "1"                 # 需求卡数,key保持不变。Number of required NPUs. The maximum value is 16. You can add lines below to configure resources such as memory and CPU.
            limits:
              huawei.com/ascend-1980: "1"                 # 限制卡数,key保持不变。The value must be consistent with that in requests.
          volumeMounts:
          - name: ascend-driver               #驱动挂载,保持不动
            mountPath: /usr/local/Ascend/driver
          - name: ascend-add-ons           #驱动挂载,保持不动
            mountPath: /usr/local/Ascend/add-ons
          - name: localtime
            mountPath: /etc/localtime
          - name: hccn                             #驱动hccn配置,保持不动
            mountPath: /etc/hccn.conf
          - name: npu-smi                             #npu-smi
            mountPath: /usr/local/bin/npu-smi
        nodeSelector:
          accelerator/huawei-npu: ascend-1980
        volumes:
        - name: ascend-driver
          hostPath:
            path: /usr/local/Ascend/driver
        - name: ascend-add-ons
          hostPath:
            path: /usr/local/Ascend/add-ons
        - name: localtime
          hostPath:
            path: /etc/localtime                      # Configure the Docker time.
        - name: hccn
          hostPath:
            path: /etc/hccn.conf
        - name: npu-smi
          hostPath:
            path: /usr/local/bin/npu-smi
        restartPolicy: OnFailure

根据config.yaml创建pod。

kubectl apply -f config.yaml

检查pod启动情况,执行下述命令。如果显示“1/1 running”状态代表启动成功。

kubectl get pod -A

进入容器,{pod_name}替换为您的pod名字(get pod中显示的名字),{namespace}替换为您的命名空间(默认为default)。

kubectl exec -it {pod_name} bash -n {namespace}

查看卡信息,执行以下命令。

npu-smi info

kubernetes会根据config.yaml文件中配置的卡数分配资源给pod,如下图所示由于配置了1卡因此在容器中只会显示1卡,说明配置生效。

图1 查看卡信息

修改pod的卡数。由于本案例中为分布式训练,因此所需卡数修改为8卡。

删除已创建的pod。

kubectl delete -f config.yaml

将config.yaml文件中“limit”和“request”改为8。

vi config.yaml

图2 修改卡数

重新创建pod。

kubectl apply -f config.yaml

进入容器并查看卡信息,{pod_name}替换为您的pod名字,{namespace}替换为您的命名空间(默认为default)。

kubectl exec -it {pod_name} bash -n {namespace}
npu-smi info

如图所示为8卡,pod配置成功。

图3 查看卡信息

查看卡间通信配置文件,执行以下命令。

cat /user/config/jobstart_hccl.json

多卡训练时,需要依赖r“ank_table_file”做卡间通信的配置文件,该文件自动生成,pod启动之后文件地址。为“/user/config/jobstart_hccl.json”,“/user/config/jobstart_hccl.json”配置文件生成需要一段时间,业务进程需要等待“/user/config/jobstart_hccl.json”中“status”字段为“completed”状态,才能生成卡间通信信息。如下图所示。

图4 卡间通信配置文件

启动训练任务。

cd /home/ma-user/modelarts/user-job-dir/code/bert/
export MS_ENABLE_GE=1
export MS_GE_TRAIN=1
python scripts/ascend_distributed_launcher/get_distribute_pretrain_cmd.py --run_script_dir ./scripts/run_distributed_pretrain_ascend.sh --hyper_parameter_config_dir ./scripts/ascend_distributed_launcher/hyper_parameter_config.ini --data_dir /home/ma-user/modelarts/user-job-dir/data/cn-news-128-1f-mind/ --hccl_config /user/config/jobstart_hccl.json --cmd_file ./distributed_cmd.sh
bash scripts/run_distributed_pretrain_ascend.sh /home/ma-user/modelarts/user-job-dir/data/cn-news-128-1f-mind/ /user/config/jobstart_hccl.json

图5 启动训练任务

训练任务加载需要一定时间,在等待若干分钟后,可以执行下述命令查看卡信息。如下图可见,8张卡均被占用,说明训练任务在进行中

npu-smi info

图6 查看卡信息

若想停止训练任务,可执行下述命令关闭进程,查询进程后显示已无运行中python进程。

pkill -9 python
ps -ef

图7 关闭训练进程

limit/request配置cpu和内存大小,已知单节点Snt9B机器为:8张Snt9B卡+192u1536g,请合理规划,避免cpu和内存限制过小引起任务无法正常运行。

父主题: k8s Cluster资源使用

同意关联代理商云淘科技,购买华为云产品更优惠(QQ 78315851)

内容没看懂? 不太想学习?想快速解决? 有偿解决: 联系专家