华为云AI开发平台ModelArts在lite资源池上使用Snt9B完成分布式训练任务_云淘科技

AI开发平台ModelArts

华为云AI开发平台ModelArts在lite资源池上使用Snt9B完成分布式训练任务_云淘科技

12 月 05, 2023

138 0

场景描述

本案例介绍如何在Snt9B上进行分布式训练任务。lite资源池已经默认安装volcano调度器，训练任务默认使用volcano job形式下发lite池集群。训练测试用例使用NLP的bert模型，详细代码和指导可参考Bert。

操作步骤

拉取镜像。本测试镜像为bert_pretrain_mindspore:v1，已经把测试数据和代码打进镜像中。

docker pull swr.cn-southwest-2.myhuaweicloud.com/os-public-repo/bert_pretrain_mindspore:v1
docker tag swr.cn-southwest-2.myhuaweicloud.com/os-public-repo/bert_pretrain_mindspore:v1 bert_pretrain_mindspore:v1

在主机上新建config.yaml文件。

config.yaml文件用于配置pod，本示例中使用sleep命令启动pod，便于进入pod调试。您也可以修改command为对应的任务启动命令（如“python train.py”），任务会在启动容器后执行。

config.yaml内容如下：

apiVersion: v1
kind: ConfigMap
metadata:
  name: configmap1980-yourvcjobname     # 前缀使用“configmap1980-”不变，后接vcjob的名字
  namespace: default                      # 命名空间自选，需要和下边的vcjob处在同一命名空间
  labels:
    ring-controller.cce: ascend-1980   # 保持不动
data:                    #data内容保持不动，初始化完成，会被volcano插件自动修改
  jobstart_hccl.json: |
    {
        "status":"initializing"
    }
---
apiVersion: batch.volcano.sh/v1alpha1   # The value cannot be changed. The volcano API must be used.
kind: Job                               # Only the job type is supported at present.
metadata:
  name: yourvcjobname                  # job名字，需要和configmap中名字保持联系
  namespace: default                      # 和configmap保持一致
  labels:
    ring-controller.cce: ascend-1980   # 保持不动
    fault-scheduling: "force"
spec:
  minAvailable: 1                       # The value of minAvailable is 1 in a single-node scenario and N in an N-node distributed scenario.
  schedulerName: volcano                # 保持不动，Use the Volcano scheduler to schedule jobs.
  policies:
    - event: PodEvicted
      action: RestartJob
  plugins:
    configmap1980:
    - --rank-table-version=v2  # 保持不动，生成v2版本ranktablefile
    env: []
    svc:
    - --publish-not-ready-addresses=true
  maxRetry: 3
  queue: default
  tasks:
  - name: "yourvcjobname-1"
    replicas: 1                              # The value of replicas is 1 in a single-node scenario and N in an N-node scenario. The number of NPUs in the requests field is 8 in an N-node scenario.
    template:
      metadata:
        labels:
          app: mindspore
          ring-controller.cce: ascend-1980  # 保持不动，The value must be the same as the label in ConfigMap and cannot be changed.
      spec:
        affinity:
          podAntiAffinity:
            requiredDuringSchedulingIgnoredDuringExecution:
              - labelSelector:
                  matchExpressions:
                    - key: volcano.sh/job-name
                      operator: In
                      values:
                        - yourvcjobname
                topologyKey: kubernetes.io/hostname
        containers:
        - image: bert_pretrain_mindspore:v1               # 镜像地址，Training framework image, which can be modified.
          imagePullPolicy: IfNotPresent
          name: mindspore
          env:
          - name: name                               # The value must be the same as that of Jobname.
            valueFrom:
              fieldRef:
                fieldPath: metadata.name
          - name: ip                                       # IP address of the physical node, which is used to identify the node where the pod is running
            valueFrom:
              fieldRef:
                fieldPath: status.hostIP
          - name: framework
            value: "MindSpore"
          command:
          - "sleep"
          - "1000000000000000000"
          resources:
            requests:
              huawei.com/ascend-1980: "1"                 # 需求卡数，key保持不变。Number of required NPUs. The maximum value is 16. You can add lines below to configure resources such as memory and CPU.
            limits:
              huawei.com/ascend-1980: "1"                 # 限制卡数，key保持不变。The value must be consistent with that in requests.
          volumeMounts:
          - name: ascend-driver               #驱动挂载，保持不动
            mountPath: /usr/local/Ascend/driver
          - name: ascend-add-ons           #驱动挂载，保持不动
            mountPath: /usr/local/Ascend/add-ons
          - name: localtime
            mountPath: /etc/localtime
          - name: hccn                             #驱动hccn配置，保持不动
            mountPath: /etc/hccn.conf
          - name: npu-smi                             #npu-smi
            mountPath: /usr/local/bin/npu-smi
        nodeSelector:
          accelerator/huawei-npu: ascend-1980
        volumes:
        - name: ascend-driver
          hostPath:
            path: /usr/local/Ascend/driver
        - name: ascend-add-ons
          hostPath:
            path: /usr/local/Ascend/add-ons
        - name: localtime
          hostPath:
            path: /etc/localtime                      # Configure the Docker time.
        - name: hccn
          hostPath:
            path: /etc/hccn.conf
        - name: npu-smi
          hostPath:
            path: /usr/local/bin/npu-smi
        restartPolicy: OnFailure

根据config.yaml创建pod。

kubectl apply -f config.yaml

检查pod启动情况，执行下述命令。如果显示“1/1 running”状态代表启动成功。

kubectl get pod -A

进入容器，{pod_name}替换为您的pod名字（get pod中显示的名字），{namespace}替换为您的命名空间（默认为default）。

kubectl exec -it {pod_name} bash -n {namespace}

查看卡信息，执行以下命令。

npu-smi info

kubernetes会根据config.yaml文件中配置的卡数分配资源给pod，如下图所示由于配置了1卡因此在容器中只会显示1卡，说明配置生效。

图1 查看卡信息

修改pod的卡数。由于本案例中为分布式训练，因此所需卡数修改为8卡。

删除已创建的pod。

kubectl delete -f config.yaml

将config.yaml文件中“limit”和“request”改为8。

vi config.yaml

图2 修改卡数

重新创建pod。

kubectl apply -f config.yaml

进入容器并查看卡信息，{pod_name}替换为您的pod名字，{namespace}替换为您的命名空间（默认为default）。

kubectl exec -it {pod_name} bash -n {namespace}
npu-smi info

如图所示为8卡，pod配置成功。

图3 查看卡信息

查看卡间通信配置文件，执行以下命令。

cat /user/config/jobstart_hccl.json

多卡训练时，需要依赖r“ank_table_file”做卡间通信的配置文件，该文件自动生成，pod启动之后文件地址。为“/user/config/jobstart_hccl.json”，“/user/config/jobstart_hccl.json”配置文件生成需要一段时间，业务进程需要等待“/user/config/jobstart_hccl.json”中“status”字段为“completed”状态，才能生成卡间通信信息。如下图所示。

图4 卡间通信配置文件

启动训练任务。

cd /home/ma-user/modelarts/user-job-dir/code/bert/
export MS_ENABLE_GE=1
export MS_GE_TRAIN=1
python scripts/ascend_distributed_launcher/get_distribute_pretrain_cmd.py --run_script_dir ./scripts/run_distributed_pretrain_ascend.sh --hyper_parameter_config_dir ./scripts/ascend_distributed_launcher/hyper_parameter_config.ini --data_dir /home/ma-user/modelarts/user-job-dir/data/cn-news-128-1f-mind/ --hccl_config /user/config/jobstart_hccl.json --cmd_file ./distributed_cmd.sh
bash scripts/run_distributed_pretrain_ascend.sh /home/ma-user/modelarts/user-job-dir/data/cn-news-128-1f-mind/ /user/config/jobstart_hccl.json

图5 启动训练任务

训练任务加载需要一定时间，在等待若干分钟后，可以执行下述命令查看卡信息。如下图可见，8张卡均被占用，说明训练任务在进行中

npu-smi info

图6 查看卡信息

若想停止训练任务，可执行下述命令关闭进程，查询进程后显示已无运行中python进程。

pkill -9 python
ps -ef

图7 关闭训练进程

limit/request配置cpu和内存大小，已知单节点Snt9B机器为：8张Snt9B卡+192u1536g，请合理规划，避免cpu和内存限制过小引起任务无法正常运行。

父主题： k8s Cluster资源使用

同意关联代理商云淘科技，购买华为云产品更优惠（QQ 78315851）

内容没看懂？不太想学习？想快速解决？有偿解决：联系专家

华为云AI开发平台ModelArts在lite资源池上使用Snt9B完成分布式训练任务_云淘科技

场景描述

操作步骤

分类

近期文章

近期评论

友情链接

分类目录

场景描述

操作步骤

相关文章

分类

近期文章

近期评论

友情链接

分类目录