msrun启动 | 您所在的位置:网站首页 › wegame安装拉起进程失败 › msrun启动 |
多机多卡
下面以执行2机8卡训练,每台机器执行启动4个Worker为例: 脚本msrun_1.sh在节点1上执行,使用msrun指令拉起1个Scheduler进程以及4个Worker进程,配置master_addr为节点1的IP地址(msrun会自动检测到当前节点IP与master_addr匹配而拉起Scheduler进程),通过node_rank设置当前节点为0号节点: EXEC_PATH=$(pwd) if [ ! -d "${EXEC_PATH}/MNIST_Data" ]; then if [ ! -f "${EXEC_PATH}/MNIST_Data.zip" ]; then wget http://mindspore-website.obs.cn-north-4.myhuaweicloud.com/notebook/datasets/MNIST_Data.zip fi unzip MNIST_Data.zip fi export DATA_PATH=${EXEC_PATH}/MNIST_Data/train/ rm -rf msrun_log mkdir msrun_log echo "start training" msrun --worker_num=8 --local_worker_num=4 --master_addr= --master_port=8118 --node_rank=0 --log_dir=msrun_log --join=True --cluster_time_out=300 net.py脚本msrun_2.sh在节点2上执行,使用msrun指令拉起4个Worker进程,配置master_addr为节点1的IP地址,通过node_rank设置当前节点为1号节点: EXEC_PATH=$(pwd) if [ ! -d "${EXEC_PATH}/MNIST_Data" ]; then if [ ! -f "${EXEC_PATH}/MNIST_Data.zip" ]; then wget http://mindspore-website.obs.cn-north-4.myhuaweicloud.com/notebook/datasets/MNIST_Data.zip fi unzip MNIST_Data.zip fi export DATA_PATH=${EXEC_PATH}/MNIST_Data/train/ rm -rf msrun_log mkdir msrun_log echo "start training" msrun --worker_num=8 --local_worker_num=4 --master_addr= --master_port=8118 --node_rank=1 --log_dir=msrun_log --join=True --cluster_time_out=300 net.py节点2和节点1的指令差别在于node_rank不同。 在节点1执行: bash msrun_1.sh在节点2执行: bash msrun_2.sh即可执行2机8卡分布式训练任务,日志文件会保存到./msrun_log目录下,结果保存在./msrun_log/worker_*.log中,Loss结果如下: epoch: 0, step: 0, loss is 2.3499548 epoch: 0, step: 10, loss is 1.6682479 epoch: 0, step: 20, loss is 1.4237018 epoch: 0, step: 30, loss is 1.0437132 epoch: 0, step: 40, loss is 1.0643986 epoch: 0, step: 50, loss is 1.1021575 epoch: 0, step: 60, loss is 0.8510884 epoch: 0, step: 70, loss is 1.0581372 epoch: 0, step: 80, loss is 1.0076828 epoch: 0, step: 90, loss is 0.88950706 ... |
今日新闻 |
推荐新闻 |
专题文章 |
CopyRight 2018-2019 实验室设备网 版权所有 |