多机多卡训练：NCCL Debug

大模型因为参数量巨大，即使是Finetune也只能在多卡GPU的机器上训练（全精度），如果是A100 8卡40GB机器，用上DeepSpeed的各种优化之后勉强能训3B模型，7B模型训不了，必须要多机多卡才行。这里记录一下早期探索，使用裸机环境配置多机多卡来跑大模型训练遇到的一些问题。

多机多卡训练需要一个高效的通信框架来协调多个设备之间的数据传输和计算任务。常见的通信框架包括MPI、NCCL等。同时，多机多卡训练还需要一些额外的技术支持，如数据并行化、模型并行化等，以便将计算和存储任务分配到不同的设备上。

虽然多机多卡训练可以大大加速深度学习模型的训练速度，但也面临一些挑战，如设备故障、通信延迟等。因此，在应用多机多卡训练时需要谨慎选择合适的硬件设备和软件工具，并进行充分测试和优化。

测试一

环境配置

两台主机加入 swarm worker，docker 指定overlay network

容器启动之后，需要手动启动ssh

1	/etc/init.d/ssh start

运行 DeepSpeed-chat 多机训练

NCCL_DEBUG_SUBSYS=ALL NCCL_IB_DISABLE=1 NCCL_DEBUG=INFO python train.py --actor-model facebook/opt-1.3b --reward-model facebook/opt-350m --deployment-type multi_node --actor-zero-stage 3 --output-dir /data/opt-1.3b-multi-node/ --hostfile hostfile

log 分析

错误如下：socket 连接错误

GPU 不支持 NCCL IB，因此设置NCCL_IB_DISABLE=1 禁用掉，多机间通信将透过socket 建立连接，并且使用的eth0，eth1 是容器的网络配置

容器内 ifconfig 结果：

主机的ifconfig：

如下提示，说明两个网络上 RDMA 也不能用

测试二

环境配置

两台主机未加入swarm（docker swarm leave），docker容器指定为host 网络，docker启动后ssh port 6000 修改为22

运行 DeepSpeed-chat 多机训练

NCCL_DEBUG_SUBSYS=ALL NCCL_IB_DISABLE=1 NCCL_DEBUG=INFO NCCL_SOCKET_IFNAME=eth0 python train.py --actor-model facebook/opt-1.3b --reward-model facebook/opt-350m --deployment-type multi_node --actor-zero-stage 3 --output-dir /data/opt-1.3b-multi-node/ --hostfile hostfile

log 分析

测试三

环境配置

两台主机未加入swarm（docker swarm leave），docker容器指定为host 网络，docker启动后ssh port 6000 修改为22

torchrun 运行 BELLE/train 多机训练，排除DeepSpeed的影响

node1 （182）上执行：

OMP_NUM_THREADS=8 torchrun --node_rank=0 --master_addr=192.168.0.8 --master_port=29500 --nnodes=2 --nproc_per_node=8 finetune.py --model_config_file run_config/Bloom_config.json --lora_hyperparams_file run_config/lora_hyperparams_bloom.json --use_lora

node2 （188）上执行：

OMP_NUM_THREADS=8 torchrun --node_rank=1 --master_addr=192.168.0.16 --master_port=29500 --nnodes=2 --nproc_per_node=8 finetune.py --model_config_file run_config/Bloom_config.json --lora_hyperparams_file run_config/lora_hyperparams_bloom.json --use_lora

测试四

环境配置

两台主机未加入swarm（docker swarm leave），docker容器指定为host 网络，docker启动后ssh port 6000 修改为22

运行 nccl test 命令

NCCL_DEBUG_SUBSYS=ALL NCCL_DEBUG=INFO NCCL_IB_DISABLE=1 NCCL_SOCKET_IFNAME=eth0 mpirun --allow-run-as-root  -np 16 -H 192.168.0.8:54321,192.168.0.16:54321 ./build/all_gather_perf  -b 8 -e  128M -f 2 -g 8 2>&1 |tee ib.log

最终解决方案

分析 nccl_test log发现，多机之间的连接还是因为防火墙和iptables 的影响，索性关闭防火墙

#!/usr/bin/env bash

iptables -t nat -P PREROUTING ACCEPT
iptables -t nat -P POSTROUTING ACCEPT
iptables -t nat -P OUTPUT ACCEPT
iptables -t nat -F
iptables -t nat -X
iptables -t mangle -P PREROUTING ACCEPT
iptables -t mangle -P INPUT ACCEPT
iptables -t mangle -P FORWARD ACCEPT
iptables -t mangle -P OUTPUT ACCEPT
iptables -t mangle -P POSTROUTING ACCEPT
iptables -t mangle -F
iptables -t mangle -X
iptables -t filter -P INPUT ACCEPT
iptables -t filter -P FORWARD ACCEPT
iptables -t filter -P OUTPUT ACCEPT
iptables -t filter -F
iptables -t filter -X

ufw disable

双机训练跑通，中途worker的ssh 连接中断了，稳定性有待验证

点击展开


[2023-04-17 17:13:37,910] [INFO] [runner.py:446:main] Using IP address of 192.168.0.8 for node a182
[2023-04-17 17:13:37,911] [INFO] [multinode_runner.py:70:get_cmd] Running on the following workers: a182,a188
[2023-04-17 17:13:37,911] [INFO] [runner.py:540:main] cmd = pdsh -S -f 1024 -w a182,a188 export NCCL_VERSION=2.16.5; export NCCL_SOCKET_IFNAME=eth0; export PYTHONIOENCODING=utf-8; export NCCL_IB_DISABLE=1; export PYTHONPATH=/data/repos/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step1_supervised_finetuning;  cd /data/repos/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step1_supervised_finetuning; /usr/bin/python -u -m deepspeed.launcher.launch --world_info=eyJhMTgyIjogWzAsIDEsIDIsIDMsIDQsIDUsIDYsIDddLCAiYTE4OCI6IFswLCAxLCAyLCAzLCA0LCA1LCA2LCA3XX0= --node_rank=%n --master_addr=192.168.0.8 --master_port=29500 main.py --data_path 'Dahoas/rm-static' 'Dahoas/full-hh-rlhf' 'Dahoas/synthetic-instruct-gptj-pairwise' 'yitingxie/rlhf-reward-datasets' 'openai/webgpt_comparisons' 'stanfordnlp/SHP' --data_split '2,4,4' --model_name_or_path 'facebook/opt-1.3b' --per_device_train_batch_size '4' --per_device_eval_batch_size '4' --max_seq_len '512' --learning_rate '1e-4' --weight_decay '0.1' --num_train_epochs '2' --gradient_accumulation_steps '1' --lr_scheduler_type 'cosine' --num_warmup_steps '0' --seed '1234' --gradient_checkpointing --zero_stage '3' --lora_dim '128' --lora_module_name 'decoder.layers.' --deepspeed --output_dir '/data/opt-1.3b-multi-node/actor-models/1.3b'
a182: [2023-04-17 17:13:43,896] [INFO] [launch.py:222:main] 0 NCCL_VERSION=2.16.5
a182: [2023-04-17 17:13:43,897] [INFO] [launch.py:222:main] 0 NCCL_SOCKET_IFNAME=eth0
a182: [2023-04-17 17:13:43,897] [INFO] [launch.py:222:main] 0 NCCL_IB_DISABLE=1
a182: [2023-04-17 17:13:43,897] [INFO] [launch.py:229:main] WORLD INFO DICT: {'a182': [0, 1, 2, 3, 4, 5, 6, 7], 'a188': [0, 1, 2, 3, 4, 5, 6, 7]}
a182: [2023-04-17 17:13:43,897] [INFO] [launch.py:235:main] nnodes=2, num_local_procs=8, node_rank=0
a182: [2023-04-17 17:13:43,897] [INFO] [launch.py:246:main] global_rank_mapping=defaultdict(, {'a182': [0, 1, 2, 3, 4, 5, 6, 7], 'a188': [8, 9, 10, 11, 12, 13, 14, 15]})
a182: [2023-04-17 17:13:43,897] [INFO] [launch.py:247:main] dist_world_size=16
a182: [2023-04-17 17:13:43,897] [INFO] [launch.py:249:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
a188: [2023-04-17 17:13:43,896] [INFO] [launch.py:222:main] 1 NCCL_VERSION=2.16.5
a188: [2023-04-17 17:13:43,896] [INFO] [launch.py:222:main] 1 NCCL_SOCKET_IFNAME=eth0
a188: [2023-04-17 17:13:43,896] [INFO] [launch.py:222:main] 1 NCCL_IB_DISABLE=1
a188: [2023-04-17 17:13:43,896] [INFO] [launch.py:229:main] WORLD INFO DICT: {'a182': [0, 1, 2, 3, 4, 5, 6, 7], 'a188': [0, 1, 2, 3, 4, 5, 6, 7]}
a188: [2023-04-17 17:13:43,896] [INFO] [launch.py:235:main] nnodes=2, num_local_procs=8, node_rank=1
a188: [2023-04-17 17:13:43,896] [INFO] [launch.py:246:main] global_rank_mapping=defaultdict(, {'a182': [0, 1, 2, 3, 4, 5, 6, 7], 'a188': [8, 9, 10, 11, 12, 13, 14, 15]})
a188: [2023-04-17 17:13:43,896] [INFO] [launch.py:247:main] dist_world_size=16
a188: [2023-04-17 17:13:43,896] [INFO] [launch.py:249:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
a182: [2023-04-17 17:13:54,469] [INFO] [comm.py:586:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
a182: 'HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /facebook/opt-1.3b/resolve/main/config.json (Caused by ConnectTimeoutError(, 'Connection to huggingface.co timed out. (connect timeout=10)'))' thrown while requesting HEAD https://huggingface.co/facebook/opt-1.3b/resolve/main/config.json
a182: [2023-04-17 17:17:31,616] [INFO] [partition_parameters.py:436:__exit__] finished initializing model with 1.42B parameters
a188: Found cached dataset parquet (/root/.cache/huggingface/datasets/Dahoas___parquet/default-b9d2c4937d617106/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec)
a188:
  0%|          | 0/2 [00:00
a188:     main()
a188:   File "main.py", line 218, in main
a188:     train_dataset, eval_dataset = create_prompt_dataset(
a188:   File "/data/repos/DeepSpeedExamples/applications/DeepSpeed-Chat/training/utils/data/data_utils.py", line 279, in create_prompt_dataset
a188:     train_dataset, eval_dataset = create_dataset(
a188:   File "/data/repos/DeepSpeedExamples/applications/DeepSpeed-Chat/training/utils/data/data_utils.py", line 212, in create_dataset
a188:     raw_dataset = get_raw_dataset(dataset_name, output_path, seed, local_rank)
a188:   File "/data/repos/DeepSpeedExamples/applications/DeepSpeed-Chat/training/utils/data/data_utils.py", line 21, in get_raw_dataset
a188:     return raw_datasets.DahoasRmstaticDataset(output_path, seed,
a188:   File "/data/repos/DeepSpeedExamples/applications/DeepSpeed-Chat/training/utils/data/raw_datasets.py", line 52, in __init__
a188:     self.raw_datasets = load_dataset("Dahoas/rm-static")
a188:   File "/usr/local/lib/python3.8/dist-packages/datasets/load.py", line 1767, in load_dataset
a188:     builder_instance = load_dataset_builder(
a188:   File "/usr/local/lib/python3.8/dist-packages/datasets/load.py", line 1498, in load_dataset_builder
a188:     dataset_module = dataset_module_factory(
a188:   File "/usr/local/lib/python3.8/dist-packages/datasets/load.py", line 1215, in dataset_module_factory
a188:     raise e1 from None
a188:   File "/usr/local/lib/python3.8/dist-packages/datasets/load.py", line 1192, in dataset_module_factory
a188:     return HubDatasetModuleFactoryWithoutScript(
a188:   File "/usr/local/lib/python3.8/dist-packages/datasets/load.py", line 825, in get_module
a188:     dataset_readme_path = cached_path(
a188:   File "/usr/local/lib/python3.8/dist-packages/datasets/utils/file_utils.py", line 183, in cached_path
a188:     output_path = get_from_cache(
a188:   File "/usr/local/lib/python3.8/dist-packages/datasets/utils/file_utils.py", line 566, in get_from_cache
a188:     raise ConnectionError(f"Couldn't reach {url} ({repr(head_error)})")
a188: ConnectionError: Couldn't reach https://huggingface.co/datasets/Dahoas/rm-static/resolve/main/README.md (ReadTimeout(ReadTimeoutError("HTTPSConnectionPool(host='huggingface.co', port=443): Read timed out. (read timeout=100)")))
a188: [2023-04-17 18:19:46,372] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 9324
a188: [2023-04-17 18:19:46,591] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 9325
a188: [2023-04-17 18:19:46,806] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 9326
a188: [2023-04-17 18:19:46,980] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 9327
a188: [2023-04-17 18:19:47,195] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 9328
a188: [2023-04-17 18:19:47,369] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 9329
a188: [2023-04-17 18:19:47,542] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 9330
a188: [2023-04-17 18:19:47,543] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 9332
a188: [2023-04-17 18:19:47,917] [ERROR] [launch.py:434:sigkill_handler] ['/usr/bin/python', '-u', 'main.py', '--local_rank=7', '--data_path', 'Dahoas/rm-static', 'Dahoas/full-hh-rlhf', 'Dahoas/synthetic-instruct-gptj-pairwise', 'yitingxie/rlhf-reward-datasets', 'openai/webgpt_comparisons', 'stanfordnlp/SHP', '--data_split', '2,4,4', '--model_name_or_path', 'facebook/opt-1.3b', '--per_device_train_batch_size', '4', '--per_device_eval_batch_size', '4', '--max_seq_len', '512', '--learning_rate', '1e-4', '--weight_decay', '0.1', '--num_train_epochs', '2', '--gradient_accumulation_steps', '1', '--lr_scheduler_type', 'cosine', '--num_warmup_steps', '0', '--seed', '1234', '--gradient_checkpointing', '--zero_stage', '3', '--lora_dim', '128', '--lora_module_name', 'decoder.layers.', '--deepspeed', '--output_dir', '/data/opt-1.3b-multi-node/actor-models/1.3b'] exits with return code = 1
pdsh@ecs-21075649-010: a188: ssh exited with exit code 1
pdsh@ecs-21075649-010: interrupt (one more within 1 sec to abort)
pdsh@ecs-21075649-010:  (^Z within 1 sec to cancel pending threads)
pdsh@ecs-21075649-010: a182: command in progress
sending SIGTERM to ssh a182
sending signal 15 to a182 [ssh] pid 5758
pdsh@ecs-21075649-010: interrupt, aborting.
pdsh@ecs-21075649-010: a188: ssh exited with exit code 1

性能对比测试

实验一

4x A100 8卡 A100-PCIE-40GB

【step1】

deepspeed --hostfile=$HOSTFILE main.py \
   --data_path all_instruction_data_for_DeepSpeedChat \
   --data_split 2,4,4 \
   --model_name_or_path /data/models/bigscience_bloomz-7b1 \
   --per_device_train_batch_size 4 \
   --per_device_eval_batch_size 4 \
   --max_seq_len 2048 \
   --learning_rate 1e-4 \
   --weight_decay 0.1 \
   --num_train_epochs 2  \
   --gradient_accumulation_steps 1 \
   --lr_scheduler_type cosine \
   --num_warmup_steps 0 \
   --seed 1234 \
   --gradient_checkpointing \
   --zero_stage $ZERO_STAGE \
   --lora_dim 128 \
   --lora_module_name decoder.layers. \
   --deepspeed \
   --output_dir $OUTPUT \
   &> $OUTPUT/training.log

【step3】

deepspeed --hostfile=$HOSTFILE --master_port 12346 main.py \
   --data_path all_instruction_data_for_DeepSpeedChat \
   --data_split 2,4,4 \
   --actor_model_name_or_path $ACTOR_MODEL_PATH \
   --critic_model_name_or_path $CRITIC_MODEL_PATH \
   --num_padding_at_beginning 1 \
   --per_device_train_batch_size 4 \
   --per_device_mini_train_batch_size 4 \
   --generation_batch_numbers 1 \
   --ppo_epochs 1 \
   --max_answer_seq_len 1024 \
   --max_prompt_seq_len 1024 \
   --actor_learning_rate ${Actor_Lr} \
   --critic_learning_rate ${Critic_Lr} \
   --actor_weight_decay 0.1 \
   --critic_weight_decay 0.1 \
   --num_train_epochs 1 \
   --lr_scheduler_type cosine \
   --gradient_accumulation_steps 1 \
   --num_warmup_steps 100 \
   --deepspeed --seed 1234 \
   --enable_hybrid_engine \
   --inference_tp_size 8 \
   --tp_gather_partition_size 4 \
   --actor_zero_stage $ACTOR_ZERO_STAGE \
   --critic_zero_stage $CRITIC_ZERO_STAGE \
   --actor_gradient_checkpointing \
   --actor_lora_dim 128 \
   --actor_lora_module_name decoder.layers. \
   --output_dir $OUTPUT \
    &> $OUTPUT/training.log

【training.log】

点击展开


a182: ***** Running training *****
a182: ***** Evaluating perplexity, Epoch 0/2 *****
a182: ppl: 1615.3291015625
a182: Beginning of Epoch 1/2, Total Micro Batches 844
a182: [2023-04-21 15:32:05,774] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1
a182: [2023-04-21 15:32:47,044] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768
a182: [2023-04-21 15:33:29,700] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 32768, reducing to 16384
a182: [2023-04-21 15:34:11,199] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 16384, reducing to 8192
a182: [2023-04-21 15:34:53,648] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 8192, reducing to 4096
a182: [2023-04-21 15:35:37,038] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 4096, reducing to 2048
a182: [2023-04-21 15:36:19,724] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 2048, reducing to 1024
a182: [2023-04-21 15:37:02,075] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1024, reducing to 512
a182: [2023-04-21 15:37:44,410] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 512, reducing to 256
a182: [2023-04-21 15:38:26,336] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 256, reducing to 128
a182: [2023-04-21 15:38:26,337] [INFO] [logging.py:96:log_dist] [Rank 0] step=10, skipped=10, lr=[0.0001, 0.0001], mom=[(0.9, 0.95), (0.9, 0.95)]
a182: [2023-04-21 15:38:26,338] [INFO] [timer.py:199:stop] epoch=0/micro_step=10/global_step=10, RunningAvgSamplesPerSec=3.018237674524317, CurrSamplesPerSec=3.0533144699669323, MemAllocated=8.18GB, MaxMemAllocated=27.21GB
a182: [2023-04-21 15:39:09,043] [WARNING] [stage3.py:1787:step] 17 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
a182: [2023-04-21 15:39:50,772] [WARNING] [stage3.py:1787:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
a182: [2023-04-21 15:40:32,341] [WARNING] [stage3.py:1787:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
a182: [2023-04-21 15:41:14,049] [WARNING] [stage3.py:1787:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
a182: [2023-04-21 15:41:55,981] [WARNING] [stage3.py:1787:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
a182: [2023-04-21 15:42:38,949] [WARNING] [stage3.py:1787:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
a182: [2023-04-21 15:44:05,257] [WARNING] [stage3.py:1787:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
a182: [2023-04-21 15:44:47,426] [WARNING] [stage3.py:1787:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
a182: [2023-04-21 15:45:29,589] [WARNING] [stage3.py:1787:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
a182: [2023-04-21 15:45:29,590] [INFO] [logging.py:96:log_dist] [Rank 0] step=20, skipped=10, lr=[9.999134070902207e-05, 9.999134070902207e-05], mom=[(0.9, 0.95), (0.9, 0.95)]
a182: [2023-04-21 15:45:29,591] [INFO] [timer.py:199:stop] epoch=0/micro_step=20/global_step=20, RunningAvgSamplesPerSec=3.021699438930297, CurrSamplesPerSec=3.0359636074270875, MemAllocated=8.18GB, MaxMemAllocated=27.21GB
a182: [2023-04-21 15:46:10,884] [WARNING] [stage3.py:1787:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
a182: [2023-04-21 15:46:53,122] [WARNING] [stage3.py:1787:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
a182: [2023-04-21 15:47:34,534] [WARNING] [stage3.py:1787:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
a182: [2023-04-21 15:48:16,075] [WARNING] [stage3.py:1787:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
a182: [2023-04-21 15:48:58,285] [WARNING] [stage3.py:1787:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
a182: [2023-04-21 15:49:40,420] [WARNING] [stage3.py:1787:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
a182: [2023-04-21 15:50:23,103] [WARNING] [stage3.py:1787:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
a182: [2023-04-21 15:51:06,003] [WARNING] [stage3.py:1787:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
a182: [2023-04-21 15:51:47,326] [WARNING] [stage3.py:1787:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
a182: [2023-04-21 15:52:29,013] [WARNING] [stage3.py:1787:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
a182: [2023-04-21 15:52:29,014] [INFO] [logging.py:96:log_dist] [Rank 0] step=30, skipped=10, lr=[9.996536583542105e-05, 9.996536583542105e-05], mom=[(0.9, 0.95), (0.9, 0.95)]
a182: [2023-04-21 15:52:29,015] [INFO] [timer.py:199:stop] epoch=0/micro_step=30/global_step=30, RunningAvgSamplesPerSec=3.0324580151109792, CurrSamplesPerSec=3.0706646549224987, MemAllocated=8.18GB, MaxMemAllocated=27.21GB

【GPU Loading】

【内存占用】

实验二

6x A100 8卡 A100-PCIE-40GB
训练参数与实验1完全相同，只是增加了两台机器

【结论】：还是会有显存高压的警告，不过没有连续出现。担心step3 还是会爆掉。先跑完step1，后面可以单独跑step2,3

【training.log】

点击展开


a182: Time to load utils op: 0.0005230903625488281 seconds
a182: ***** Running training *****
a182: ***** Evaluating perplexity, Epoch 0/2 *****
a182: ppl: 1615.9046630859375
a182: Beginning of Epoch 1/2, Total Micro Batches 563
a182: [2023-04-22 00:20:37,845] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1
a182: [2023-04-22 00:21:23,520] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768
a182: [2023-04-22 00:22:08,163] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 32768, reducing to 16384
a182: [2023-04-22 00:22:54,065] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 16384, reducing to 8192
a182: [2023-04-22 00:23:40,058] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 8192, reducing to 4096
a182: [2023-04-22 00:24:25,473] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 4096, reducing to 2048
a182: [2023-04-22 00:25:12,089] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 2048, reducing to 1024
a182: [2023-04-22 00:25:58,308] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1024, reducing to 512
a182: [2023-04-22 00:26:43,708] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 512, reducing to 256
a182: [2023-04-22 00:27:30,709] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 256, reducing to 128
a182: [2023-04-22 00:27:30,711] [INFO] [logging.py:96:log_dist] [Rank 0] step=10, skipped=10, lr=[0.0001, 0.0001], mom=[(0.9, 0.95), (0.9, 0.95)]
a182: [2023-04-22 00:27:30,712] [INFO] [timer.py:199:stop] epoch=0/micro_step=10/global_step=10, RunningAvgSamplesPerSec=4.183477766906136, CurrSamplesPerSec=4.0852692074479595, MemAllocated=7.04GB, MaxMemAllocated=26.07GB
a182: [2023-04-22 00:28:17,487] [WARNING] [stage3.py:1787:step] 3 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
a182: [2023-04-22 00:35:01,687] [INFO] [logging.py:96:log_dist] [Rank 0] step=20, skipped=10, lr=[9.99805403600595e-05, 9.99805403600595e-05], mom=[(0.9, 0.95), (0.9, 0.95)]
a182: [2023-04-22 00:35:01,688] [INFO] [timer.py:199:stop] epoch=0/micro_step=20/global_step=20, RunningAvgSamplesPerSec=4.224439829942313, CurrSamplesPerSec=4.238250039396783, MemAllocated=7.04GB, MaxMemAllocated=26.07GB
a182: [2023-04-22 00:42:17,326] [INFO] [logging.py:96:log_dist] [Rank 0] step=30, skipped=10, lr=[9.99221765873415e-05, 9.99221765873415e-05], mom=[(0.9, 0.95), (0.9, 0.95)]
a182: [2023-04-22 00:42:17,327] [INFO] [timer.py:199:stop] epoch=0/micro_step=30/global_step=30, RunningAvgSamplesPerSec=4.288112235651409, CurrSamplesPerSec=4.507034898076392, MemAllocated=7.04GB, MaxMemAllocated=26.07GB
a182: [2023-04-22 00:49:22,478] [INFO] [logging.py:96:log_dist] [Rank 0] step=40, skipped=10, lr=[9.982495411136606e-05, 9.982495411136606e-05], mom=[(0.9, 0.95), (0.9, 0.95)]
a182: [2023-04-22 00:49:22,479] [INFO] [timer.py:199:stop] epoch=0/micro_step=40/global_step=40, RunningAvgSamplesPerSec=4.345926499891864, CurrSamplesPerSec=4.512978428671682, MemAllocated=7.04GB, MaxMemAllocated=26.07GB

【GPU Loading】

【内存占用】