MaxInitializationTime
|
Checks if the time spent on initialization exceeds a threshold percent of the total training time. The rule waits until the first step of training loop starts. The initialization can take longer if downloading the entire dataset from Amazon S3 in File mode. The default threshold is 20 minutes.
|
Initialization takes too long. If using File mode, consider switching to Pipe mode in case you are using TensorFlow framework.
|
0
|
85
|
threshold:20
|
BatchSize
|
Checks if GPUs are underutilized because the batch size is too small. To detect this problem, the rule analyzes the average GPU memory footprint, the CPU and the GPU utilization.
|
The batch size is too small, and GPUs are underutilized. Consider running on a smaller instance type or increasing the batch size.
|
0
|
4627
|
cpu_threshold_p95:70
gpu_threshold_p95:70
gpu_memory_threshold_p95:70
patience:1000
window:500
|
LowGPUUtilization
|
Checks if the GPU utilization is low or fluctuating. This can happen due to bottlenecks, blocking calls for synchronizations, or a small batch size.
|
Check if there are bottlenecks, minimize blocking calls, change distributed training strategy, or increase the batch size.
|
0
|
0
|
threshold_p95:70
threshold_p5:10
window:500
patience:1000
|
StepOutlier
|
Detects outliers in step duration. The step duration for forward and backward pass should be roughly the same throughout the training. If there are significant outliers, it may indicate a system stall or bottleneck issues.
|
Check if there are any bottlenecks (CPU, I/O) correlated to the step outliers.
|
0
|
85
|
threshold:3
mode:None
n_outliers:10
stddev:3
|
Dataloader
|
Checks how many data loaders are running in parallel and whether the total number is equal the number of available CPU cores. The rule triggers if number is much smaller or larger than the number of available cores. If too small, it might lead to low GPU utilization. If too large, it might impact other compute intensive operations on CPU.
|
Change the number of data loader processes.
|
0
|
10
|
min_threshold:70
max_threshold:200
|
IOBottleneck
|
Checks if the data I/O wait time is high and the GPU utilization is low. It might indicate IO bottlenecks where GPU is waiting for data to arrive from storage. The rule evaluates the I/O and GPU utilization rates and triggers the issue if the time spent on the IO bottlenecks exceeds a threshold percent of the total training time. The default threshold is 50 percent.
|
Pre-fetch data or choose different file formats, such as binary formats that improve I/O performance.
|
0
|
4629
|
threshold:50
io_threshold:50
gpu_threshold:10
patience:1000
|
GPUMemoryIncrease
|
Measures the average GPU memory footprint and triggers if there is a large increase.
|
Choose a larger instance type with more memory if footprint is close to maximum available memory.
|
0
|
0
|
increase:5
patience:1000
window:10
|
CPUBottleneck
|
Checks if the CPU utilization is high and the GPU utilization is low. It might indicate CPU bottlenecks, where the GPUs are waiting for data to arrive from the CPUs. The rule evaluates the CPU and GPU utilization rates, and triggers the issue if the time spent on the CPU bottlenecks exceeds a threshold percent of the total training time. The default threshold is 50 percent.
|
Consider increasing the number of data loaders or applying data pre-fetching.
|
0
|
4629
|
threshold:50
cpu_threshold:90
gpu_threshold:10
patience:1000
|
LoadBalancing
|
Detects workload balancing issues across GPUs. Workload imbalance can occur in training jobs with data parallelism. The gradients are accumulated on a primary GPU, and this GPU might be overused with regard to other GPUs, resulting in reducing the efficiency of data parallelization.
|
Choose a different distributed training strategy or a different distributed training framework.
|
0
|
0
|
threshold:0.2
patience:1000
|