SageMaker Debugger Profiling Report

SageMaker Debugger auto generated this report. You can generate similar reports on all supported training jobs. The report provides summary of training job, system resource usage statistics, framework metrics, rules summary, and detailed analysis from each rule. The graphs and tables are interactive.

Legal disclaimer: This report and any recommendations are provided for informational purposes only and are not definitive. You are responsible for making your own independent assessment of the information.

In [4]:
# Parameters
processing_job_arn = "arn:aws:sagemaker:us-east-1:215290792315:processing-job/pytorch-training-2023-02-1-profilerreport-6e39fdd0"

Training job summary

System usage statistics

Framework metrics summary

Overview: CPU operators

Rules summary

The following table shows a profiling summary of the Debugger built-in rules. The table is sorted by the rules that triggered the most frequently. During your training job, the MaxInitializationTime rule was the most frequently triggered. It processed 85 datapoints and was triggered 0 times.

Description Recommendation Number of times rule triggered Number of datapoints Rule parameters
MaxInitializationTime Checks if the time spent on initialization exceeds a threshold percent of the total training time. The rule waits until the first step of training loop starts. The initialization can take longer if downloading the entire dataset from Amazon S3 in File mode. The default threshold is 20 minutes. Initialization takes too long. If using File mode, consider switching to Pipe mode in case you are using TensorFlow framework. 0 85 threshold:20
BatchSize Checks if GPUs are underutilized because the batch size is too small. To detect this problem, the rule analyzes the average GPU memory footprint, the CPU and the GPU utilization. The batch size is too small, and GPUs are underutilized. Consider running on a smaller instance type or increasing the batch size. 0 4627 cpu_threshold_p95:70
gpu_threshold_p95:70
gpu_memory_threshold_p95:70
patience:1000
window:500
LowGPUUtilization Checks if the GPU utilization is low or fluctuating. This can happen due to bottlenecks, blocking calls for synchronizations, or a small batch size. Check if there are bottlenecks, minimize blocking calls, change distributed training strategy, or increase the batch size. 0 0 threshold_p95:70
threshold_p5:10
window:500
patience:1000
StepOutlier Detects outliers in step duration. The step duration for forward and backward pass should be roughly the same throughout the training. If there are significant outliers, it may indicate a system stall or bottleneck issues. Check if there are any bottlenecks (CPU, I/O) correlated to the step outliers. 0 85 threshold:3
mode:None
n_outliers:10
stddev:3
Dataloader Checks how many data loaders are running in parallel and whether the total number is equal the number of available CPU cores. The rule triggers if number is much smaller or larger than the number of available cores. If too small, it might lead to low GPU utilization. If too large, it might impact other compute intensive operations on CPU. Change the number of data loader processes. 0 10 min_threshold:70
max_threshold:200
IOBottleneck Checks if the data I/O wait time is high and the GPU utilization is low. It might indicate IO bottlenecks where GPU is waiting for data to arrive from storage. The rule evaluates the I/O and GPU utilization rates and triggers the issue if the time spent on the IO bottlenecks exceeds a threshold percent of the total training time. The default threshold is 50 percent. Pre-fetch data or choose different file formats, such as binary formats that improve I/O performance. 0 4629 threshold:50
io_threshold:50
gpu_threshold:10
patience:1000
GPUMemoryIncrease Measures the average GPU memory footprint and triggers if there is a large increase. Choose a larger instance type with more memory if footprint is close to maximum available memory. 0 0 increase:5
patience:1000
window:10
CPUBottleneck Checks if the CPU utilization is high and the GPU utilization is low. It might indicate CPU bottlenecks, where the GPUs are waiting for data to arrive from the CPUs. The rule evaluates the CPU and GPU utilization rates, and triggers the issue if the time spent on the CPU bottlenecks exceeds a threshold percent of the total training time. The default threshold is 50 percent. Consider increasing the number of data loaders or applying data pre-fetching. 0 4629 threshold:50
cpu_threshold:90
gpu_threshold:10
patience:1000
LoadBalancing Detects workload balancing issues across GPUs. Workload imbalance can occur in training jobs with data parallelism. The gradients are accumulated on a primary GPU, and this GPU might be overused with regard to other GPUs, resulting in reducing the efficiency of data parallelization. Choose a different distributed training strategy or a different distributed training framework. 0 0 threshold:0.2
patience:1000

Analyzing the training loop

Step duration analysis

GPU utilization analysis

Usage per GPU

Workload balancing

Dataloading analysis

Batch size

CPU bottlenecks

I/O bottlenecks

GPU memory