Skip to main content

User Guide

Monitoring Training Jobs

Understanding Job Statuses and Lifecycle

The typical statuses are as belows:

  • Pending:
    The training job has been submitted and is waiting for available resources. It has not yet started scheduling or container setup.
  • Scheduling:
    The platform is assigning the job to appropriate compute resources. During this stage, the scheduler determines which nodes or machines will run the job.
  • Running:
    The job is actively being executed. Your program is running, GPUs and CPUs are working, and logs/metrics are being generated.
  • Completed:
    The job has finished successfully, and all planned tasks have been executed. You can now review results, logs, and any generated output.
  • Failed:
    The job has ended prematurely due to errors. Review the logs for details and consider making adjustments before retrying.
  • Suspending:
    The platform is in the process of pausing the job. This may happen due to manual intervention or automated triggers.
  • Suspended:
    The job has been successfully paused. No resources are currently being consumed. It can potentially be resumed at a later time.
  • Restarting:
    The job is being restarted, often after adjustments or following a suspension. The platform will go through the necessary preparation steps again before the job returns to the Running state.

Details Page

From the listing page, click on the name of specific training job to view its details.

Within the details page, you can find information like image, volume attached and machine specified for each worker machine. You could also view logs from each worker