Bitdeer AI Cloud Python SDK
Overview
The Bitdeer AI Cloud Python SDK provides a simple and efficient interface for managing cloud resources and services, like training jobs. It allows users to create, list, get details, and manage training jobs with ease. This SDK communicates with the server using gRPC and provides a range of functionalities for handling training jobs, including creation, retrieval, listing, deletion, suspension, and resumption.
Installation
To install the Bitdeer AI Cloud Python SDK, you can use pip:
pip install bitdeer-ai
Usage of Training Service
Initialization
To interact with training service, you need to initialize the TrainingClient object with the host address of target host and an API Key for authentication.
from bitdeer_ai.training.client import TrainingClient
# Initialize the client
client = TrainingClient(host='api.bitdeer.ai:443', token='API-KEY')
Creating a Training Job
To create a training job, use the create_training_job
method. You need to provide various parameters such as project_id, job_name, job_type, worker_spec, num_workers, and optional parameters like worker_image, working_dir, volume_name, volume_mount_path etc.
from training.training_pb2 import JobType
job = client.create_training_job(
project_id='your_project_id',
region_id='your_region_id',
zone_id='your_zone_id',
job_name='example_job',
job_type='your_job_type',
worker_spec='spec_of_worker',
num_workers=2,
worker_image='worker_image_url',
working_dir='/path/to/working/dir',
volume_name='volume_name',
volume_mount_path='/mount/path'
)
print(f'Training job created with ID: {job.training_job_id}')
Retrieving a Training Job
To retrieve details of a specific training job, use the get_training_job
method with the training_job_id.
job = client.get_training_job(training_job_id='your_training_job_id')
print(f'Job Name: {job.job_name}')
Listing Training Jobs
To list all training jobs, use the list_training_jobs
method.
jobs = client.list_training_jobs()
for job in jobs.training_jobs:
print(f'Job ID: {job.training_job_id}, Job Name: {job.job_name}')
Deleting a Training Job
To delete a specific training job, use the delete_training_job
method with the training_job_id.
client.delete_training_job(training_job_id='your_training_job_id')
print('Training job deleted successfully.')
Suspending a Training Job
To suspend an active training job, use the suspend_training_job
method with the training_job_id.
client.suspend_training_job(training_job_id='your_training_job_id')
print('Training job suspended successfully.')
Resuming a Training Job
To resume a suspended training job, use the resume_training_job
method with the training_job_id.
client.resume_training_job(training_job_id='your_training_job_id')
print('Training job resumed successfully.')
Getting Training Job Workers
To get details of workers associated with a specific training job, use the get_training_job_workers
method with the training_job_id.
workers = client.get_training_job_workers(training_job_id='your_training_job_id')
for worker in workers.workers:
print(f'Worker Name: {worker.name}')
Getting Training Job Logs
To stream logs of a specific training job, use the get_training_job_logs
method with the training_job_id, worker_name, and follow flag.
logs = client.get_training_job_logs(training_job_id='your_training_job_id', worker_name='worker_name', follow=True)
for log in logs:
print(log)
Error Handling
The SDK raises various exceptions to handle errors:
- RuntimeError: Raised when there is a failure in creating or deleting a training job.
Make sure to handle these exceptions in your code to ensure smooth operation.
try:
job = client.create_training_job(
project_id='your_project_id',
job_name='example_job',
job_type='your_job_type',
worker_spec='spec_of_worker',
num_workers=2
)
except RuntimeError as e:
print(f'Runtime Error: {e}')