DeepRacer-for-Cloud provides a great way for developers to train DeepRacer models on EC2 (or other cloud compute instances, or even local servers) however many users have noticed that unlike the official AWS console it didn't provide the kind of friendly web UI showing the current state of training.
While there are some fantastic log analysis notebooks available these can be a little tricky to set up and often require re-loading vast amounts of log data to get a refreshed view of the metrics.
Deepracer-for-Cloud v5.2.2 is now available and has added an exciting new feature which enables real-time metrics visualisation using Grafana.
Under the hood this involves creating three new containers for Telegraf, InfluxDB, and Grafana.
The Robomaker simulation workers send the training metrics to Telegraf, which aggregates and stores them in the InfluxDB time-series database. Grafana provides a presentation layer for interactive dashboards.
Getting started
To use this new feature you will need v5.2.2 of Deepracer-for-Cloud, and also the v5.2.2 Robomaker container image.
Updating DeepRacer-for-Cloud
If you're installing DRfC for the first time then it should already download the correct image and templates, but if you're upgrading an existing install then you'll need to do a few steps:
If you installed DRfC the recommended way by cloning the GitHub repo then you should do a git pull
on the master branch to fetch the latest updates.
To enable real-time metrics you need to add two additional lines to your system.env
file:
DR_TELEGRAF_HOST=telegraf
DR_TELEGRAF_PORT=8092
In almost all cases you can paste these directly in without modifying the values, as the hostname will reference the telegraf container running inside Docker.
If this is your first install then these lines will need to be uncommented.
Updating the Robomaker container image
First pull the updated container image from DockerHub. Use the cpu or gpu tag as appropriate for your system.
docker pull awsdeepracercommunity/deepracer-robomaker:5.2.2-cpu
or
docker pull awsdeepracercommunity/deepracer-robomaker:5.2.2-gpu
Then update the DR_ROBOMAKER_IMAGE
line in system.env
to set to the new image you just pulled.
DR_ROBOMAKER_IMAGE=5.2.1-cpu
Starting the metrics stack
You can then start the metrics containers using dr-start-metrics
. (You might need to relogin or reload your shell to pick up the new changes in bin/activate.sh
)
This will start the three new containers. If it's the first time starting the metrics stack then Grafana will need to run some database migrations that can take 30-60 seconds before the web UI is available.
Collecting metrics
As long as the two Telegraf lines have been added to system.env and you have v5.2.2 of the robomaker container then all you have to do is start training normally and the metrics will be automatically generated.
Using the dashboards
Once the metrics stack is running you should be able to access the Grafana web UI on port 3000 (eg, http://localhost:3000 if running locally)
Grafana initially starts with an admin user provisioned (username admin
, password admin
). It will prompt you to choose a new password upon first connect, so you should do this right away.
A template dashboard is provided to show how to access basic DeepRacer training metrics. You can use this dashboard as a base to build your own more customised dashboards.
After connecting to the Grafana Web UI with a browser use the menu to browse to the Dashboards section.
The template dashboard called DeepRacer Training template should be visible, showing graphs of reward, progress, and completed lap times.
As this is an automatically provisioned dashboard you are not able to save changes to it, however you can copy it by clicking on the small cog icon to enter the dashboard settings page, and then clicking Save as
to make an editable copy.
Grafana dashboards are interactive - you can over over datapoints to see more details, and you can click and drag on a graph panel to zoom in.
You can also change the time range using the selector box on the top right, and also select an auto-refresh period from the selector next to that.
A full user guide on how to work the dashboards is available on the Grafana website.
Currently we record metrics for training and evaluation sessions such as reward, progress, average and best lap times but in the future we'll be adding more even metrics and dashboards.