You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on Feb 3, 2021. It is now read-only.
When provisioning a new cluster with more than 50 nodes we start to see a large proportion of start task failures (25+%) when pulling the docker image from an ACR.
error pulling image
configuration: received unexpected HTTP status: 503 The server is busy
I raised a support issue with the ACR team and they said we were being throttled and recommend that we attempt to retry pulling the image when we get a 503.
Note our image is quite large (~7 GB) which may be why we experience this issue whilst other do not.
Here are some of the ways I think we could mitigate this:
Retry pulling the image when 503 returned (as recommended by the ACR team)
Retry the entire start-task when any failure occurs (it looks like there is support for this in the batch SDK?)
Configure the docker daemon to pull fewer layers in parallel using the max-concurrent-downloads option. I looked at whether this could be done using a plugin but I think plugins would be run too late?
What do you think the best approach would be? Can you recommend any others?