fix: token cache exits after exhaustive retries#4551
fix: token cache exits after exhaustive retries#4551alvarowolfx wants to merge 16 commits intogoogleapis:mainfrom
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #4551 +/- ##
==========================================
+ Coverage 95.01% 95.02% +0.01%
==========================================
Files 195 195
Lines 7456 7458 +2
==========================================
+ Hits 7084 7087 +3
+ Misses 372 371 -1 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
src/auth/src/token_cache.rs
Outdated
| Err(err) => { | ||
| // The retry policy has been used already by the inner token provider. | ||
| // If it ended in an error, just quit the background task. | ||
| break; | ||
| // If it ended in an error, the background task will wait for a while | ||
| // and try again. This allows the task to eventually recover if the | ||
| // error was transient but exhausted all the retries. |
There was a problem hiding this comment.
This comment does not read right.
| Err(err) => { | |
| // The retry policy has been used already by the inner token provider. | |
| // If it ended in an error, just quit the background task. | |
| break; | |
| // If it ended in an error, the background task will wait for a while | |
| // and try again. This allows the task to eventually recover if the | |
| // error was transient but exhausted all the retries. | |
| Err(err) => { | |
| // On transient errors, even if the retry policy is exhausted, | |
| // we want to continue running this retry loop. | |
| // This loop cannot stop because that may leave the | |
| // credentials in an unrecoverable state (see #...). | |
| // We considered using a notification to wake up the next time | |
| // a caller wants to retrieve a token, but that seemed prone to | |
| // deadlocks. We may implement this as an improvement (#....). | |
| // On permanent errors, then there is really no point in trying | |
| // again, by definition of "permanent". If the error was misclassified | |
| // as permanent, that is a bug in the retry policy and better fixed | |
| // there than implemented as a workaround here. |
With prompts the question: are we certain our default policy does not misclassify some errors as permanent? Can you check?
There was a problem hiding this comment.
The retry loop always map errors to gax:authentication errors, so I think in that part is correct, so I don't any other gax error being raised. If was not that I would be worried because https://github.com/googleapis/google-cloud-rust/pull/4551/changes#diff-43612f24f6413f8e5304d6af3e912553d7913ceea55c203c549db570b52d0d30L133 treat all non auth errors as permanent.
In the auth side, I was doing some audit on every place that creates CredentialsError::from_msg(false, ...), CredentialsError::new(false,...), CredentialsError::from_source( or CredentialsError::from_http_response and they are mostly consider non transient errors the ones related to transport and decoding errors when fetching data from remote services, so I think that's correct.
I could not find yet any case of misclassification, but you raised a valid point
There was a problem hiding this comment.
I updated the comment and created an issue to track future improvement on the retry loop.
There was a problem hiding this comment.
I am not sure the parsing errors are permanent. We sent a request, got a bad response (sometimes the load balancers like to return silly things). The next attempt may succeed. OTOH, maybe the endpoint is configured wrong and we got a nice cocktail recipe 🤷
The error classification is too close to the source, we should let the retry policy decide what errors are transient vs. permanent, instead of hard-coding that decision in each function.
coryan
left a comment
There was a problem hiding this comment.
I think this is better than what we have today, so approved. It could also get some improvements.
| pub(crate) const SAML2_TOKEN_TYPE: &str = "urn:ietf:params:oauth:token-type:saml2"; | ||
|
|
||
| pub(crate) const RETRY_EXHAUSTED_ERROR: &str = "All retry attempts to fetch the token were exhausted. Subsequent calls with this credential will also fail."; | ||
| pub(crate) const RETRY_EXHAUSTED_ERROR: &str = "All retry attempts to fetch the token were exhausted with transient errors. Subsequent calls with this credential might succeed."; |
There was a problem hiding this comment.
I find this confusing, but I am tired. Maybe another team member can help you wordsmith this a bit?
src/auth/src/token_cache.rs
Outdated
| Err(err) => { | ||
| // The retry policy has been used already by the inner token provider. | ||
| // If it ended in an error, just quit the background task. | ||
| break; | ||
| // If it ended in an error, the background task will wait for a while | ||
| // and try again. This allows the task to eventually recover if the | ||
| // error was transient but exhausted all the retries. |
There was a problem hiding this comment.
I am not sure the parsing errors are permanent. We sent a request, got a bad response (sometimes the load balancers like to return silly things). The next attempt may succeed. OTOH, maybe the endpoint is configured wrong and we got a nice cocktail recipe 🤷
The error classification is too close to the source, we should let the retry policy decide what errors are transient vs. permanent, instead of hard-coding that decision in each function.
Towards #4541