Skip to content

fix: token cache exits after exhaustive retries#4551

Open
alvarowolfx wants to merge 16 commits intogoogleapis:mainfrom
alvarowolfx:fix-token-cache-poisoned
Open

fix: token cache exits after exhaustive retries#4551
alvarowolfx wants to merge 16 commits intogoogleapis:mainfrom
alvarowolfx:fix-token-cache-poisoned

Conversation

@alvarowolfx
Copy link
Collaborator

Towards #4541

@codecov
Copy link

codecov bot commented Feb 4, 2026

Codecov Report

❌ Patch coverage is 83.33333% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 95.02%. Comparing base (a314eee) to head (d380f85).
⚠️ Report is 3 commits behind head on main.

Files with missing lines Patch % Lines
src/auth/src/token_cache.rs 75.00% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4551      +/-   ##
==========================================
+ Coverage   95.01%   95.02%   +0.01%     
==========================================
  Files         195      195              
  Lines        7456     7458       +2     
==========================================
+ Hits         7084     7087       +3     
+ Misses        372      371       -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@alvarowolfx alvarowolfx changed the title fix: token cache exists after exaustive retries fix: token cache exits after exaustive retries Feb 4, 2026
@alvarowolfx alvarowolfx changed the title fix: token cache exits after exaustive retries fix: token cache exits after exhaustive retries Feb 4, 2026
@alvarowolfx alvarowolfx marked this pull request as ready for review February 6, 2026 15:56
@alvarowolfx alvarowolfx requested review from a team as code owners February 6, 2026 15:56
@alvarowolfx alvarowolfx requested review from coryan and dbolduc February 6, 2026 15:57
Comment on lines 137 to 141
Err(err) => {
// The retry policy has been used already by the inner token provider.
// If it ended in an error, just quit the background task.
break;
// If it ended in an error, the background task will wait for a while
// and try again. This allows the task to eventually recover if the
// error was transient but exhausted all the retries.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment does not read right.

Suggested change
Err(err) => {
// The retry policy has been used already by the inner token provider.
// If it ended in an error, just quit the background task.
break;
// If it ended in an error, the background task will wait for a while
// and try again. This allows the task to eventually recover if the
// error was transient but exhausted all the retries.
Err(err) => {
// On transient errors, even if the retry policy is exhausted,
// we want to continue running this retry loop.
// This loop cannot stop because that may leave the
// credentials in an unrecoverable state (see #...).
// We considered using a notification to wake up the next time
// a caller wants to retrieve a token, but that seemed prone to
// deadlocks. We may implement this as an improvement (#....).
// On permanent errors, then there is really no point in trying
// again, by definition of "permanent". If the error was misclassified
// as permanent, that is a bug in the retry policy and better fixed
// there than implemented as a workaround here.

With prompts the question: are we certain our default policy does not misclassify some errors as permanent? Can you check?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The retry loop always map errors to gax:authentication errors, so I think in that part is correct, so I don't any other gax error being raised. If was not that I would be worried because https://github.com/googleapis/google-cloud-rust/pull/4551/changes#diff-43612f24f6413f8e5304d6af3e912553d7913ceea55c203c549db570b52d0d30L133 treat all non auth errors as permanent.

In the auth side, I was doing some audit on every place that creates CredentialsError::from_msg(false, ...), CredentialsError::new(false,...), CredentialsError::from_source( or CredentialsError::from_http_response and they are mostly consider non transient errors the ones related to transport and decoding errors when fetching data from remote services, so I think that's correct.

I could not find yet any case of misclassification, but you raised a valid point

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the comment and created an issue to track future improvement on the retry loop.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure the parsing errors are permanent. We sent a request, got a bad response (sometimes the load balancers like to return silly things). The next attempt may succeed. OTOH, maybe the endpoint is configured wrong and we got a nice cocktail recipe 🤷

The error classification is too close to the source, we should let the retry policy decide what errors are transient vs. permanent, instead of hard-coding that decision in each function.

Copy link
Collaborator

@coryan coryan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is better than what we have today, so approved. It could also get some improvements.

pub(crate) const SAML2_TOKEN_TYPE: &str = "urn:ietf:params:oauth:token-type:saml2";

pub(crate) const RETRY_EXHAUSTED_ERROR: &str = "All retry attempts to fetch the token were exhausted. Subsequent calls with this credential will also fail.";
pub(crate) const RETRY_EXHAUSTED_ERROR: &str = "All retry attempts to fetch the token were exhausted with transient errors. Subsequent calls with this credential might succeed.";
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find this confusing, but I am tired. Maybe another team member can help you wordsmith this a bit?

Comment on lines 137 to 141
Err(err) => {
// The retry policy has been used already by the inner token provider.
// If it ended in an error, just quit the background task.
break;
// If it ended in an error, the background task will wait for a while
// and try again. This allows the task to eventually recover if the
// error was transient but exhausted all the retries.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure the parsing errors are permanent. We sent a request, got a bad response (sometimes the load balancers like to return silly things). The next attempt may succeed. OTOH, maybe the endpoint is configured wrong and we got a nice cocktail recipe 🤷

The error classification is too close to the source, we should let the retry policy decide what errors are transient vs. permanent, instead of hard-coding that decision in each function.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants