fix: token cache exits after exhaustive retries by alvarowolfx · Pull Request #4551 · googleapis/google-cloud-rust

alvarowolfx · 2026-02-04T15:01:03Z

Towards #4541

codecov · 2026-02-04T15:17:53Z

Codecov Report

❌ Patch coverage is 83.33333% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 95.02%. Comparing base (a314eee) to head (d380f85).
⚠️ Report is 3 commits behind head on main.

Files with missing lines	Patch %	Lines
src/auth/src/token_cache.rs	75.00%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #4551      +/-   ##
==========================================
+ Coverage   95.01%   95.02%   +0.01%     
==========================================
  Files         195      195              
  Lines        7456     7458       +2     
==========================================
+ Hits         7084     7087       +3     
+ Misses        372      371       -1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

src/auth/src/credentials/idtoken/user_account.rs

src/auth/src/credentials/external_account.rs

src/auth/src/credentials/mds.rs

src/auth/src/retry.rs

coryan · 2026-02-06T16:26:11Z

src/auth/src/token_cache.rs

+            Err(err) => {
                // The retry policy has been used already by the inner token provider.
-                // If it ended in an error, just quit the background task.
-                break;
+                // If it ended in an error, the background task will wait for a while
+                // and try again. This allows the task to eventually recover if the
+                // error was transient but exhausted all the retries.


This comment does not read right.

Suggested change

Err(err) => {

// The retry policy has been used already by the inner token provider.

// If it ended in an error, just quit the background task.

break;

// If it ended in an error, the background task will wait for a while

// and try again. This allows the task to eventually recover if the

// error was transient but exhausted all the retries.

Err(err) => {

// On transient errors, even if the retry policy is exhausted,

// we want to continue running this retry loop.

// This loop cannot stop because that may leave the

// credentials in an unrecoverable state (see #...).

// We considered using a notification to wake up the next time

// a caller wants to retrieve a token, but that seemed prone to

// deadlocks. We may implement this as an improvement (#....).

// On permanent errors, then there is really no point in trying

// again, by definition of "permanent". If the error was misclassified

// as permanent, that is a bug in the retry policy and better fixed

// there than implemented as a workaround here.

With prompts the question: are we certain our default policy does not misclassify some errors as permanent? Can you check?

The retry loop always map errors to gax:authentication errors, so I think in that part is correct, so I don't any other gax error being raised. If was not that I would be worried because https://github.com/googleapis/google-cloud-rust/pull/4551/changes#diff-43612f24f6413f8e5304d6af3e912553d7913ceea55c203c549db570b52d0d30L133 treat all non auth errors as permanent.

In the auth side, I was doing some audit on every place that creates CredentialsError::from_msg(false, ...), CredentialsError::new(false,...), CredentialsError::from_source( or CredentialsError::from_http_response and they are mostly consider non transient errors the ones related to transport and decoding errors when fetching data from remote services, so I think that's correct.

I could not find yet any case of misclassification, but you raised a valid point

I updated the comment and created an issue to track future improvement on the retry loop.

I am not sure the parsing errors are permanent. We sent a request, got a bad response (sometimes the load balancers like to return silly things). The next attempt may succeed. OTOH, maybe the endpoint is configured wrong and we got a nice cocktail recipe 🤷

The error classification is too close to the source, we should let the retry policy decide what errors are transient vs. permanent, instead of hard-coding that decision in each function.

src/auth/src/retry.rs

coryan

I think this is better than what we have today, so approved. It could also get some improvements.

coryan · 2026-02-06T21:03:40Z

src/auth/src/constants.rs

 pub(crate) const SAML2_TOKEN_TYPE: &str = "urn:ietf:params:oauth:token-type:saml2";

-pub(crate) const RETRY_EXHAUSTED_ERROR: &str = "All retry attempts to fetch the token were exhausted. Subsequent calls with this credential will also fail.";
+pub(crate) const RETRY_EXHAUSTED_ERROR: &str = "All retry attempts to fetch the token were exhausted with transient errors. Subsequent calls with this credential might succeed.";


I find this confusing, but I am tired. Maybe another team member can help you wordsmith this a bit?

coryan · 2026-02-06T21:19:28Z

src/auth/src/token_cache.rs

+            Err(err) => {
                // The retry policy has been used already by the inner token provider.
-                // If it ended in an error, just quit the background task.
-                break;
+                // If it ended in an error, the background task will wait for a while
+                // and try again. This allows the task to eventually recover if the
+                // error was transient but exhausted all the retries.


I am not sure the parsing errors are permanent. We sent a request, got a bad response (sometimes the load balancers like to return silly things). The next attempt may succeed. OTOH, maybe the endpoint is configured wrong and we got a nice cocktail recipe 🤷

The error classification is too close to the source, we should let the retry policy decide what errors are transient vs. permanent, instead of hard-coding that decision in each function.

fix: token cache exists after exaustive retries

f1ae629

alvarowolfx mentioned this pull request Feb 4, 2026

google-cloud-auth: permanent credential failure after transient network failures #4541

Open

alvarowolfx changed the title ~~fix: token cache exists after exaustive retries~~ fix: token cache exits after exaustive retries Feb 4, 2026

alvarowolfx changed the title ~~fix: token cache exits after exaustive retries~~ fix: token cache exits after exhaustive retries Feb 4, 2026

alvarowolfx added 9 commits February 4, 2026 16:29

impl: use exponential backoff

2857f87

impl: transient auth errors should stay transient

0d93d4e

Merge branch 'main' into fix-token-cache-poisoned

fa598ce

impl: minimize changes

1b038be

impl: simplify code

b747261

fix: block refresh_task instead of sleeping

2cf908e

impl: come back to sleep approach

4b23384

fix: use tokio::time::advance instead of sleep

46bd945

test: cover changes on how auth errors are treated on retry

4fd4000

alvarowolfx marked this pull request as ready for review February 6, 2026 15:56

alvarowolfx requested review from a team as code owners February 6, 2026 15:56

alvarowolfx requested review from coryan and dbolduc February 6, 2026 15:57

alvarowolfx added 3 commits February 6, 2026 16:05

test: increase yield time

f512cce

Merge branch 'main' into fix-token-cache-poisoned

491b556

fix: fmt all the things

22c0449

coryan requested changes Feb 6, 2026

View reviewed changes

alvarowolfx mentioned this pull request Feb 6, 2026

use notification mechanism to block TokenCache refresh loop on exhaustive retries #4593

Open

alvarowolfx added 3 commits February 6, 2026 20:39

fix: address review comments

35fd545

fix: add debug msg on assert

d380f85

test: increase yield time to wait refresh and http calls

f8564bb

coryan approved these changes Feb 6, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: token cache exits after exhaustive retries#4551

fix: token cache exits after exhaustive retries#4551
alvarowolfx wants to merge 16 commits intogoogleapis:mainfrom
alvarowolfx:fix-token-cache-poisoned

alvarowolfx commented Feb 4, 2026

Uh oh!

codecov bot commented Feb 4, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coryan Feb 6, 2026

Uh oh!

alvarowolfx Feb 6, 2026

Uh oh!

alvarowolfx Feb 6, 2026

Uh oh!

coryan Feb 6, 2026

Uh oh!

Uh oh!

coryan left a comment

Uh oh!

coryan Feb 6, 2026

Uh oh!

coryan Feb 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

-            Err(err) => {
-                // The retry policy has been used already by the inner token provider.
-                // If it ended in an error, just quit the background task.
-                break;
-                // If it ended in an error, the background task will wait for a while
-                // and try again. This allows the task to eventually recover if the
-                // error was transient but exhausted all the retries.
+            Err(err) => {
+                // On transient errors, even if the retry policy is exhausted,
+                // we want to continue running this retry loop.
+                // This loop cannot stop because that may leave the
+                // credentials in an unrecoverable state (see #...).
+                // We considered using a notification to wake up the next time
+                // a caller wants to retrieve a token, but that seemed prone to
+                // deadlocks. We may implement this as an improvement (#....).
+                // On permanent errors, then there is really no point in trying
+                // again, by definition of "permanent".  If the error was misclassified
+                // as permanent, that is a bug in the retry policy and better fixed
+                // there than implemented as a workaround here.

Conversation

alvarowolfx commented Feb 4, 2026

Uh oh!

codecov bot commented Feb 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coryan Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

alvarowolfx Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

alvarowolfx Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

coryan Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coryan left a comment

Choose a reason for hiding this comment

Uh oh!

coryan Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

coryan Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codecov bot commented Feb 4, 2026 •

edited

Loading