Skip to content

Conversation

@jovnc
Copy link
Collaborator

@jovnc jovnc commented Jan 14, 2026

Likely cause:

Race condition in Windows in the following lines of code:

rmtree(PROGRESS_LOCAL_FOLDER_NAME)
clone_with_custom_name(f"{username}/{fork_name}", PROGRESS_LOCAL_FOLDER_NAME)

This is because of issues with shutil.rmtree in Windows due to permissions issue, causing it to retry by changing permissions to write, and trying the delete again.

It is likely gh repo clone continues to run, which fails as progress repository already exists at that point of time. Only afterwards is shutil.rmtree deletes the repository, causing us to land in a state where progress folder is deleted from the student's folder.

Solution:

  • add check to ensure that folder is deleted before we clone, else we can fail this check (since it is guaranteed that progress folder is present, unless student deletes it themselves)
  • if clone fails, we restore local progress.json file as safety precaution

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses a race condition on Windows where rmtree and gh repo clone operations can conflict. The issue occurs because shutil.rmtree on Windows may retry deletion by changing permissions, allowing the clone command to start before deletion completes, potentially leaving the system in an inconsistent state where the progress folder is deleted.

Changes:

  • Added a retry mechanism with polling to ensure the progress folder is fully deleted before cloning
  • Added verification after cloning to ensure the operation succeeded, with recovery logic to restore local progress if the clone fails

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.


clone_with_custom_name(f"{username}/{fork_name}", PROGRESS_LOCAL_FOLDER_NAME)

# Verify clone succeeded by checking folder exists
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm adding this as an additional precaution, in case clone fails, so we can re-create the progress folder, ensuring student's are not left in the state where progress doesn't exist at all.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@woojiahao
Copy link
Member

Core reasoning behind race condition: https://bugs.python.org/issue40143 (discovered by @jovnc )

local_progress = json.load(file)

rmtree(PROGRESS_LOCAL_FOLDER_NAME)
os.makedirs(os.path.dirname(local_progress_filepath), exist_ok=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can I understand the rationale behind this? Is it because rmtree might not fully delete it on Windows, and as a result, we need to re-create the folder explicitly, rather than allowing open(..., "a") to create it?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is to address an existing bug for all OSes. It's more of a separate issue, but I just placed it here for convenience, since it's sort of related.

Image

Idea is because we do a rmtree at the top deletes the progress folder, but below we instantly open it

with open(local_progress_filepath, "a") as progress_file:

This line throws a FileNotFoundError. The problem is that open with "a" mode creates the file if it doesn't exist, but does NOT create the parent directory.

clone_with_custom_name(f"{username}/{fork_name}", PROGRESS_LOCAL_FOLDER_NAME)

# Verify clone succeeded, else restore local progress before failing
if not os.path.exists(PROGRESS_LOCAL_FOLDER_NAME):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In theory, isn't it still possible for this to work even though the deletion still trails?

  1. rmtree fails, retry
  2. Clone runs
  3. Path exists, skip over if
  4. rmtree works, deletes the new folder

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If rmtree keeps failing after 20 retries, we raise an exception at rmtree already and won't proceed.

This check here is more of a safety check to check that clone succeeds rather than for rmtree.

It's more of a safety net, as if clone fails, we can still restore the existing local progress.json file for the user (otherwise if it fails, it will leave users in unexpected state without progress folder).

But I'm also ok with leaving this check out to be honest, as i don't feel its entirely necessary.

from pathlib import Path
from typing import Union

MAX_DELETE_RETRIES = 20
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we might stall for 4 seconds in total? Seems like a pretty high upper bound. Do we want to lower it for now? I think a max retry of 5 is somewhat reasonable unless we hit an obvious issue

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like we should be safer with a higher number, I increased it from 10 initially. Rationale is if we set too low a threshold, there is a higher likelihood of failure and we will need to update the app again for students, which we want to avoid.

@jovnc jovnc added the bug Something isn't working label Jan 15, 2026
@jovnc jovnc requested a review from woojiahao January 17, 2026 15:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants