Skip to content

Fix: Mirror URI in manifest /index/files response is set for MA files (#7687)#7693

Open
nadove-ucsc wants to merge 20 commits intodevelopfrom
issues/nadove-ucsc/7687-mirror-uri-set-for-ma-files
Open

Fix: Mirror URI in manifest /index/files response is set for MA files (#7687)#7693
nadove-ucsc wants to merge 20 commits intodevelopfrom
issues/nadove-ucsc/7687-mirror-uri-set-for-ma-files

Conversation

@nadove-ucsc
Copy link
Contributor

@nadove-ucsc nadove-ucsc commented Jan 13, 2026

Linked issues: #7687

Checklist

Author

  • PR is assigned to the author
  • Status of PR is In progress
  • PR is a draft
  • Target branch is develop
  • Name of PR branch matches issues/<GitHub handle of author>/<issue#>-<slug>
  • PR is linked to all issues it (partially) resolves
  • Status of linked issues is In progress
  • PR description links to linked issues
  • PR title matches1 that of a linked issue or comment in PR explains why they're different
  • PR title references all linked issues
  • For each linked issue, there is at least one commit whose title references that issue

1 when the issue title describes a problem, the corresponding PR
title is Fix: followed by the issue title

Author (partiality)

  • Added p tag to titles of partial commits
  • This PR is labeled partial or completely resolves all linked issues
  • This PR partially resolves each of the linked issues or does not have the partial label

Author (reindex)

  • Added r tag to commit title or the changes introduced by this PR will not require reindexing of any deployment
  • This PR is labeled reindex:dev or the changes introduced by it will not require reindexing of dev
  • This PR is labeled reindex:anvildev or the changes introduced by it will not require reindexing of anvildev
  • This PR is labeled reindex:anvilprod or the changes introduced by it will not require reindexing of anvilprod
  • This PR is labeled reindex:prod or the changes introduced by it will not require reindexing of prod
  • This PR is labeled reindex:partial and its description documents the specific reindexing procedure for dev, anvildev, anvilprod and prod or requires a full reindex or carries none of the labels reindex:dev, reindex:anvildev, reindex:anvilprod and reindex:prod

Author (API changes)

  • This PR and its linked issues are labeled API or this PR does not modify a REST API
  • Added a (A) tag to commit title for backwards (in)compatible changes or this PR does not modify a REST API
  • Updated REST API version number in app.py or this PR does not modify a REST API

Author (upgrading deployments)

  • Ran make docker_images.json and committed the resulting changes or this PR does not modify azul_docker_images, or any other variables referenced in the definition of that variable
  • Documented upgrading of deployments in UPGRADING.rst or this PR does not require upgrading deployments
  • Added u tag to commit title or this PR does not require upgrading deployments
  • This PR is labeled upgrade or does not require upgrading deployments
  • This PR is labeled deploy:shared or does not modify docker_images.json, and does not require deploying the shared component for any other reason
  • This PR is labeled deploy:gitlab or does not require deploying the gitlab component
  • This PR is labeled deploy:runner or does not require deploying the runner image

Author (hotfixes)

  • Added F tag to main commit title or this PR does not include permanent fix for a temporary hotfix
  • Reverted the temporary hotfixes for any linked issues or the none of the stable branches (anvilprod and prod) have temporary hotfixes for any of the issues linked to this PR

Author (before every review)

  • Rebased PR branch on develop, squashed fixups from prior reviews
  • Ran make requirements_update or this PR does not modify Dockerfile, environment, requirements*.txt, common.mk, Makefile or environment.boot
  • Added R tag to commit title or this PR does not modify requirements*.txt
  • This PR is labeled reqs or does not modify requirements*.txt
  • make integration_test passes in personal deployment or this PR does not modify functionality that could affect the IT outcome
  • PR is awaiting requested review from a peer
  • Status of PR is Review requested
  • PR is assigned to only the peer and the author

Peer reviewer (after approval)

Note that after requesting changes, the PR must be assigned to only the author.

  • Actually approved the PR
  • PR is not a draft
  • PR is awaiting requested review from system administrator
  • Status of PR is Review requested
  • PR is assigned to only the system administrator and the author

System administrator (after approval)

  • Actually approved the PR
  • Labeled linked issues as demo or no demo
  • Commented on linked issues about demo expectations or all linked issues are labeled no demo
  • Decided if PR can be labeled no sandbox
  • A comment to this PR details the completed security design review
  • PR title is appropriate as title of merge commit
  • N reviews label is accurate
  • Status of PR is Approved
  • PR is assigned to only the operator and the author

Operator

  • Checked reindex:… labels and r commit title tag
  • Checked that demo expectations are clear or all linked issues are labeled no demo
  • Squashed PR branch and rebased onto develop
  • Sanity-checked history
  • Pushed PR branch to GitHub

Operator (deploy .shared and .gitlab components)

  • Ran _select dev.shared && CI_COMMIT_REF_NAME=develop make -C terraform/shared apply_keep_unused or this PR is not labeled deploy:shared
  • Ran _select dev.gitlab && CI_COMMIT_REF_NAME=develop make -C terraform/gitlab apply or this PR is not labeled deploy:gitlab
  • Ran _select anvildev.shared && CI_COMMIT_REF_NAME=develop make -C terraform/shared apply_keep_unused or this PR is not labeled deploy:shared
  • Ran _select anvildev.gitlab && CI_COMMIT_REF_NAME=develop make -C terraform/gitlab apply or this PR is not labeled deploy:gitlab
  • Checked the items in the next section or this PR is labeled deploy:gitlab
  • PR is assigned to only the system administrator and the author or this PR is not labeled deploy:gitlab

System administrator (post-deploy of .gitlab component)

  • Background migrations for dev.gitlab are complete or this PR is not labeled deploy:gitlab
  • Background migrations for anvildev.gitlab are complete or this PR is not labeled deploy:gitlab
  • PR is assigned to only the operator and the author

Operator (deploy runner image)

  • Ran _select dev.gitlab && make -C terraform/gitlab/runner or this PR is not labeled deploy:runner
  • Ran _select anvildev.gitlab && make -C terraform/gitlab/runner or this PR is not labeled deploy:runner

Operator (sandbox build)

  • Added sandbox label or PR is labeled no sandbox
  • Pushed PR branch to GitLab dev or PR is labeled no sandbox
  • Pushed PR branch to GitLab anvildev or PR is labeled no sandbox
  • Build passes in sandbox deployment or PR is labeled no sandbox
  • Build passes in anvilbox deployment or PR is labeled no sandbox
  • Reviewed build logs for anomalies in sandbox deployment or PR is labeled no sandbox
  • Reviewed build logs for anomalies in anvilbox deployment or PR is labeled no sandbox
  • Deleted unreferenced indices in sandbox or this PR does not remove catalogs or otherwise causes unreferenced indices in sandbox
  • Deleted unreferenced indices in anvilbox or this PR does not remove catalogs or otherwise causes unreferenced indices in anvilbox
  • Started reindex in sandbox or this PR is not labeled reindex:dev
  • Started reindex in anvilbox or this PR is not labeled reindex:anvildev
  • Checked for failures in sandbox or this PR is not labeled reindex:dev
  • Checked for failures in anvilbox or this PR is not labeled reindex:anvildev

Operator (merge the branch)

  • All status checks passed and the PR is mergeable
  • The title of the merge commit starts with the title of this PR
  • Added PR # reference to merge commit title
  • Collected commit title tags in merge commit title but only included p if the PR is also labeled partial
  • Pushed merge commit to GitHub
  • Status of PR is Merged lower
  • Status of blocked issues is Triage or no issues are blocked on the linked issues

Operator (main build)

  • Pushed merge commit to GitLab dev
  • Pushed merge commit to GitLab anvildev
  • Build passes on GitLab dev
  • Reviewed build logs for anomalies on GitLab dev
  • Build passes on GitLab anvildev
  • Reviewed build logs for anomalies on GitLab anvildev
  • Ran _select dev.shared && make -C terraform/shared apply or this PR is not labeled deploy:shared
  • Ran _select anvildev.shared && make -C terraform/shared apply or this PR is not labeled deploy:shared
  • Deleted PR branch from GitHub
  • PR is assigned to only the operator
  • Deleted PR branch from GitLab dev
  • Deleted PR branch from GitLab anvildev
  • Status of linked issues is Lower, or Triage, if PR is partial

Operator (reindex)

  • Deindexed all unreferenced catalogs in dev or this PR is neither labeled reindex:partial nor reindex:dev
  • Deindexed all unreferenced catalogs in anvildev or this PR is neither labeled reindex:partial nor reindex:anvildev
  • Deindexed specific sources in dev or this PR is neither labeled reindex:partial nor reindex:dev
  • Deindexed specific sources in anvildev or this PR is neither labeled reindex:partial nor reindex:anvildev
  • Indexed specific sources in dev or this PR is neither labeled reindex:partial nor reindex:dev
  • Indexed specific sources in anvildev or this PR is neither labeled reindex:partial nor reindex:anvildev
  • Started reindex in dev or this PR does not require reindexing dev
  • Started reindex in anvildev or this PR does not require reindexing anvildev
  • Checked for, triaged and possibly requeued messages in both fail queues in dev or this PR does not require reindexing dev
  • Checked for, triaged and possibly requeued messages in both fail queues in anvildev or this PR does not require reindexing anvildev
  • Emptied fail queues in dev or this PR does not require reindexing dev
  • Emptied fail queues in anvildev or this PR does not require reindexing anvildev
  • Restarted the Data Browser pipeline for the ucsc/hca/dev branch on GitLab in dev or this PR does not require reindexing dev
  • Restarted the Data Browser pipeline for the ucsc/lungmap/dev branch on GitLab in dev or this PR does not require reindexing dev
  • Restarted deploy_browser job in the GitLab pipeline for this PR in dev or this PR does not require reindexing dev
  • Restarted the Data Browser pipeline for the ucsc/anvil/anvildev branch on GitLab in anvildev or this PR does not require reindexing anvildev
  • Restarted deploy_browser job in the GitLab pipeline for this PR in anvildev or this PR does not require reindexing anvildev

Operator (mirroring)

  • Started mirroring in dev or this PR does not require mirroring dev
  • Started mirroring in anvildev or this PR does not require mirroring anvildev
  • Checked for, triaged and possibly requeued messages in mirror fail queue in dev or this PR does not require mirroring dev
  • Checked for, triaged and possibly requeued messages in mirror fail queue in anvildev or this PR does not require mirroring anvildev
  • Emptied mirror fail queue in dev or this PR does not require mirroring dev
  • Emptied mirror fail queue in anvildev or this PR does not require mirroring anvildev

Operator

  • Propagated the deploy:shared, deploy:gitlab, deploy:runner, API, reindex:partial, reindex:anvilprod and reindex:prod labels to the next promotion PRs or this PR carries none of these labels
  • Propagated any specific instructions related to the deploy:shared, deploy:gitlab, deploy:runner, API, reindex:partial, reindex:anvilprod and reindex:prod labels, from the description of this PR to that of the next promotion PRs or this PR carries none of these labels
  • PR is assigned to no one

Shorthand for review comments

  • L line is too long
  • W line wrapping is wrong
  • Q bad quotes
  • F other formatting problem

@nadove-ucsc nadove-ucsc self-assigned this Jan 13, 2026
@nadove-ucsc nadove-ucsc force-pushed the issues/nadove-ucsc/7687-mirror-uri-set-for-ma-files branch 7 times, most recently from 52a4ca7 to d8627ae Compare January 13, 2026 22:28
@codecov
Copy link

codecov bot commented Jan 13, 2026

Codecov Report

❌ Patch coverage is 94.35484% with 7 lines in your changes missing coverage. Please review.
✅ Project coverage is 84.83%. Comparing base (fa8d30d) to head (912d50a).

Files with missing lines Patch % Lines
test/integration_test.py 0.00% 3 Missing ⚠️
src/azul/indexer/mirror_service.py 87.50% 2 Missing ⚠️
src/azul/plugins/repository/tdr.py 87.50% 1 Missing ⚠️
src/azul/service/source_service.py 95.45% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #7693      +/-   ##
===========================================
+ Coverage    84.80%   84.83%   +0.03%     
===========================================
  Files          157      157              
  Lines        23067    23124      +57     
===========================================
+ Hits         19561    19617      +56     
- Misses        3506     3507       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@coveralls
Copy link

coveralls commented Jan 13, 2026

Coverage Status

coverage: 85.052% (+0.03%) from 85.02%
when pulling 912d50a on issues/nadove-ucsc/7687-mirror-uri-set-for-ma-files
into fa8d30d on develop.

@nadove-ucsc nadove-ucsc force-pushed the issues/nadove-ucsc/7687-mirror-uri-set-for-ma-files branch 3 times, most recently from 39b3cd7 to 87d7dea Compare January 16, 2026 04:43
@hannes-ucsc hannes-ucsc removed the request for review from dsotirho-ucsc January 16, 2026 18:55
@hannes-ucsc hannes-ucsc marked this pull request as ready for review January 16, 2026 18:55
@hannes-ucsc hannes-ucsc self-requested a review as a code owner January 16, 2026 18:55
Comment on lines 665 to 666
Implementations should raise PermissionError if the provided
authentication is insufficient to access the repository.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Implementations" implies "implementations of this method". No authentication is provided to those implementations, so I don't understand this sentence.

Comment on lines 138 to 139
all_sources = set()
public_sources = set()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We typically use tuple assignmets when they correlate like this.

return int(time())

@cache
def _configured_sources(self) -> Mapping[str, AbstractSet[SourceRef]]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The return type hint is a text-book case for a typed dict.

@hannes-ucsc hannes-ucsc added the 1 review [process] Lead requested changes once label Jan 20, 2026
Copy link
Member

@hannes-ucsc hannes-ucsc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, please add unit test coverage for the the newly uncovered lines

https://app.codecov.io/gh/DataBiosphere/azul/pull/7693/blob/src/azul/service/source_service.py

@hannes-ucsc hannes-ucsc removed their assignment Jan 20, 2026
@nadove-ucsc
Copy link
Contributor Author

The bot is right, which raises the question as to why make pep8 didn't catch this.

It wasn't caught because $(project_root)/lambdas/{indexer,service}/vendor/ is not included in the paths passed to flake8. It also wasn't detected by PyCharm (not grayed out in the editor) because those directories are explicitly marked as ignored.

@nadove-ucsc nadove-ucsc force-pushed the issues/nadove-ucsc/7687-mirror-uri-set-for-ma-files branch from 87d7dea to 7464fca Compare January 21, 2026 02:20
@nadove-ucsc nadove-ucsc force-pushed the issues/nadove-ucsc/7687-mirror-uri-set-for-ma-files branch from 7464fca to 73ab983 Compare January 21, 2026 02:23
Comment on lines +471 to +473
Specifically, this means that subclasses may not add fields without a
default or modify whether a field is initialized via a keyword-only or
positional-only argument.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Specifically, this means that subclasses may not add fields without a
default or modify whether a field is initialized via a keyword-only or
positional-only argument.
For example, this means that subclasses may not add fields without a
default value or modify the constructor to accept positional arguments.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modifying the constructor to accept positional arguments would not prevent the constructor from being invoked with keyword arguments. I maintain that my original wording is more accurate and precise.

"""
List source IDs in the underlying repository that are accessible using
the provided authentication. May require a roundtrip to the underlying
repository, but results are cached in DynamoDB for up to 1 minute.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
repository, but results are cached in DynamoDB for up to 1 minute.
repository, but results are cached in DynamoDB for a short time.

Implementation detail. It's actually 5 min now.

}

@property
def configured_sources(self) -> AbstractSet[SourceRef]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's remove this for now. Let's also remove the outer dictionary in sources.json for now, and the TypedDict, and let's rename sources.json to public_sources.json.

I think we should organize the contents of sources.json by catalog. Then we can use configured_public_sources in list_accessible_sources if authentication is None. We can also make this method protected and have the mirror service use the public list_accessible_sources instead. This would create infinite recursion outside of a Lambda context. I think that can be broken by having list_accessible_sources and _list_accessible_sources. I can elaborate in PL.

Somewhat unrelated: Please add a log statement to the auth fallback for the case when it falls back to no auth. I want to be able to tell from the logs how frequently that occurs.

I also have the feeling that a set of SourceRef instances (what this method returns) doesn't work as one would expect. I think it can contain two SourceRef instances with the same source_id.

In essence what this PR then effectively does is cache public sources for much longer and ensure that the plugin layer restricts the set of sources to those configured.

In another PR we can add back the functionality of this removed method, and use it in list_accessible_source_ids() to further subset the return value.

Another point of tension that we need to address in the long run (not in this PR) is that post_deploy and outsourcing of sources do similar things.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's remove this for now. Let's also remove the outer dictionary in sources.json for now, and the TypedDict, and let's rename sources.json to public_sources.json.

Agreed to remove this method since it is unused in this PR. It will trivial to re-add it later.

I think we should organize the contents of sources.json by catalog. Then we can use configured_public_sources in list_accessible_sources if authentication is None. We can also make this method protected and have the mirror service use the public list_accessible_sources instead. This would create infinite recursion outside of a Lambda context. I think that can be broken by having list_accessible_sources and _list_accessible_sources. I can elaborate in PL.

Agreed to postpone organizing by catalog and unifying list_accessible_sources with configured_public_sources.

Somewhat unrelated: Please add a log statement to the auth fallback for the case when it falls back to no auth. I want to be able to tell from the logs how frequently that occurs.

This is easy, I can do this.

I also have the feeling that a set of SourceRef instances (what this method returns) doesn't work as one would expect. I think it can contain two SourceRef instances with the same source_id.

Agreed to change return type annotation to Iterable[SourceRef] and skip the conversion to set.

In essence what this PR then effectively does is cache public sources for much longer and ensure that the plugin layer restricts the set of sources to those configured.

In another PR we can add back the functionality of this removed method, and use it in list_accessible_source_ids() to further subset the return value.

Another point of tension that we need to address in the long run (not in this PR) is that post_deploy and outsourcing of sources do similar things.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed to change return type annotation to Iterable[SourceRef] and skip the conversion to set.

Something I missed was that it is necessary to store the sources in a set when preparing for outsourcing to avoid duplicate entries in cases where the same source occurs in multiple catalogs (e.g. dcp55 and dcp56).

Two sources with the same ID can coexist in a set only if they differ by some other attribute (e.g. have different prefixes). Therefore using a set here is still very useful at eliminating duplicates, which would otherwise be very common.

@hannes-ucsc hannes-ucsc removed their assignment Feb 5, 2026
@nadove-ucsc nadove-ucsc force-pushed the issues/nadove-ucsc/7687-mirror-uri-set-for-ma-files branch from 3ebeec1 to c41db2c Compare February 7, 2026 02:00
@nadove-ucsc nadove-ucsc force-pushed the issues/nadove-ucsc/7687-mirror-uri-set-for-ma-files branch 2 times, most recently from beb364a to e8b472a Compare February 7, 2026 02:16
@hannes-ucsc hannes-ucsc force-pushed the issues/nadove-ucsc/7687-mirror-uri-set-for-ma-files branch from e8b472a to 75108db Compare February 9, 2026 05:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

2 reviews [process] Lead requested changes twice reqs [process] PR includes commit requiring ``make requirements``

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Fix: Mirror URI in manifest /index/files response is set for MA files

4 participants