-
-
Notifications
You must be signed in to change notification settings - Fork 4.5k
feat(aci milestone 3): delete data in Seer when AD detector is deleted #101843
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #101843 +/- ##
===========================================
+ Coverage 76.26% 80.67% +4.40%
===========================================
Files 9200 9206 +6
Lines 393001 393451 +450
Branches 25000 25000
===========================================
+ Hits 299740 317413 +17673
+ Misses 92836 75613 -17223
Partials 425 425 |
src/sentry/workflow_engine/endpoints/organization_detector_details.py
Outdated
Show resolved
Hide resolved
saponifi3d
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like there's a bit of code we can cleanup before merging, generally the setup for the hooks etc all looks good though!
| return True | ||
|
|
||
|
|
||
| def delete_rule_in_seer_legacy(alert_rule: "AlertRule") -> bool: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could this method just invoke return delete_rule_in_seer(alert_rule.id)? (that way we can reduce code replication)
and
| def delete_rule_in_seer_legacy(alert_rule: "AlertRule") -> bool: | |
| def delete_rule_in_seer_legacy(alert_rule: AlertRule) -> bool: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The two methods are slightly different in what they pass to Seer. I think we've done this code duplication for all anomaly detection methods so we can easily delete the legacy code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we need to maintain this code in the future, it might be a cool time to decompose the method and reuse what we can. That would also mean we don't need to maintain two code paths while we're waiting to remove the legacy code. Instead we cold take the opportunity to make this a bit easier to manage, and a lot easier for us to debug any issues in the mean time.
The way i tend to decompose things with two examples is to just look at those differences and try to determine where it would make the most sense to expose new methods. For example, maybe we could wrap the send to seer and handle errors as a method, then delete_rule_in_seer_legacy could compose those shared methods and make any tweaks it might need.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I think we can delete the legacy code outright if every alert rule has a detector. The call only needs to happen once.
I see your point about decomp, will look into it.
src/sentry/workflow_engine/endpoints/organization_detector_details.py
Outdated
Show resolved
Hide resolved
src/sentry/workflow_engine/endpoints/organization_detector_details.py
Outdated
Show resolved
Hide resolved
saponifi3d
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think biggest feedback would be to make delete_rule_in_seer a little more debug friendly; we can either decompose the method and share it between alert rule / detector deletion or we can update the logs so we can easily differentiate (i think that'd just be changing rule to detector).
src/sentry/workflow_engine/endpoints/organization_detector_details.py
Outdated
Show resolved
Hide resolved
| return True | ||
|
|
||
|
|
||
| def delete_rule_in_seer_legacy(alert_rule: "AlertRule") -> bool: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we need to maintain this code in the future, it might be a cool time to decompose the method and reuse what we can. That would also mean we don't need to maintain two code paths while we're waiting to remove the legacy code. Instead we cold take the opportunity to make this a bit easier to manage, and a lot easier for us to debug any issues in the mean time.
The way i tend to decompose things with two examples is to just look at those differences and try to determine where it would make the most sense to expose new methods. For example, maybe we could wrap the send to seer and handle errors as a method, then delete_rule_in_seer_legacy could compose those shared methods and make any tweaks it might need.
|
@saponifi3d removed the legacy code. Let me know if you think we should still break down the error handling into a separate method. |
src/sentry/incidents/logic.py
Outdated
| extra={ | ||
| "rule_id": alert_rule.id, | ||
| }, | ||
| try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When are we still hitting this code path? is it only for people using the legacy API? If so, don't we need to call this on delete too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I put the delete code in the detector lifecycle hook, so any time a detector is deleted the code will be called. If all alert rules are dual written, then we don't need the call in multiple places.
I didn't hook deletion into updates, however, so this is for users of the legacy API. Good callout that I should create an update hook as well.
| ) | ||
|
|
||
| @classmethod | ||
| def delete_data_in_seer(cls, instance: AlertRule, **kwargs: Any) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we now require the source id (the query subscription id) instead of the rule id we can't execute this in post delete because the query subscription is deleted by this point. Instead we call it in logic.py's delete_alert_rule method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The post_delete is used for cascade deletions (like when a project gets deleted), during which the code in logic.py will not be executed 😭 I don't think we can take this out of here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess it doesn't matter if Seer has a cleanup task, though EDIT: since we're overriding the detector delete method, we don't need anything for the alert rule at all, so this is good to remove.
|
|
||
| incidents = Incident.objects.filter(alert_rule=alert_rule) | ||
| if incidents.exists(): | ||
| if alert_rule.detection_type == AlertRuleDetectionType.DYNAMIC: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're now deleting the rule in Seer regardless of whether an incident exists or not - previously we had it behind that condition because we were deleting it in the post delete signal and due to the order of deletions there would still be incidents after the rule was deleted so this was the only way we could make sure a dynamic alert rule with incidents sent the data to Seer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're overriding the delete() method on the detector, I don't think we need any logic to do it for the alert rule (given that every alert rule has a corresponding detector). Let's take this out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have it here for the legacy API use case - if they hit this directly, it won't hit the detector validator code path right? Worst case Seer gets 2 requests to delete the same data, but the 2nd one would just fail because it's already gone.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I see that we changed the validator and not the model itself, in which case this should stay.
| }, | ||
| ) | ||
| return | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: The delete_data_in_seer_for_detector() function incompletely deletes data for detectors with multiple data sources.
Severity: CRITICAL | Confidence: 1.00
🔍 Detailed Analysis
If a detector has multiple associated data sources, the delete_data_in_seer_for_detector() function will only clean up the first one from Seer due to its use of .first(). The remaining data sources will have orphaned data in Seer, causing data bloat and inconsistent state where some detector data remains after deletion, despite the operation appearing to succeed.
💡 Suggested Fix
Modify delete_data_in_seer_for_detector() to iterate over all DataSourceDetector objects associated with the detector, rather than using .first(), to ensure all related data in Seer is properly cleaned up.
🤖 Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent.
Verify if this is a real issue. If it is, propose a fix; if not, explain why it's not
valid.
Location: src/sentry/seer/anomaly_detection/delete_rule.py#L40
Potential issue: If a detector has multiple associated data sources, the
`delete_data_in_seer_for_detector()` function will only clean up the first one from Seer
due to its use of `.first()`. The remaining data sources will have orphaned data in
Seer, causing data bloat and inconsistent state where some detector data remains after
deletion, despite the operation appearing to succeed.
Did we get this right? 👍 / 👎 to inform future reviews.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are no detectors w/ multiple data sources
| Delete accompanying data in Seer for anomaly detection rules | ||
| """ | ||
| try: | ||
| source_id = QuerySubscription.objects.get(snuba_query_id=snuba_query.id).id |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: Handle Multiple Subscriptions Gracefully
QuerySubscription.objects.get(snuba_query_id=snuba_query.id) will raise MultipleObjectsReturned when an alert rule has multiple projects, since each project has its own subscription pointing to the same SnubaQuery. The function should either use .first() instead of .get(), or iterate through all subscriptions if multiple Seer deletions are needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no multi-project alert rules/detectors exist
|
Note for approver (I cannot approve since I started the PR): without the post-delete, we now have no way to clean up Seer data when a dynamic detector/alert rule is cascade deleted (such as during project or organization deletion). Seer has a cleanup task to delete data on their end for all alerts that haven't sent updates in 90 days, so it's not the end of the world, but I wanted to call out that we necessarily must lose some functionality here due to the switch to data source for the Seer key. |
| if status is None or status is not True: | ||
| logger.error( | ||
| "Request to delete alert rule from Seer was unsuccessful", | ||
| extra=extra_data, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When status is not True, Seer includes a message indicating what failed. It will be worth logging that here.
ram-senth
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed the seer interaction part and have one comment about logging failure. I do not have full context to review the other aspects like retrieving the alert details.
When a detector is deleted, we need to delete the rule data in Seer. Similar implementation to the legacy alert rule code, but uses detector ID instead of alert rule ID.
Note that because we are no longer able to use the post delete signal (the
QuerySubscriptionmodel is deleted by the time we get to post delete so we cannot access thesource_idneeded) we cannot ensure Seer knows to delete data on their end during a cascade deletion - however Seer will delete data that doesn't receive an update for 90 days so it will eventually be cleaned up.