Skip to content

Conversation

@nora-shap
Copy link
Member

Problem

The LLM issue detection task was fetching full span data for every trace in Sentry, then sending bits of that telemetry to Seer in individual requests. We want to use EAPTrace instead which would include much more data in a format better optimized for llm analysis. This requires a significant restructuring of the request/response formats between this task and its seer endpoint.

There was also a lil bug in how we were selecting traces for each transaction - cleared that up and introduced a tiny bit of variation to trace selection logic.

Solution

Changed the request/response flow so Sentry sends only trace IDs to Seer in a single bundled request. Now, Seer fetches the full EAPTrace data itself via Sentry's existing get_trace_waterfall RPC endpoint and uses that as the input for llm detection.

Changes to Sentry → Seer Request

Before:

  • Sentry sent truncated trace telemetry
  • Multiple fields: trace_id, project_id, transaction_name, total_spans, spans: list[Span]
  • Sent one trace at a time

After:

  • Sentry sends only trace metadata: trace_id and normalized transaction_name
  • Sends up to 50 traces in a single request
  • Seer fetches full EAPTrace data via RPC

Changes to Seer → Sentry Response

Updated DetectedIssue model to include context fields:

  • Added trace_id: str - which trace the issue was found in
  • Added transaction_name: str - normalized transaction name
  • These are pass-through fields Seer must return from the request

Trace Selection Logic

  • Query top transactions by sum(span.duration) over 30-minute window
  • Deduplicate by normalized transaction name
  • For each unique transaction, select one representative trace using a randomized time sub-window (1-8 minute offset)

Breaking Changes

This is a breaking change to the Seer integration. Deployment requires:

  1. Stop the task (issue-detection.llm-detection.enabled = false)
  2. Deploy Seer changes to handle new request format and fetch traces via RPC
  3. Deploy this Sentry change
  4. Re-enable the task
    This will not impact any customers.

@linear
Copy link

linear bot commented Dec 5, 2025

@github-actions github-actions bot added the Scope: Backend Automatically applied to PRs that change backend components label Dec 5, 2025


class EvidenceTraceData(BaseModel):
class EvidenceTraceData(BaseModel): # hate this name
Copy link
Member Author

@nora-shap nora-shap Dec 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any naming suggestions?


NUM_TRANSACTIONS_TO_PROCESS = 20
LOWER_SPAN_LIMIT = 20
UPPER_SPAN_LIMIT = 500
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these will be handled on the seer side

class EvidenceTraceData(BaseModel): # hate this name
trace_id: str
project_id: int
transaction_name: str
Copy link
Member

@roggenkemper roggenkemper Dec 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need the transaction name in addition to the trace_id when fetching the EAPTrace? or is this just so we still have access to the transaction name for our own things

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great q - transaction name is now just context data that we pass to seer, and seer passes back in the detected issue, because we need it to create the issue.
the EAPTrace only needs trace_id + org_id

organization_id=organization_id,
response_data=response.data.decode("utf-8"),
error_message=str(e),
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Missing pydantic.ValidationError in exception handler

The exception handler catches (ValueError, TypeError) but IssueDetectionResponse.parse_obj() raises pydantic.ValidationError when the Seer response doesn't match the expected schema. Since the DetectedIssue model now requires trace_id and transaction_name fields that Seer must pass back, if Seer fails to return these fields or returns them with incorrect types, the pydantic.ValidationError will propagate uncaught instead of being wrapped in LLMIssueDetectionError. The codebase correctly catches pydantic.ValidationError elsewhere when using parse_obj.

Fix in Cursor Fix in Web

spans: list[Span]


class EvidenceTraceData(BaseModel):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

now that we aren't passing any real trace data here, we can use whatever name we want

Comment on lines 224 to 225
if not has_access:
return
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Unhandled Project.DoesNotExist exception when fetching project from cache, leading to task crashes.
Severity: CRITICAL | Confidence: High

🔍 Detailed Analysis

The Project.objects.get_from_cache(id=project_id) call at src/sentry/tasks/llm_issue_detection/detection.py:224 lacks error handling for Project.DoesNotExist. If a project is deleted after run_llm_issue_detection() dispatches the subtask but before detect_llm_issues_for_project() executes, the task will crash with an unhandled Project.DoesNotExist exception. This is inconsistent with trace_data.py in the same PR, which explicitly handles this exception.

💡 Suggested Fix

Wrap the Project.objects.get_from_cache() call in a try-except Project.DoesNotExist block, log the error, and return early to gracefully handle missing projects.

🤖 Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent.
Verify if this is a real issue. If it is, propose a fix; if not, explain why it's not
valid.

Location: src/sentry/tasks/llm_issue_detection/detection.py#L224-L225

Potential issue: The `Project.objects.get_from_cache(id=project_id)` call at
`src/sentry/tasks/llm_issue_detection/detection.py:224` lacks error handling for
`Project.DoesNotExist`. If a project is deleted after `run_llm_issue_detection()`
dispatches the subtask but before `detect_llm_issues_for_project()` executes, the task
will crash with an unhandled `Project.DoesNotExist` exception. This is inconsistent with
`trace_data.py` in the same PR, which explicitly handles this exception.

Did we get this right? 👍 / 👎 to inform future reviews.
Reference ID: 5916674

if processed_count >= NUM_TRANSACTIONS_TO_PROCESS:
break
seer_request = {
"telemetry": [{**trace.dict(), "kind": "trace"} for trace in evidence_traces],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

feels like we could use better variable names here since it's just the id/name instead of an actual trace now

@codecov
Copy link

codecov bot commented Dec 5, 2025

Codecov Report

❌ Patch coverage is 83.05085% with 10 lines in your changes missing coverage. Please review.
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
src/sentry/tasks/llm_issue_detection/detection.py 76.92% 6 Missing ⚠️
src/sentry/tasks/llm_issue_detection/trace_data.py 87.87% 4 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff            @@
##           master   #104485   +/-   ##
========================================
  Coverage   80.53%    80.53%           
========================================
  Files        9350      9350           
  Lines      400099    400078   -21     
  Branches    25660     25660           
========================================
- Hits       322216    322201   -15     
+ Misses      77415     77409    -6     
  Partials      468       468           

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Scope: Backend Automatically applied to PRs that change backend components

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants