Skip to content

Conversation

@MODSetter
Copy link
Owner

@MODSetter MODSetter commented Oct 30, 2025

Description

feat: added file limit tracking for a user

Motivation and Context

FIX #

Screenshots

API Changes

  • This PR includes API changes

Change Type

  • Bug fix
  • New feature
  • Performance improvement
  • Refactoring
  • Documentation
  • Dependency/Build system
  • Breaking change
  • Other (specify):

Testing Performed

  • Tested locally
  • Manual/QA verification

Checklist

  • Follows project coding standards and conventions
  • Documentation updated as needed
  • Dependencies updated as needed
  • No lint/build errors or new warnings
  • All relevant tests are passing

High-level PR Summary

This PR implements file/page limit tracking for users by adding pages_limit and pages_used columns to the user table. The system now estimates page counts from documents before processing them through ETL services (Unstructured, LlamaCloud, Docling) to prevent users from exceeding their allocated page limits. When a user attempts to process a document that would exceed their limit, the system rejects it before making expensive API calls. After successful processing, the actual page count is tracked and the user's pages_used counter is incremented. The feature includes a comprehensive page estimation service that handles various file formats (PDF, Word, Excel, images, etc.) and provides detailed error messages when limits are exceeded.

⏱️ Estimated Review Time: 30-90 minutes

💡 Review Order Suggestion
Order File Path
1 surfsense_backend/app/db.py
2 surfsense_backend/alembic/versions/33_add_page_limits_to_user.py
3 surfsense_backend/app/services/page_limit_service.py
4 surfsense_backend/pyproject.toml
5 surfsense_backend/uv.lock
6 surfsense_backend/app/tasks/document_processors/file_processors.py
7 surfsense_backend/app/tasks/celery_tasks/document_tasks.py
⚠️ Inconsistent Changes Detected
File Path Warning
surfsense_backend/app/db.py Contains an unrelated change removing a Chinese comment from the LiteLLMProvider enum (line 84), which is not related to the page limit tracking feature

Need help? Join our Discord

Analyze latest changes

Summary by CodeRabbit

  • New Features

    • Introduced user page quota system to manage ETL processing capacity with per-user limits (default 500 pages)
    • Implemented page limit enforcement with validation before and after file processing
    • Automatic page estimation for PDFs and various document types
    • Enhanced error handling with specific messaging for quota exceeded scenarios
  • Chores

    • Added pypdf library dependency

@vercel
Copy link

vercel bot commented Oct 30, 2025

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Preview Comments Updated (UTC)
surf-sense-frontend Ready Ready Preview Comment Oct 30, 2025 9:59pm

@coderabbitai
Copy link

coderabbitai bot commented Oct 30, 2025

Walkthrough

Introduces a page limit tracking system for users. Adds database migration and model fields (pages_limit, pages_used) to track page usage. Implements PageLimitService to manage quotas and estimate pages from various document formats. Integrates checks throughout the document processing pipeline to prevent exceeding limits.

Changes

Cohort / File(s) Summary
Database Migration
surfsense_backend/alembic/versions/33_add_page_limits_to_user.py
New Alembic migration script that conditionally adds pages_limit (default 500) and pages_used (default 0) integer columns to the user table with upgrade/downgrade support.
Database Model
surfsense_backend/app/db.py
Extends User model with pages_limit and pages_used integer fields (non-nullable, defaults 500 and 0 respectively) across both Google-auth and non-Google branches.
Page Limit Service
surfsense_backend/app/services/page_limit_service.py
New service module introducing PageLimitExceededError exception and PageLimitService class with methods for checking quotas, updating usage, fetching current usage, and estimating pages from multiple content types (PDFs, markdown, elements, file path analysis).
Task Error Handling
surfsense_backend/app/tasks/celery_tasks/document_tasks.py
Enhances _process_file_upload error handling with context-aware error messages based on exception type (PageLimitExceededError, HTTPException with "page limit" detail, or generic).
Document Processing Integration
surfsense_backend/app/tasks/document_processors/file_processors.py
Adds comprehensive page-limit workflow across all ETL paths: pre-processing page estimation with validation, post-processing actual page counting with mismatch warnings, usage updates gated on successful document creation, and HTTP 403 abort on limit exceeded.
Dependencies
surfsense_backend/pyproject.toml
Adds pypdf>=5.1.0 dependency for PDF page counting support.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant Task as Celery Task
    participant Estimator as Page Estimator
    participant Service as PageLimitService
    participant DB as Database
    participant ETL as ETL Processor
    
    User->>Task: Upload file
    Task->>Estimator: estimate_pages_before_processing()
    Estimator-->>Task: estimated_pages
    Task->>Service: check_page_limit(user_id, estimated_pages)
    Service->>DB: Get user pages_used, pages_limit
    
    alt Limit exceeded
        Service-->>Task: PageLimitExceededError
        Task-->>User: HTTP 403
    else Within limit
        Service-->>Task: (True, pages_used, pages_limit)
        Task->>ETL: Process document
        ETL-->>Task: parsed_documents
        Task->>Estimator: estimate_pages_from_parsed()
        Estimator-->>Task: actual_pages
        rect rgb(200, 220, 240)
            Note over Task,DB: Update usage only if<br/>documents created
            Task->>Service: update_page_usage(user_id, final_pages)
            Service->>DB: Increment pages_used
            DB-->>Service: new_usage
        end
        Task-->>User: Success + metrics
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

  • file_processors.py requires careful review: pervasive control-flow changes across four distinct ETL paths (Unstructured, LlamaCloud, Docling, markdown/audio), conditional page-limit checks with abort logic, pre/post-estimation mismatch detection, and gated usage updates tied to document creation.
  • page_limit_service.py: Multiple estimation heuristics for different content types; validate PDF page counting fallback logic and edge cases in element/markdown estimation.
  • celery_tasks/document_tasks.py: Verify context-aware error message logic correctly distinguishes between exception types and handles all error paths.
  • Migration/DB/Service wiring: Confirm AsyncSession usage and database constraint alignment.

Poem

🐰✨ A limit on pages, a quota so fair,
Our bunny now tallies with utmost care,
Pre-check, process, then post-count we do,
ETL pipelines march with boundaries true,
Eight-hundred documents? Hop wisely through! 📄

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Title Check ⚠️ Warning The pull request title states "feat: added file limit tracking for a user", which is partially related to the changeset but contains a notable terminology mismatch. The implementation clearly focuses on page-level limits (introducing pages_limit and pages_used fields, PageLimitService, and PageLimitExceededError), not file-level limits. While the title captures the general intent of adding user quota tracking for document processing, using "file limit" is misleading because the actual feature tracks and enforces page limits at a granular level, not file counts. This distinction matters for clarity when scanning PR history. Consider revising the title to "feat: added page limit tracking for a user" to accurately reflect that the implementation tracks pages_used and pages_limit rather than file-level quotas. This change will make the PR title more precise and prevent confusion about the scope of the feature when reviewed or referenced later.
Docstring Coverage ⚠️ Warning Docstring coverage is 76.92% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (1 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch dev

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@recurseml recurseml bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review by RecurseML

🔍 Review performed on 5654f6c..4be9d09

✨ No bugs found, your code is sparkling clean

✅ Files analyzed, no issues (7)

surfsense_backend/alembic/versions/33_add_page_limits_to_user.py
surfsense_backend/app/db.py
surfsense_backend/app/services/page_limit_service.py
surfsense_backend/app/tasks/celery_tasks/document_tasks.py
surfsense_backend/app/tasks/document_processors/file_processors.py
surfsense_backend/pyproject.toml
surfsense_backend/uv.lock

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
surfsense_backend/app/tasks/document_processors/file_processors.py (1)

708-875: Avoid charging duplicates against the user's page quota.

add_received_file_document_using_unstructured (and the LlamaCloud/Docling helpers) return an existing Document object when the content hash is unchanged, so result is still truthy for pure duplicates. With the new quota logic that means every re-upload of the same file now decrements the user's page allowance even though nothing new was persisted. The same pattern recurs in the LlamaCloud and Docling branches below, so the bug is systemic.

Please only call update_page_usage when we actually created or updated content. One way is to have each helper return a (document, created) tuple (or None for duplicates) and gate the quota update accordingly, e.g.:

-                result = await add_received_file_document_using_unstructured(
-                    session, filename, docs, search_space_id, user_id
-                )
-
-                if result:
+                result, created = await add_received_file_document_using_unstructured(
+                    session, filename, docs, search_space_id, user_id
+                )
+
+                if created:
                     # Update page usage after successful processing
                     # allow_exceed=True because document was already created after passing initial check
                     await page_limit_service.update_page_usage(
                         user_id, final_page_count, allow_exceed=True
                     )
 
                     await task_logger.log_task_success(
                         log_entry,
                         f"Successfully processed file with Unstructured: {filename}",
                         {
                             "document_id": result.id,
                             "content_hash": result.content_hash,
                             "file_type": "document",
                             "etl_service": "UNSTRUCTURED",
                             "pages_processed": final_page_count,
                         },
                     )
                 else:
                     await task_logger.log_task_success(
                         log_entry,
                         f"Document already exists (duplicate): {filename}",
                         {
                             "duplicate_detected": True,
                             "file_type": "document",
                             "etl_service": "UNSTRUCTURED",
                         },
                     )

And adjust the helpers to return the new shape, for example:

async def add_received_file_document_using_unstructured(...) -> tuple[Document | None, bool]:
    ...
    if existing_document and existing_document.content_hash == content_hash:
        logging.info("Document for file %s unchanged. Skipping.", file_name)
        return existing_document, False
    ...
    return document, True

Please mirror the same contract in the LlamaCloud and Docling paths so duplicates never burn quota.

🧹 Nitpick comments (1)
surfsense_backend/alembic/versions/33_add_page_limits_to_user.py (1)

36-57: Consider adding database constraints for data integrity.

The columns are added without database-level validation. This allows invalid states like pages_used > pages_limit or negative values. Adding CHECK constraints would enforce business rules at the database level.

Consider applying this diff to add constraints:

     # Add pages_limit column if it doesn't exist
     if "pages_limit" not in user_columns:
         op.add_column(
             "user",
             sa.Column(
                 "pages_limit",
                 sa.Integer(),
                 nullable=False,
                 server_default="500",
             ),
         )
+        # Add constraint to ensure pages_limit is positive
+        op.create_check_constraint(
+            "ck_user_pages_limit_positive",
+            "user",
+            "pages_limit > 0"
+        )

     # Add pages_used column if it doesn't exist
     if "pages_used" not in user_columns:
         op.add_column(
             "user",
             sa.Column(
                 "pages_used",
                 sa.Integer(),
                 nullable=False,
                 server_default="0",
             ),
         )
+        # Add constraint to ensure pages_used is non-negative
+        op.create_check_constraint(
+            "ck_user_pages_used_non_negative",
+            "user",
+            "pages_used >= 0"
+        )
+        # Add constraint to ensure pages_used doesn't exceed pages_limit
+        op.create_check_constraint(
+            "ck_user_pages_used_within_limit",
+            "user",
+            "pages_used <= pages_limit"
+        )

And update the downgrade function to drop these constraints:

     # Drop columns if they exist
     if "pages_used" in user_columns:
+        op.drop_constraint("ck_user_pages_used_within_limit", "user")
+        op.drop_constraint("ck_user_pages_used_non_negative", "user")
         op.drop_column("user", "pages_used")

     if "pages_limit" in user_columns:
+        op.drop_constraint("ck_user_pages_limit_positive", "user")
         op.drop_column("user", "pages_limit")
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5654f6c and 4be9d09.

⛔ Files ignored due to path filters (1)
  • surfsense_backend/uv.lock is excluded by !**/*.lock
📒 Files selected for processing (6)
  • surfsense_backend/alembic/versions/33_add_page_limits_to_user.py (1 hunks)
  • surfsense_backend/app/db.py (2 hunks)
  • surfsense_backend/app/services/page_limit_service.py (1 hunks)
  • surfsense_backend/app/tasks/celery_tasks/document_tasks.py (1 hunks)
  • surfsense_backend/app/tasks/document_processors/file_processors.py (11 hunks)
  • surfsense_backend/pyproject.toml (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (3)
surfsense_backend/app/tasks/celery_tasks/document_tasks.py (2)
surfsense_backend/app/services/page_limit_service.py (1)
  • PageLimitExceededError (12-27)
surfsense_backend/app/services/task_logging_service.py (1)
  • log_task_failure (107-162)
surfsense_backend/app/services/page_limit_service.py (1)
surfsense_backend/app/db.py (2)
  • User (392-405)
  • User (409-419)
surfsense_backend/app/tasks/document_processors/file_processors.py (2)
surfsense_backend/app/services/page_limit_service.py (7)
  • PageLimitExceededError (12-27)
  • PageLimitService (30-401)
  • check_page_limit (36-77)
  • estimate_pages_from_elements (147-179)
  • update_page_usage (79-123)
  • estimate_pages_from_markdown (181-211)
  • estimate_pages_from_content_length (213-224)
surfsense_backend/app/services/task_logging_service.py (2)
  • log_task_failure (107-162)
  • log_task_success (60-105)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Python Backend Quality

Comment on lines +99 to +123
# Get user
result = await self.session.execute(select(User).where(User.id == user_id))
user = result.scalar_one_or_none()

if not user:
raise ValueError(f"User with ID {user_id} not found")

# Check if this would exceed limit (only if allow_exceed is False)
new_usage = user.pages_used + pages_to_add
if not allow_exceed and new_usage > user.pages_limit:
raise PageLimitExceededError(
message=f"Cannot update page usage. Would exceed limit. "
f"Current: {user.pages_used}/{user.pages_limit}, "
f"Trying to add: {pages_to_add}",
pages_used=user.pages_used,
pages_limit=user.pages_limit,
pages_to_add=pages_to_add,
)

# Update usage
user.pages_used = new_usage
await self.session.commit()
await self.session.refresh(user)

return user.pages_used
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Lock the user row before incrementing page usage.

update_page_usage reads pages_used, adds the delta, and writes it back with no locking. If two Celery workers finish at the same time they both see the same starting usage and the second commit overwrites the first, effectively dropping an increment and letting users bypass their limit. Please lock the row (or use an atomic UPDATE … SET pages_used = pages_used + :delta RETURNING …) before computing the new total, e.g.:

-        result = await self.session.execute(select(User).where(User.id == user_id))
+        result = await self.session.execute(
+            select(User).where(User.id == user_id).with_for_update()
+        )
         user = result.scalar_one_or_none()
@@
-        user.pages_used = new_usage
+        user.pages_used = new_usage
         await self.session.commit()
         await self.session.refresh(user)
 
         return user.pages_used

That guarantees concurrent workers serialize on the same row instead of losing updates.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Get user
result = await self.session.execute(select(User).where(User.id == user_id))
user = result.scalar_one_or_none()
if not user:
raise ValueError(f"User with ID {user_id} not found")
# Check if this would exceed limit (only if allow_exceed is False)
new_usage = user.pages_used + pages_to_add
if not allow_exceed and new_usage > user.pages_limit:
raise PageLimitExceededError(
message=f"Cannot update page usage. Would exceed limit. "
f"Current: {user.pages_used}/{user.pages_limit}, "
f"Trying to add: {pages_to_add}",
pages_used=user.pages_used,
pages_limit=user.pages_limit,
pages_to_add=pages_to_add,
)
# Update usage
user.pages_used = new_usage
await self.session.commit()
await self.session.refresh(user)
return user.pages_used
# Get user
result = await self.session.execute(
select(User).where(User.id == user_id).with_for_update()
)
user = result.scalar_one_or_none()
if not user:
raise ValueError(f"User with ID {user_id} not found")
# Check if this would exceed limit (only if allow_exceed is False)
new_usage = user.pages_used + pages_to_add
if not allow_exceed and new_usage > user.pages_limit:
raise PageLimitExceededError(
message=f"Cannot update page usage. Would exceed limit. "
f"Current: {user.pages_used}/{user.pages_limit}, "
f"Trying to add: {pages_to_add}",
pages_used=user.pages_used,
pages_limit=user.pages_limit,
pages_to_add=pages_to_add,
)
# Update usage
user.pages_used = new_usage
await self.session.commit()
await self.session.refresh(user)
return user.pages_used
🤖 Prompt for AI Agents
In surfsense_backend/app/services/page_limit_service.py around lines 99 to 123,
the method reads pages_used, computes a new value and commits without any row
locking which allows concurrent workers to overwrite increments; change this to
perform the update atomically or lock the row: either (1) start a transaction
and re-select the user row with a FOR UPDATE (SQLAlchemy
select(...).with_for_update()) before computing and writing pages_used so
concurrent transactions serialize on the row, or (2) issue an atomic UPDATE
statement that increments pages_used (UPDATE ... SET pages_used = pages_used +
:delta RETURNING pages_used, pages_limit) and use the returned values to enforce
the limit check and return the new usage; ensure the transaction is committed
and the session refreshed only after the locked/atomic operation completes.

@MODSetter MODSetter merged commit a09309a into main Oct 30, 2025
7 of 9 checks passed
okapteinis pushed a commit to okapteinis/SurfSense that referenced this pull request Nov 21, 2025
feat: added file limit tracking for a user
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants