feat: added file limit tracking for a user #455

MODSetter · 2025-10-30T21:59:24Z

Description

feat: added file limit tracking for a user

Motivation and Context

FIX #

Screenshots

API Changes

This PR includes API changes

Change Type

Testing Performed

Tested locally
Manual/QA verification

Checklist

Follows project coding standards and conventions
Documentation updated as needed
Dependencies updated as needed
No lint/build errors or new warnings
All relevant tests are passing

High-level PR Summary

This PR implements file/page limit tracking for users by adding pages_limit and pages_used columns to the user table. The system now estimates page counts from documents before processing them through ETL services (Unstructured, LlamaCloud, Docling) to prevent users from exceeding their allocated page limits. When a user attempts to process a document that would exceed their limit, the system rejects it before making expensive API calls. After successful processing, the actual page count is tracked and the user's pages_used counter is incremented. The feature includes a comprehensive page estimation service that handles various file formats (PDF, Word, Excel, images, etc.) and provides detailed error messages when limits are exceeded.

⏱️ Estimated Review Time: 30-90 minutes

💡 Review Order Suggestion

Order	File Path
1	`surfsense_backend/app/db.py`
2	`surfsense_backend/alembic/versions/33_add_page_limits_to_user.py`
3	`surfsense_backend/app/services/page_limit_service.py`
4	`surfsense_backend/pyproject.toml`
5	`surfsense_backend/uv.lock`
6	`surfsense_backend/app/tasks/document_processors/file_processors.py`
7	`surfsense_backend/app/tasks/celery_tasks/document_tasks.py`

⚠️ Inconsistent Changes Detected

File Path	Warning
`surfsense_backend/app/db.py`	Contains an unrelated change removing a Chinese comment from the LiteLLMProvider enum (line 84), which is not related to the page limit tracking feature

Summary by CodeRabbit

New Features
- Introduced user page quota system to manage ETL processing capacity with per-user limits (default 500 pages)
- Implemented page limit enforcement with validation before and after file processing
- Automatic page estimation for PDFs and various document types
- Enhanced error handling with specific messaging for quota exceeded scenarios
Chores
- Added pypdf library dependency

vercel · 2025-10-30T21:59:29Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Preview	Comments	Updated (UTC)
surf-sense-frontend	Ready	Preview	Comment	Oct 30, 2025 9:59pm

coderabbitai · 2025-10-30T21:59:33Z

Walkthrough

Introduces a page limit tracking system for users. Adds database migration and model fields (pages_limit, pages_used) to track page usage. Implements PageLimitService to manage quotas and estimate pages from various document formats. Integrates checks throughout the document processing pipeline to prevent exceeding limits.

Changes

Cohort / File(s)	Summary
Database Migration `surfsense_backend/alembic/versions/33_add_page_limits_to_user.py`	New Alembic migration script that conditionally adds pages_limit (default 500) and pages_used (default 0) integer columns to the user table with upgrade/downgrade support.
Database Model `surfsense_backend/app/db.py`	Extends User model with pages_limit and pages_used integer fields (non-nullable, defaults 500 and 0 respectively) across both Google-auth and non-Google branches.
Page Limit Service `surfsense_backend/app/services/page_limit_service.py`	New service module introducing PageLimitExceededError exception and PageLimitService class with methods for checking quotas, updating usage, fetching current usage, and estimating pages from multiple content types (PDFs, markdown, elements, file path analysis).
Task Error Handling `surfsense_backend/app/tasks/celery_tasks/document_tasks.py`	Enhances _process_file_upload error handling with context-aware error messages based on exception type (PageLimitExceededError, HTTPException with "page limit" detail, or generic).
Document Processing Integration `surfsense_backend/app/tasks/document_processors/file_processors.py`	Adds comprehensive page-limit workflow across all ETL paths: pre-processing page estimation with validation, post-processing actual page counting with mismatch warnings, usage updates gated on successful document creation, and HTTP 403 abort on limit exceeded.
Dependencies `surfsense_backend/pyproject.toml`	Adds pypdf>=5.1.0 dependency for PDF page counting support.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant Task as Celery Task
    participant Estimator as Page Estimator
    participant Service as PageLimitService
    participant DB as Database
    participant ETL as ETL Processor
    
    User->>Task: Upload file
    Task->>Estimator: estimate_pages_before_processing()
    Estimator-->>Task: estimated_pages
    Task->>Service: check_page_limit(user_id, estimated_pages)
    Service->>DB: Get user pages_used, pages_limit
    
    alt Limit exceeded
        Service-->>Task: PageLimitExceededError
        Task-->>User: HTTP 403
    else Within limit
        Service-->>Task: (True, pages_used, pages_limit)
        Task->>ETL: Process document
        ETL-->>Task: parsed_documents
        Task->>Estimator: estimate_pages_from_parsed()
        Estimator-->>Task: actual_pages
        rect rgb(200, 220, 240)
            Note over Task,DB: Update usage only if<br/>documents created
            Task->>Service: update_page_usage(user_id, final_pages)
            Service->>DB: Increment pages_used
            DB-->>Service: new_usage
        end
        Task-->>User: Success + metrics
    end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

file_processors.py requires careful review: pervasive control-flow changes across four distinct ETL paths (Unstructured, LlamaCloud, Docling, markdown/audio), conditional page-limit checks with abort logic, pre/post-estimation mismatch detection, and gated usage updates tied to document creation.
page_limit_service.py: Multiple estimation heuristics for different content types; validate PDF page counting fallback logic and edge cases in element/markdown estimation.
celery_tasks/document_tasks.py: Verify context-aware error message logic correctly distinguishes between exception types and handles all error paths.
Migration/DB/Service wiring: Confirm AsyncSession usage and database constraint alignment.

Poem

🐰✨ A limit on pages, a quota so fair,
Our bunny now tallies with utmost care,
Pre-check, process, then post-count we do,
ETL pipelines march with boundaries true,
Eight-hundred documents? Hop wisely through! 📄

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Title Check	⚠️ Warning	The pull request title states "feat: added file limit tracking for a user", which is partially related to the changeset but contains a notable terminology mismatch. The implementation clearly focuses on page-level limits (introducing `pages_limit` and `pages_used` fields, `PageLimitService`, and `PageLimitExceededError`), not file-level limits. While the title captures the general intent of adding user quota tracking for document processing, using "file limit" is misleading because the actual feature tracks and enforces page limits at a granular level, not file counts. This distinction matters for clarity when scanning PR history.	Consider revising the title to "feat: added page limit tracking for a user" to accurately reflect that the implementation tracks pages_used and pages_limit rather than file-level quotas. This change will make the PR title more precise and prevent confusion about the scope of the feature when reviewed or referenced later.
Docstring Coverage	⚠️ Warning	Docstring coverage is 76.92% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch dev

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

recurseml

Review by RecurseML

🔍 Review performed on 5654f6c..4be9d09

✨ No bugs found, your code is sparkling clean

✅ Files analyzed, no issues (7)

• surfsense_backend/alembic/versions/33_add_page_limits_to_user.py
• surfsense_backend/app/db.py
• surfsense_backend/app/services/page_limit_service.py
• surfsense_backend/app/tasks/celery_tasks/document_tasks.py
• surfsense_backend/app/tasks/document_processors/file_processors.py
• surfsense_backend/pyproject.toml
• surfsense_backend/uv.lock

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

surfsense_backend/app/tasks/document_processors/file_processors.py (1)

708-875: Avoid charging duplicates against the user's page quota.

add_received_file_document_using_unstructured (and the LlamaCloud/Docling helpers) return an existing Document object when the content hash is unchanged, so result is still truthy for pure duplicates. With the new quota logic that means every re-upload of the same file now decrements the user's page allowance even though nothing new was persisted. The same pattern recurs in the LlamaCloud and Docling branches below, so the bug is systemic.

Please only call update_page_usage when we actually created or updated content. One way is to have each helper return a (document, created) tuple (or None for duplicates) and gate the quota update accordingly, e.g.:

-                result = await add_received_file_document_using_unstructured(
-                    session, filename, docs, search_space_id, user_id
-                )
-
-                if result:
+                result, created = await add_received_file_document_using_unstructured(
+                    session, filename, docs, search_space_id, user_id
+                )
+
+                if created:
                     # Update page usage after successful processing
                     # allow_exceed=True because document was already created after passing initial check
                     await page_limit_service.update_page_usage(
                         user_id, final_page_count, allow_exceed=True
                     )
 
                     await task_logger.log_task_success(
                         log_entry,
                         f"Successfully processed file with Unstructured: {filename}",
                         {
                             "document_id": result.id,
                             "content_hash": result.content_hash,
                             "file_type": "document",
                             "etl_service": "UNSTRUCTURED",
                             "pages_processed": final_page_count,
                         },
                     )
                 else:
                     await task_logger.log_task_success(
                         log_entry,
                         f"Document already exists (duplicate): {filename}",
                         {
                             "duplicate_detected": True,
                             "file_type": "document",
                             "etl_service": "UNSTRUCTURED",
                         },
                     )

And adjust the helpers to return the new shape, for example:

async def add_received_file_document_using_unstructured(...) -> tuple[Document | None, bool]:
    ...
    if existing_document and existing_document.content_hash == content_hash:
        logging.info("Document for file %s unchanged. Skipping.", file_name)
        return existing_document, False
    ...
    return document, True

Please mirror the same contract in the LlamaCloud and Docling paths so duplicates never burn quota.

🧹 Nitpick comments (1)

surfsense_backend/alembic/versions/33_add_page_limits_to_user.py (1)

36-57: Consider adding database constraints for data integrity.

The columns are added without database-level validation. This allows invalid states like pages_used > pages_limit or negative values. Adding CHECK constraints would enforce business rules at the database level.

Consider applying this diff to add constraints:

     # Add pages_limit column if it doesn't exist
     if "pages_limit" not in user_columns:
         op.add_column(
             "user",
             sa.Column(
                 "pages_limit",
                 sa.Integer(),
                 nullable=False,
                 server_default="500",
             ),
         )
+        # Add constraint to ensure pages_limit is positive
+        op.create_check_constraint(
+            "ck_user_pages_limit_positive",
+            "user",
+            "pages_limit > 0"
+        )

     # Add pages_used column if it doesn't exist
     if "pages_used" not in user_columns:
         op.add_column(
             "user",
             sa.Column(
                 "pages_used",
                 sa.Integer(),
                 nullable=False,
                 server_default="0",
             ),
         )
+        # Add constraint to ensure pages_used is non-negative
+        op.create_check_constraint(
+            "ck_user_pages_used_non_negative",
+            "user",
+            "pages_used >= 0"
+        )
+        # Add constraint to ensure pages_used doesn't exceed pages_limit
+        op.create_check_constraint(
+            "ck_user_pages_used_within_limit",
+            "user",
+            "pages_used <= pages_limit"
+        )

And update the downgrade function to drop these constraints:

     # Drop columns if they exist
     if "pages_used" in user_columns:
+        op.drop_constraint("ck_user_pages_used_within_limit", "user")
+        op.drop_constraint("ck_user_pages_used_non_negative", "user")
         op.drop_column("user", "pages_used")

     if "pages_limit" in user_columns:
+        op.drop_constraint("ck_user_pages_limit_positive", "user")
         op.drop_column("user", "pages_limit")

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5654f6c and 4be9d09.

⛔ Files ignored due to path filters (1)

surfsense_backend/uv.lock is excluded by !**/*.lock

📒 Files selected for processing (6)

surfsense_backend/alembic/versions/33_add_page_limits_to_user.py (1 hunks)
surfsense_backend/app/db.py (2 hunks)
surfsense_backend/app/services/page_limit_service.py (1 hunks)
surfsense_backend/app/tasks/celery_tasks/document_tasks.py (1 hunks)
surfsense_backend/app/tasks/document_processors/file_processors.py (11 hunks)
surfsense_backend/pyproject.toml (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (3)

surfsense_backend/app/tasks/celery_tasks/document_tasks.py (2)

surfsense_backend/app/services/page_limit_service.py (1)

PageLimitExceededError (12-27)

surfsense_backend/app/services/task_logging_service.py (1)

log_task_failure (107-162)

surfsense_backend/app/services/page_limit_service.py (1)

surfsense_backend/app/db.py (2)

User (392-405)

User (409-419)

surfsense_backend/app/tasks/document_processors/file_processors.py (2)

surfsense_backend/app/services/page_limit_service.py (7)

PageLimitExceededError (12-27)

PageLimitService (30-401)

check_page_limit (36-77)

estimate_pages_from_elements (147-179)

update_page_usage (79-123)

estimate_pages_from_markdown (181-211)

estimate_pages_from_content_length (213-224)

surfsense_backend/app/services/task_logging_service.py (2)

log_task_failure (107-162)

log_task_success (60-105)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Python Backend Quality

coderabbitai · 2025-10-30T22:07:06Z

surfsense_backend/app/services/page_limit_service.py

+        # Get user
+        result = await self.session.execute(select(User).where(User.id == user_id))
+        user = result.scalar_one_or_none()
+
+        if not user:
+            raise ValueError(f"User with ID {user_id} not found")
+
+        # Check if this would exceed limit (only if allow_exceed is False)
+        new_usage = user.pages_used + pages_to_add
+        if not allow_exceed and new_usage > user.pages_limit:
+            raise PageLimitExceededError(
+                message=f"Cannot update page usage. Would exceed limit. "
+                f"Current: {user.pages_used}/{user.pages_limit}, "
+                f"Trying to add: {pages_to_add}",
+                pages_used=user.pages_used,
+                pages_limit=user.pages_limit,
+                pages_to_add=pages_to_add,
+            )
+
+        # Update usage
+        user.pages_used = new_usage
+        await self.session.commit()
+        await self.session.refresh(user)
+
+        return user.pages_used


⚠️ Potential issue | 🔴 Critical

Lock the user row before incrementing page usage.

update_page_usage reads pages_used, adds the delta, and writes it back with no locking. If two Celery workers finish at the same time they both see the same starting usage and the second commit overwrites the first, effectively dropping an increment and letting users bypass their limit. Please lock the row (or use an atomic UPDATE … SET pages_used = pages_used + :delta RETURNING …) before computing the new total, e.g.:

- result = await self.session.execute(select(User).where(User.id == user_id)) + result = await self.session.execute( + select(User).where(User.id == user_id).with_for_update() + ) user = result.scalar_one_or_none() @@ - user.pages_used = new_usage + user.pages_used = new_usage await self.session.commit() await self.session.refresh(user) return user.pages_used

That guarantees concurrent workers serialize on the same row instead of losing updates.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# Get user

result = await self.session.execute(select(User).where(User.id == user_id))

user = result.scalar_one_or_none()

if not user:

raise ValueError(f"User with ID {user_id} not found")

# Check if this would exceed limit (only if allow_exceed is False)

new_usage = user.pages_used + pages_to_add

if not allow_exceed and new_usage > user.pages_limit:

raise PageLimitExceededError(

message=f"Cannot update page usage. Would exceed limit. "

f"Current: {user.pages_used}/{user.pages_limit}, "

f"Trying to add: {pages_to_add}",

pages_used=user.pages_used,

pages_limit=user.pages_limit,

pages_to_add=pages_to_add,

)

# Update usage

user.pages_used = new_usage

await self.session.commit()

await self.session.refresh(user)

return user.pages_used

# Get user

result = await self.session.execute(

select(User).where(User.id == user_id).with_for_update()

)

user = result.scalar_one_or_none()

if not user:

raise ValueError(f"User with ID {user_id} not found")

# Check if this would exceed limit (only if allow_exceed is False)

new_usage = user.pages_used + pages_to_add

if not allow_exceed and new_usage > user.pages_limit:

raise PageLimitExceededError(

message=f"Cannot update page usage. Would exceed limit. "

f"Current: {user.pages_used}/{user.pages_limit}, "

f"Trying to add: {pages_to_add}",

pages_used=user.pages_used,

pages_limit=user.pages_limit,

pages_to_add=pages_to_add,

)

# Update usage

user.pages_used = new_usage

await self.session.commit()

await self.session.refresh(user)

return user.pages_used

🤖 Prompt for AI Agents

In surfsense_backend/app/services/page_limit_service.py around lines 99 to 123, the method reads pages_used, computes a new value and commits without any row locking which allows concurrent workers to overwrite increments; change this to perform the update atomically or lock the row: either (1) start a transaction and re-select the user row with a FOR UPDATE (SQLAlchemy select(...).with_for_update()) before computing and writing pages_used so concurrent transactions serialize on the row, or (2) issue an atomic UPDATE statement that increments pages_used (UPDATE ... SET pages_used = pages_used + :delta RETURNING pages_used, pages_limit) and use the returned values to enforce the limit check and return the new usage; ensure the transaction is committed and the session refreshed only after the locked/atomic operation completes.

feat: added file limit tracking for a user

feat: added file limit tracking for a user

4be9d09

vercel bot deployed to Preview October 30, 2025 21:59 View deployment

recurseml bot reviewed Oct 30, 2025

View reviewed changes

coderabbitai bot reviewed Oct 30, 2025

View reviewed changes

MODSetter merged commit a09309a into main Oct 30, 2025
7 of 9 checks passed

okapteinis pushed a commit to okapteinis/SurfSense that referenced this pull request Nov 21, 2025

Merge pull request MODSetter#455 from MODSetter/dev

0e3a77a

feat: added file limit tracking for a user

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

feat: added file limit tracking for a user #455

feat: added file limit tracking for a user #455

Uh oh!

MODSetter commented Oct 30, 2025 •

edited by recurseml bot

Loading

Uh oh!

vercel bot commented Oct 30, 2025 •

edited

Loading

Uh oh!

coderabbitai bot commented Oct 30, 2025 •

edited

Loading

Uh oh!

recurseml bot left a comment •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Oct 30, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

feat: added file limit tracking for a user #455

feat: added file limit tracking for a user #455

Uh oh!

Conversation

MODSetter commented Oct 30, 2025 • edited by recurseml bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Motivation and Context

Screenshots

API Changes

Change Type

Testing Performed

Checklist

High-level PR Summary

Summary by CodeRabbit

Uh oh!

vercel bot commented Oct 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai bot commented Oct 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Pre-merge checks and finishing touches

Uh oh!

recurseml bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Review by RecurseML

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 30, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

MODSetter commented Oct 30, 2025 •

edited by recurseml bot

Loading

vercel bot commented Oct 30, 2025 •

edited

Loading

coderabbitai bot commented Oct 30, 2025 •

edited

Loading

recurseml bot left a comment •

edited

Loading