-
-
Notifications
You must be signed in to change notification settings - Fork 932
feat: added file limit tracking for a user #455
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
WalkthroughIntroduces a page limit tracking system for users. Adds database migration and model fields (pages_limit, pages_used) to track page usage. Implements PageLimitService to manage quotas and estimate pages from various document formats. Integrates checks throughout the document processing pipeline to prevent exceeding limits. Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant Task as Celery Task
participant Estimator as Page Estimator
participant Service as PageLimitService
participant DB as Database
participant ETL as ETL Processor
User->>Task: Upload file
Task->>Estimator: estimate_pages_before_processing()
Estimator-->>Task: estimated_pages
Task->>Service: check_page_limit(user_id, estimated_pages)
Service->>DB: Get user pages_used, pages_limit
alt Limit exceeded
Service-->>Task: PageLimitExceededError
Task-->>User: HTTP 403
else Within limit
Service-->>Task: (True, pages_used, pages_limit)
Task->>ETL: Process document
ETL-->>Task: parsed_documents
Task->>Estimator: estimate_pages_from_parsed()
Estimator-->>Task: actual_pages
rect rgb(200, 220, 240)
Note over Task,DB: Update usage only if<br/>documents created
Task->>Service: update_page_usage(user_id, final_pages)
Service->>DB: Increment pages_used
DB-->>Service: new_usage
end
Task-->>User: Success + metrics
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes
Poem
Pre-merge checks and finishing touches❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review by RecurseML
🔍 Review performed on 5654f6c..4be9d09
✨ No bugs found, your code is sparkling clean
✅ Files analyzed, no issues (7)
• surfsense_backend/alembic/versions/33_add_page_limits_to_user.py
• surfsense_backend/app/db.py
• surfsense_backend/app/services/page_limit_service.py
• surfsense_backend/app/tasks/celery_tasks/document_tasks.py
• surfsense_backend/app/tasks/document_processors/file_processors.py
• surfsense_backend/pyproject.toml
• surfsense_backend/uv.lock
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
surfsense_backend/app/tasks/document_processors/file_processors.py (1)
708-875: Avoid charging duplicates against the user's page quota.
add_received_file_document_using_unstructured(and the LlamaCloud/Docling helpers) return an existingDocumentobject when the content hash is unchanged, soresultis still truthy for pure duplicates. With the new quota logic that means every re-upload of the same file now decrements the user's page allowance even though nothing new was persisted. The same pattern recurs in the LlamaCloud and Docling branches below, so the bug is systemic.Please only call
update_page_usagewhen we actually created or updated content. One way is to have each helper return a(document, created)tuple (orNonefor duplicates) and gate the quota update accordingly, e.g.:- result = await add_received_file_document_using_unstructured( - session, filename, docs, search_space_id, user_id - ) - - if result: + result, created = await add_received_file_document_using_unstructured( + session, filename, docs, search_space_id, user_id + ) + + if created: # Update page usage after successful processing # allow_exceed=True because document was already created after passing initial check await page_limit_service.update_page_usage( user_id, final_page_count, allow_exceed=True ) await task_logger.log_task_success( log_entry, f"Successfully processed file with Unstructured: {filename}", { "document_id": result.id, "content_hash": result.content_hash, "file_type": "document", "etl_service": "UNSTRUCTURED", "pages_processed": final_page_count, }, ) else: await task_logger.log_task_success( log_entry, f"Document already exists (duplicate): {filename}", { "duplicate_detected": True, "file_type": "document", "etl_service": "UNSTRUCTURED", }, )And adjust the helpers to return the new shape, for example:
async def add_received_file_document_using_unstructured(...) -> tuple[Document | None, bool]: ... if existing_document and existing_document.content_hash == content_hash: logging.info("Document for file %s unchanged. Skipping.", file_name) return existing_document, False ... return document, TruePlease mirror the same contract in the LlamaCloud and Docling paths so duplicates never burn quota.
🧹 Nitpick comments (1)
surfsense_backend/alembic/versions/33_add_page_limits_to_user.py (1)
36-57: Consider adding database constraints for data integrity.The columns are added without database-level validation. This allows invalid states like
pages_used > pages_limitor negative values. Adding CHECK constraints would enforce business rules at the database level.Consider applying this diff to add constraints:
# Add pages_limit column if it doesn't exist if "pages_limit" not in user_columns: op.add_column( "user", sa.Column( "pages_limit", sa.Integer(), nullable=False, server_default="500", ), ) + # Add constraint to ensure pages_limit is positive + op.create_check_constraint( + "ck_user_pages_limit_positive", + "user", + "pages_limit > 0" + ) # Add pages_used column if it doesn't exist if "pages_used" not in user_columns: op.add_column( "user", sa.Column( "pages_used", sa.Integer(), nullable=False, server_default="0", ), ) + # Add constraint to ensure pages_used is non-negative + op.create_check_constraint( + "ck_user_pages_used_non_negative", + "user", + "pages_used >= 0" + ) + # Add constraint to ensure pages_used doesn't exceed pages_limit + op.create_check_constraint( + "ck_user_pages_used_within_limit", + "user", + "pages_used <= pages_limit" + )And update the downgrade function to drop these constraints:
# Drop columns if they exist if "pages_used" in user_columns: + op.drop_constraint("ck_user_pages_used_within_limit", "user") + op.drop_constraint("ck_user_pages_used_non_negative", "user") op.drop_column("user", "pages_used") if "pages_limit" in user_columns: + op.drop_constraint("ck_user_pages_limit_positive", "user") op.drop_column("user", "pages_limit")
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
surfsense_backend/uv.lockis excluded by!**/*.lock
📒 Files selected for processing (6)
surfsense_backend/alembic/versions/33_add_page_limits_to_user.py(1 hunks)surfsense_backend/app/db.py(2 hunks)surfsense_backend/app/services/page_limit_service.py(1 hunks)surfsense_backend/app/tasks/celery_tasks/document_tasks.py(1 hunks)surfsense_backend/app/tasks/document_processors/file_processors.py(11 hunks)surfsense_backend/pyproject.toml(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (3)
surfsense_backend/app/tasks/celery_tasks/document_tasks.py (2)
surfsense_backend/app/services/page_limit_service.py (1)
PageLimitExceededError(12-27)surfsense_backend/app/services/task_logging_service.py (1)
log_task_failure(107-162)
surfsense_backend/app/services/page_limit_service.py (1)
surfsense_backend/app/db.py (2)
User(392-405)User(409-419)
surfsense_backend/app/tasks/document_processors/file_processors.py (2)
surfsense_backend/app/services/page_limit_service.py (7)
PageLimitExceededError(12-27)PageLimitService(30-401)check_page_limit(36-77)estimate_pages_from_elements(147-179)update_page_usage(79-123)estimate_pages_from_markdown(181-211)estimate_pages_from_content_length(213-224)surfsense_backend/app/services/task_logging_service.py (2)
log_task_failure(107-162)log_task_success(60-105)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Python Backend Quality
| # Get user | ||
| result = await self.session.execute(select(User).where(User.id == user_id)) | ||
| user = result.scalar_one_or_none() | ||
|
|
||
| if not user: | ||
| raise ValueError(f"User with ID {user_id} not found") | ||
|
|
||
| # Check if this would exceed limit (only if allow_exceed is False) | ||
| new_usage = user.pages_used + pages_to_add | ||
| if not allow_exceed and new_usage > user.pages_limit: | ||
| raise PageLimitExceededError( | ||
| message=f"Cannot update page usage. Would exceed limit. " | ||
| f"Current: {user.pages_used}/{user.pages_limit}, " | ||
| f"Trying to add: {pages_to_add}", | ||
| pages_used=user.pages_used, | ||
| pages_limit=user.pages_limit, | ||
| pages_to_add=pages_to_add, | ||
| ) | ||
|
|
||
| # Update usage | ||
| user.pages_used = new_usage | ||
| await self.session.commit() | ||
| await self.session.refresh(user) | ||
|
|
||
| return user.pages_used |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lock the user row before incrementing page usage.
update_page_usage reads pages_used, adds the delta, and writes it back with no locking. If two Celery workers finish at the same time they both see the same starting usage and the second commit overwrites the first, effectively dropping an increment and letting users bypass their limit. Please lock the row (or use an atomic UPDATE … SET pages_used = pages_used + :delta RETURNING …) before computing the new total, e.g.:
- result = await self.session.execute(select(User).where(User.id == user_id))
+ result = await self.session.execute(
+ select(User).where(User.id == user_id).with_for_update()
+ )
user = result.scalar_one_or_none()
@@
- user.pages_used = new_usage
+ user.pages_used = new_usage
await self.session.commit()
await self.session.refresh(user)
return user.pages_usedThat guarantees concurrent workers serialize on the same row instead of losing updates.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Get user | |
| result = await self.session.execute(select(User).where(User.id == user_id)) | |
| user = result.scalar_one_or_none() | |
| if not user: | |
| raise ValueError(f"User with ID {user_id} not found") | |
| # Check if this would exceed limit (only if allow_exceed is False) | |
| new_usage = user.pages_used + pages_to_add | |
| if not allow_exceed and new_usage > user.pages_limit: | |
| raise PageLimitExceededError( | |
| message=f"Cannot update page usage. Would exceed limit. " | |
| f"Current: {user.pages_used}/{user.pages_limit}, " | |
| f"Trying to add: {pages_to_add}", | |
| pages_used=user.pages_used, | |
| pages_limit=user.pages_limit, | |
| pages_to_add=pages_to_add, | |
| ) | |
| # Update usage | |
| user.pages_used = new_usage | |
| await self.session.commit() | |
| await self.session.refresh(user) | |
| return user.pages_used | |
| # Get user | |
| result = await self.session.execute( | |
| select(User).where(User.id == user_id).with_for_update() | |
| ) | |
| user = result.scalar_one_or_none() | |
| if not user: | |
| raise ValueError(f"User with ID {user_id} not found") | |
| # Check if this would exceed limit (only if allow_exceed is False) | |
| new_usage = user.pages_used + pages_to_add | |
| if not allow_exceed and new_usage > user.pages_limit: | |
| raise PageLimitExceededError( | |
| message=f"Cannot update page usage. Would exceed limit. " | |
| f"Current: {user.pages_used}/{user.pages_limit}, " | |
| f"Trying to add: {pages_to_add}", | |
| pages_used=user.pages_used, | |
| pages_limit=user.pages_limit, | |
| pages_to_add=pages_to_add, | |
| ) | |
| # Update usage | |
| user.pages_used = new_usage | |
| await self.session.commit() | |
| await self.session.refresh(user) | |
| return user.pages_used |
🤖 Prompt for AI Agents
In surfsense_backend/app/services/page_limit_service.py around lines 99 to 123,
the method reads pages_used, computes a new value and commits without any row
locking which allows concurrent workers to overwrite increments; change this to
perform the update atomically or lock the row: either (1) start a transaction
and re-select the user row with a FOR UPDATE (SQLAlchemy
select(...).with_for_update()) before computing and writing pages_used so
concurrent transactions serialize on the row, or (2) issue an atomic UPDATE
statement that increments pages_used (UPDATE ... SET pages_used = pages_used +
:delta RETURNING pages_used, pages_limit) and use the returned values to enforce
the limit check and return the new usage; ensure the transaction is committed
and the session refreshed only after the locked/atomic operation completes.
feat: added file limit tracking for a user
Description
feat: added file limit tracking for a user
Motivation and Context
FIX #
Screenshots
API Changes
Change Type
Testing Performed
Checklist
High-level PR Summary
This PR implements file/page limit tracking for users by adding
pages_limitandpages_usedcolumns to the user table. The system now estimates page counts from documents before processing them through ETL services (Unstructured, LlamaCloud, Docling) to prevent users from exceeding their allocated page limits. When a user attempts to process a document that would exceed their limit, the system rejects it before making expensive API calls. After successful processing, the actual page count is tracked and the user'spages_usedcounter is incremented. The feature includes a comprehensive page estimation service that handles various file formats (PDF, Word, Excel, images, etc.) and provides detailed error messages when limits are exceeded.⏱️ Estimated Review Time: 30-90 minutes
💡 Review Order Suggestion
surfsense_backend/app/db.pysurfsense_backend/alembic/versions/33_add_page_limits_to_user.pysurfsense_backend/app/services/page_limit_service.pysurfsense_backend/pyproject.tomlsurfsense_backend/uv.locksurfsense_backend/app/tasks/document_processors/file_processors.pysurfsense_backend/app/tasks/celery_tasks/document_tasks.pysurfsense_backend/app/db.pySummary by CodeRabbit
New Features
Chores