Skip to content

Conversation

@codeBunny2022
Copy link
Contributor

@codeBunny2022 codeBunny2022 commented Nov 1, 2025

Problem

PDF uploads were halting when processing files, especially around 34MB in size, due to repeated pdfminer warnings:

These warnings would spam the logs and cause the upload process to hang or fail, requiring users to delete and reupload files.

Solution

  • Added warning suppression for pdfminer warnings during Docling PDF processing
  • Suppresses specific warnings: 'Cannot set gray non-stroke color' and 'invalid float value'
  • Temporarily sets pdfminer logger to ERROR level during document processing to prevent log spam
  • Restores original logging level after processing completes

Changes

  • Modified surfsense_backend/app/tasks/document_processors/file_processors.py to wrap Docling processing with warning suppression
  • Added logging level management to suppress pdfminer warnings without affecting other logs
  • Also fixed pyproject.toml setuptools configuration for proper package discovery

Testing

  • Tested with 40MB PDF that was previously failing
  • Processing completes successfully without warning spam
  • Error logging remains intact (only warnings are suppressed)

Fixes issue where PDF files around 34MB would fail to upload due to pdfminer warning spam.

High-level PR Summary

This PR fixes an issue where PDF uploads were failing or hanging when processing files around 34MB in size due to excessive warning messages from the pdfminer library. The solution wraps the Docling PDF processing with warning suppression logic that temporarily filters out specific harmless warnings ('Cannot set gray non-stroke color' and 'invalid float value') and raises the pdfminer logger level to ERROR during processing, then restores the original logging level afterwards. The PR also includes an unrelated fix to the pyproject.toml setuptools configuration for proper package discovery.

⏱️ Estimated Review Time: 5-15 minutes

💡 Review Order Suggestion
Order File Path
1 surfsense_backend/app/tasks/document_processors/file_processors.py
2 surfsense_backend/pyproject.toml
⚠️ Inconsistent Changes Detected
File Path Warning
surfsense_backend/pyproject.toml This file contains setuptools/build system configuration changes that are unrelated to the PDF upload warning suppression fix described in the PR

Need help? Join our Discord

Analyze latest changes

Summary by CodeRabbit

  • Improvements

    • Document processing now runs with less noisy warning output and more stable logging during processing.
  • Chores

    • Updated packaging and build configuration to streamline project build and distribution.

- Added warning suppression for pdfminer warnings during Docling PDF processing
- Suppresses 'Cannot set gray non-stroke color' warnings that cause uploads to halt
- Temporarily sets pdfminer logger to ERROR level during document processing
- Fixes issue where files ~34MB would fail due to pdfminer warning spam

Resolves issue where PDF uploads would halt with repeated pdfminer warnings
@vercel
Copy link

vercel bot commented Nov 1, 2025

@codeBunny2022 is attempting to deploy a commit to the Rohan Verma's projects Team on Vercel.

A member of the Team first needs to authorize it.

@coderabbitai
Copy link

coderabbitai bot commented Nov 1, 2025

Walkthrough

The changes add a suppression block around Docling PDF processing to silence pdfminer warnings and temporarily set its logger to ERROR, and introduce build-system and setuptools configuration in pyproject.toml for packaging metadata.

Changes

Cohort / File(s) Summary
Document Processing Warning Suppression
surfsense_backend/app/tasks/document_processors/file_processors.py
Wraps the docling_service.process_document call in a warnings suppression context: imports warnings and getLogger, filters pdfminer-related UserWarnings, lowers the pdfminer logger to ERROR during processing, and restores the original logger level afterward.
Build System Configuration
surfsense_backend/pyproject.toml
Adds a [build-system] section with requires and build-backend; adds [tool.setuptools] with packages and include-package-data; adds [tool.setuptools.package-data] mapping app to include all files.

Sequence Diagram

sequenceDiagram
    participant Caller
    participant FileProcessor
    participant Warnings
    participant PdfminerLogger
    participant DoclingService

    Caller->>FileProcessor: process_document(...)
    rect rgb(245, 250, 255)
    Note over FileProcessor,DoclingService: Suppress pdfminer warnings & raise logger level
    FileProcessor->>Warnings: filterwarnings(ignore, category=UserWarning, message=pdfminer...)
    FileProcessor->>PdfminerLogger: setLevel(ERROR) (save old level)
    FileProcessor->>DoclingService: process_document(...)
    DoclingService-->>FileProcessor: result / error
    FileProcessor->>PdfminerLogger: restore(saved level)
    end
    FileProcessor-->>Caller: return result
Loading

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

  • Attention points:
    • Ensure logger level is always restored (exception safety / finally block) in file_processors.py.
    • Verify the warning filter scope is narrow enough to avoid hiding unrelated warnings.
    • Confirm pyproject.toml entries match repository packaging expectations.

Poem

🐰 Hush, little pdfminer, silence your tune,

I tuck your warnings away under the moon.
Logger set gentle, packages in place,
Docs processed quietly, hopping with grace.

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The PR title "fix: suppress pdfminer warnings to prevent upload halting" directly and accurately summarizes the main change in this pull request. The primary modification in file_processors.py wraps Docling document processing with warning suppression for pdfminer-related warnings to resolve the issue where PDF uploads were failing, which is precisely what the title conveys. The title is concise, specific, and avoids vague terminology—it clearly identifies both the technical action (suppress warnings) and the user-facing benefit (prevent upload halting). The secondary change to pyproject.toml is supporting configuration that doesn't alter the core purpose. A teammate reviewing the git history would immediately understand that this PR addresses pdfminer warning-related upload failures without needing additional context.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 094bdfa and b3026e4.

📒 Files selected for processing (1)
  • surfsense_backend/app/tasks/document_processors/file_processors.py (2 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
surfsense_backend/app/tasks/document_processors/file_processors.py (1)
surfsense_backend/app/services/docling_service.py (1)
  • process_document (167-223)
🔇 Additional comments (2)
surfsense_backend/app/tasks/document_processors/file_processors.py (2)

7-8: LGTM: Import additions support the warning suppression functionality.

The imports are appropriate for managing pdfminer warnings and logging levels during document processing.


896-922: Implementation is safe under default Celery configuration.

Verification confirms the logger modification is safe: Celery uses the prefork pool by default, which isolates each worker as a separate process. The example configuration in docker-compose.yml also specifies --pool=solo (single-process pool), neither of which would cause concurrent tasks to interfere with the global logger state.

The code correctly:

  • Uses try-finally to restore the original logger level, ensuring cleanup even if processing fails
  • Leverages NullPool for database connections in Celery tasks (best practice)
  • Properly suppresses pdfminer warnings that can halt processing

The implementation addresses the reported issue with appropriate error handling and process isolation.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@recurseml recurseml bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review by RecurseML

🔍 Review performed on 57fd82f..094bdfa

✨ No bugs found, your code is sparkling clean

✅ Files analyzed, no issues (1)

surfsense_backend/pyproject.toml

Copy link

@recurseml recurseml bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review by RecurseML

🔍 Review performed on 094bdfa..094bdfa

✨ No files to analyze

Copy link

@recurseml recurseml bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review by RecurseML

🔍 Review performed on 094bdfa..094bdfa

✨ No files to analyze

@codeBunny2022 codeBunny2022 marked this pull request as draft November 3, 2025 03:27
@codeBunny2022 codeBunny2022 marked this pull request as ready for review November 3, 2025 03:27
Copy link
Owner

@MODSetter MODSetter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove pyproject.toml changes and everything else looks good to me 👍

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@codeBunny2022 This is causing docker build to fail. Please revert these changes.

Err Log

10.68 Downloading av (38.7MiB)
12.60   × Failed to build `surf-new-backend @ file:///app`
12.60   ├─▶ The build backend returned an error
12.60   ╰─▶ Call to `setuptools.build_meta.build_editable` failed (exit status: 1)
12.60
12.60       [stdout]
12.60       running egg_info
12.60       creating surf_new_backend.egg-info
12.60       writing surf_new_backend.egg-info/PKG-INFO
12.60       writing dependency_links to
12.60       surf_new_backend.egg-info/dependency_links.txt
12.60       writing requirements to surf_new_backend.egg-info/requires.txt
12.60       writing top-level names to surf_new_backend.egg-info/top_level.txt
12.60       writing manifest file 'surf_new_backend.egg-info/SOURCES.txt'
12.60
12.60       [stderr]
12.60       /tmp/.tmp7SLZ0s/builds-v0/.tmprY3Eyn/lib/python3.12/site-packages/setuptools/config/expand.py:128:
12.60       SetuptoolsWarning: File '/app/README.md' cannot be found
12.60         for path in _filter_existing_files(_filepaths)
12.60       error: package directory 'app' does not exist
12.60
12.60       hint: This usually indicates a problem with the package or the build
12.60       environment.
------
failed to solve: process "/bin/sh -c pip install --no-cache-dir uv &&     uv pip install --system --no-cache-dir -e ." did not complete successfully: exit code: 1

@codeBunny2022
Copy link
Contributor Author

i have reverted the changes in pyproject.toml, it should be good to go now

@MODSetter
Copy link
Owner

@codeBunny2022 Thanks for your work 👍

@MODSetter MODSetter merged commit 97ed791 into MODSetter:main Nov 4, 2025
1 check failed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants