Skip to content

[FEATURE] Add web crawling capability to allow users to add websites to their knowledge base #468

@samkul-swe

Description

@samkul-swe

Feature Description

Users can now add any website URL to their knowledge base. The system will automatically crawl the website using FireCrawl API, extract the content in clean markdown format, and make it searchable.

Target Deployment

  • [] SurfSense Cloud (hosted version)
  • Self-hosted version
  • [] Both

Problem Statement

Users couldn't easily add content from websites to their knowledge base. They would have to manually copy and paste content, or save pages as PDFs first. This made it hard to quickly build up their knowledge base with web resources like documentation, articles, blog posts, or research papers.

Proposed Solution

Add a URL crawling feature that lets users:

  1. Paste any website URL into SurfSense
  2. System automatically crawls the URL using FireCrawl API
  3. Extracts clean content in markdown format (no ads, navigation, or clutter)
  4. Saves page metadata (title, description, language)
  5. Indexes the content so users can search it later
  6. Provides a fallback crawler for users without FireCrawl API access

Benefits

  • ✅ Quick way to add web content to knowledge base
  • ✅ Gets clean, readable content without ads or clutter
  • ✅ Works with modern websites that use JavaScript
  • ✅ Extracts useful metadata automatically (titles, descriptions)
  • ✅ No manual copying and pasting needed
  • ✅ Works for documentation, articles, blogs, research papers, etc.

Use Case Examples

  1. Developer saving documentation: Developer finds helpful API documentation at https://docs.example.com/api/quickstart, pastes URL into SurfSense, content is crawled and added to their knowledge base for future reference
  2. Student researching a topic: Student finds a useful article, adds the URL, can now search and reference it later alongside their other materials
  3. Team building knowledge base: Team members can quickly add relevant blog posts, guides, and resources by just sharing URLs

Implementation Considerations

  • This may require frontend changes - Yes (add URL input field)
  • This may require backend changes - Yes (crawling logic, processing pipeline)
  • This may require database changes - Yes (store crawled documents)
  • This may affect existing features - No

Requirements:

  • FireCrawl API key (optional, but recommended for best results)
  • firecrawl-py library version 4.5.0+
  • Background task processing (Celery)

Checklist

  • I have searched existing issues/feature requests to ensure this is not a duplicate
  • I have provided a clear description of the feature
  • I have added appropriate labels (enhancement, deployment type)

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions