E-mail processing system with intelligent natural language parsing
A robust Ruby on Rails application that processes .eml email files from multiple vendors, extracting structured customer data using an intelligent, vendor-specific parsing system.
Built with scalability, maintainability, and real-world use cases in mind.
Key Features & DifferentiatorsArchitecture (click 2 show details)
- ✅ Bulk Upload: Process multiple .eml files simultaneously (ideal for real-world scenarios)
- ✅ Real-time Dashboard: Live statistics showing total emails, success/failure rates, and customer count
- ✅ Auto-refresh: Pages automatically update when emails are being processed
- ✅ Modern UI: Gradient design, animations, drag & drop support, and responsive layout
- ✅ Smart Status Display: Three-state system (Pending/Success/Failed) prevents user confusion
- ✅ Modal Data Viewing: Clean interface with popup windows for detailed data inspection
Unlike basic regex parsers, our system uses multi-strategy natural language processing to extract product codes even when customers write in natural language:
- ✅ Structured formats:
Produto: ABC123,Código: XYZ789 - ✅ Natural language: "interessado no produto de código ABC123"
- ✅ Subject line extraction: Automatically parses subjects like "Pedido - Produto XYZ987"
- ✅ Intelligent fallback: Pattern recognition for standalone codes (e.g.,
ABC123,PROD-999) - ✅ 100% extraction rate on real-world test data
Why this matters: Most competitors fail when customers don't follow exact formats.
my system handles real human communication, significantly reducing manual intervention.
Architecture
- Strategy Pattern implementation for vendor-specific parsers
- Async processing with Sidekiq for high-throughput scenarios
- SOLID principles - easily extend without modifying existing code
- Comprehensive logging with automatic retention policies
- UTF-8 encoding handling to prevent common parsing errors
- CI/CD pipeline with GitHub Actions
- test coverage (RSpec)
- Secure Sidekiq web interface with authentication
- Automatic data cleanup with configurable retention policies
Clique para ver a galeria de screenshots
Interface de Clientes
|
Dashboard Principal
|
Logs do Sistema
|
Upload de Arquivos
|
- Technologies
- Quick Start
- Architecture Deep Dive
- Intelligent Parser System
- Usage
- Testing
- API & Integration
- Deployment
- Contributing
| Category | Technology | Version | Purpose |
|---|---|---|---|
| Backend | Ruby | 3.4.5 | Application runtime |
| Framework | Rails | 7.2.3 | Web framework |
| Database | PostgreSQL | 15+ | Primary data store with JSONB support |
| Cache/Queue | Redis | 7+ | Job queue & caching |
| Jobs | Sidekiq | 7.3.9 | Async job processing |
| Scheduling | Sidekiq-Cron | 1.12 | Scheduled jobs (log cleanup) |
| Frontend | Bootstrap | 5.3.2 | Responsive UI framework |
| Icons | Bootstrap Icons | 1.11.1 | Modern icon set |
| Email Parsing | Mail Gem | 2.8+ | RFC822 email parsing |
| Storage | Active Storage | - | .eml file management |
| Testing | RSpec | 3.13+ | Comprehensive test suite |
| Containerization | Docker | Latest | Consistent deployment |
- Docker & Docker Compose (recommended)
- OR: Ruby 3.4+, PostgreSQL 15+, Redis 7+
# 1. Clone the repository
git clone https://github.com/bulletdev/EmailProcessorRails.git
cd EmailProcessorRails
# 2. Configure environment (optional - defaults provided)
cp .env
# 3. Build and start all services
docker-compose up --build
# 4. Run database migrations
docker-compose exec -T app bundle exec rails db:migrate
# 5. Access the application
open http://localhost:5999That's it! The application is now running with:
- Web app: http://localhost:5999
- Sidekiq UI: http://localhost:5999/sidekiq
- PostgreSQL: localhost:5499
- Redis: localhost:6399
Click to expand manual installation steps
# 1. Clone the repository
git clone https://github.com/bulletdev/EmailProcessorRails.git
cd EmailProcessorRails
# 1. Install Ruby
bundle install
# 2. Configure the database connection
# The project uses DATABASE_URL. You have two options:
## Option A: Set the environment variable (recommended) export DATABASE_URL="postgresql://postgres:postgres@localhost:5499/email_processor_development"
# Option B: Configure the database.yml file (if not using the URL)
# cp config/database.yml.example config/database.yml
# Edit the config/database.yml file with your credentials.
# 3. Create and migrate the database
bundle exec rails db:create db:migrate
# 4. Start Redis (in a separate terminal)
redis-server
# 5. Start Sidekiq (in another separate terminal)
bundle exec sidekiq
# 6. Start the Rails server
bundle exec rails server -p 5999┌─────────────┐ ┌──────────────┐ ┌─────────────┐
│ User/API │────▶│ Rails Server │────▶│ PostgreSQL │
└─────────────┘ └──────────────┘ └─────────────┘
│
▼
┌────────────────┐
│ Active Storage │
│ (.eml files) │
└────────────────┘
│
▼
┌─────────────┐ ┌──────────────┐ ┌─────────────┐
│ Sidekiq │◀────│ Redis Queue │◀────│ Process Job │
└─────────────┘ └──────────────┘ └─────────────┘
│
▼
┌─────────────────────────────────────────────────────┐
│ EmailProcessorService (Context) │
├─────────────────────────────────────────────────────┤
│ • Selects parser based on sender email │
│ • Handles errors gracefully │
│ • Updates logs with detailed status │
└──────────────┬──────────────────────────────────────┘
│
▼
┌──────────────────┐
│ BaseParser │ ◀── Strategy Pattern Interface
│ (Module) │
└──────────────────┘
△
│
┌───────┴────────┐
│ │
┌───▼────────┐ ┌───▼────────┐
│Fornecedor │ │ Parceiro │
│A Parser │ │ B Parser │
└────────────┘ └────────────┘
Each vendor has a dedicated parser implementing BaseParser interface:
# Context
class EmailProcessorService
PARSERS = {
"[email protected]" => FornecedorAParser,
"[email protected]" => ParceiroBParser
}
end
# Strategy Interface
module BaseParser
def parse(mail_content)
# Template method defining parsing flow
end
end
# Concrete Strategies
class FornecedorAParser
include BaseParser
# Vendor-specific extraction logic
endBenefits:
- ✅ Open/Closed Principle - add new vendors without modifying existing code
- ✅ Single Responsibility - each parser handles one vendor's format
- ✅ Easy testing - mock/test parsers independently
BaseParser defines the parsing algorithm structure, subclasses implement specific steps:
def parse(mail_content)
mail = Mail.read_from_string(mail_content)
{
name: extract_name(mail), # ← Subclass implements
email: extract_email(mail), # ← Subclass implements
phone: extract_phone(mail), # ← Subclass implements
product_code: extract_product_code(mail), # ← Subclass implements
subject: mail.subject
}
endBackground processing with automatic retries and monitoring:
class ProcessEmailJob < ApplicationJob
queue_as :default
sidekiq_options retry: 3
def perform(email_log_id)
# Async processing with error handling
end
end1. User uploads .eml file
↓
2. EmailLog created (status: pending) + File stored in Active Storage
↓
3. ProcessEmailJob enqueued to Sidekiq
↓
4. Job picks up email_log_id from queue
↓
5. EmailProcessorService.process(email_log)
↓
6. Select parser based on sender email
↓
7. Parser extracts structured data
↓
8. Customer record created
↓
9. EmailLog updated (status: success/failed)
Traditional email parsers fail when users don't follow exact formats:
# ❌ Traditional approach - only works with exact format
/Produto:\s*([A-Z0-9\-]+)/iThis fails on:
- "interessado no produto de código ABC123"
- "Preciso de informações sobre XYZ987"
- Subject: "Pedido - Produto LMN456"
my intelligent parser uses cascading pattern matching with 6 extraction strategies:
def extract_product_code(mail)
# Strategy 1: Structured formats (highest priority)
extract_from_body(mail, /Produto:\s*([A-Z0-9\-]+)/i) ||
extract_from_body(mail, /Código:\s*([A-Z0-9\-]+)/i) ||
# Strategy 2: Natural language patterns
extract_from_body(mail, /produto\s+de\s+código\s+([A-Z0-9\-]+)/i) ||
extract_from_body(mail, /produto\s+([A-Z][A-Z0-9\-]{2,})/i) ||
# Strategy 3: Subject line extraction
extract_from_subject(mail, /Produto\s+([A-Z][A-Z0-9\-]{2,})/i) ||
# Strategy 4: Intelligent fallback - pattern recognition
extract_from_body(mail, /\b([A-Z]{3,}[\-]?\d{3,})\b/)
endTest Data Performance:
| Customer Input | Extracted Code | Strategy Used | |
|---|---|---|---|
| email1.eml | "produto de código ABC123" | ✅ ABC123 | Natural Language |
| email2.eml | "interessado no produto XYZ987" | ✅ XYZ987 | Natural Language |
| email3.eml | Subject: "Produto LMN456" | ✅ LMN456 | Subject Line |
| email6.eml | "Produto: PROD-999" | ✅ PROD-999 | Structured |
📊 Extraction Success Rate: 100%
Common issue: "incompatible encoding regexp match (UTF-8 regexp with BINARY string)"
solution:
def extract_from_body(mail, pattern)
body = mail_body_text(mail)
# Force UTF-8 encoding to prevent errors
body = body.force_encoding('UTF-8') unless body.encoding == Encoding::UTF_8
match = body.match(pattern)
match ? match[1].strip : nil
endThis prevents encoding errors that crash most parsers when handling international characters or different email clients.
Step 1: Create parser class in app/parsers/:
class NewVendorParser
include BaseParser
private
def extract_name(mail)
extract_from_body(mail, /Name:\s*(.+)/i)
end
def extract_email(mail)
extract_from_body(mail, /Email:\s*([^\s]+@[^\s]+)/i)
end
def extract_phone(mail)
extract_from_body(mail, /Phone:\s*([\d\s\-\(\)]+)/i)
end
def extract_product_code(mail)
# Implement vendor-specific patterns
end
endStep 2: Register in EmailProcessorService:
PARSERS = {
"[email protected]" => FornecedorAParser,
"[email protected]" => ParceiroBParser,
"[email protected]" => NewVendorParser # ← Add here
}.freezeThat's it! No changes to controllers, jobs, or tests needed. ✨
URL: / or /dashboard
The main dashboard provides a comprehensive overview:
- Statistics Cards: Total emails, successful, failed, and customer count
- Recent Activity: Last 5 email logs with real-time status
- New Customers: Recently added customers from processed emails
- Quick Actions: Fast access to upload, logs, and customer pages
- Auto-refresh: Automatically updates when emails are being processed
URL: /emails/new
Features:
- Multi-file upload: Select multiple .eml files at once (ideal for batch processing)
- Drag & Drop: Drag files directly into the upload zone
- Progress tracking: Shows count of selected files with full list
- Async processing: All files are queued and processed in parallel via Sidekiq
How to use:
- Navigate to Upload Email
- Click "Choose Files" or drag & drop multiple .eml files
- Review the list of selected files
- Click "Upload and Process"
- Files are processed in background - watch progress in Email Logs
Sample emails available in emails/ and sample_emails/ directories for testing.
URL: /customers
- Paginated list (20 per page)
- Displays: Name, Email, Phone, Product Code, Subject, Creation Date
- Clean, modern table design with Bootstrap Icons
- Empty state with call-to-action when no customers exist
URL: /email_logs
Features:
- Real-time status: Three-state system (Pending/Success/Failed)
- Auto-refresh: Page refreshes every 3 seconds when emails are pending
- Modal data viewing: Click "View Data" to see extracted information in popup
- Error inspection: Click "View Error" to see detailed error messages
- Status filters: Quick filter buttons for All/Success/Failed emails
- Reprocess capability: One-click reprocessing for failed emails
- Responsive table: All columns properly sized, no text cutoff
Status indicators:
- 🟢 Success - Email processed successfully, customer data extracted
- 🔴 Failed - Processing error (view error details for debugging)
- 🟡 Processing - Email currently being processed (with spinner animation)
URL: /sidekiq
Features:
- Real-time job queue monitoring
- Job statistics and history
- Worker performance metrics
- Scheduled jobs (cron) management
Security:
- Production: Basic HTTP Auth required
- Set
SIDEKIQ_USERNAMEandSIDEKIQ_PASSWORDenv vars - Development: Open access (no auth)
# Process all test emails
docker-compose exec -T app bundle exec rails runner lib/scripts/process_all_emails.rb
# Test product code extraction specifically
docker-compose exec -T app bundle exec rails runner lib/scripts/test_product_extraction.rb
# View customer data
docker-compose exec -T app bundle exec rails runner lib/scripts/show_customers.rb# Rails console
docker-compose exec app bundle exec rails console
# Run migrations
docker-compose exec -T app bundle exec rails db:migrate
# Reset database (CAUTION: Deletes all data)
docker-compose exec -T app bundle exec rails db:reset# Clean up logs older than 90 days (default)
docker-compose exec -T app bundle exec rake email_logs:cleanup
# Custom retention period (60 days)
docker-compose exec -T app bundle exec rake email_logs:cleanup[60]
# View statistics
docker-compose exec -T app bundle exec rake email_logs:statsAutomatic Cleanup:
- Runs daily at 2:00 AM (configurable in
config/schedule.yml) - Default retention: 90 days
- Includes .eml file attachments
8 test emails provided in emails/ directory:
| File | Vendor | Status | Notes |
|---|---|---|---|
| email1.eml | Fornecedor A | ✅ Success | Natural language product code |
| email2.eml | Fornecedor A | ✅ Success | Product code in sentence |
| email3.eml | Fornecedor A | ✅ Success | Product code in subject |
| email4.eml | Parceiro B | ❌ Expected Fail | Missing contact info |
| email5.eml | Parceiro B | ❌ Expected Fail | Missing name |
| email6.eml | Parceiro B | ✅ Success | Structured format |
| email7.eml | Fornecedor A | ❌ Expected Fail | No email/phone |
| email8.eml | Parceiro B | ❌ Expected Fail | Incomplete data |
docker-compose exec -T app bundle exec rspec# Parser tests (includes product code extraction)
docker-compose exec -T app bundle exec rspec spec/parsers/
# Service tests
docker-compose exec -T app bundle exec rspec spec/services/
# Model tests
docker-compose exec -T app bundle exec rspec spec/models/
# Job tests
docker-compose exec -T app bundle exec rspec spec/jobs/
# Integration tests
docker-compose exec -T app bundle exec rspec spec/requests/docker-compose exec -T app bundle exec rspec --format documentation# Run Rubocop linter
docker-compose exec -T app bundle exec rubocop
# Auto-fix issues
docker-compose exec -T app bundle exec rubocop -A- Models: Customer, EmailLog
- Parsers: FornecedorAParser, ParceiroBParser (including natural language extraction)
- Services: EmailProcessorService
- Jobs: ProcessEmailJob, CleanupEmailLogsJob
- Controllers: Customers, EmailLogs, Emails
curl -X POST http://localhost:5999/emails \
-F "eml_files[]=@emails/email1.eml"curl -X POST http://localhost:5999/emails \
-F "eml_files[]=@emails/email1.eml" \
-F "eml_files[]=@emails/email2.eml" \
-F "eml_files[]=@emails/email3.eml"Response: Redirects to /email_logs with success message showing count of uploaded files
Example response messages:
- Single file: "📧 1 email uploaded successfully! Processing will complete in a few seconds..."
- Multiple files: "📧 10 emails uploaded successfully! Processing will complete in a few seconds..."
curl -X POST http://localhost:5999/emails/{id}/reprocessGET /customers
GET /customers?page=2GET /email_logs
GET /email_logs?status=failed# In Rails console or custom script
email_log = EmailLog.create!(
filename: "customer_inquiry.eml",
status: :pending
)
email_log.eml_file.attach(
io: File.open("path/to/email.eml"),
filename: "customer_inquiry.eml",
content_type: "message/rfc822"
)
# Process synchronously
EmailProcessorService.process(email_log)
# OR process asynchronously (recommended)
ProcessEmailJob.perform_later(email_log.id)
# Check result
email_log.reload
puts email_log.status # => "success" or "failed"
puts email_log.extracted_data # => Hash of extracted fieldsdocker-compose.prod.yml example:
services:
app:
build: .
environment:
RAILS_ENV: production
RAILS_SERVE_STATIC_FILES: "true"
RAILS_LOG_TO_STDOUT: "true"
SIDEKIQ_USERNAME: ${SIDEKIQ_USERNAME}
SIDEKIQ_PASSWORD: ${SIDEKIQ_PASSWORD}
SECRET_KEY_BASE: ${SECRET_KEY_BASE}
ports:
- "80:5000"Required for production:
# Rails
SECRET_KEY_BASE=<generate with: rails secret>
RAILS_ENV=production
# Database
DATABASE_URL=postgresql://user:pass@host:5432/db_name
# Redis
REDIS_URL=redis://redis:6379/0
# Sidekiq Auth
SIDEKIQ_USERNAME=admin
SIDEKIQ_PASSWORD=<strong-password>
# Optional
RAILS_MAX_THREADS=5# Application health
curl http://localhost:5999/up
# Database connectivity
docker-compose exec app bundle exec rails db:migrate:status
# Redis connectivity
docker-compose exec app bundle exec rails runner "puts Sidekiq.redis(&:ping)"GitHub Actions automatically:
- Runs tests on every push/PR
- Runs Rubocop linter
- Builds Docker image
- Validates docker-compose
Badge: Shows real-time build status in README
- All email processing happens in background jobs
- Non-blocking user experience
- Automatic retry on transient failures (3 attempts)
- Indexed fields:
email,phone,status,created_at - JSONB storage for flexible
extracted_data - GIN index on JSONB for fast queries
- Redis caches Sidekiq job data
- Active Storage caching for .eml files
- Horizontal scaling: Add more Sidekiq workers
- Vertical scaling: Increase
RAILS_MAX_THREADS - Database: PostgreSQL connection pooling
- File storage: Active Storage supports S3/GCS for production
We welcome contributions! Please follow these steps:
- Fork the repository
- Create a feature branch:
git checkout -b feature/amazing-parser - Write tests for your changes
- Ensure all tests pass:
bundle exec rspec - Ensure code quality:
bundle exec rubocop -A - Commit with clear messages:
git commit -m 'Add parser for Vendor X' - Push to your fork:
git push origin feature/amazing-parser - Open a Pull Request with detailed description
- Follow Ruby Style Guide
- Write descriptive commit messages
- Add RSpec tests for new features
- Update documentation
Roadmap
- Bulk upload for multiple email files
- Real-time dashboard with statistics
- Auto-refresh for pending emails
- REST API with authentication (JWT)
- Real-time notifications (Action Cable)
- Machine learning for parser auto-improvement
- Multi-tenancy support
- Advanced analytics and reporting
- Email template generation
- S3/GCS integration for production storage
- Export data to CSV/Excel
- Webhook integrations
- Built with Ruby on Rails
- Email parsing powered by Mail gem
- Background jobs by Sidekiq
- UI components from Bootstrap
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Documentation: This README + inline code comments
© 2025 BulletOnRails .
All rights reserved.
O código-fonte contido aqui é disponibilizado sob a licença Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Você pode encontrar o texto completo da licença no arquivo LICENSE neste repositório.
Shield:
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
❤️ understand that real-world data is messy.





