Skip to content

sensein/GrobidArticleExtractor

GrobidArticleExtractor

This Python tool extracts content from PDF files using GROBID and organizes it by sections. It provides a structured way to extract both metadata and content from academic papers and other structured documents.

Features

  • Direct PDF processing using GROBID API
  • Metadata extraction (title, authors, abstract, publication date)
  • Hierarchical section organization with subsections

Prerequisites

  1. Start GROBID Service:

    docker pull lfoppiano/grobid:0.8.0
    docker run --init -p 8070:8070 -e JAVA_OPTS="-XX:+UseZGC" lfoppiano/grobid:0.8.0

    JAVA_OPTS="-XX:+UseZGC" helps to resolve the following error in mac os.

    [thread 44 also had an error]
    
    A fatal error has been detected by the Java Runtime Environment:
    
    SIGSEGV (0xb) at pc=0x00007ffffef8ad07, pid=8, tid=47
    
    JRE version: OpenJDK Runtime Environment (17.0.2+8) (build 17.0.2+8-86)
    Java VM: OpenJDK 64-Bit Server VM (17.0.2+8-86, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, parallel gc, linux-amd64)
    Problematic frame:
    [thread 41 also had an error]
    [thread 45 also had an error]
    [thread 46 also had an error]
  2. Installation :

    Install this package via :

    pip install GrobidArticleExtractor

    Or get the newest development version via:

    pip install git+https://github.com/sensein/GrobidArticleExtractor.git

    Note: If upgrading from a previous version, you may need to reinstall the package to ensure the CLI command is properly installed:

    pip uninstall GrobidArticleExtractor
    pip install GrobidArticleExtractor

Usage

Command Line Interface

The tool provides a user-friendly command-line interface for batch processing PDF files:

# Basic usage (processes PDFs from 'pdfs' directory)
grobidextractor

# Process PDFs from a specific directory
grobidextractor path/to/pdfs

# Specify custom output directory
grobidextractor path/to/pdfs -o path/to/output

# Use custom GROBID server and disable content preview
grobidextractor path/to/pdfs --grobid-url http://custom:8070 --no-preview

Available options:

$ grobidextractor --help
Usage: grobidextractor [OPTIONS] [INPUT_FOLDER]  

  Process PDF files from INPUT_FOLDER and extract their content using GROBID.

  The extracted content is saved as JSON files in the output directory.
  Each JSON file is named after its source PDF file.

Options:
  -o, --output-dir PATH  Directory to save extracted JSON files (default: output)
  -g, --grobid-url TEXT  GROBID service URL (default: http://localhost:8070)
  --preview / --no-preview
                        Show preview of extracted content (default: True)
  --help                Show this message and exit.

Example:
  grobidextractor path/to/pdfs -o path/to/output

Python API Usage

You can also use the tool programmatically in your Python code:

from GrobidArticleExtractor.app import GrobidArticleExtractor

# Initialize extractor (default GROBID URL: http://localhost:8070)
extractor = GrobidArticleExtractor()

# Process a PDF file
xml_content = extractor.process_pdf("path/to/your/paper.pdf")

if xml_content:
   # Extract and organize content
   result = extractor.extract_content(xml_content)

   # Access metadata
   print(result['metadata'])

   # Access sections
   for section in result['sections']:
      print(section['heading'])
      if 'content' in section:
         print(section['content'])

Custom GROBID server:

extractor = GrobidArticleExtractor(grobid_url="http://your-grobid-server:8070")

Example NoteBook

Check out the example notebook for a step-by-step guide on how to run it.

Output Structure

The extracted content is organized as follows:

{
   'metadata': {
      'title': 'Paper Title',
      'authors': ['Author 1', 'Author 2'],
      'abstract': 'Paper abstract...',
      'publication_date': '2023'
   },
   'sections': [
      {
         'heading': 'Introduction',
         'content': ['Paragraph 1...', 'Paragraph 2...'],
         'subsections': [
            {
               'heading': 'Background',
               'content': ['Subsection content...']
            }
         ]
      }
      # More sections...
   ]
}

Project Structure

The project is organized into two main files:

  • app.py - Contains the core GrobidArticleExtractor class with all the PDF processing and content extraction functionality
  • cli.py - Contains the command-line interface implementation using Click

Error Handling

The tool includes comprehensive error handling for common scenarios:

  • PDF file not found
  • GROBID service unavailable
  • XML parsing errors
  • Invalid content structure

All errors are logged with appropriate messages using Python's logging module.

Contributing

Feel free to submit issues and enhancement requests!

License

MIT License

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •