A simple Python script to batch-extract text from all PDF files in a specified folder and save the extracted text as Markdown files.
- Extracts text from every PDF in a chosen directory
- Saves extracted text as
.mdfiles in atextracted/subfolder - Handles errors gracefully and provides clear messages
- Skips PDFs if corresponding markdown already exists to avoid reprocessing
- Python 3.7+
- PyMuPDF (fitz)
- Clone this repository or download the script.
- Install the required dependencies:
pip install pymupdf
- Run the script:
python textractor.py
- When prompted, enter the path to the folder containing your PDF files.
- The script will:
- Check the folder exists and contains PDF files
- Extract text from each PDF
- Save each PDF's text as a Markdown file in a
textracted/subfolder next to your PDFs - Skip PDFs whose
.mdoutput already exists
- For each PDF, a corresponding
.mdfile will be created intextracted/. - Example:
document.pdf→textracted/document.md
$ python textractor.py
Enter the path to the folder containing PDFs: /path/to/my/pdfs
Extracted text from 'file1.pdf' to 'textracted/file1.md'
Extracted text from 'file2.pdf' to 'textracted/file2.md'- v1 (
textractor.py): Simple version that processes a single folder, extracting PDFs directly within and saving markdowns into atextracted/subfolder. - v2 (
textractorv2.py): Advanced version that watches a root folder recursively, processes PDFs at any depth, skips files with existing.md, and outputs each into its first-level subfolder'stextracted/.
Recommended: for most use cases, start with v1 and extend it as needed.
- Ensure you have permission to read the PDF files and write to the output directory.
- If you encounter errors, check that the folder path is correct and that the PDFs are not corrupted.
MIT License