PDFtextractor

A simple Python script to batch-extract text from all PDF files in a specified folder and save the extracted text as Markdown files.

Features

Extracts text from every PDF in a chosen directory
Saves extracted text as .md files in a textracted/ subfolder
Handles errors gracefully and provides clear messages
Skips PDFs if corresponding markdown already exists to avoid reprocessing

Requirements

Python 3.7+
PyMuPDF (fitz)

Installation

Clone this repository or download the script.
Install the required dependencies:
```
pip install pymupdf
```

Usage

Run the script:
```
python textractor.py
```
When prompted, enter the path to the folder containing your PDF files.
The script will:
- Check the folder exists and contains PDF files
- Extract text from each PDF
- Save each PDF's text as a Markdown file in a textracted/ subfolder next to your PDFs
- Skip PDFs whose .md output already exists

Output

For each PDF, a corresponding .md file will be created in textracted/.
Example: document.pdf → textracted/document.md

Example

$ python textractor.py
Enter the path to the folder containing PDFs: /path/to/my/pdfs
Extracted text from 'file1.pdf' to 'textracted/file1.md'
Extracted text from 'file2.pdf' to 'textracted/file2.md'

Versions

v1 (textractor.py): Simple version that processes a single folder, extracting PDFs directly within and saving markdowns into a textracted/ subfolder.
v2 (textractorv2.py): Advanced version that watches a root folder recursively, processes PDFs at any depth, skips files with existing .md, and outputs each into its first-level subfolder's textracted/.

Recommended: for most use cases, start with v1 and extend it as needed.

Troubleshooting

Ensure you have permission to read the PDF files and write to the output directory.
If you encounter errors, check that the folder path is correct and that the PDFs are not corrupted.

License

MIT License

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.DS_Store		.DS_Store
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
PyPDF2vsPyMuPDFcomparison.png		PyPDF2vsPyMuPDFcomparison.png
README.md		README.md
requirements.txt		requirements.txt
textractorv1.py		textractorv1.py
textractorv2.py		textractorv2.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PDFtextractor

Features

Requirements

Installation

Usage

Output

Example

Versions

Troubleshooting

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

PDFtextractor

Features

Requirements

Installation

Usage

Output

Example

Versions

Troubleshooting

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages