A simple Python tool for preprocessing FASTA and FASTQ files with common bioinformatics operations. This script provides reverse complement transformation, sequence trimming, and adaptor removal capabilities, along with detailed base composition statistics.
- Automatic Format Detection: Recognizes FASTA and FASTQ formats automatically
- Reverse Complement: Generate reverse complement sequences with proper quality score handling
- Flexible Trimming: Remove bases from either or both sequence ends with validation
- Adaptor Removal: Detect and remove adaptor sequences from sequence starts
- Statistical Analysis: Comprehensive base composition reporting with percentages
- Multi-line FASTA Support: Handles both single-line and multi-line FASTA formats
- Quality Score Handling: Maintains quality score integrity for FASTQ operations
- Python: Version 3.13.1 or higher (tested on Python 3.13.1)
- Standard Library: Uses only built-in Python modules (
argparse) - Environment: Compatible with Unix/Linux and Windows (tested on GNU bash 5.2.37)
Clone the repository:
git clone https://github.com/Cobos-Bioinfo/FASTX-Preprocessor.git
cd FASTX-Preprocessorpython3 fastX_pp.py --input <file> --output <file> --operation <op> [options]--input: Input FASTA or FASTQ file--output: Output file path--operation: Operation to performrc: Reverse complementtrim: Trim bases from sequence endsadaptor-removal: Remove adaptor sequences
For trim operation:
--trim-left <int>: Number of bases to remove from the left end (default: 0)--trim-right <int>: Number of bases to remove from the right end (default: 0)
For adaptor-removal operation:
--adaptor <sequence>: Adaptor sequence to remove from sequence starts
FASTA file:
python3 fastX_pp.py --input input.fasta --output output.fasta --operation rcFASTQ file:
python3 fastX_pp.py --input input.fastq --output output.fastq --operation rcTrim both ends:
python3 fastX_pp.py --input input.fastq --output output.fastq --operation trim --trim-left 5 --trim-right 3Trim right end only:
python3 fastX_pp.py --input input.fastq --output output.fastq --operation trim --trim-right 15Remove adaptor from FASTQ:
python3 fastX_pp.py --input input.fastq --output output.fastq --operation adaptor-removal --adaptor TATAGA- Must start with
>character - Supports both single-line and multi-line sequences
- Header lines begin with
> - Sequence lines contain nucleotide bases
>seq1
ATCGATCGATCG
>seq2
GCTAGCTAGCTA
- Must start with
@character - Each record consists of exactly 4 lines:
- Header line (starts with
@) - Sequence line
- Separator line (
+) - Quality scores line
- Header line (starts with
@seq1
ATCGATCGATCG
+
IIIIIIIIIIII
@seq2
GCTAGCTAGCTA
+
JJJJJJJJJJJJ
The script generates a processed output file in the same format as the input (FASTA or FASTQ) and displays a comprehensive summary:
==========================================================
| FASTX PREPROCESSING SUMMARY |
==========================================================
Input file : input.fastq
Output file : output.fastq
Operation : Hard-trimmed
----------------------------------------------------------
1,234 reads processed
456,789 bases processed (25% A, 25% C, 25% G, 25% T, 0% N)
12,345 bases trimmed (24% A, 26% C, 24% G, 26% T, 0% N)
==========================================================
Processing complete ✓
==========================================================
- Transforms sequences to their reverse complement (A↔T, G↔C)
- Preserves ambiguous base codes (N→N)
- For FASTQ files: Reverses quality scores to match reversed sequences
- Case-insensitive input, produces uppercase output
- Removes specified number of bases from sequence ends
- Requires at least one of
--trim-leftor--trim-rightto be non-zero - Validates that total trimming does not exceed sequence length
- For FASTQ files: Trims quality scores accordingly
- Reports statistics for both retained sequences and discarded fragments
- Removes adaptor sequences found at the start of sequences only
- Case-sensitive matching
- For FASTQ files: Trims corresponding quality scores
- Reports number of sequences where adaptors were found
- Sequences without adaptors remain unchanged
The script includes comprehensive validation:
- Format Detection: Ensures files start with
>(FASTA) or@(FASTQ) - Operation Validation:
- Trim operation requires non-zero trim values
- Adaptor-removal requires adaptor sequence
- Sequence Length Validation: Prevents over-trimming (when trim length ≥ sequence length)
- Entry Structure: Validates tuple lengths during processing
Missing required argument:
error: the following arguments are required: --operation
Invalid trim parameters:
ValueError: For trim operation, you must specify --trim-left and/or --trim-right
Over-trimming detection:
ValueError: Total trimming 200 for entry seq_001 is equal or longer than sequence length 150
Missing adaptor sequence:
ValueError: For adaptor-removal operation, you must specify --adaptor
Unknown file format:
ValueError: Unknown file format. File must start with '>' (FASTA) or '@' (FASTQ), found: 'X'
detect_format(filename): Detects FASTA or FASTQ formatread_fasta(filename): Parses FASTA files into (header, sequence) tuplesread_fastq(filename): Parses FASTQ files into (header, sequence, quality) tupleswrite_fastX(filename, entries): Writes entries in appropriate formatreverse_complement(sequence): Generates reverse complement sequencescount_bases(sequence): Counts base composition (A, C, G, T, N)calculate_statistics(sequences): Aggregates statistics across all sequencesformat_statistics(...): Formats statistics with percentages and thousands separators
operation_reverse_complement(entries): Applies reverse complement transformationoperation_trim(entries, trim_left, trim_right): Performs sequence trimmingoperation_adaptor_removal(entries, adaptor): Removes adaptor sequences
main(): Orchestrates argument parsing, validation, operation execution, and reporting
Uses str.maketrans() for efficient nucleotide complement mapping:
- A ↔ T
- C ↔ G
- N ↔ N (preserves ambiguity)
Accumulates sequence lines in a list and joins them into a single string, efficiently handling both single-line and multi-line FASTA formats.
Uses modulo arithmetic (line_num % 4) to identify each line's role within the 4-line record structure, ensuring robust parsing without regex patterns.
For all FASTQ operations:
- Reverse complement: Quality scores are reversed (
[::-1]) - Trimming: Quality scores are sliced with same boundaries as sequences
- Adaptor removal: Quality scores are trimmed by adaptor length
- Memory Efficient: Processes files line-by-line without loading entire datasets
- Single-Pass Statistics: Calculates all base composition in one iteration
- Optimized String Operations: Uses efficient slice notation and join operations
- Type Hints: Includes type annotations for clarity and potential performance gains
The script has been tested with:
- Multiple FASTA and FASTQ files of varying sizes
- Single-line and multi-line FASTA formats
- Various trim combinations (left only, right only, both ends)
- Edge cases (minimal trimming, heavy trimming, no adaptors found)
- Error conditions (over-trimming, missing parameters, invalid formats)
- Mixed-case input sequences
- Over-trimming halts execution entirely rather than skipping affected sequences
- Adaptor removal only detects adaptors at sequence starts (5' end)
- No support for paired-end read processing
- Quality score validation is not performed (assumes valid Phred scores)
Potential enhancements identified:
- Resilient Trimming: Skip sequences where trimming exceeds length instead of halting, with separate reporting
- 3' Adaptor Detection: Add option to remove adaptors from sequence ends
- Paired-End Support: Process R1 and R2 files together
- Quality Filtering: Add quality score-based filtering options
- Compression Support: Handle gzipped input/output files
Alejandro Cobos
Developed as part of MSc in Bioinformatics - MI Programming in Bioinformatics course, Assignment 2.
This project is open source and available for educational purposes.
Contributions, issues, and feature requests are welcome. Feel free to check the issues page if you want to contribute.
Note: This script was designed for educational purposes to demonstrate Python programming techniques including modular function design, type hinting, file I/O operations, argument parsing, and bioinformatics sequence manipulation.