FASTX Preprocessor

A simple Python tool for preprocessing FASTA and FASTQ files with common bioinformatics operations. This script provides reverse complement transformation, sequence trimming, and adaptor removal capabilities, along with detailed base composition statistics.

Features

Automatic Format Detection: Recognizes FASTA and FASTQ formats automatically
Reverse Complement: Generate reverse complement sequences with proper quality score handling
Flexible Trimming: Remove bases from either or both sequence ends with validation
Adaptor Removal: Detect and remove adaptor sequences from sequence starts
Statistical Analysis: Comprehensive base composition reporting with percentages
Multi-line FASTA Support: Handles both single-line and multi-line FASTA formats
Quality Score Handling: Maintains quality score integrity for FASTQ operations

Requirements

Python: Version 3.13.1 or higher (tested on Python 3.13.1)
Standard Library: Uses only built-in Python modules (argparse)
Environment: Compatible with Unix/Linux and Windows (tested on GNU bash 5.2.37)

Installation

Clone the repository:

git clone https://github.com/Cobos-Bioinfo/FASTX-Preprocessor.git
cd FASTX-Preprocessor

Usage

python3 fastX_pp.py --input <file> --output <file> --operation <op> [options]

Required Arguments

--input: Input FASTA or FASTQ file
--output: Output file path
--operation: Operation to perform
- rc: Reverse complement
- trim: Trim bases from sequence ends
- adaptor-removal: Remove adaptor sequences

Optional Arguments

For trim operation:

--trim-left <int>: Number of bases to remove from the left end (default: 0)
--trim-right <int>: Number of bases to remove from the right end (default: 0)

For adaptor-removal operation:

--adaptor <sequence>: Adaptor sequence to remove from sequence starts

Examples

Reverse Complement

FASTA file:

python3 fastX_pp.py --input input.fasta --output output.fasta --operation rc

FASTQ file:

python3 fastX_pp.py --input input.fastq --output output.fastq --operation rc

Sequence Trimming

Trim both ends:

python3 fastX_pp.py --input input.fastq --output output.fastq --operation trim --trim-left 5 --trim-right 3

Trim right end only:

python3 fastX_pp.py --input input.fastq --output output.fastq --operation trim --trim-right 15

Adaptor Removal

Remove adaptor from FASTQ:

python3 fastX_pp.py --input input.fastq --output output.fastq --operation adaptor-removal --adaptor TATAGA

Input File Formats

FASTA Format

Must start with > character
Supports both single-line and multi-line sequences
Header lines begin with >
Sequence lines contain nucleotide bases

>seq1
ATCGATCGATCG
>seq2
GCTAGCTAGCTA

FASTQ Format

Must start with @ character
Each record consists of exactly 4 lines:
1. Header line (starts with @)
2. Sequence line
3. Separator line (+)
4. Quality scores line

@seq1
ATCGATCGATCG
+
IIIIIIIIIIII
@seq2
GCTAGCTAGCTA
+
JJJJJJJJJJJJ

Output

The script generates a processed output file in the same format as the input (FASTA or FASTQ) and displays a comprehensive summary:

==========================================================
|         FASTX PREPROCESSING SUMMARY                    |
==========================================================
 Input file   : input.fastq
 Output file  : output.fastq
 Operation    : Hard-trimmed
----------------------------------------------------------
1,234 reads processed
456,789 bases processed (25% A, 25% C, 25% G, 25% T, 0% N)
12,345 bases trimmed (24% A, 26% C, 24% G, 26% T, 0% N)
==========================================================
 Processing complete ✓
==========================================================

Operations Details

Reverse Complement (`rc`)

Transforms sequences to their reverse complement (A↔T, G↔C)
Preserves ambiguous base codes (N→N)
For FASTQ files: Reverses quality scores to match reversed sequences
Case-insensitive input, produces uppercase output

Trimming (`trim`)

Removes specified number of bases from sequence ends
Requires at least one of --trim-left or --trim-right to be non-zero
Validates that total trimming does not exceed sequence length
For FASTQ files: Trims quality scores accordingly
Reports statistics for both retained sequences and discarded fragments

Adaptor Removal (`adaptor-removal`)

Removes adaptor sequences found at the start of sequences only
Case-sensitive matching
For FASTQ files: Trims corresponding quality scores
Reports number of sequences where adaptors were found
Sequences without adaptors remain unchanged

Validation and Error Handling

The script includes comprehensive validation:

Format Detection: Ensures files start with > (FASTA) or @ (FASTQ)
Operation Validation:
- Trim operation requires non-zero trim values
- Adaptor-removal requires adaptor sequence
Sequence Length Validation: Prevents over-trimming (when trim length ≥ sequence length)
Entry Structure: Validates tuple lengths during processing

Common Error Messages

Missing required argument:

error: the following arguments are required: --operation

Invalid trim parameters:

ValueError: For trim operation, you must specify --trim-left and/or --trim-right

Over-trimming detection:

ValueError: Total trimming 200 for entry seq_001 is equal or longer than sequence length 150

Missing adaptor sequence:

ValueError: For adaptor-removal operation, you must specify --adaptor

Unknown file format:

ValueError: Unknown file format. File must start with '>' (FASTA) or '@' (FASTQ), found: 'X'

Script Architecture

Core Functions

detect_format(filename): Detects FASTA or FASTQ format
read_fasta(filename): Parses FASTA files into (header, sequence) tuples
read_fastq(filename): Parses FASTQ files into (header, sequence, quality) tuples
write_fastX(filename, entries): Writes entries in appropriate format
reverse_complement(sequence): Generates reverse complement sequences
count_bases(sequence): Counts base composition (A, C, G, T, N)
calculate_statistics(sequences): Aggregates statistics across all sequences
format_statistics(...): Formats statistics with percentages and thousands separators

Operation Functions

operation_reverse_complement(entries): Applies reverse complement transformation
operation_trim(entries, trim_left, trim_right): Performs sequence trimming
operation_adaptor_removal(entries, adaptor): Removes adaptor sequences

Main Execution

main(): Orchestrates argument parsing, validation, operation execution, and reporting

Implementation Details

Translation Table

Uses str.maketrans() for efficient nucleotide complement mapping:

A ↔ T
C ↔ G
N ↔ N (preserves ambiguity)

Multi-line FASTA Handling

Accumulates sequence lines in a list and joins them into a single string, efficiently handling both single-line and multi-line FASTA formats.

FASTQ Parsing Strategy

Uses modulo arithmetic (line_num % 4) to identify each line's role within the 4-line record structure, ensuring robust parsing without regex patterns.

Quality Score Management

For all FASTQ operations:

Reverse complement: Quality scores are reversed ([::-1])
Trimming: Quality scores are sliced with same boundaries as sequences
Adaptor removal: Quality scores are trimmed by adaptor length

Performance Considerations

Memory Efficient: Processes files line-by-line without loading entire datasets
Single-Pass Statistics: Calculates all base composition in one iteration
Optimized String Operations: Uses efficient slice notation and join operations
Type Hints: Includes type annotations for clarity and potential performance gains

Testing

The script has been tested with:

Multiple FASTA and FASTQ files of varying sizes
Single-line and multi-line FASTA formats
Various trim combinations (left only, right only, both ends)
Edge cases (minimal trimming, heavy trimming, no adaptors found)
Error conditions (over-trimming, missing parameters, invalid formats)
Mixed-case input sequences

Known Limitations

Over-trimming halts execution entirely rather than skipping affected sequences
Adaptor removal only detects adaptors at sequence starts (5' end)
No support for paired-end read processing
Quality score validation is not performed (assumes valid Phred scores)

Future Improvements

Potential enhancements identified:

Resilient Trimming: Skip sequences where trimming exceeds length instead of halting, with separate reporting
3' Adaptor Detection: Add option to remove adaptors from sequence ends
Paired-End Support: Process R1 and R2 files together
Quality Filtering: Add quality score-based filtering options
Compression Support: Handle gzipped input/output files

Author

Alejandro Cobos

Developed as part of MSc in Bioinformatics - MI Programming in Bioinformatics course, Assignment 2.

License

This project is open source and available for educational purposes.

Contributing

Contributions, issues, and feature requests are welcome. Feel free to check the issues page if you want to contribute.

Note: This script was designed for educational purposes to demonstrate Python programming techniques including modular function design, type hinting, file I/O operations, argument parsing, and bioinformatics sequence manipulation.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.gitignore		.gitignore
Cobos_PiB_Assignment2-Report.pdf		Cobos_PiB_Assignment2-Report.pdf
LICENSE		LICENSE
README.md		README.md
fastX_pp.py		fastX_pp.py

License

Cobos-Bioinfo/FASTX-PreProcessor

Folders and files

Latest commit

History

Repository files navigation

FASTX Preprocessor

Features

Requirements

Installation

Usage

Required Arguments

Optional Arguments

Examples

Reverse Complement

Sequence Trimming

Adaptor Removal

Input File Formats

FASTA Format

FASTQ Format

Output

Operations Details

Reverse Complement (rc)

Trimming (trim)

Adaptor Removal (adaptor-removal)

Validation and Error Handling

Common Error Messages

Script Architecture

Core Functions

Operation Functions

Main Execution

Implementation Details

Translation Table

Multi-line FASTA Handling

FASTQ Parsing Strategy

Quality Score Management

Performance Considerations

Testing

Known Limitations

Future Improvements

Author

License

Contributing

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Reverse Complement (`rc`)

Trimming (`trim`)

Adaptor Removal (`adaptor-removal`)

Packages