Skip to content

A simple Python script for preprocessing FASTA and FASTQ files, performing reverse complement, trimming, and adaptor removal operations with comprehensive base composition statistics and quality score handling.

License

Notifications You must be signed in to change notification settings

Cobos-Bioinfo/FASTX-PreProcessor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

FASTX Preprocessor

A simple Python tool for preprocessing FASTA and FASTQ files with common bioinformatics operations. This script provides reverse complement transformation, sequence trimming, and adaptor removal capabilities, along with detailed base composition statistics.

Features

  • Automatic Format Detection: Recognizes FASTA and FASTQ formats automatically
  • Reverse Complement: Generate reverse complement sequences with proper quality score handling
  • Flexible Trimming: Remove bases from either or both sequence ends with validation
  • Adaptor Removal: Detect and remove adaptor sequences from sequence starts
  • Statistical Analysis: Comprehensive base composition reporting with percentages
  • Multi-line FASTA Support: Handles both single-line and multi-line FASTA formats
  • Quality Score Handling: Maintains quality score integrity for FASTQ operations

Requirements

  • Python: Version 3.13.1 or higher (tested on Python 3.13.1)
  • Standard Library: Uses only built-in Python modules (argparse)
  • Environment: Compatible with Unix/Linux and Windows (tested on GNU bash 5.2.37)

Installation

Clone the repository:

git clone https://github.com/Cobos-Bioinfo/FASTX-Preprocessor.git
cd FASTX-Preprocessor

Usage

python3 fastX_pp.py --input <file> --output <file> --operation <op> [options]

Required Arguments

  • --input: Input FASTA or FASTQ file
  • --output: Output file path
  • --operation: Operation to perform
    • rc: Reverse complement
    • trim: Trim bases from sequence ends
    • adaptor-removal: Remove adaptor sequences

Optional Arguments

For trim operation:

  • --trim-left <int>: Number of bases to remove from the left end (default: 0)
  • --trim-right <int>: Number of bases to remove from the right end (default: 0)

For adaptor-removal operation:

  • --adaptor <sequence>: Adaptor sequence to remove from sequence starts

Examples

Reverse Complement

FASTA file:

python3 fastX_pp.py --input input.fasta --output output.fasta --operation rc

FASTQ file:

python3 fastX_pp.py --input input.fastq --output output.fastq --operation rc

Sequence Trimming

Trim both ends:

python3 fastX_pp.py --input input.fastq --output output.fastq --operation trim --trim-left 5 --trim-right 3

Trim right end only:

python3 fastX_pp.py --input input.fastq --output output.fastq --operation trim --trim-right 15

Adaptor Removal

Remove adaptor from FASTQ:

python3 fastX_pp.py --input input.fastq --output output.fastq --operation adaptor-removal --adaptor TATAGA

Input File Formats

FASTA Format

  • Must start with > character
  • Supports both single-line and multi-line sequences
  • Header lines begin with >
  • Sequence lines contain nucleotide bases
>seq1
ATCGATCGATCG
>seq2
GCTAGCTAGCTA

FASTQ Format

  • Must start with @ character
  • Each record consists of exactly 4 lines:
    1. Header line (starts with @)
    2. Sequence line
    3. Separator line (+)
    4. Quality scores line
@seq1
ATCGATCGATCG
+
IIIIIIIIIIII
@seq2
GCTAGCTAGCTA
+
JJJJJJJJJJJJ

Output

The script generates a processed output file in the same format as the input (FASTA or FASTQ) and displays a comprehensive summary:

==========================================================
|         FASTX PREPROCESSING SUMMARY                    |
==========================================================
 Input file   : input.fastq
 Output file  : output.fastq
 Operation    : Hard-trimmed
----------------------------------------------------------
1,234 reads processed
456,789 bases processed (25% A, 25% C, 25% G, 25% T, 0% N)
12,345 bases trimmed (24% A, 26% C, 24% G, 26% T, 0% N)
==========================================================
 Processing complete ✓
==========================================================

Operations Details

Reverse Complement (rc)

  • Transforms sequences to their reverse complement (A↔T, G↔C)
  • Preserves ambiguous base codes (N→N)
  • For FASTQ files: Reverses quality scores to match reversed sequences
  • Case-insensitive input, produces uppercase output

Trimming (trim)

  • Removes specified number of bases from sequence ends
  • Requires at least one of --trim-left or --trim-right to be non-zero
  • Validates that total trimming does not exceed sequence length
  • For FASTQ files: Trims quality scores accordingly
  • Reports statistics for both retained sequences and discarded fragments

Adaptor Removal (adaptor-removal)

  • Removes adaptor sequences found at the start of sequences only
  • Case-sensitive matching
  • For FASTQ files: Trims corresponding quality scores
  • Reports number of sequences where adaptors were found
  • Sequences without adaptors remain unchanged

Validation and Error Handling

The script includes comprehensive validation:

  • Format Detection: Ensures files start with > (FASTA) or @ (FASTQ)
  • Operation Validation:
    • Trim operation requires non-zero trim values
    • Adaptor-removal requires adaptor sequence
  • Sequence Length Validation: Prevents over-trimming (when trim length ≥ sequence length)
  • Entry Structure: Validates tuple lengths during processing

Common Error Messages

Missing required argument:

error: the following arguments are required: --operation

Invalid trim parameters:

ValueError: For trim operation, you must specify --trim-left and/or --trim-right

Over-trimming detection:

ValueError: Total trimming 200 for entry seq_001 is equal or longer than sequence length 150

Missing adaptor sequence:

ValueError: For adaptor-removal operation, you must specify --adaptor

Unknown file format:

ValueError: Unknown file format. File must start with '>' (FASTA) or '@' (FASTQ), found: 'X'

Script Architecture

Core Functions

  • detect_format(filename): Detects FASTA or FASTQ format
  • read_fasta(filename): Parses FASTA files into (header, sequence) tuples
  • read_fastq(filename): Parses FASTQ files into (header, sequence, quality) tuples
  • write_fastX(filename, entries): Writes entries in appropriate format
  • reverse_complement(sequence): Generates reverse complement sequences
  • count_bases(sequence): Counts base composition (A, C, G, T, N)
  • calculate_statistics(sequences): Aggregates statistics across all sequences
  • format_statistics(...): Formats statistics with percentages and thousands separators

Operation Functions

  • operation_reverse_complement(entries): Applies reverse complement transformation
  • operation_trim(entries, trim_left, trim_right): Performs sequence trimming
  • operation_adaptor_removal(entries, adaptor): Removes adaptor sequences

Main Execution

  • main(): Orchestrates argument parsing, validation, operation execution, and reporting

Implementation Details

Translation Table

Uses str.maketrans() for efficient nucleotide complement mapping:

  • A ↔ T
  • C ↔ G
  • N ↔ N (preserves ambiguity)

Multi-line FASTA Handling

Accumulates sequence lines in a list and joins them into a single string, efficiently handling both single-line and multi-line FASTA formats.

FASTQ Parsing Strategy

Uses modulo arithmetic (line_num % 4) to identify each line's role within the 4-line record structure, ensuring robust parsing without regex patterns.

Quality Score Management

For all FASTQ operations:

  • Reverse complement: Quality scores are reversed ([::-1])
  • Trimming: Quality scores are sliced with same boundaries as sequences
  • Adaptor removal: Quality scores are trimmed by adaptor length

Performance Considerations

  • Memory Efficient: Processes files line-by-line without loading entire datasets
  • Single-Pass Statistics: Calculates all base composition in one iteration
  • Optimized String Operations: Uses efficient slice notation and join operations
  • Type Hints: Includes type annotations for clarity and potential performance gains

Testing

The script has been tested with:

  • Multiple FASTA and FASTQ files of varying sizes
  • Single-line and multi-line FASTA formats
  • Various trim combinations (left only, right only, both ends)
  • Edge cases (minimal trimming, heavy trimming, no adaptors found)
  • Error conditions (over-trimming, missing parameters, invalid formats)
  • Mixed-case input sequences

Known Limitations

  • Over-trimming halts execution entirely rather than skipping affected sequences
  • Adaptor removal only detects adaptors at sequence starts (5' end)
  • No support for paired-end read processing
  • Quality score validation is not performed (assumes valid Phred scores)

Future Improvements

Potential enhancements identified:

  1. Resilient Trimming: Skip sequences where trimming exceeds length instead of halting, with separate reporting
  2. 3' Adaptor Detection: Add option to remove adaptors from sequence ends
  3. Paired-End Support: Process R1 and R2 files together
  4. Quality Filtering: Add quality score-based filtering options
  5. Compression Support: Handle gzipped input/output files

Author

Alejandro Cobos

Developed as part of MSc in Bioinformatics - MI Programming in Bioinformatics course, Assignment 2.

License

This project is open source and available for educational purposes.

Contributing

Contributions, issues, and feature requests are welcome. Feel free to check the issues page if you want to contribute.


Note: This script was designed for educational purposes to demonstrate Python programming techniques including modular function design, type hinting, file I/O operations, argument parsing, and bioinformatics sequence manipulation.

About

A simple Python script for preprocessing FASTA and FASTQ files, performing reverse complement, trimming, and adaptor removal operations with comprehensive base composition statistics and quality score handling.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages