Skip to content

A robust Bash script for processing multiple SAM files and assembly reports, generating consolidated alignment statistics including read counts, chromosome mapping, and execution metrics.

License

Notifications You must be signed in to change notification settings

Cobos-Bioinfo/SAM-Alignment-Analyzer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 

Repository files navigation

SAM Alignment Analyzer

A comprehensive Bash script for processing multiple SAM (Sequence Alignment/Map) files alongside genomic assembly reports. This tool generates consolidated alignment statistics, chromosome mapping tables, and performance metrics in a single output file.

Features

  • Multi-file Processing: Analyze multiple SAM files in a single execution
  • Comprehensive Statistics: Total reads processed, aligned reads count, and per-chromosome distribution
  • Assembly Integration: Automatically maps accession numbers to chromosome names using assembly reports
  • Robust Validation: Extensive input validation including file format, structure, and content checks
  • Performance Tracking: Built-in execution time measurement
  • Interactive Output: Optional display of results upon completion

Requirements

  • Bash: Version 5.0 or higher (tested on GNU bash 5.2.21)
  • Standard Unix Tools: awk, grep, sort, join, cat
  • File Format: SAM files with standard 11-column format, assembly report in tab-delimited text format

Installation

Clone the repository:

git clone https://github.com/Cobos-Bioinfo/SAM-Alignment-Analyzer.git
cd SAM-Alignment-Analyzer

Make the script executable:

chmod +x analyze_sam.sh

Usage

Basic Syntax

./analyze_sam.sh <file1.sam> [file2.sam ...] <assembly_report.txt>

Examples

Single SAM file:

./analyze_sam.sh sample.sam assembly_report.txt

Multiple SAM files:

./analyze_sam.sh sample1.sam sample2.sam sample3.sam assembly_report.txt

Input File Requirements

SAM Files

  • Must have .sam extension
  • Must contain at least 11 tab-separated columns
  • Header lines starting with @ are automatically filtered
  • Aligned reads must have a valid reference name in column 3 (not *)

Assembly Report

  • Must have .txt extension
  • Must contain at least one comment line starting with #
  • Must have at least 5 tab-separated columns
  • Column 1: Chromosome name
  • Column 5: Accession number

Output Format

The script generates an output.txt file with the following structure:

==================================================
|          SAM FILE ANALYSIS REPORT              |
==================================================

Total reads processed: 1234567
Total aligned reads: 1200000

--------------------------------------------------
ACCESSION       CHROMOSOME      READ COUNT
--------------------------------------------------
NC_000001.11    chr1            85432
NC_000002.12    chr2            78901
...

Execution time: 5s
==================================================

Validation Features

The script includes comprehensive validation:

Argument Validation

  • Ensures minimum 2 arguments (at least 1 SAM file + 1 assembly report)

File Existence and Structure

  • Verifies all files exist and are not empty
  • Checks correct file extensions (.sam for SAM files, .txt for assembly report)

Format Validation

  • SAM files: Confirms at least 11 tab-separated columns in the first 20 non-header lines
  • Assembly report: Verifies presence of comment lines (starting with #)

Content Validation

  • Ensures assembly report has adequate columns for chromosome-accession mapping
  • Filters out invalid or unaligned reads during processing

Script Architecture

Core Functions

  • print_usage(): Displays usage instructions
  • validate_arguments(): Validates command-line argument count
  • file_validation(): Performs comprehensive file checks
  • count_reads(): Calculates total and aligned read counts
  • generate_alignment_table(): Creates chromosome-accession mapping table
  • end_msg(): Displays completion message and optional output preview

Processing Pipeline

  1. Argument and file validation
  2. Read counting across all SAM files
  3. Accession-to-chromosome mapping generation
  4. Results consolidation in output file
  5. Execution time reporting
  6. Interactive result display option

Error Handling

The script provides detailed error messages for common issues:

  • Missing or empty files
  • Incorrect file extensions
  • Insufficient command-line arguments
  • Invalid SAM file structure (fewer than 11 columns)
  • Malformed assembly report (missing comment lines or insufficient columns)

Example error output:

ERROR: SAM file 'sample.sam' not found or empty!
Usage: ./analyze_sam.sh <file1.sam> [file2.sam ...] <assembly_report.txt>

Performance Considerations

  • Uses temporary files for efficient sorting and joining operations
  • Processes all SAM files in a single awk pass for read counting
  • Automatic cleanup of temporary files after execution
  • Optimized for large SAM files with minimal memory footprint

Testing

The script has been tested with:

  • Single and multiple SAM file inputs
  • Various file sizes (tested up to 100,000+ reads)
  • Different assembly report formats
  • Edge cases (empty files, missing columns, invalid formats)

Limitations

  • File extension validation does not guarantee content validity (design choice for educational purposes)
  • Requires standard Unix environment with common text processing tools
  • Assembly report must follow the expected column structure (accession in column 5, chromosome in column 1)

Authors

  • Alejandro Cobos
  • Pablo Donaire

Developed as part of MSc in Bioinformatics - MI Programming in Bioinformatics course.

Contributing

Contributions, issues, and feature requests are welcome. Feel free to check the issues page if you want to contribute.


Note: This script was designed for educational purposes to demonstrate Bash scripting techniques including function reusability, input validation, text processing with awk, and pipeline construction.

About

A robust Bash script for processing multiple SAM files and assembly reports, generating consolidated alignment statistics including read counts, chromosome mapping, and execution metrics.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages