A comprehensive Bash script for processing multiple SAM (Sequence Alignment/Map) files alongside genomic assembly reports. This tool generates consolidated alignment statistics, chromosome mapping tables, and performance metrics in a single output file.
- Multi-file Processing: Analyze multiple SAM files in a single execution
- Comprehensive Statistics: Total reads processed, aligned reads count, and per-chromosome distribution
- Assembly Integration: Automatically maps accession numbers to chromosome names using assembly reports
- Robust Validation: Extensive input validation including file format, structure, and content checks
- Performance Tracking: Built-in execution time measurement
- Interactive Output: Optional display of results upon completion
- Bash: Version 5.0 or higher (tested on GNU bash 5.2.21)
- Standard Unix Tools:
awk,grep,sort,join,cat - File Format: SAM files with standard 11-column format, assembly report in tab-delimited text format
Clone the repository:
git clone https://github.com/Cobos-Bioinfo/SAM-Alignment-Analyzer.git
cd SAM-Alignment-AnalyzerMake the script executable:
chmod +x analyze_sam.sh./analyze_sam.sh <file1.sam> [file2.sam ...] <assembly_report.txt>Single SAM file:
./analyze_sam.sh sample.sam assembly_report.txtMultiple SAM files:
./analyze_sam.sh sample1.sam sample2.sam sample3.sam assembly_report.txt- Must have
.samextension - Must contain at least 11 tab-separated columns
- Header lines starting with
@are automatically filtered - Aligned reads must have a valid reference name in column 3 (not
*)
- Must have
.txtextension - Must contain at least one comment line starting with
# - Must have at least 5 tab-separated columns
- Column 1: Chromosome name
- Column 5: Accession number
The script generates an output.txt file with the following structure:
==================================================
| SAM FILE ANALYSIS REPORT |
==================================================
Total reads processed: 1234567
Total aligned reads: 1200000
--------------------------------------------------
ACCESSION CHROMOSOME READ COUNT
--------------------------------------------------
NC_000001.11 chr1 85432
NC_000002.12 chr2 78901
...
Execution time: 5s
==================================================
The script includes comprehensive validation:
- Ensures minimum 2 arguments (at least 1 SAM file + 1 assembly report)
- Verifies all files exist and are not empty
- Checks correct file extensions (
.samfor SAM files,.txtfor assembly report)
- SAM files: Confirms at least 11 tab-separated columns in the first 20 non-header lines
- Assembly report: Verifies presence of comment lines (starting with
#)
- Ensures assembly report has adequate columns for chromosome-accession mapping
- Filters out invalid or unaligned reads during processing
print_usage(): Displays usage instructionsvalidate_arguments(): Validates command-line argument countfile_validation(): Performs comprehensive file checkscount_reads(): Calculates total and aligned read countsgenerate_alignment_table(): Creates chromosome-accession mapping tableend_msg(): Displays completion message and optional output preview
- Argument and file validation
- Read counting across all SAM files
- Accession-to-chromosome mapping generation
- Results consolidation in output file
- Execution time reporting
- Interactive result display option
The script provides detailed error messages for common issues:
- Missing or empty files
- Incorrect file extensions
- Insufficient command-line arguments
- Invalid SAM file structure (fewer than 11 columns)
- Malformed assembly report (missing comment lines or insufficient columns)
Example error output:
ERROR: SAM file 'sample.sam' not found or empty!
Usage: ./analyze_sam.sh <file1.sam> [file2.sam ...] <assembly_report.txt>
- Uses temporary files for efficient sorting and joining operations
- Processes all SAM files in a single
awkpass for read counting - Automatic cleanup of temporary files after execution
- Optimized for large SAM files with minimal memory footprint
The script has been tested with:
- Single and multiple SAM file inputs
- Various file sizes (tested up to 100,000+ reads)
- Different assembly report formats
- Edge cases (empty files, missing columns, invalid formats)
- File extension validation does not guarantee content validity (design choice for educational purposes)
- Requires standard Unix environment with common text processing tools
- Assembly report must follow the expected column structure (accession in column 5, chromosome in column 1)
- Alejandro Cobos
- Pablo Donaire
Developed as part of MSc in Bioinformatics - MI Programming in Bioinformatics course.
Contributions, issues, and feature requests are welcome. Feel free to check the issues page if you want to contribute.
Note: This script was designed for educational purposes to demonstrate Bash scripting techniques including function reusability, input validation, text processing with awk, and pipeline construction.