A light-weight and portable command-line VCF Annotator tool designed to filter and annotate VCF files for human genomes. It filters chromosomes and mutations and then annotates the VCF file using OpenCravat. Its greatest benefit is not needing to download and setup a local database as it is seen in other annotators. Additionally, this software allows fo seamless integration with MAFTools, as it generates a MAF file, and required MAF sample annotation (called Clinical_Data in MAF tools). A starting R script for MAFTools is also generated at runtime, which can then be modified as most appropriate by for own analysis.
- Filters VCF chromosomes based on specified criteria.
- Filters mutations based on sample genotypes.
- Annotates VCF files using OpenCravat (requires an account at OpenCravat | https://opencravat.org/ |. This is a free OpenSource website).
- Supports saving and loading OpenCravat credentials for convenience.
Python version: 3.11.4. For dependencies see requirements.txt
-
Clone the repository:
git clone https://github.com/yourusername/vcf-annotator.git cd vcf_annotator -
Install the required dependencies:
pip install -r requirements.txt
To run VCF annotator, use the following command:
python ./main_cli.py -i <input_file> -o <output_file> -g <sample_groups> [-temp <y/n>] [-Num <normal_mutation_thresholds>] [-Tum <treatment_mutation_thresholds>]
To remove Open Cravat credentials and results files located in VCF_annotator/results/ directory, use the following command:
python ./cleanup.py
-i, --input_file
Required: Yes
Usage: -i <input_file> or --input_file <input_file>
Description:
This argument specifies the path to the input VCF (Variant Call Format) file that will be processed by the tool.
The VCF file contains the genomic variants data that needs to be filtered and annotated.
Example: -i data/sample.vcf or --input_file data/sample.vcf
-o, --output_file
Required: Yes
Usage: -o <output_file> or --output_file <output_file>
Description:
This argument specifies an identifier for the output files generated by the tool.
Note that this is not the full path to the output file, but rather an identifier that will be included in the output file names.
The actual files will be saved in specified directories with names incorporating this identifier.
```sh
Example: -o result1 or --output_file result1
```
-g, --sample_groups
Required: Yes
Usage: -g <sample_groups> or --sample_groups <sample_groups>
Description:
This argument specifies the path to a TAB delimited file containing sample names found in the VCF file.
The first column of this file should contain the normal samples, and the second column should contain the treatment samples.
This file helps in categorizing the samples for the filtering process.
```sh
Example: -g data/sample_groups.txt or --sample_groups data/sample_groups.txt
```
-temp, --temp_keep
Required: No
Usage: -temp <y/n> or --temp_keep <y/n>
Type: str
Default: "n"
Description:
This argument determines whether intermediate files generated during the processing should be kept or deleted after the analysis is complete.
Use y to keep the temporary files and n to delete them.
```sh
Example: -temp y or --temp_keep y
```
-Num, --normal_mutations
Required: No
Usage: -Num <threshold1> <threshold2> or --normal_mutations <threshold1> <threshold2>
Type: int (expects two integers)
Default: [2, 0]
Description:
This argument specifies the thresholds for filtering normal mutations.
The tool will keep mutations found in the normal samples that are equal to or above the first threshold and equal to or below the second threshold.
This helps in fine-tuning which mutations are considered significant for the normal samples.
```sh
Example: -Num 2 0 or --normal_mutations 2 0
```
-Tum, --treatment_mutations
Required: No
Usage: -Tum <threshold1> <threshold2> or --treatment_mutations <threshold1> <threshold2>
Type: int (expects two integers)
Default: [0, 2]
Description:
This argument specifies the thresholds for filtering treatment mutations.
The tool will keep mutations found in the treatment samples that are equal to or below the first threshold and equal to or above the second threshold.
This helps in fine-tuning which mutations are considered significant for the treatment samples.
```sh
Example: -Tum 0 2 or --treatment_mutations 0 2
```
