Conversation
There was a problem hiding this comment.
Pull Request Overview
This PR changes the quick fingerprinting tool to output results in CSV format instead of plain text, while adding comprehensive documentation throughout the codebase.
- Converts output from text format to structured CSV with columns for cram_filename, sample_id, ground_truth_id, hit_fraction, and best_match
- Adds detailed docstrings and inline comments to improve code readability and maintainability
- Removes error checking and validation logic that was previously present in the comparison workflow
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| ugvc/pipelines/comparison/quick_fingerprinting.py | Added docstring to parser function and inline comments explaining each processing step |
| ugvc/comparison/quick_fingerprinter.py | Changed output format from text to CSV, added comprehensive documentation, and removed error validation logic |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
| self.output_file = open(f"{self.out_dir}/quick_fingerprinting_results.csv", "w") | ||
|
|
||
| os.makedirs(out_dir, exist_ok=True) | ||
|
|
||
| # Prepare ground truth VCFs for each sample | ||
| self.ground_truths_to_check = self.prepare_ground_truth() | ||
|
|
There was a problem hiding this comment.
The file is opened without proper resource management. Consider using a context manager or ensure the file is closed in a finally block to prevent resource leaks if an exception occurs before reaching the close() call.
| self.output_file = open(f"{self.out_dir}/quick_fingerprinting_results.csv", "w") | |
| os.makedirs(out_dir, exist_ok=True) | |
| # Prepare ground truth VCFs for each sample | |
| self.ground_truths_to_check = self.prepare_ground_truth() | |
| self.output_file = open(f"{self.out_dir}/quick_fingerprinting_results.csv", "w") | |
| self._file_closed = False | |
| os.makedirs(out_dir, exist_ok=True) | |
| # Prepare ground truth VCFs for each sample | |
| self.ground_truths_to_check = self.prepare_ground_truth() | |
| def __enter__(self): | |
| return self | |
| def __exit__(self, exc_type, exc_val, exc_tb): | |
| self.close() | |
| def close(self): | |
| if hasattr(self, "output_file") and not self.output_file.closed: | |
| self.output_file.close() | |
| self._file_closed = True |
|
@gilhornung - can you fix the test please? |
f4bc296 to
08f89f2
Compare
| for ground_truth_id, hit_fraction in hit_fraction_dict.items(): | ||
| self.print(f"{cram_base_name},{sample_id},{ground_truth_id},{hit_fraction},{best_match}") | ||
|
|
||
| self.output_file.close() |
There was a problem hiding this comment.
Fingerprint check no longer fails mismatches
High Severity
check() now only writes CSV rows and never evaluates min_hit_fraction_target or raises errors for incorrect matches. As a result, sample/ground-truth mismatches no longer fail the run, so quick_fingerprinter.py can report successful completion even when fingerprinting is inconsistent.
There was a problem hiding this comment.
This is a functionality I want
| self.vc = VariantHitFractionCaller(self.ref, self.out_dir, self.sp, self.min_af_snps, region) | ||
| self.vpu = VcfUtils(self.sp) | ||
| # Output file for results (now CSV, with configurable name) | ||
| self.output_file = open(os.path.join(self.out_dir, csv_name), "w") |
There was a problem hiding this comment.
Output file not closed on failures
Low Severity
self.output_file is opened in __init__ and only closed at the end of check(). If any command in prepare_ground_truth() or check() raises, cleanup is skipped and the file handle remains open. This removed the previous context-managed lifecycle and can leak descriptors across repeated runs.
Additional Locations (1)
|
I bumped into this PR, which I probably missed before. |
| self.vpu = VcfUtils(self.sp) | ||
| # Output file for results (now CSV, with configurable name) | ||
| self.output_file = open(os.path.join(self.out_dir, csv_name), "w") | ||
|
|
There was a problem hiding this comment.
Output file opened before directory is created
Medium Severity
self.output_file = open(...) on line 52 is called before os.makedirs(out_dir, exist_ok=True) on line 54. If the output directory doesn't already exist, the open() call will raise a FileNotFoundError. The os.makedirs call needs to come first. The current single caller happens to pre-create the directory, masking this bug, but the class itself is broken for any other caller.
| self.sp = sp | ||
| self.add_aws_auth_command = add_aws_auth_command | ||
| self.n_jobs = n_jobs | ||
| # Variant caller for hit fraction calculation |
There was a problem hiding this comment.
Instance attribute self.vc is now unused dead code
Low Severity
self.vc (a VariantHitFractionCaller instance) is created in __init__ but never referenced anywhere in the new code. The refactored _process_cram method creates its own local vc instance instead. This is leftover dead code from the old check() method that previously used self.vc.
| crams = self.crams[sample_id] | ||
| rows_by_cram = Parallel(n_jobs=self.n_jobs, prefer="threads")( | ||
| delayed(self._process_cram)(sample_id, cram) for cram in crams | ||
| ) |
There was a problem hiding this comment.
Shared SimplePipeline used unsafely from parallel threads
High Severity
_process_cram is executed in parallel threads via joblib.Parallel, but each thread calls self.sp.print_and_run() on the same shared SimplePipeline instance. SimplePipeline maintains an internal command index for sequential execution with fc/lc range control and is not designed for concurrent thread access. The library provides its own run_parallel method for parallelism. Concurrent print_and_run calls will corrupt the command counter, potentially causing commands to be skipped or misordered.
Additional Locations (1)
| called_variants = get_vcf_df(called_vcf) | ||
| if "dp" not in called_variants.columns: | ||
| called_variants["dp"] = pd.NA | ||
| called_variants["dp"] = pd.to_numeric(called_variants["dp"], errors="coerce") |
There was a problem hiding this comment.
Unused dp column processing in calc_hit_fraction
Low Severity
The dp column handling (checking for existence, adding if missing, converting to numeric) was added to calc_hit_fraction but the dp column is never used within this method. This logic is only meaningful in get_mean_depth, where an identical block appears. It looks like copy-paste dead code.
32d2ce2 to
b4e73bc
Compare
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.
| n_jobs=args.n_jobs, | ||
| ).check() | ||
|
|
||
| # Raise errors if any occurred |
There was a problem hiding this comment.
Dead errors list never populated in run function
Low Severity
The errors list on line 85 is initialized but never appended to. The old check() method internally raised RuntimeError on fingerprinting mismatches, but the new check() only outputs CSV without any error reporting. The if len(errors) > 0 check on line 105 is dead code that gives a false impression that errors are being caught.


Note
Medium Risk
Changes core fingerprinting execution flow (parallelism) and output contract (CSV + new metrics), which may impact downstream consumers and introduce concurrency-related issues in the shared pipeline runner.
Overview
Quick fingerprinting output is changed from a free-form
.txtlog to a structured CSV (quick_fingerprinting_results.csv) with a header and per-(cram, ground truth) rows includinghit_fraction,best_match, and newmean_depth.QuickFingerprinter.check()is refactored to run CRAM processing in parallel viajoblib(--n_jobs) and the pipeline adds--results_file_nameto control the output filename.VariantHitFractionCallernow tolerates missingdpand computes mean depth from called VCFs, and the system test is updated to assert against the CSV format.Written by Cursor Bugbot for commit b4e73bc. This will update automatically on new commits. Configure here.