Skip to content

Comments

Output fingerprinting results as csv#249

Open
gilhornung wants to merge 2 commits intomasterfrom
gil/fingerprint_csv
Open

Output fingerprinting results as csv#249
gilhornung wants to merge 2 commits intomasterfrom
gil/fingerprint_csv

Conversation

@gilhornung
Copy link
Collaborator

@gilhornung gilhornung commented Oct 6, 2025

Note

Medium Risk
Changes core fingerprinting execution flow (parallelism) and output contract (CSV + new metrics), which may impact downstream consumers and introduce concurrency-related issues in the shared pipeline runner.

Overview
Quick fingerprinting output is changed from a free-form .txt log to a structured CSV (quick_fingerprinting_results.csv) with a header and per-(cram, ground truth) rows including hit_fraction, best_match, and new mean_depth.

QuickFingerprinter.check() is refactored to run CRAM processing in parallel via joblib (--n_jobs) and the pipeline adds --results_file_name to control the output filename. VariantHitFractionCaller now tolerates missing dp and computes mean depth from called VCFs, and the system test is updated to assert against the CSV format.

Written by Cursor Bugbot for commit b4e73bc. This will update automatically on new commits. Configure here.

@gilhornung gilhornung requested review from Copilot and doron-st October 6, 2025 10:29
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR changes the quick fingerprinting tool to output results in CSV format instead of plain text, while adding comprehensive documentation throughout the codebase.

  • Converts output from text format to structured CSV with columns for cram_filename, sample_id, ground_truth_id, hit_fraction, and best_match
  • Adds detailed docstrings and inline comments to improve code readability and maintainability
  • Removes error checking and validation logic that was previously present in the comparison workflow

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
ugvc/pipelines/comparison/quick_fingerprinting.py Added docstring to parser function and inline comments explaining each processing step
ugvc/comparison/quick_fingerprinter.py Changed output format from text to CSV, added comprehensive documentation, and removed error validation logic

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Comment on lines 47 to 53
self.output_file = open(f"{self.out_dir}/quick_fingerprinting_results.csv", "w")

os.makedirs(out_dir, exist_ok=True)

# Prepare ground truth VCFs for each sample
self.ground_truths_to_check = self.prepare_ground_truth()

Copy link

Copilot AI Oct 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The file is opened without proper resource management. Consider using a context manager or ensure the file is closed in a finally block to prevent resource leaks if an exception occurs before reaching the close() call.

Suggested change
self.output_file = open(f"{self.out_dir}/quick_fingerprinting_results.csv", "w")
os.makedirs(out_dir, exist_ok=True)
# Prepare ground truth VCFs for each sample
self.ground_truths_to_check = self.prepare_ground_truth()
self.output_file = open(f"{self.out_dir}/quick_fingerprinting_results.csv", "w")
self._file_closed = False
os.makedirs(out_dir, exist_ok=True)
# Prepare ground truth VCFs for each sample
self.ground_truths_to_check = self.prepare_ground_truth()
def __enter__(self):
return self
def __exit__(self, exc_type, exc_val, exc_tb):
self.close()
def close(self):
if hasattr(self, "output_file") and not self.output_file.closed:
self.output_file.close()
self._file_closed = True

Copilot uses AI. Check for mistakes.
@ilyasoifer ilyasoifer self-requested a review October 18, 2025 20:52
@ilyasoifer
Copy link
Collaborator

@gilhornung - can you fix the test please?

for ground_truth_id, hit_fraction in hit_fraction_dict.items():
self.print(f"{cram_base_name},{sample_id},{ground_truth_id},{hit_fraction},{best_match}")

self.output_file.close()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fingerprint check no longer fails mismatches

High Severity

check() now only writes CSV rows and never evaluates min_hit_fraction_target or raises errors for incorrect matches. As a result, sample/ground-truth mismatches no longer fail the run, so quick_fingerprinter.py can report successful completion even when fingerprinting is inconsistent.

Fix in Cursor Fix in Web

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a functionality I want

self.vc = VariantHitFractionCaller(self.ref, self.out_dir, self.sp, self.min_af_snps, region)
self.vpu = VcfUtils(self.sp)
# Output file for results (now CSV, with configurable name)
self.output_file = open(os.path.join(self.out_dir, csv_name), "w")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Output file not closed on failures

Low Severity

self.output_file is opened in __init__ and only closed at the end of check(). If any command in prepare_ground_truth() or check() raises, cleanup is skipped and the file handle remains open. This removed the previous context-managed lifecycle and can leak descriptors across repeated runs.

Additional Locations (1)

Fix in Cursor Fix in Web

@doron-st
Copy link
Collaborator

I bumped into this PR, which I probably missed before.
Do we want to close it or try to merge it?

self.vpu = VcfUtils(self.sp)
# Output file for results (now CSV, with configurable name)
self.output_file = open(os.path.join(self.out_dir, csv_name), "w")

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Output file opened before directory is created

Medium Severity

self.output_file = open(...) on line 52 is called before os.makedirs(out_dir, exist_ok=True) on line 54. If the output directory doesn't already exist, the open() call will raise a FileNotFoundError. The os.makedirs call needs to come first. The current single caller happens to pre-create the directory, masking this bug, but the class itself is broken for any other caller.

Fix in Cursor Fix in Web

self.sp = sp
self.add_aws_auth_command = add_aws_auth_command
self.n_jobs = n_jobs
# Variant caller for hit fraction calculation
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instance attribute self.vc is now unused dead code

Low Severity

self.vc (a VariantHitFractionCaller instance) is created in __init__ but never referenced anywhere in the new code. The refactored _process_cram method creates its own local vc instance instead. This is leftover dead code from the old check() method that previously used self.vc.

Fix in Cursor Fix in Web

crams = self.crams[sample_id]
rows_by_cram = Parallel(n_jobs=self.n_jobs, prefer="threads")(
delayed(self._process_cram)(sample_id, cram) for cram in crams
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shared SimplePipeline used unsafely from parallel threads

High Severity

_process_cram is executed in parallel threads via joblib.Parallel, but each thread calls self.sp.print_and_run() on the same shared SimplePipeline instance. SimplePipeline maintains an internal command index for sequential execution with fc/lc range control and is not designed for concurrent thread access. The library provides its own run_parallel method for parallelism. Concurrent print_and_run calls will corrupt the command counter, potentially causing commands to be skipped or misordered.

Additional Locations (1)

Fix in Cursor Fix in Web

called_variants = get_vcf_df(called_vcf)
if "dp" not in called_variants.columns:
called_variants["dp"] = pd.NA
called_variants["dp"] = pd.to_numeric(called_variants["dp"], errors="coerce")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused dp column processing in calc_hit_fraction

Low Severity

The dp column handling (checking for existence, adding if missing, converting to numeric) was added to calc_hit_fraction but the dp column is never used within this method. This logic is only meaningful in get_mean_depth, where an identical block appears. It looks like copy-paste dead code.

Fix in Cursor Fix in Web

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.

n_jobs=args.n_jobs,
).check()

# Raise errors if any occurred
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dead errors list never populated in run function

Low Severity

The errors list on line 85 is initialized but never appended to. The old check() method internally raised RuntimeError on fingerprinting mismatches, but the new check() only outputs CSV without any error reporting. The if len(errors) > 0 check on line 105 is dead code that gives a false impression that errors are being caught.

Fix in Cursor Fix in Web

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants