Output fingerprinting results as csv#249

Open

gilhornung wants to merge 2 commits intomasterfrom

gil/fingerprint_csv

Collaborator

gilhornung commented Oct 6, 2025 •

edited by cursor bot

Loading

Note

Medium Risk
Changes core fingerprinting execution flow (parallelism) and output contract (CSV + new metrics), which may impact downstream consumers and introduce concurrency-related issues in the shared pipeline runner.

Overview
Quick fingerprinting output is changed from a free-form .txt log to a structured CSV (quick_fingerprinting_results.csv) with a header and per-(cram, ground truth) rows including hit_fraction, best_match, and new mean_depth.

QuickFingerprinter.check() is refactored to run CRAM processing in parallel via joblib (--n_jobs) and the pipeline adds --results_file_name to control the output filename. VariantHitFractionCaller now tolerates missing dp and computes mean depth from called VCFs, and the system test is updated to assert against the CSV format.

^{Written by Cursor Bugbot for commit b4e73bc. This will update automatically on new commits. Configure here.}

gilhornung requested review from Copilot and doron-st

October 6, 2025 10:29

Copilot AI reviewed

View reviewed changes

Copilot AI left a comment

Pull Request Overview

This PR changes the quick fingerprinting tool to output results in CSV format instead of plain text, while adding comprehensive documentation throughout the codebase.

Converts output from text format to structured CSV with columns for cram_filename, sample_id, ground_truth_id, hit_fraction, and best_match
Adds detailed docstrings and inline comments to improve code readability and maintainability
Removes error checking and validation logic that was previously present in the comparison workflow

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File	Description
ugvc/pipelines/comparison/quick_fingerprinting.py	Added docstring to parser function and inline comments explaining each processing step
ugvc/comparison/quick_fingerprinter.py	Changed output format from text to CSV, added comprehensive documentation, and removed error validation logic

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

ugvc/comparison/quick_fingerprinter.py Outdated

Comment on lines 47 to 53

+                      self.output_file = open(f"{self.out_dir}/quick_fingerprinting_results.csv", "w")
                       os.makedirs(out_dir, exist_ok=True)
+                      # Prepare ground truth VCFs for each sample
                       self.ground_truths_to_check = self.prepare_ground_truth()

Copilot AI Oct 6, 2025

The file is opened without proper resource management. Consider using a context manager or ensure the file is closed in a finally block to prevent resource leaks if an exception occurs before reaching the close() call.

Suggested change

      
                    self.output_file = open(f"{self.out_dir}/quick_fingerprinting_results.csv", "w")
          
                    os.makedirs(out_dir, exist_ok=True)
          
                    # Prepare ground truth VCFs for each sample
          
                    self.ground_truths_to_check = self.prepare_ground_truth()
          
                    self.output_file = open(f"{self.out_dir}/quick_fingerprinting_results.csv", "w")
          
                    self._file_closed = False
          
                    os.makedirs(out_dir, exist_ok=True)
          
                    # Prepare ground truth VCFs for each sample
          
                    self.ground_truths_to_check = self.prepare_ground_truth()
          
                def __enter__(self):
          
                    return self
          
                def __exit__(self, exc_type, exc_val, exc_tb):
          
                    self.close()
          
                def close(self):
          
                    if hasattr(self, "output_file") and not self.output_file.closed:
          
                        self.output_file.close()
          
                        self._file_closed = True

Copilot uses AI. Check for mistakes.

ugvc/comparison/quick_fingerprinter.py Outdated Show resolved Hide resolved

ilyasoifer self-requested a review

October 18, 2025 20:52

Collaborator

ilyasoifer commented Oct 18, 2025

@gilhornung - can you fix the test please?

gilhornung force-pushed the gil/fingerprint_csv branch from f4bc296 to 08f89f2 Compare

February 15, 2026 15:35

cursor bot reviewed

View reviewed changes

ugvc/comparison/quick_fingerprinter.py

+                              for ground_truth_id, hit_fraction in hit_fraction_dict.items():
+                                  self.print(f"{cram_base_name},{sample_id},{ground_truth_id},{hit_fraction},{best_match}")
+                      self.output_file.close()

cursor bot Feb 15, 2026

Fingerprint check no longer fails mismatches

High Severity

check() now only writes CSV rows and never evaluates min_hit_fraction_target or raises errors for incorrect matches. As a result, sample/ground-truth mismatches no longer fail the run, so quick_fingerprinter.py can report successful completion even when fingerprinting is inconsistent.

Collaborator Author

gilhornung Feb 22, 2026

This is a functionality I want

ugvc/comparison/quick_fingerprinter.py

                       self.vc = VariantHitFractionCaller(self.ref, self.out_dir, self.sp, self.min_af_snps, region)
                       self.vpu = VcfUtils(self.sp)
+                      # Output file for results (now CSV, with configurable name)
+                      self.output_file = open(os.path.join(self.out_dir, csv_name), "w")

cursor bot Feb 15, 2026

Output file not closed on failures

Low Severity

self.output_file is opened in __init__ and only closed at the end of check(). If any command in prepare_ground_truth() or check() raises, cleanup is skipped and the file handle remains open. This removed the previous context-managed lifecycle and can leak descriptors across repeated runs.

Additional Locations (1)

ugvc/comparison/quick_fingerprinter.py#L139-L140

Collaborator

doron-st commented Feb 22, 2026

I bumped into this PR, which I probably missed before.
Do we want to close it or try to merge it?

cursor bot reviewed

View reviewed changes

ugvc/comparison/quick_fingerprinter.py

                       self.vpu = VcfUtils(self.sp)
+                      # Output file for results (now CSV, with configurable name)
+                      self.output_file = open(os.path.join(self.out_dir, csv_name), "w")

cursor bot Feb 22, 2026

Output file opened before directory is created

Medium Severity

self.output_file = open(...) on line 52 is called before os.makedirs(out_dir, exist_ok=True) on line 54. If the output directory doesn't already exist, the open() call will raise a FileNotFoundError. The os.makedirs call needs to come first. The current single caller happens to pre-create the directory, masking this bug, but the class itself is broken for any other caller.

ugvc/comparison/quick_fingerprinter.py

                       self.sp = sp
                       self.add_aws_auth_command = add_aws_auth_command
+                      self.n_jobs = n_jobs
+                      # Variant caller for hit fraction calculation

cursor bot Feb 22, 2026

Instance attribute `self.vc` is now unused dead code

Low Severity

self.vc (a VariantHitFractionCaller instance) is created in __init__ but never referenced anywhere in the new code. The refactored _process_cram method creates its own local vc instance instead. This is leftover dead code from the old check() method that previously used self.vc.

ugvc/comparison/quick_fingerprinter.py

+                          crams = self.crams[sample_id]
+                          rows_by_cram = Parallel(n_jobs=self.n_jobs, prefer="threads")(
+                              delayed(self._process_cram)(sample_id, cram) for cram in crams
+                          )

cursor bot Feb 22, 2026

Shared `SimplePipeline` used unsafely from parallel threads

High Severity

_process_cram is executed in parallel threads via joblib.Parallel, but each thread calls self.sp.print_and_run() on the same shared SimplePipeline instance. SimplePipeline maintains an internal command index for sequential execution with fc/lc range control and is not designed for concurrent thread access. The library provides its own run_parallel method for parallelism. Concurrent print_and_run calls will corrupt the command counter, potentially causing commands to be skipped or misordered.

Additional Locations (1)

ugvc/comparison/quick_fingerprinter.py#L110-L119

ugvc/comparison/variant_hit_fraction_caller.py

                       called_variants = get_vcf_df(called_vcf)
+                      if "dp" not in called_variants.columns:
+                          called_variants["dp"] = pd.NA
+                      called_variants["dp"] = pd.to_numeric(called_variants["dp"], errors="coerce")

cursor bot Feb 22, 2026

Unused `dp` column processing in `calc_hit_fraction`

Low Severity

The dp column handling (checking for existence, adding if missing, converting to numeric) was added to calc_hit_fraction but the dp column is never used within this method. This logic is only meaningful in get_mean_depth, where an identical block appears. It looks like copy-paste dead code.

gilhornung added 2 commits

February 22, 2026 15:16


          Output fingerprinting results as csv

3e876fe


          updated ugbio_utils

b4e73bc

gilhornung force-pushed the gil/fingerprint_csv branch from 32d2ce2 to b4e73bc Compare

February 22, 2026 15:16

cursor bot reviewed

View reviewed changes

cursor bot left a comment

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.}

ugvc/pipelines/comparison/quick_fingerprinting.py

+                      n_jobs=args.n_jobs,
                   ).check()
+                  # Raise errors if any occurred

cursor bot Feb 22, 2026

Dead errors list never populated in run function

Low Severity

The errors list on line 85 is initialized but never appended to. The old check() method internally raised RuntimeError on fingerprinting mismatches, but the new check() only outputs CSV without any error reporting. The if len(errors) > 0 check on line 105 is dead code that gives a false impression that errors are being caught.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet