Skip to content

DRAFT: Triage + proposed priority fixes (with three-point effort estimates) #434

@dillingerstaffing

Description

@dillingerstaffing

Status: DRAFT. Rough guesses. This issue needs review for accuracy.

Reviewed open issues. I'd like to propose the following three as highest-impact, lowest-effort fixes. Posting here to give the community a chance to weigh in before anyone picks them up.

Estimates below are three-point: optimistic / realistic / pessimistic. Pessimistic is the "would actually surprise me if it blew past this" ceiling, optimistic is technically achievable if everything lines up, realistic is the likely outcome.

Already fixed on main, recommend closing

#396, tsv_parser() 23 vs 24 columns

  • Status: Fix already landed. scoring/src/scoring/constants.py:602-622 defines isCollaborativeNoteKey and includes it in noteTSVColumnsAndTypes with pd.Int8Dtype(). The reporter was on commit d2f2ea3 which predates the fix.
  • Action (close as fixed): O 5 min / R 15 min / P 45 min
    • Optimistic: reproduce on current main with a 24-column TSV to confirm parse succeeds, then close with a pointer to the commit.
    • Realistic: confirm + write a short regression test on the schema length against the constants table.
    • Pessimistic: reporter's dataset exposes a second schema drift (e.g. isCollaborativeNote dtype edge cases with NA), requiring a small dtype fix.

#306, cuda:0 tensor → numpy TypeError in normalized_loss.py

  • Status: Fix already landed. scoring/src/scoring/matrix_factorization/normalized_loss.py:108 already reads targets.cpu().numpy().
  • Action (close as fixed): O 5 min / R 15 min / P 1 hr
    • Optimistic: confirm on main, close with commit reference.
    • Realistic: grep for any remaining .numpy() calls on potentially-CUDA tensors in the MF module; close with confidence.
    • Pessimistic: find at least one other unguarded .numpy() path (there are several tensor ops in matrix_factorization/) that needs the same guard.

Still actionable

#345, pandas int64 vs Int64 merge mismatch in pflip_plus_model.py:348

  • Status: Reproducible on main. _compute_scoring_cutoff casts cutoffByRatings to pd.Int64Dtype() (nullable) via .astype(pd.Int64Dtype()) at lines ~336 and ~342, then merges against scoringCutoff whose noteId is plain int64 (numpy). This repo enforces column dtypes via a custom pandas_utils wrapper (Type expectation mismatch on noteId: found=Int64 expected=int64), so the fix must align with the canonical registered dtype, not just paper over with a local cast.
  • Fix shape: cast cutoffByRatings[c.noteIdKey] back to np.int64 before the merge (canonical noteId dtype in the repo is int64, based on the type-checker's "expected=int64" message). Apply the same pattern to the "ratingMin" column if it carries through a type expectation.
  • Estimate: O 20 min / R 1 hr / P 4 hrs
    • Optimistic (20 min): one-line .astype(np.int64) on the noteId column of cutoffByRatings before .merge, verified against the reporter's stack trace; no other call sites affected.
    • Realistic (~1 hr): reproduce locally, trace the dtype through _compute_scoring_cutoff and the surrounding fit/_prepare_note_info flow to confirm the canonical dtype choice, apply the cast, run the existing test suite, add a small regression test that exercises the merge.
    • Pessimistic (4 hrs): the Int64Dtype() cast on the whole cutoffByRatings frame was intentional (e.g. nullable semantics for "nthRatingMts" when minRatings exceeds the group size), so narrowing just the noteId column cascades into downstream dtype mismatches on "ratingMin" or _SCORING_CUTOFF_MTS. Resolution requires coordinating with maintainers on the canonical dtype registered in pandas_utils and potentially updating a handful of adjacent call sites.

Honorable mentions (considered but skipped, not repo code fixes)

Feedback welcome

cc the reporters: @2vitalik, @Jacobsonradical, @SarahGrevy

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions