DRAFT: Triage + proposed priority fixes (with three-point effort estimates)

Status: DRAFT. Rough guesses. This issue needs review for accuracy.

Reviewed open issues. I'd like to propose the following three as highest-impact, lowest-effort fixes. Posting here to give the community a chance to weigh in before anyone picks them up.

Estimates below are three-point: **optimistic / realistic / pessimistic**. Pessimistic is the "would actually surprise me if it blew past this" ceiling, optimistic is technically achievable if everything lines up, realistic is the likely outcome.

## Already fixed on `main`, recommend closing

### #396, `tsv_parser()` 23 vs 24 columns
- **Status:** Fix already landed. `scoring/src/scoring/constants.py:602-622` defines `isCollaborativeNoteKey` and includes it in `noteTSVColumnsAndTypes` with `pd.Int8Dtype()`. The reporter was on commit `d2f2ea3` which predates the fix.
- **Action (close as fixed):** O **5 min** / R **15 min** / P **45 min**
  - Optimistic: reproduce on current `main` with a 24-column TSV to confirm parse succeeds, then close with a pointer to the commit.
  - Realistic: confirm + write a short regression test on the schema length against the constants table.
  - Pessimistic: reporter's dataset exposes a second schema drift (e.g. `isCollaborativeNote` dtype edge cases with `NA`), requiring a small dtype fix.

### #306, `cuda:0` tensor → numpy `TypeError` in `normalized_loss.py`
- **Status:** Fix already landed. `scoring/src/scoring/matrix_factorization/normalized_loss.py:108` already reads `targets.cpu().numpy()`.
- **Action (close as fixed):** O **5 min** / R **15 min** / P **1 hr**
  - Optimistic: confirm on main, close with commit reference.
  - Realistic: grep for any remaining `.numpy()` calls on potentially-CUDA tensors in the MF module; close with confidence.
  - Pessimistic: find at least one other unguarded `.numpy()` path (there are several tensor ops in `matrix_factorization/`) that needs the same guard.

## Still actionable

### #345, pandas `int64` vs `Int64` merge mismatch in `pflip_plus_model.py:348`
- **Status:** Reproducible on `main`. `_compute_scoring_cutoff` casts `cutoffByRatings` to `pd.Int64Dtype()` (nullable) via `.astype(pd.Int64Dtype())` at lines ~336 and ~342, then merges against `scoringCutoff` whose `noteId` is plain `int64` (numpy). This repo enforces column dtypes via a custom `pandas_utils` wrapper (`Type expectation mismatch on noteId: found=Int64 expected=int64`), so the fix must align with the canonical registered dtype, not just paper over with a local cast.
- **Fix shape:** cast `cutoffByRatings[c.noteIdKey]` back to `np.int64` before the merge (canonical `noteId` dtype in the repo is `int64`, based on the type-checker's "expected=int64" message). Apply the same pattern to the `"ratingMin"` column if it carries through a type expectation.
- **Estimate:** O **20 min** / R **1 hr** / P **4 hrs**
  - Optimistic (20 min): one-line `.astype(np.int64)` on the noteId column of `cutoffByRatings` before `.merge`, verified against the reporter's stack trace; no other call sites affected.
  - Realistic (~1 hr): reproduce locally, trace the dtype through `_compute_scoring_cutoff` and the surrounding `fit`/`_prepare_note_info` flow to confirm the canonical dtype choice, apply the cast, run the existing test suite, add a small regression test that exercises the merge.
  - Pessimistic (4 hrs): the `Int64Dtype()` cast on the whole `cutoffByRatings` frame was intentional (e.g. nullable semantics for `"nthRatingMts"` when `minRatings` exceeds the group size), so narrowing just the noteId column cascades into downstream dtype mismatches on `"ratingMin"` or `_SCORING_CUTOFF_MTS`. Resolution requires coordinating with maintainers on the canonical dtype registered in `pandas_utils` and potentially updating a handful of adjacent call sites.

## Honorable mentions (considered but skipped, not repo code fixes)
- **#261** wrong `content-length` header on dataset downloads, server-side (`ton.twimg.com`), not fixable in this repo.
- **#430** X Support doxxing CN contributor handles, product/ops issue, no repo fix.
- **#432** Collaborative Notes disappeared, product behavior, no repro path.
- **#373 / #394** political bias / moderation concerns, policy discussions, no code fix.
- **#248** possible infinite loop notification, product bug, reporter has no repro.
- **#265** add author id to notes export, valid feature request, but requires data-pipeline changes outside this repo's scoring code.

## Feedback welcome
- For **#396** and **#306**: any objection to closing with a commit reference? Anyone still reproducing on latest `main`?
- For **#345**: confirmation on the canonical `noteId` dtype (looks like `int64` from the type-checker message), should the fix narrow `cutoffByRatings` to `int64`, or widen the registered expectation to `Int64`?
- Anyone already working on #345?

cc the reporters: @2vitalik, @Jacobsonradical, @SarahGrevy

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DRAFT: Triage + proposed priority fixes (with three-point effort estimates) #434

Already fixed on `main`, recommend closing

#396, `tsv_parser()` 23 vs 24 columns

#306, `cuda:0` tensor → numpy `TypeError` in `normalized_loss.py`

Still actionable

#345, pandas `int64` vs `Int64` merge mismatch in `pflip_plus_model.py:348`

Honorable mentions (considered but skipped, not repo code fixes)

Feedback welcome

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

DRAFT: Triage + proposed priority fixes (with three-point effort estimates) #434

Description

Already fixed on main, recommend closing

#396, tsv_parser() 23 vs 24 columns

#306, cuda:0 tensor → numpy TypeError in normalized_loss.py

Still actionable

#345, pandas int64 vs Int64 merge mismatch in pflip_plus_model.py:348

Honorable mentions (considered but skipped, not repo code fixes)

Feedback welcome

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Already fixed on `main`, recommend closing

#396, `tsv_parser()` 23 vs 24 columns

#306, `cuda:0` tensor → numpy `TypeError` in `normalized_loss.py`

#345, pandas `int64` vs `Int64` merge mismatch in `pflip_plus_model.py:348`