Skip to content

fix(metadata,cli,frontend): ingest jsPsych CSVs with unescaped quotes#132

Open
Mandyx22 wants to merge 2 commits into
mainfrom
fix/csv-relax-quotes
Open

fix(metadata,cli,frontend): ingest jsPsych CSVs with unescaped quotes#132
Mandyx22 wants to merge 2 commits into
mainfrom
fix/csv-relax-quotes

Conversation

@Mandyx22

Copy link
Copy Markdown
Contributor

Problem

Some jsPsych experiments export the stimulus column as unquoted HTML containing literal " (e.g. <div class = "EncodingBox">), which violates strict RFC-4180 quoting. Surfaced on a real 1258-file OSF working-memory dataset (osf.io/phxq4) where the tool read 0 filescsv-parse threw Invalid Opening Quote and dropped every file. Both the CLI and the browser uploader were affected.

There were two layers to fix:

  1. Read: the parser rejected the file outright.
  2. Write: even once read, the data file is copied verbatim, so the malformed quotes would land in the Psych-DS data/ payload and the validator (which also strict-parses CSV) rejected it with CSV_FORMATTING_ERROR.

Changes

  • parseCSV now sets relax_quotes: true, so a quote inside an unquoted field no longer throws and drops the file; the HTML is kept intact.
  • New parseCSVForWrite helper returns the parsed rows plus a verbatimSafe flag (strict-parse probe). The CLI and frontend use it so:
    • a clean CSV still keeps its exact bytes (written verbatim — unchanged behavior), and
    • a file that only parsed thanks to quote relaxation is re-serialised to well-formed CSV, so the written file is valid.

Result (verified end to end)

Before After
Library/CLI read 0 files all files read
Written data file verbatim (malformed) re-serialised, quotes escaped
Psych-DS validation CSV_FORMATTING_ERROR ✅ 0 errors
Frontend (real DataUpload) every file error success, output validates (0 errors)

Clean CSVs are unaffected (still verbatim) — only files that were previously 100% unreadable change.

Tests

  • Metadata regression: an unquoted field containing " now parses instead of dropping the file.
  • Frontend wiring test: clean → verbatim, malformed → re-serialised and strictly re-parseable.
  • Updated the DataUpload mock for the new helper.
  • Full suites green: metadata 251, CLI 160, frontend 247.

🤖 Generated with Claude Code

jsPsych can export the `stimulus` column as unquoted HTML containing
literal `"` (e.g. `<div class = "EncodingBox">`), which violates strict
RFC-4180 quoting. Previously csv-parse threw "Invalid Opening Quote" and
the entire file was dropped, making such datasets unreadable end to end
(observed on a 1258-file OSF working-memory dataset: 0 files read).

- parseCSV sets `relax_quotes: true` so the row parses instead of being
  rejected.
- New `parseCSVForWrite` reports whether the content was already strictly
  valid CSV. The CLI and frontend use it so a clean file keeps its exact
  bytes (verbatim), while a file that only parsed thanks to relaxation is
  re-serialised to well-formed CSV — otherwise the malformed bytes land
  in the Psych-DS data/ payload and the validator rejects them with
  CSV_FORMATTING_ERROR.

Net: these datasets now ingest and pass Psych-DS validation through both
the library/CLI and the browser uploader. Adds regression tests for the
parse and the re-serialise-on-write behavior.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@changeset-bot

changeset-bot Bot commented Jun 26, 2026

Copy link
Copy Markdown

🦋 Changeset detected

Latest commit: 0dc55d0

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 3 packages
Name Type
@jspsych/metadata Patch
@jspsych/metadata-cli Patch
frontend Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

jsPsych stimulus HTML often contains both a literal `"` and a `,`. relax_quotes
keeps the quote literal but the comma still splits the field, so csv-parse throws
"Invalid Record Length" and the file is still dropped. Documents the gap as a
test.failing so CI stays green and the spec flips to a hard failure once
comma-bearing stimuli ingest correctly.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@jodeleeuw

Copy link
Copy Markdown
Member

The fix handles quote-only fields but not the common quote+comma case.

relax_quotes: true stops INVALID_OPENING_QUOTE, but it does not stop an inner comma from splitting an unquoted field. jsPsych stimulus HTML frequently contains both a literal " and a , (instruction text, lists, key prompts), e.g.:

rt,stimulus,trial_type,trial_index
300,<p>Press "F", "J" to respond</p>,html-keyboard-response,0

The stimulus's inner comma makes the row 5 fields against 4 headers, so csv-parse throws Invalid Record Length: columns length is 4, got 5 even with relaxation. In parseCSVForWrite the strict probe throws, the catch retries leniently, and the lenient parse throws the same record-length error — which propagates uncaught. Since the loop in processDirectory (packages/cli/src/data.ts) isn't wrapped in try/catch, that aborts the whole run, reproducing the "0 files read" symptom the PR targets. The existing regression test only covers a comma-free stimulus, so this gap isn't caught.

I pushed a target spec documenting it (commit 0dc55d0, packages/metadata/tests/csv-input.stress.test.ts) as test.failing so CI stays green; it will flip to a hard failure once comma-bearing stimuli ingest correctly, prompting removal of the marker.

Note "make it parse" isn't free: the only csv-parse knob that lets these through is relax_column_count, which would then silently mis-split the stimulus into the wrong columns and re-serialise valid-but-wrong CSV that passes Psych-DS validation — trading a loud drop for silent corruption. A robust fix probably needs the field quoted at the jsPsych source or a smarter pre-pass. Worth deciding deliberately before claiming the end-to-end fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants