Minor CRAM notation suggestion: Use IEC-style MiB notation for powers of two, rather than MB #841

jmarshall · 2025-08-12T11:45:39Z

CRAM parts split out from PR #839 — see the conversation there:

CRAMv2.1.tex and CRAMv3.tex correctly distinguish MB (megabytes, though could use MiB) and kb/Mb (kilobases and megabases, correctly decimal, and the correct abbreviation for “bases” in a bioinformatics context).

github-actions · 2025-08-12T11:47:36Z

Changed PDFs as of 4a80f86: CRAMv2.1 (diff), CRAMv3.

jkbonfield · 2025-08-12T14:28:51Z

Thank you for the update, but on reviewing this text I see it's been hanging around since CRAMv2 days (before I started my implementation). The figures are very misleading, especially given no statement about instrument type.

Eg:

\textbf{Mapped short reads with bases, pairing and mapping information}

We have 250,000 mapped short reads (100bp) with bases, pairing and mapping information.
We estimate the compression to be 0.2 bits/base. Space estimate is $250,000 \times 100
\times 0.2 \bits \approx 0.6 \MiB$. Data could be stored in a single
container.

On a 1 million read novaseq file this averaged out at 0.93 bits per base including quality values, aux tags, etc. Sequence was about 0.18 bits per base, so that's perhaps where this value came from. I note it's unchanged since v2.1, and maybe earlier. Original CRAM wasn't storing quality values amongst other things, and maybe no tags, so perhaps it was more realistic then? Uncompressing this (writing to BAM level 0) comes out at 17.6 bpb (aound 19x larger) and a bit less for uncompressed CRAM (11bpb). Yet this file also has just 0.167 MiB compressed container size.

We're talking in the quoted text about compressed sizes, so a 1MiB compressed container would be ~20 MiB if held in memory as a an array of decoded BAM objects. Still fitting in the L2 size, but do we really want to recommend a default block size two orders of magnitude larger than BAM (64KiB)? It feels heavy handed.

Indeed my own implementations default to capping at 10,000 alignments or 500 kbp, whichever comes sooner. For short read data that's around 3MiB uncompressed, or closer to 10MiB for long read technologies.

My conclusion is the entire section is somewhat irrelevant, and a single recommendation is inappropriate too. I know the cancer pipeline here were using smaller CRAM container sizes than the defaults because they prioritised random access. Other places maybe using larger containers and slower compression methods as they view CRAM primarily as an archive-only format. Hence the profiles (e.g. samtools view -O cram,fast or samtools view -O cram,small). These are totally implementation defined and we don't really have any stipulations to make. A recommendation would be soft, and perhaps for the middle default ground.

Maybe:

"The choice of containing size is entirely implementation defined, as is which compression methods and compression levels to use.
We recommend exposing a series of compression profiles or command-line options to provide user control, defaulting to faster methods and no more than a few megabytes of uncompressed data per container."

I think the rest of it is unnecessary (including being overly specific on units).

Use IEC-style MiB notation for powers of two, rather than MB [minor]

4a80f86

jmarshall added the cram label Aug 12, 2025

jmarshall mentioned this pull request Aug 12, 2025

Fix BGZF block size to be 64 kilobyte #839

Merged

jkbonfield added this to GA4GH File Formats Aug 12, 2025

jkbonfield moved this to New items in GA4GH File Formats Aug 12, 2025

jkbonfield moved this from New items to To do (backlog) in GA4GH File Formats Aug 12, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Minor CRAM notation suggestion: Use IEC-style MiB notation for powers of two, rather than MB #841

Minor CRAM notation suggestion: Use IEC-style MiB notation for powers of two, rather than MB #841

Uh oh!

jmarshall commented Aug 12, 2025

Uh oh!

github-actions bot commented Aug 12, 2025

Uh oh!

jkbonfield commented Aug 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Minor CRAM notation suggestion: Use IEC-style MiB notation for powers of two, rather than MB #841

Are you sure you want to change the base?

Minor CRAM notation suggestion: Use IEC-style MiB notation for powers of two, rather than MB #841

Uh oh!

Conversation

jmarshall commented Aug 12, 2025

Uh oh!

github-actions bot commented Aug 12, 2025

Uh oh!

jkbonfield commented Aug 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants