Skip to content

Conversation

@jmarshall
Copy link
Member

CRAM parts split out from PR #839 — see the conversation there:

  • CRAMv2.1.tex and CRAMv3.tex correctly distinguish MB (megabytes, though could use MiB) and kb/Mb (kilobases and megabases, correctly decimal, and the correct abbreviation for “bases” in a bioinformatics context).

@jmarshall jmarshall added the cram label Aug 12, 2025
@github-actions
Copy link

Changed PDFs as of 4a80f86: CRAMv2.1 (diff), CRAMv3.

@jkbonfield jkbonfield moved this to New items in GA4GH File Formats Aug 12, 2025
@jkbonfield jkbonfield moved this from New items to To do (backlog) in GA4GH File Formats Aug 12, 2025
@jkbonfield
Copy link
Contributor

Thank you for the update, but on reviewing this text I see it's been hanging around since CRAMv2 days (before I started my implementation). The figures are very misleading, especially given no statement about instrument type.

Eg:

\textbf{Mapped short reads with bases, pairing and mapping information}

We have 250,000 mapped short reads (100bp) with bases, pairing and mapping information.
We estimate the compression to be 0.2 bits/base. Space estimate is $250,000 \times 100
\times 0.2 \bits \approx 0.6 \MiB$. Data could be stored in a single
container.

On a 1 million read novaseq file this averaged out at 0.93 bits per base including quality values, aux tags, etc. Sequence was about 0.18 bits per base, so that's perhaps where this value came from. I note it's unchanged since v2.1, and maybe earlier. Original CRAM wasn't storing quality values amongst other things, and maybe no tags, so perhaps it was more realistic then? Uncompressing this (writing to BAM level 0) comes out at 17.6 bpb (aound 19x larger) and a bit less for uncompressed CRAM (11bpb). Yet this file also has just 0.167 MiB compressed container size.

We're talking in the quoted text about compressed sizes, so a 1MiB compressed container would be ~20 MiB if held in memory as a an array of decoded BAM objects. Still fitting in the L2 size, but do we really want to recommend a default block size two orders of magnitude larger than BAM (64KiB)? It feels heavy handed.

Indeed my own implementations default to capping at 10,000 alignments or 500 kbp, whichever comes sooner. For short read data that's around 3MiB uncompressed, or closer to 10MiB for long read technologies.

My conclusion is the entire section is somewhat irrelevant, and a single recommendation is inappropriate too. I know the cancer pipeline here were using smaller CRAM container sizes than the defaults because they prioritised random access. Other places maybe using larger containers and slower compression methods as they view CRAM primarily as an archive-only format. Hence the profiles (e.g. samtools view -O cram,fast or samtools view -O cram,small). These are totally implementation defined and we don't really have any stipulations to make. A recommendation would be soft, and perhaps for the middle default ground.

Maybe:

"The choice of containing size is entirely implementation defined, as is which compression methods and compression levels to use.
We recommend exposing a series of compression profiles or command-line options to provide user control, defaulting to faster methods and no more than a few megabytes of uncompressed data per container."

I think the rest of it is unnecessary (including being overly specific on units).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

Status: To do (backlog)

Development

Successfully merging this pull request may close these issues.

3 participants