feat: In-memory dataset loading of Arrow Record Batches #3974

ap0phasi · 2025-11-03T10:36:36Z

Pull Request Template

Checklist

Confirmed that cargo run-checks command has been executed.
Made sure the book is up to date with changes in this PR.

Changes

Added arrow feature to In-Memory datasets that allows for loading from Arrow Record Batches, such as what is produced by datafusion or duckdb-rs.

Testing

Two new feature-dependent tests have been added demonstrating using datafusion to query the same test csv used in the existing from_csv_rows test. These tests can be run with cargo test --feature arrow, which executes an example datafusion query within a tokio runtime.

Functionality to create in memory dataset from Vec<RecordBatch>. Successful tests, requires formatting of dependencies.

Adding unit test for using SQL in datafusion to reorder CSV columns for loading into in-memory dataset.

Updated both Cargo.toml and crates/burn-dataset/Cargo.toml

Arrow tests will only run if `arrow` feature is included.

codecov · 2025-11-03T11:49:44Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 64.73%. Comparing base (5da68bf) to head (3ce3b7d).

❌ Your project check has failed because the head coverage (64.73%) is below the target coverage (80.00%). You can increase the head coverage or adjust the target coverage.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3974      +/-   ##
==========================================
+ Coverage   64.71%   64.73%   +0.01%     
==========================================
  Files        1180     1180              
  Lines      140328   140384      +56     
==========================================
+ Hits        90816    90872      +56     
  Misses      49512    49512

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

laggui

Sorry for the delay!

Just a couple of comments, otherwise LGTM.

/edit: maybe it should be its own dataset type instead of a InMemDataset constructor though? Also suggested in the other review.

laggui · 2025-11-06T18:09:39Z

crates/burn-dataset/src/dataset/in_memory.rs

+    pub fn from_arrow_batches(record_batches: Vec<RecordBatch>) -> Result<Self, std::io::Error> {
+        let items: Vec<I> = record_batches
+            .into_iter()
+            .flat_map(|batch| -> Vec<I> { I::from_record_batch(batch).unwrap() })


We should probably propagate the result/error instead of possibly panicking w/ unwrap.

The return type Result should also reflect this change (not a std::io::Error, as opposed to the other methods).

laggui · 2025-11-06T18:13:55Z

burn-book/src/building-blocks/dataset.md

+    column_bool: bool,
+    column_float: f64,
+}
+let ctx = SessionContext::new();


Should probably indicate that this is imported from datafusion given that it is not a burn type

antimora

Personally I am against extending in InMemoryDataset to support arrow record batches. From CVS records method is a legacy use case. I think we should have a separate cvs and json dataset implementations/

I think ArrowRecord should be its own dataset implementation.

If someone needs to load all records from the dataset, we should pipe it using from_dataset https://github.com/tracel-ai/burn/pull/3974/files#diff-22ba832494e8fb0ff4df0fe55c7e6f52f3f6a6c22e0c744c43f8b71bed03ce46R45

ap0phasi · 2025-11-06T23:24:42Z

Thanks, I agree it makes sense to have separate ArrowDataset, CsvDataset, and JsonDatasets. I'll break out a separate ArrowDataset implementation and better propagate the results/errors. Would you like me to do a separate PR for separating out CSV and JSON datasets?

laggui · 2025-11-07T12:55:16Z

I think this PR can focus on the arrow dataset changes, and if you feel like following up for the csv and json sources another PR can be opened after 🙂

ap0phasi added 5 commits November 1, 2025 17:53

In Memory Dataset from Arrow Batch

4204f6c

Functionality to create in memory dataset from Vec<RecordBatch>. Successful tests, requires formatting of dependencies.

Arrow from SQL unit test

5b39396

Adding unit test for using SQL in datafusion to reorder CSV columns for loading into in-memory dataset.

Update Cargo workspace for arrow support

d15e0fb

Updated both Cargo.toml and crates/burn-dataset/Cargo.toml

Add conditions for arrow feature

c9de1f7

Arrow tests will only run if `arrow` feature is included.

Burn book update

3ce3b7d

laggui self-requested a review November 3, 2025 20:57

laggui requested changes Nov 6, 2025

View reviewed changes

antimora requested changes Nov 6, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: In-memory dataset loading of Arrow Record Batches #3974

feat: In-memory dataset loading of Arrow Record Batches #3974

Uh oh!

ap0phasi commented Nov 3, 2025

Uh oh!

codecov bot commented Nov 3, 2025

Uh oh!

laggui left a comment •

edited

Loading

Uh oh!

laggui Nov 6, 2025

Uh oh!

laggui Nov 6, 2025

Uh oh!

antimora left a comment

Uh oh!

ap0phasi commented Nov 6, 2025

Uh oh!

laggui commented Nov 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

feat: In-memory dataset loading of Arrow Record Batches #3974

Are you sure you want to change the base?

feat: In-memory dataset loading of Arrow Record Batches #3974

Uh oh!

Conversation

ap0phasi commented Nov 3, 2025

Pull Request Template

Checklist

Changes

Testing

Uh oh!

codecov bot commented Nov 3, 2025

Codecov Report

Uh oh!

laggui left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

laggui Nov 6, 2025

Choose a reason for hiding this comment

Uh oh!

laggui Nov 6, 2025

Choose a reason for hiding this comment

Uh oh!

antimora left a comment

Choose a reason for hiding this comment

Uh oh!

ap0phasi commented Nov 6, 2025

Uh oh!

laggui commented Nov 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

laggui left a comment •

edited

Loading