Skip to content

Conversation

@ap0phasi
Copy link

@ap0phasi ap0phasi commented Nov 3, 2025

Pull Request Template

Checklist

  • Confirmed that cargo run-checks command has been executed.
  • Made sure the book is up to date with changes in this PR.

Changes

Added arrow feature to In-Memory datasets that allows for loading from Arrow Record Batches, such as what is produced by datafusion or duckdb-rs.

Testing

Two new feature-dependent tests have been added demonstrating using datafusion to query the same test csv used in the existing from_csv_rows test. These tests can be run with cargo test --feature arrow, which executes an example datafusion query within a tokio runtime.

Functionality to create in memory dataset from Vec<RecordBatch>.
Successful tests, requires formatting of dependencies.
Adding unit test for using SQL in datafusion to reorder CSV columns for
loading into in-memory dataset.
Updated both Cargo.toml and crates/burn-dataset/Cargo.toml
Arrow tests will only run if `arrow` feature is included.
@codecov
Copy link

codecov bot commented Nov 3, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 64.73%. Comparing base (5da68bf) to head (3ce3b7d).

❌ Your project check has failed because the head coverage (64.73%) is below the target coverage (80.00%). You can increase the head coverage or adjust the target coverage.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3974      +/-   ##
==========================================
+ Coverage   64.71%   64.73%   +0.01%     
==========================================
  Files        1180     1180              
  Lines      140328   140384      +56     
==========================================
+ Hits        90816    90872      +56     
  Misses      49512    49512              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@laggui laggui self-requested a review November 3, 2025 20:57
Copy link
Member

@laggui laggui left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay!

Just a couple of comments, otherwise LGTM.

/edit: maybe it should be its own dataset type instead of a InMemDataset constructor though? Also suggested in the other review.

pub fn from_arrow_batches(record_batches: Vec<RecordBatch>) -> Result<Self, std::io::Error> {
let items: Vec<I> = record_batches
.into_iter()
.flat_map(|batch| -> Vec<I> { I::from_record_batch(batch).unwrap() })
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably propagate the result/error instead of possibly panicking w/ unwrap.

The return type Result should also reflect this change (not a std::io::Error, as opposed to the other methods).

column_bool: bool,
column_float: f64,
}
let ctx = SessionContext::new();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should probably indicate that this is imported from datafusion given that it is not a burn type

Copy link
Collaborator

@antimora antimora left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally I am against extending in InMemoryDataset to support arrow record batches. From CVS records method is a legacy use case. I think we should have a separate cvs and json dataset implementations/

I think ArrowRecord should be its own dataset implementation.

If someone needs to load all records from the dataset, we should pipe it using from_dataset https://github.com/tracel-ai/burn/pull/3974/files#diff-22ba832494e8fb0ff4df0fe55c7e6f52f3f6a6c22e0c744c43f8b71bed03ce46R45

@ap0phasi
Copy link
Author

ap0phasi commented Nov 6, 2025

Thanks, I agree it makes sense to have separate ArrowDataset, CsvDataset, and JsonDatasets. I'll break out a separate ArrowDataset implementation and better propagate the results/errors. Would you like me to do a separate PR for separating out CSV and JSON datasets?

@laggui
Copy link
Member

laggui commented Nov 7, 2025

I think this PR can focus on the arrow dataset changes, and if you feel like following up for the csv and json sources another PR can be opened after 🙂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants