-
Notifications
You must be signed in to change notification settings - Fork 744
feat: In-memory dataset loading of Arrow Record Batches #3974
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Functionality to create in memory dataset from Vec<RecordBatch>. Successful tests, requires formatting of dependencies.
Adding unit test for using SQL in datafusion to reorder CSV columns for loading into in-memory dataset.
Updated both Cargo.toml and crates/burn-dataset/Cargo.toml
Arrow tests will only run if `arrow` feature is included.
Codecov Report✅ All modified and coverable lines are covered by tests. ❌ Your project check has failed because the head coverage (64.73%) is below the target coverage (80.00%). You can increase the head coverage or adjust the target coverage. Additional details and impacted files@@ Coverage Diff @@
## main #3974 +/- ##
==========================================
+ Coverage 64.71% 64.73% +0.01%
==========================================
Files 1180 1180
Lines 140328 140384 +56
==========================================
+ Hits 90816 90872 +56
Misses 49512 49512 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay!
Just a couple of comments, otherwise LGTM.
/edit: maybe it should be its own dataset type instead of a InMemDataset constructor though? Also suggested in the other review.
| pub fn from_arrow_batches(record_batches: Vec<RecordBatch>) -> Result<Self, std::io::Error> { | ||
| let items: Vec<I> = record_batches | ||
| .into_iter() | ||
| .flat_map(|batch| -> Vec<I> { I::from_record_batch(batch).unwrap() }) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably propagate the result/error instead of possibly panicking w/ unwrap.
The return type Result should also reflect this change (not a std::io::Error, as opposed to the other methods).
| column_bool: bool, | ||
| column_float: f64, | ||
| } | ||
| let ctx = SessionContext::new(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should probably indicate that this is imported from datafusion given that it is not a burn type
antimora
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally I am against extending in InMemoryDataset to support arrow record batches. From CVS records method is a legacy use case. I think we should have a separate cvs and json dataset implementations/
I think ArrowRecord should be its own dataset implementation.
If someone needs to load all records from the dataset, we should pipe it using from_dataset https://github.com/tracel-ai/burn/pull/3974/files#diff-22ba832494e8fb0ff4df0fe55c7e6f52f3f6a6c22e0c744c43f8b71bed03ce46R45
|
Thanks, I agree it makes sense to have separate ArrowDataset, CsvDataset, and JsonDatasets. I'll break out a separate ArrowDataset implementation and better propagate the results/errors. Would you like me to do a separate PR for separating out CSV and JSON datasets? |
|
I think this PR can focus on the arrow dataset changes, and if you feel like following up for the csv and json sources another PR can be opened after 🙂 |
Pull Request Template
Checklist
cargo run-checkscommand has been executed.Changes
Added
arrowfeature to In-Memory datasets that allows for loading from Arrow Record Batches, such as what is produced bydatafusionorduckdb-rs.Testing
Two new feature-dependent tests have been added demonstrating using
datafusionto query the same test csv used in the existingfrom_csv_rowstest. These tests can be run withcargo test --feature arrow, which executes an exampledatafusionquery within atokioruntime.