How to materialize the Dataset returned by download_common_crawl #532
Unanswered
volkerstampa
asked this question in
Q&A
Replies: 2 comments 6 replies
-
|
OK ... I found this. It also relies on |
Beta Was this translation helpful? Give feedback.
0 replies
-
|
I think we can refactor to just have saving the dataset be the way to trigger the computation. Right now the writing to disk is bundled in the function, but we can extract it to make your life easier. I'll work on this. |
Beta Was this translation helpful? Give feedback.
6 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Using
download_common_crawllike this:returns a
DocumentDatasetthat is backed by lazy daskDataFrames. Without further processing, the dataset does not get downloaded at all. I am primarily interested in the side effect of this operation, i.e., the downloaded and extracted data. How can I "materialize" the lazyDataFramewithout loading it all into the memory of either the Dask workers or the Dask scheduler? As far as I understand, callingds.df.compute()(and awaiting the returned Future) will load the dataset into the scheduler's memory,while ds.df.persist()loads it into the workers' memory. I tried to work around this problem by callinglen(ds), but even that seems to load the entire dataset into the workers' memory. What is the correct way to enforce the side effect (of downloading the data) without loading the entire dataset into memory?Any hints are appreciated.
Beta Was this translation helpful? Give feedback.
All reactions