How to materialize the Dataset returned by download_common_crawl #532

volkerstampa · 2025-02-09T11:10:41Z

volkerstampa
Feb 9, 2025

Using download_common_crawl like this:

archive_id = "2025-05"
ds = download_common_crawl(
    f"/target-folder/CC-MAIN-{archive_id}", 
    archive_id, archive_id, 
    output_type="parquet", 
    algorithm=ResiliparseExtractor(),
    url_limit=None)

returns a DocumentDataset that is backed by lazy dask DataFrames. Without further processing, the dataset does not get downloaded at all. I am primarily interested in the side effect of this operation, i.e., the downloaded and extracted data. How can I "materialize" the lazy DataFrame without loading it all into the memory of either the Dask workers or the Dask scheduler? As far as I understand, calling ds.df.compute() (and awaiting the returned Future) will load the dataset into the scheduler's memory, while ds.df.persist() loads it into the workers' memory. I tried to work around this problem by calling len(ds), but even that seems to load the entire dataset into the workers' memory. What is the correct way to enforce the side effect (of downloading the data) without loading the entire dataset into memory?

Any hints are appreciated.

volkerstampa · 2025-02-10T10:23:48Z

volkerstampa
Feb 10, 2025
Author

OK ... I found this. It also relies on len, though, which does not seem to work for me. But I am trying it with a hard-coded length. Still looks like a workaround for me, but I guess that is just the way you do it?!?

0 replies

ryantwolf · 2025-02-10T20:51:01Z

ryantwolf
Feb 10, 2025

I think we can refactor to just have saving the dataset be the way to trigger the computation. Right now the writing to disk is bundled in the function, but we can extract it to make your life easier. I'll work on this.

6 replies

ryantwolf Feb 12, 2025

Good concern. What I have done is kept the output path as a parameter so that it can still automatically pick up from interrupted downloads and extractions, but it is no longer used to output the final extracted dataset. Can you take a look at this PR I just made to see what you think?

volkerstampa Feb 12, 2025
Author

Thanks for this. The PR makes sense to me. Maybe its worth to point out explicitly that for the caching to be effective that:

the output_type passed to download_common_crawl has to match the Dataset.to_... function used to save the Dataset
the output_folder passed to download_common_crawl has to match the output_path passed to Dataset.to_...

This is at least my current understanding.

ryantwolf Feb 12, 2025

Yes that is correct. I'll make it more explicit in the docstring and documentation.

ryantwolf Feb 18, 2025

Ok this should be updated now. Thanks for pointing this out!

volkerstampa Feb 18, 2025
Author

Looks good to me. Thanks again for implementing the changes. Looking forward to a release :-)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How to materialize the Dataset returned by download_common_crawl #532

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 6 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

How to materialize the Dataset returned by download_common_crawl #532

Uh oh!

volkerstampa Feb 9, 2025

Replies: 2 comments · 6 replies

Uh oh!

volkerstampa Feb 10, 2025 Author

Uh oh!

ryantwolf Feb 10, 2025

Uh oh!

ryantwolf Feb 12, 2025

Uh oh!

volkerstampa Feb 12, 2025 Author

Uh oh!

ryantwolf Feb 12, 2025

Uh oh!

ryantwolf Feb 18, 2025

Uh oh!

volkerstampa Feb 18, 2025 Author

volkerstampa
Feb 9, 2025

Replies: 2 comments 6 replies

volkerstampa
Feb 10, 2025
Author

ryantwolf
Feb 10, 2025

volkerstampa Feb 12, 2025
Author

volkerstampa Feb 18, 2025
Author