Skip to content

Conversation

@jlee303
Copy link
Contributor

@jlee303 jlee303 commented Oct 2, 2025

Summary: Adds test script to make sure that data pulled from data management python scripts match the ones on Manifold. Pulls data from Manifold through CLI and compare SHA or data (if file is canonical parquet since metadata can be different)

Differential Revision: D83766939

@meta-codesync
Copy link

meta-codesync bot commented Oct 2, 2025

@jlee303 has exported this pull request. If you are a Meta employee, you can view the originating Diff in D83766939.

@jlee303 jlee303 force-pushed the export-D83766939 branch 5 times, most recently from 5e1ae15 to ecb3877 Compare October 3, 2025 21:39
Summary:

Add scripts that allow user to pull public datasets
- BaseDatasetBuilder - abstract base class that defines interface for all dataset builders. Each dataset is expected to have a corresponding dataset builder python file.
- DatasetManager - manages all available dataset builders
- DownloadUtils - shared utility class with methods for downloading from various sources and handles data extraction

Differential Revision: D82497246
Summary:

Adding dataset builder for Binance data
Adding manifest containing dataset config and other relevant info

Reviewed By: Victor-C-Zhang

Differential Revision: D83766688
Summary:

Adds test script to make sure that data pulled from data management python scripts match the ones on Manifold. Pulls data from Manifold through CLI and compare SHA or data (if file is canonical parquet since metadata can be different)


All files match!
```

Differential Revision: D83766939
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant