Skip to content

refactor#35

Merged
sambles merged 15 commits into
developfrom
feature/cleanup-and-simplify
May 29, 2026
Merged

refactor#35
sambles merged 15 commits into
developfrom
feature/cleanup-and-simplify

Conversation

@Ha-Ree
Copy link
Copy Markdown
Contributor

@Ha-Ree Ha-Ree commented Mar 10, 2026

Summary

This PR cleans up the OasisDataManager library with ergonomic improvements, bug fixes, and expanded test coverage. Changes span the public API, internal reader logic, storage backends, and exception naming.


Public API improvement

Storage backend module renames

  • filestore/backends/aws_s3.pyfilestore/backends/aws.py (canonical path)
  • filestore/backends/azure_abfs.pyfilestore/backends/azure.py (canonical path)
  • The old paths are retained as backward-compatible shims that emit a DeprecationWarning on import, so existing code continues to work.

Exception rename

  • OasisException renamed to OasisDataManagerException to better reflect the library it belongs to and avoid confusion with the same name used elsewhere in the Oasis platform.
  • OasisException is kept as a backward-compatible alias (same class object) so nothing breaks.
  • MissingInputsException updated to subclass OasisDataManagerException.

Aliases for easier use in config

  • config = {"engine": "oasis_data_manager.df_reader.backends.pandas.OasisPandasReaderCSV"} → config = {"engine": "OasisPandasReaderCSV"}

Bug fixes

Dask RecursionError on parquet reads

  • OasisReader._read() now sets has_read = True before calling read_parquet() / read_csv(), wrapped in a try/except that resets the flag on failure.
  • Previously the flag was set after the read, causing Dask's read_parquet() to re-enter _read() via self.df (a property that calls _read()), producing infinite recursion.

Dask copy_with_df type mismatch

  • OasisDaskReader.copy_with_df() now converts any incoming pandas DataFrame to a Dask DataFrame before passing it to the base implementation.
  • Previously, copying with a pandas DataFrame left self._df as pandas, causing AttributeError when as_pandas() called .compute() on it.

Double _read() calls

  • OasisReader.filter() and OasisReader.as_pandas() now access self._df directly instead of going through the self.df property, eliminating a redundant second _read() call.

Code quality

  • Extension detection: Replaced a nested for/else/break loop in _read() with a one-line any() expression.
  • Old-style super() calls: Updated to super() (no arguments) in AwsS3Storage, AzureABFSStorage, MissingInputsException, and OasisDataManagerException.
  • f-strings: Replaced .format() call in MissingInputsException.__init__ with an f-string.
  • Logger usage: delete_file() and delete_dir() in BaseStorage now call self.logger.info() instead of the bare module-level logging.info(), consistent with the rest of the class. Fixed a "Unknwon" typo in the log message.
  • Stale docstring: Removed outdated Django-storage references and TODO stubs from AwsS3Storage.
  • config_options serialization: AWS and Azure backends now store the original root_dir argument (self._root_dir_arg) before joining it onto the bucket/container path, and use it in config_options. This avoids a fragile Path.relative_to() reverse-computation that could fail if the paths didn't align.
  • ComplexData.run() clarity: Added a comment explaining the fetch_required logic — CSV and Parquet files are read directly by the df_reader, so fetch() is only needed for formats the reader cannot handle directly.

Test coverage

New test files

  • tests/df_reader/test_pyarrow.py: PyArrow backend tests covering parquet reads, column selection, and filter predicates.
  • tests/filestorage/test_storage_utils.py: Tests for BaseStorage.create_traceback() and AwsS3Storage._strip_signing_parameters().

New tests in existing files

  • test_read_csv.py / test_read_parquet.py: OasisReader.query(), copy_with_df(), and OasisDaskReader.read_from_dataframe().
  • test_from_dataframe.py: Passing a pandas DataFrame via dataframe= to a Dask reader.
  • test_caching.py: OasisDataManagerException backward-compat alias verification.

Test fixes

  • Dask query() test: added .compute() call for lazy scalar results (e.g. frame["D"].sum()).
  • test_complex/test_base.py: guarded dask import with pytest.importorskip so the file is skipped cleanly when Dask is not installed.
  • mypy type: ignore comments on optional-dependency fallback assignments updated to suppress the correct error codes.

Readme creation

  • Updated readme to be nonempty

@Ha-Ree Ha-Ree changed the title seeing tests refactor Mar 10, 2026
@Ha-Ree Ha-Ree linked an issue Mar 11, 2026 that may be closed by this pull request
@Ha-Ree Ha-Ree requested a review from sstruzik May 20, 2026 13:02
@Ha-Ree Ha-Ree marked this pull request as ready for review May 20, 2026 13:02
@sstruzik sstruzik requested a review from vinulw May 27, 2026 16:17
@sambles sambles mentioned this pull request May 28, 2026
Comment thread README.md Outdated
| `OasisDaskReader` | Dask | CSV, Parquet |
| `OasisPyarrowReader` | PyArrow | Parquet |

Format-specific subclasses (`OasisPandasReaderCSV`, `OasisDaskReaderParquet`, etc.) are also available.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor but I don't think the config has the OasisPyarrowReaderParquet, it's redundant but might be worth including for correctness of this line in the README.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is there no?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In oasis_data_manager/df_reader/backends/pandas.py you have OasisPandasReader, OasisPandasReaderCSV and OasisPandasReaderParquet.

In oasis_data_manager/df_reader/backends/pyarrow.py only the OasisPyarrowReader is there, I think according the the README you expect there to also be a OasisPyarrowReaderParquet and this needs to be added to the aliases in the config parsing section.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oooh complete misread from me I was looking at the daskreader, will update

Comment thread tests/df_reader/test_pyarrow.py
Comment thread tests/complex/test_base.py Outdated
Copy link
Copy Markdown
Contributor

@vinulw vinulw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍🏾

@sambles sambles merged commit 630b304 into develop May 29, 2026
3 checks passed
@sambles sambles deleted the feature/cleanup-and-simplify branch May 29, 2026 07:46
@github-actions github-actions Bot mentioned this pull request May 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

OasisDataManager Refactor

3 participants