Skip to content

Issue932#933

Open
ilayfalach wants to merge 16 commits into
masterfrom
issue932
Open

Issue932#933
ilayfalach wants to merge 16 commits into
masterfrom
issue932

Conversation

@ilayfalach

Copy link
Copy Markdown
Collaborator

Implement document export to repository (#932)

Adds the reverse of the existing repository loader: export project documents into a repository JSON file that loads back through loadAllDatasourcesInRepositoryJSONToProject.

What you can do:

Export a single document, several, or all documents of a project
Merge into an existing repository with automatic duplicate detection (content hash or ObjectId)
override mode to strip duplicates from the whole file
How it's built (Approach C):

hera/utils/data/repositoryExport.py — pure, DB-free logic (hashing, merge, dedup)
dataToolkit.exportDocumentsToRepository — thin facade (query → pure funcs → write file)
hera-project repository export — CLI subcommand
Tests: 28 passing in test_repository_export.py (25 unit + 3 Mongo integration, incl. round-trip). No regressions in test_repository.py / test_datalayer.py.

📖 Design spec & usage: docs/superpowers/specs/2026-06-14-document-export-to-repository-design.md

Note: resource handling is reference-only in this MVP (isRelativePath:"False", no file copying); copyResources=True is the documented extension point.

Ilay Falach and others added 16 commits June 3, 2026 12:31
…nd injected pyhera config

Replaces the prior 3.12-based gate (commit 82ced32) with a Python 3.11
target per Lior's rollout direction. Mongo service runs mongo:latest
with MONGO_INITDB_ROOT_USERNAME/PASSWORD = hera/heracles; healthcheck
uses authenticated mongosh ping. Step "Write ~/.pyhera/config.json"
injects the matching config before pytest so hera's import-time
DB connect succeeds. Internal deps hermes and argos are vendored via
actions/checkout into _vendor/ and installed with pip install -e.
Pytest invocation: pytest hera/tests/ -v -m "not notebook" — the dead
"openfoam" filter and decorative MONGO_HOST/MONGO_PORT env vars are
removed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ksql==0.10.2 has been pinned in requirements.txt but is not imported
anywhere in hera/. Its setup.py does `import pip` inside an isolated
build env, which fails on modern pip (26.x) with
`ModuleNotFoundError: No module named 'pip'`. Locally it survives only
because pre-existing virtualenvs were built with an older pip; fresh
CI runners always fail at this line.

No consumers in the codebase, no transitive justification — removing.
Surfaced by the first run of .github/workflows/ci.yml on issue884-v2.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
sphinx-basic-ng has no stable 1.0.0 release on PyPI — only pre-releases
up to 1.0.0b2 exist. Under PEP 440, `>=1.0.0` excludes pre-releases,
so pip 26 fails resolution with "No matching distribution found".

Pin to the latest available beta (==1.0.0b2) to match the rest of
requirements.txt's `==` convention. This is a docs-time transitive
(via furo) — not exercised by the test suite.

Surfaced by the second run of .github/workflows/ci.yml on issue884-v2.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…3.11

Iterated `pip install --dry-run -r requirements.txt` in a fresh Python
3.12 venv with pip 26.1.2 (matches the CI runner) until resolution
succeeded. Twelve passes; one modified pin, sixteen removed pins.

Modified:
  aiosignal==1.3.2 -> aiosignal==1.4.0     (aiohttp 3.13.3 needs >=1.4.0)

Removed (unused in hera/* imports; were directly pinned but blocked
the resolver due to upstream constraints):
  basemap==1.4.1, basemap-data==1.3.2      (block matplotlib 3.9)
  gql==3.5.2, graphql-core==3.2.6          (gql wants graphql-core<3.2.5)
  hyper==0.7.0                             (unmaintained since 2016)
  pyvista==0.44.2                          (blocks vtk 9.4.1; hera uses vtkmodules directly)
  scikits.odes                             (needs Fortran compiler at build time)
  tb-rest-client==3.9.0                    (optional Thingsboard client; pins certifi==2023.7.22)

Removed (transitive deps of other packages, no direct hera/* import;
will still be installed via the transitive resolver with versions that
match their parents' constraints):
  geomet==1.1.0      (cassandra-driver 3.29.2 wants <0.3)
  h11==0.16.0        (httpcore 1.0.7 wants <0.15)
  h2==4.3.0, hpack==4.1.0, hyperframe==6.1.0  (only needed by hyper, now gone)
  httpcore==1.0.7    (httpx 0.28.1 pulls 1.0.9)
  jupyterlab==4.4.8  (notebook 7.3.2 wants <4.4; pip picks 4.3.x)
  tenacity==9.0.0    (luigi 3.6.0 wants <9)

Validated end-to-end with dry-run on a fresh venv with pip 26.1.2 +
setuptools 82.0.1: "Would install ..." with zero conflicts.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two bugs surfaced by the previous CI run on issue884-v2:

1) Wrong repo for `argos`: KaplanOpenSource/argos is "Entity placement
   on map" — a web app (server.py, client/), NOT the Python `argos`
   package hera imports. The actual python wrappings live in
   KaplanOpenSource/pyargos, which contains an `argos/__init__.py`
   subpackage at its root.

2) Neither hermes nor pyargos is pip-installable: both lack setup.py /
   pyproject.toml at every level. `pip install -e ./_vendor/<repo>`
   fails with: "does not appear to be a Python project".

Fix:
  - Checkout pyargos (not argos) into _vendor/pyargos.
  - Drop the two `pip install -e ./_vendor/...` lines.
  - Put both clone roots on PYTHONPATH for the pytest step, since
    both repos expose their python package at their root level
    (hermes/__init__.py and argos/__init__.py respectively).

Verified locally: cloning hermes and pyargos and pointing PYTHONPATH
at the parent dirs gives a working `import hermes` and `import argos`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds three steps before pytest:

1. Resolve test data version
   curl https://s3.eu-west-1.amazonaws.com/hera.kaplanopensource.co.il/latest.json
   → extract `.version` (e.g. "poc-manual-20260413-v1")
   → expose as step output for the cache key

2. Cache test data
   actions/cache@v4 keyed on the resolved version. When latest.json
   bumps to a new version, the key changes and a fresh download runs;
   otherwise cache hit, ~1-2s.

3. Fetch test data (cache miss only)
   Runs the existing scripts/s3/bootstrap_unittest_data.sh, which is
   the canonical client: it reads latest.json + manifest.json, downloads
   each file from {BASE_URL}/hera_unittest_data/<path>, and verifies
   SHA256 per file. No zip-URL guessing, no parallel re-implementation.

TEST_HERA is set at job-level to /home/runner/hera_unittest_data — the
ubuntu-latest equivalent of $HOME/hera_unittest_data, which matches
both the bootstrap script's default --target-dir and conftest's default.

Realistic outcome with the current POC subset on S3
(`mode: subset`, 5 files, 7.5MB — only N31E034.hgt + YAVNEEL.parquet
under measurements/, and an empty expected/BASELINE/):
the session-level `test_hera_root` skip will lift, unlocking a partial
slice of the 113 currently-skipped tests. The rest will continue to
skip until the full dataset (~279MB) and expected outputs are uploaded
to S3 under a new version key. When that happens, no workflow change
is needed — latest.json bumps, cache key flips, suite picks it up.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous CI run unlocked 77 additional tests (152 → 229 passed)
once TEST_HERA was wired through S3, but surfaced 25 errors in tests
whose specific data files are not in the current POC subset:

  measurements/GIS/vector/population_lamas.shp        (test_demography)
  measurements/meteorology/highfreqdata/
    slicedYamim_sonic.parquet                          (test_highfreq)
    slicedYamim_TRH.parquet                            (test_highfreq)

These three fixtures already had a `pytest.skip()` branch for the case
where `getDataSourceData()` returns None (datasource not registered in
the project), but `getDataSourceData()` eagerly loads the file inside
`doc.getData()` — so a missing file raises FileNotFoundError (parquet
via pyarrow) or pyogrio.errors.DataSourceError (shapefile via pyogrio),
not None.

Wrap the loads so missing-file conditions become per-test skips:
  - test_demography.population_gdf: catch FileNotFoundError, plus
    pyogrio.DataSourceError ONLY when the message contains "No such
    file" — corruption or unsupported-format errors must still surface
    as real failures.
  - test_highfreq.sonic_df / trh_df: catch FileNotFoundError (parquet
    raises this directly).

This is intentionally narrower than `continue-on-error` or `--ignore`:
each test skips only when ITS specific data file is missing. Logic
regressions, import errors, and data corruption continue to fail loud.

When the missing files are uploaded to S3 under a new manifest version,
the cache key in .github/workflows/ci.yml bumps, the bootstrap script
fetches them, and these tests turn green automatically without further
changes to test or workflow code.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Brainstormed design for the reverse of the repository loader: export
project Metadata documents into a repository JSON file (reference-only),
with content-hash/ObjectId duplicate detection and a dedup override mode.
Approach C: pure-function logic module + thin dataToolkit facade + CLI.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Guards against itemName collisions so distinct documents are never
silently overwritten when they share an ObjectId-derived name.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Queries the source project, delegates to the pure repositoryExport
helpers, writes the JSON file and optionally registers it. Excludes the
project's internal __config__ document when exporting all documents.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Verifies an exported repository file re-loads through the existing
loadRepositoryFromPath with item fields intact.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@lior-antonov

Copy link
Copy Markdown
Collaborator

@ilayfalach please remove/add to .gitignore claude's plan files

Comment thread .github/workflows/ci.yml

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this file is part of claude code's set up. it doesn't need to be added to git. please remove it.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this file is part of claude code's set up. it doesn't need to be added to git. please remove it.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this file is part of claude code's set up. it doesn't need to be added to git. please remove it.

Comment thread requirements.txt

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

claude upgrated this requirments file to python 3.12.
we still work on python 3.11.
it might cause problems, please revert.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants