Skip to content

chore(ingestion): enable basedpyright across the codebase via baseline#27755

Merged
pmbrull merged 9 commits intomainfrom
basedpyright-baseline-and-ci
Apr 27, 2026
Merged

chore(ingestion): enable basedpyright across the codebase via baseline#27755
pmbrull merged 9 commits intomainfrom
basedpyright-baseline-and-ci

Conversation

@IceS2
Copy link
Copy Markdown
Contributor

@IceS2 IceS2 commented Apr 27, 2026

Summary

Removes the ~25 path globs from [tool.basedpyright] ignore (which excluded roughly 90% of the codebase from type checking) and grandfathers the existing violations into a baseline file. New type errors in any previously-ignored file now fail CI, while existing debt is captured for incremental cleanup.

Why

After PR #27739 set up ruff and got the basic tooling story sane, basedpyright was still effectively neutered: the ignore list excluded metadata/ingestion/*, metadata/profiler/*, metadata/utils/*, metadata/data_quality/*, metadata/clients/* and ~20 other paths — i.e. nearly everything. Type-check coverage was sdk/ plus a couple of small modules.

Rather than fix all violations up-front (~18.8K errors, ~37.4K warnings), we use basedpyright's baseline feature to freeze the current state. The codebase remains green; any new type error introduced anywhere fails CI. Per-module cleanup follows incrementally by deleting baseline entries.

Strategy: pin analysis to the lowest supported Python, run once

After investigating multi-version baseline behavior, this PR follows the standard
pattern: type-check on the floor, run unit tests on the matrix.

Concretely:

  • [tool.basedpyright] pythonVersion = "3.10" — basedpyright analyzes code as if the runtime were Python 3.10 (the lowest supported version), independent of which interpreter actually invokes it. Forward-incompatible features (tomllib usage, PEP 695 generics, typing.Self, etc.) are caught at type-check time.
  • py-tests.yml "Run Static Checks" step gated on matrix.py-version == '3.10'— type-checking is deterministic across the matrix once pythonVersion is pinned, so running it three times wastes CI minutes. Unit tests still run on 3.10/3.11/3.12 to verify runtime compatibility.
  • The baseline is generated once against pythonVersion = 3.10 and stays stable.

What's in this PR

File Change
ingestion/pyproject.toml Drop the entire ignore = [...] block under [tool.basedpyright]; add pythonVersion = "3.10"
ingestion/setup.py Bump basedpyright~=1.14~=1.39.0; add nox to dev deps
ingestion/.basedpyright/baseline.json New, ~13MB JSON. Captures ~18.8K errors + ~37.3K warnings.
ingestion/noxfile.py static-checks session passes --baselinefile .basedpyright/baseline.json
ingestion/Makefile make static-checks delegates to nox -s static-checks (single source of truth between local and CI)
.github/workflows/py-tests.yml Gate "Run Static Checks" step on matrix.py-version == '3.10' (avoids redundant runs since pythonVersion pin makes results identical across matrix)
.gitignore Add .serena/ (local language-server cache)

How CI will run it

.github/workflows/py-tests.yml already runs:

cd ingestion
nox --no-venv -s static-checks

The nox session now resolves to basedpyright -p pyproject.toml --baselinefile .basedpyright/baseline.json. With cwd = ingestion/, the baseline path resolves correctly. No workflow changes needed.

Verifying locally

# With baseline (current state — should be 0 errors)
make static-checks

# Without baseline (raw violation count, useful for assessing cleanup progress)
echo '{\"files\":{}}' > /tmp/empty-baseline.json
cd ingestion
basedpyright -p pyproject.toml --baselinefile /tmp/empty-baseline.json
# → ~18.8K errors, ~37.4K warnings

Both make static-checks from repo root and from ingestion/ produce identical results (the Makefile target cds into ingestion/ before invoking nox).

Cleanup workflow (follow-up PRs)

To fix violations in a module:

  1. Make code fixes
  2. Re-generate baseline: cd ingestion && basedpyright -p pyproject.toml --baselinefile .basedpyright/baseline.json --writebaseline
  3. Diff baseline.json to confirm error count went down
  4. Run --writebaseline 2-3 more times to absorb non-determinism (rare 0–80 errors flake between runs)
  5. Commit the slimmer baseline alongside the code fix

The baseline JSON is keyed by file path → list of error ranges. Deletions in the JSON correspond to actual fixes.

What's intentionally not in scope

  • Fixing the ~18.8K grandfathered violations (multi-month effort, separate PRs)
  • Re-enabling weakened reports: reportDeprecated, reportAny, reportExplicitAny, reportImplicitOverride still = false. Tackling each would surface large new error sets — separate phases per the post-modernize roadmap
  • Pylint → ruff Stage 2 (the larger lint migration

Removes the ~25 paths from `[tool.basedpyright] ignore` (which excluded
roughly 90% of the codebase from type checking) and grandfathers the
existing violations into a baseline file. New violations in any
previously-ignored file now fail CI.

Changes:
- ingestion/pyproject.toml: drop the entire `ignore = [...]` block
- ingestion/setup.py: bump `basedpyright~=1.14` to `~=1.39.0`
- ingestion/.basedpyright/baseline.json (new, ~13MB): captures the
  starting violation set (~18.8K errors + ~37.4K warnings) so the
  migration is behavior-preserving. Regenerate with
  `cd ingestion && basedpyright -p pyproject.toml --baselinefile
  .basedpyright/baseline.json --writebaseline`. basedpyright analysis
  has minor non-determinism (similar to ruff's), so re-running
  --writebaseline a few times converges the baseline.
- ingestion/noxfile.py: pass `--baselinefile .basedpyright/baseline.json`
  to the basedpyright invocation in the `static-checks` session so CI
  honors the grandfathering. CI already runs the session via
  `cd ingestion && nox --no-venv -s static-checks` (py-tests.yml).
- ingestion/Makefile: `make static-checks` now delegates to
  `nox -s static-checks` so local invocations match CI exactly. Also
  drops the dead Python 3.9 / OM_SKIP_SDK_PY39 branch (we require
  Python >=3.10 since the previous modernization PR).
- .gitignore: add `.serena/` (local language-server cache)
Copilot AI review requested due to automatic review settings April 27, 2026 09:14
@IceS2 IceS2 requested a review from a team as a code owner April 27, 2026 09:14
@github-actions github-actions Bot added Ingestion safe to test Add this label to run secure Github workflows on PRs labels Apr 27, 2026
Comment thread ingestion/pyproject.toml
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Enables basedpyright type-checking across the full ingestion/ codebase by removing the large ignore-glob list and instead grandfathering existing violations via a committed baseline file, so newly introduced type issues anywhere will fail CI.

Changes:

  • Remove [tool.basedpyright].ignore path globs and document the new baseline-based workflow.
  • Add a basedpyright baseline file and update the nox static-checks session to use it.
  • Align local make static-checks with CI by delegating to nox, and bump basedpyright version.

Reviewed changes

Copilot reviewed 4 out of 6 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
ingestion/pyproject.toml Removes ignore globs and documents baseline-based grandfathering.
ingestion/.basedpyright/baseline.json Adds the committed baseline capturing current type violations.
ingestion/noxfile.py Runs basedpyright with --baselinefile in the static-checks session.
ingestion/Makefile Delegates static-checks to nox to match CI behavior.
ingestion/setup.py Updates dev dependency to basedpyright~=1.39.0.
.gitignore Ignores .serena/ local tooling cache directory.

Comment thread ingestion/pyproject.toml
Comment thread ingestion/Makefile
The static-checks Makefile target and the py-tests CI job both delegate
to `nox -s static-checks`, but nox was being installed as a separate
side step (`pip install nox` in `install_dev_env`, `uv pip install nox`
in the test-environment composite action). Listing it in dev extras
means a plain `pip install ingestion[dev]` brings it in.
Following the basedpyright + multi-Python-version research:

- ingestion/pyproject.toml: add `pythonVersion = "3.10"` to
  [tool.basedpyright] so type-checking always analyzes for the lowest
  supported Python version. Forward-incompatible code (tomllib usage,
  PEP 695 generics, etc.) is caught at type-check time regardless of
  which Python interpreter runs the checker.
- .github/workflows/py-tests.yml: gate the "Run Static Checks" step on
  `matrix.py-version == '3.10'`. With pythonVersion pinned, results are
  identical across the matrix; running once avoids redundant work and
  keeps the baseline file deterministic. Unit tests still run on the
  full 3.10/3.11/3.12 matrix to verify runtime compatibility.
- ingestion/.basedpyright/baseline.json: regenerated cleanly with the
  new pythonVersion config (~18.8K errors / ~37.3K warnings, similar
  scale to the previous baseline). Aligns with the canonical
  type-check-on-floor / test-on-matrix pattern used by Pydantic, CPython,
  and other major Python projects.
Copilot AI review requested due to automatic review settings April 27, 2026 10:01
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 7 changed files in this pull request and generated 1 comment.

Comment thread ingestion/pyproject.toml
pythonVersion = "3.10"

# Existing violations are grandfathered via .basedpyright/baseline.json
# (regenerate with `basedpyright -p ingestion/pyproject.toml --writebaseline`
Copy link

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The baseline regeneration instructions are incorrect: when running from the ingestion/ directory, -p ingestion/pyproject.toml resolves to a non-existent path (ingestion/ingestion/pyproject.toml), and the command also doesn’t specify the baseline file to update. Please update the comment to use the correct -p pyproject.toml (or clarify it’s meant to run from repo root) and include --baselinefile .basedpyright/baseline.json so it actually rewrites the committed baseline.

Suggested change
# (regenerate with `basedpyright -p ingestion/pyproject.toml --writebaseline`
# (regenerate with `basedpyright -p pyproject.toml --writebaseline --baselinefile .basedpyright/baseline.json`

Copilot uses AI. Check for mistakes.
IceS2 added 2 commits April 27, 2026 12:16
…seline

CI's previous run still surfaced ~9 issues (2 errors + 7 warnings) that
weren't in the baseline. Root cause: my local environment differs from
CI's in three ways that affect type inference — Python interpreter
(3.11 vs 3.10), platform (Darwin vs Linux), and pip-resolved package
versions (couchbase, avro, trino, sqlalchemy stubs all differ slightly).

This commit closes the platform gap and regenerates the baseline from a
fresh CI-equivalent environment:

- ingestion/pyproject.toml: add `pythonPlatform = "Linux"` to
  [tool.basedpyright] so type-checking uses the Linux subset of stdlib /
  third-party stubs regardless of where the analyzer runs.
- ingestion/.basedpyright/baseline.json: regenerated against a fresh
  Python 3.10 venv installed via `uv pip install ingestion[test]` (the
  same install path CI's setup-openmetadata-test-environment composite
  action uses). New scale: ~18.7K errors / ~37.5K warnings — same
  ballpark as the previous baseline, with column positions now matching
  CI's environment.

Local-developer note: when running `make static-checks` from a venv
that doesn't mirror CI exactly (e.g. macOS, Python 3.11, different
package versions), you may see drift errors. The supported workflow for
regenerating the baseline is to mirror CI:
  python3.10 -m venv /tmp/ci-mirror
  source /tmp/ci-mirror/bin/activate
  uv pip install --upgrade pip "setuptools<81"
  uv pip install --no-build-isolation "cx_Oracle>=8.3.0,<9"
  uv pip install -e "ingestion[test]"
  uv pip install "basedpyright~=1.39.0" nox
  cd ingestion && basedpyright -p pyproject.toml \
      --baselinefile .basedpyright/baseline.json --writebaseline
…mirror

The previous attempt added `pythonPlatform = "Linux"` thinking it would
make the local-generated baseline match CI. It did the opposite — Linux
platform stubs activate additional conditional code paths that weren't
analyzed before, so CI saw 101 errors instead of the prior 2 errors.

Reverting:
- Drop `pythonPlatform = "Linux"` from [tool.basedpyright]. Without it,
  basedpyright analyzes for the host platform; on CI's ubuntu-latest
  runner that's Linux automatically, but type-stub coverage stays the
  same as before (matching the d9196df baseline).
- Regenerate ingestion/.basedpyright/baseline.json against a fresh
  Python 3.10 venv installed via `uv pip install ingestion[test]`
  (mirroring CI's setup-openmetadata-test-environment composite action).
  ~18.8K errors / 37.7K warnings captured — same scale as the working
  d9196df version.

Local-developer note: any baseline regeneration done on macOS will drift
from CI's Linux env (different transitive package versions, different
stubs). The supported local mirror procedure:
  python3.10 -m venv /tmp/ci-mirror
  source /tmp/ci-mirror/bin/activate
  uv pip install --upgrade pip "setuptools<81"
  uv pip install --no-build-isolation "cx_Oracle>=8.3.0,<9"
  uv pip install -e "ingestion[test]"
  uv pip install "basedpyright~=1.39.0" nox
  cd ingestion && basedpyright -p pyproject.toml \\
      --baselinefile .basedpyright/baseline.json --writebaseline
IceS2 added 2 commits April 27, 2026 14:08
Prior CI-mirror only installed [test], skipping [all] and the four
--no-deps SA pins (sqlalchemy-redshift/databricks/ibmi, pydoris-custom).
That left ~75 connector packages out of the analysis env, so basedpyright
couldn't resolve types from databricks.sqlalchemy, GE 0.18 Batch,
sklearn BaseEstimator, airflow SQLAlchemy models, pandas/numpy stubs,
etc. CI saw 129 errors absent from the baseline.

Regenerated against a fresh py3.10 venv that mirrors
.github/actions/setup-openmetadata-test-environment exactly:
  uv pip install ./ingestion[dev]
  make generate
  uv pip install "setuptools<81"
  uv pip install --no-build-isolation "cx_Oracle>=8.3.0,<9"
  uv pip install --no-deps sqlalchemy-redshift==0.8.14 \
                            sqlalchemy-databricks==0.2.0 \
                            sqlalchemy-ibmi==0.9.3 \
                            pydoris-custom==1.1.0
  uv pip install ./ingestion[all]
  uv pip install ./ingestion[test]
  uv pip install nox

First run: 128 errors, 272 warnings — within 1 error of CI's 129/272.
Wrote baseline with 56,100 entries across 1,035 files. Verify run with
the new baseline reports 0/0/0.

macOS arm64 vs Linux x86_64 wheel resolution may leave a small residual
(~3-7 errors per the d9196df precedent). Re-run --writebaseline 2-3x
if any show up in CI.
CI's Linux fastavro stub returns Schema as `str | List[Any]`, while
the macOS arm64 wheel narrows to `str` — the only error not absorbed
by the regenerated baseline. Add a targeted pyright: ignore on the
parse_avro_schema call instead of broadening behavior.
Copilot AI review requested due to automatic review settings April 27, 2026 12:19
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 8 changed files in this pull request and generated 1 comment.

Comments suppressed due to low confidence (1)

.github/workflows/py-tests.yml:110

  • To make the new if: matrix.py-version == '3.10' gate actually eliminate redundant nox runs, consider setting PYTHON_VERSIONS=${{ matrix.py-version }} (or similar) for this step. ingestion/noxfile.py defaults to SUPPORTED_PYTHON_VERSIONS = ["3.10", "3.11", "3.12"] unless PYTHON_VERSIONS is set, which can cause the static-checks session to run once per listed version even inside the single 3.10 job.
      - name: Run Static Checks
        # basedpyright is configured with `pythonVersion = "3.10"` (the lowest
        # supported version) so type-checking results are identical across the
        # 3.10/3.11/3.12 matrix. Run on the lowest version only to avoid
        # redundant work and keep the baseline file deterministic.
        if: matrix.py-version == '3.10'
        run: |
          source env/bin/activate
          cd ingestion
          nox --no-venv -s static-checks

writer_schema = json.dumps(reader.writer_schema)

return parse_avro_schema(schema=writer_schema, cls=Column)
return parse_avro_schema(schema=writer_schema, cls=Column) # pyright: ignore[reportArgumentType]
Copy link

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avoid relying on pyright: ignore[reportArgumentType] here. parse_avro_schema is typed to accept schema: str, but writer_schema can still be non-string after the isinstance(..., dict) branch (e.g., if fastavro returns other schema representations). Normalize writer_schema to a str (or explicitly cast after proving it’s a str) before calling parse_avro_schema so the type check reflects the runtime contract and you don’t accidentally pass a non-string at runtime.

Copilot uses AI. Check for mistakes.
IceS2 added 2 commits April 27, 2026 14:35
CI's `--baselinemode=lock` (default) requires the baseline to match
exactly — neither up nor down. Two related issues:

1. The avro.py noqa silenced not just the surfaced error but 10
   cascading entries at line 95 (sub-errors propagating from the
   unresolved `schema` arg type). Baseline went `down by 10` → lock
   violated → exit 3 even with `0 errors` reported. Regenerate baseline
   so the 10 stale entries are dropped.

2. The macOS arm64 fastavro stub doesn't surface that error in the
   first place, so basedpyright treats the noqa as
   `reportUnnecessaryTypeIgnoreComment` locally — causing the opposite
   lock mismatch on CI (a warning entry that doesn't exist there).
   Disable the rule so platform-specific residuals can land without
   flapping between local and CI.
…ance

CI's implicit default is `lock`, which fails on any baseline change in
either direction (errors going up *or* down) via console.error → exit 3.
That cannot accommodate macOS arm64 vs Linux x86_64 stub drift: a
baseline regenerated locally always carries some entries that don't fire
on CI (and vice versa).

`auto` would tolerate the drift but silently overwrites the baseline
file — unacceptable in CI, where unreviewed changes never get committed
back.

`discard` is the right balance:
  - New errors not in the baseline still fail the run (early-return path
    in BaselineHandler.write before the lock/discard branch).
  - Stale baseline entries (errors that no longer fire on the current
    platform) print an info message and exit 0.
  - The baseline file is never modified.
Copilot AI review requested due to automatic review settings April 27, 2026 12:46
@gitar-bot
Copy link
Copy Markdown

gitar-bot Bot commented Apr 27, 2026

Code Review ✅ Approved 1 resolved / 1 findings

Enables basedpyright across the codebase using a baseline, resolving the unclosed parenthesis in the pyproject.toml comment. No open issues found.

✅ 1 resolved
Quality: Unclosed parenthesis in pyproject.toml comment

📄 ingestion/pyproject.toml:284
The regeneration comment on line 284 opens a parenthesis (regenerate with ... but never closes it. While it's just a comment and has no functional impact, it's a small readability nit.

Options

Display: compact → Showing less information.

Comment with these commands to change:

Compact
gitar display:verbose         

Was this helpful? React with 👍 / 👎 | Gitar

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 8 changed files in this pull request and generated 2 comments.

Comment thread ingestion/pyproject.toml
Comment on lines +291 to +292
# (regenerate with `basedpyright -p ingestion/pyproject.toml --writebaseline`
# from the ingestion/ directory). New violations in any file fail CI.
Copy link

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The guidance for regenerating the basedpyright baseline looks incorrect: from within the ingestion/ directory, -p ingestion/pyproject.toml will point to a non-existent path. Update the comment to use -p pyproject.toml (or drop -p entirely when running from ingestion/) and include the --baselinefile .basedpyright/baseline.json flag so the command matches how CI runs basedpyright.

Suggested change
# (regenerate with `basedpyright -p ingestion/pyproject.toml --writebaseline`
# from the ingestion/ directory). New violations in any file fail CI.
# (regenerate with `basedpyright -p pyproject.toml --baselinefile
# .basedpyright/baseline.json --writebaseline` from the ingestion/ directory).
# New violations in any file fail CI.

Copilot uses AI. Check for mistakes.
Comment on lines 91 to +95
writer_schema = reader.writer_schema
if isinstance(writer_schema, dict):
writer_schema = json.dumps(reader.writer_schema)

return parse_avro_schema(schema=writer_schema, cls=Column)
return parse_avro_schema(schema=writer_schema, cls=Column) # pyright: ignore[reportArgumentType]
Copy link

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of suppressing reportArgumentType here, consider narrowing/normalizing writer_schema to a str before calling parse_avro_schema (which is typed as schema: str). For example, if reader.writer_schema can be a dict, convert it to JSON; if it’s already a string, keep it as-is. This avoids permanently masking genuine type issues in this path.

Copilot uses AI. Check for mistakes.
@sonarqubecloud
Copy link
Copy Markdown

@github-actions
Copy link
Copy Markdown
Contributor

🟡 Playwright Results — all passed (10 flaky)

✅ 3963 passed · ❌ 0 failed · 🟡 10 flaky · ⏭️ 86 skipped

Shard Passed Failed Flaky Skipped
🟡 Shard 1 298 0 1 4
🟡 Shard 2 756 0 3 8
🟡 Shard 3 730 0 2 7
🟡 Shard 4 758 0 1 18
✅ Shard 5 687 0 0 41
🟡 Shard 6 734 0 3 8
🟡 10 flaky test(s) (passed on retry)
  • Pages/UserCreationWithPersona.spec.ts › Create user with persona and verify on profile (shard 1, 1 retry)
  • Features/ActivityAPI.spec.ts › Activity event is created when description is updated (shard 2, 1 retry)
  • Features/ActivityAPI.spec.ts › Activity event shows the actor who made the change (shard 2, 1 retry)
  • Features/DomainFilterQueryFilter.spec.ts › Domain filter should use exact match and prefix with dot to prevent false positives (shard 2, 1 retry)
  • Features/RTL.spec.ts › Verify Following widget functionality (shard 3, 1 retry)
  • Features/TeamSubscriptions.spec.ts › should open and close subscription edit modal (shard 3, 1 retry)
  • Pages/DataContracts.spec.ts › Create Data Contract and validate for Dashboard (shard 4, 1 retry)
  • Pages/Lineage/LineageFilters.spec.ts › Verify lineage schema filter selection (shard 6, 1 retry)
  • Pages/Lineage/LineageRightPanel.spec.ts › Verify custom properties tab IS visible for supported type: searchIndex (shard 6, 1 retry)
  • Pages/UserDetails.spec.ts › Admin user can edit teams from the user profile (shard 6, 1 retry)

📦 Download artifacts

How to debug locally
# Download playwright-test-results-<shard> artifact and unzip
npx playwright show-trace path/to/trace.zip    # view trace

@pmbrull pmbrull merged commit 84ed278 into main Apr 27, 2026
63 checks passed
@pmbrull pmbrull deleted the basedpyright-baseline-and-ci branch April 27, 2026 15:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Ingestion safe to test Add this label to run secure Github workflows on PRs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants