feat(file-based): use only first found file for discover #841

darynaishchenko · 2025-11-12T13:58:23Z

https://github.com/airbytehq/oncall/issues/9301

Summary by CodeRabbit

New Features
- Added a configuration option to allow schema discovery using only the first found file, enabling faster inference in file-based sources.
Tests
- Added a test verifying schema discovery when using the first found file and confirming inferred schema includes expected metadata fields.

github-actions · 2025-11-12T13:58:38Z

👋 Greetings, Airbyte Team Member!

Here are some helpful tips and reminders for your convenience.

Testing This CDK Version

You can test this version of the CDK using the following:

# Run the CLI from this branch:
uvx 'git+https://github.com/airbytehq/airbyte-python-cdk.git@daryna/file-based/update-discover#egg=airbyte-python-cdk[dev]' --help

# Update a connector to use the CDK from this branch ref:
cd airbyte-integrations/connectors/source-example
poe use-cdk-branch daryna/file-based/update-discover

Helpful Resources

CDK API Reference

PR Slash Commands

Airbyte Maintainers can execute the following slash commands on your PR:

/autofix - Fixes most formatting and linting issues
/poetry-lock - Updates poetry.lock file
/test - Runs connector tests with the updated CDK
/prerelease - Triggers a prerelease publish with default arguments
/poe build - Regenerate git-committed build artifacts, such as the pydantic models which are generated from the manifest JSON schema in YAML.
/poe <command> - Runs any poe command in the CDK environment

📝 Edit this welcome message.

coderabbitai · 2025-11-12T14:02:08Z

📝 Walkthrough

Walkthrough

Added a boolean flag use_first_found_file_for_schema_discovery (default false) to stream configuration. When enabled, DefaultFileBasedStream._get_raw_json_schema restricts schema discovery to the first discovered file. A unit test and CSV scenario schema entry were added to exercise and declare the option.

Changes

Cohort / File(s)	Summary
Configuration `airbyte_cdk/sources/file_based/config/file_based_stream_config.py`, `unit_tests/sources/file_based/scenarios/csv_scenarios.py`	Added `use_first_found_file_for_schema_discovery: bool` (default `False`, with title/description) to `FileBasedStreamConfig` and to CSV stream scenario schema properties.
Stream Implementation `airbyte_cdk/sources/file_based/stream/default_file_based_stream.py`	`_get_raw_json_schema` now checks `config.use_first_found_file_for_schema_discovery`; when true, logs an info message and limits discovery input to the first file by slicing the iterator returned from `get_files()` (using `itertools.islice`), adjusting `first_n_files` accordingly.
Tests `unit_tests/sources/file_based/stream/test_default_file_based_stream.py`	Added `test_use_first_found_file_for_schema_discovery` which mocks multiple remote files, enables the flag, asserts `infer_schema` is invoked once on the first file, and checks the produced schema includes `_ab_source_file_last_modified`, `_ab_source_file_url`, and expected nullable string types.

Sequence Diagram(s)

sequenceDiagram
  participant Stream as DefaultFileBasedStream
  participant Config as FileBasedStreamConfig
  participant FS as FileFetcher
  participant Infer as SchemaInferencer
  rect rgb(245,250,255)
    Note over Stream,Config: Schema discovery flow (flag-aware)
  end
  Stream->>Config: read use_first_found_file_for_schema_discovery
  alt flag == true
    Stream->>FS: get_files() (iterator)
    FS-->>Stream: files iterator
    Stream->>Stream: select first file (islice(...,1))
    Stream->>Infer: infer_schema(first_file)
    Infer-->>Stream: schema
  else flag == false
    Stream->>FS: get_files() (iterator)
    FS-->>Stream: many files
    Stream->>Infer: infer_schema(files up to max_n_files)
    Infer-->>Stream: schema
  end
  Stream->>Stream: consolidate & return schema

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Review the new conditional and itertools.islice usage in _get_raw_json_schema.
Verify the new Field metadata (title/description/default) in FileBasedStreamConfig.
Check the test mocks/assertions ensure infer_schema runs exactly once on the first file and that CSV scenario schema includes the new property.

Would you like a short changelog entry or a snippet to add to the config docs mentioning this flag, wdyt?

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately describes the main change: adding a feature to use only the first found file for schema discovery in file-based sources.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch daryna/file-based/update-discover

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f0443aa and 8615bb9.

📒 Files selected for processing (3)

airbyte_cdk/sources/file_based/config/file_based_stream_config.py (1 hunks)
airbyte_cdk/sources/file_based/stream/default_file_based_stream.py (1 hunks)
unit_tests/sources/file_based/stream/test_default_file_based_stream.py (1 hunks)

🧰 Additional context used

🧠 Learnings (2)

📓 Common learnings

Learnt from: pnilan
Repo: airbytehq/airbyte-python-cdk PR: 0
File: :0-0
Timestamp: 2024-12-11T16:34:46.319Z
Learning: In the airbytehq/airbyte-python-cdk repository, ignore all `__init__.py` files when providing a recommended reviewing order.

📚 Learning: 2024-12-11T16:34:46.319Z

Learnt from: pnilan
Repo: airbytehq/airbyte-python-cdk PR: 0
File: :0-0
Timestamp: 2024-12-11T16:34:46.319Z
Learning: In the airbytehq/airbyte-python-cdk repository, the `declarative_component_schema.py` file is auto-generated from `declarative_component_schema.yaml` and should be ignored in the recommended reviewing order.

Applied to files:

airbyte_cdk/sources/file_based/config/file_based_stream_config.py

🧬 Code graph analysis (2)

unit_tests/sources/file_based/stream/test_default_file_based_stream.py (5)

airbyte_cdk/sources/file_based/file_based_stream_reader.py (3)

config (46-47)

config (51-61)

get_matching_files (79-99)

airbyte_cdk/sources/file_based/discovery_policy/abstract_discovery_policy.py (2)

get_max_n_files_for_schema_inference (21-21)

n_concurrent_requests (18-18)

airbyte_cdk/sources/file_based/discovery_policy/default_discovery_policy.py (2)

get_max_n_files_for_schema_inference (24-33)

n_concurrent_requests (21-22)

airbyte_cdk/sources/file_based/stream/default_file_based_stream.py (2)

infer_schema (338-342)

get_json_schema (246-269)

airbyte_cdk/sources/file_based/remote_file.py (1)

RemoteFile (11-18)

airbyte_cdk/sources/file_based/stream/default_file_based_stream.py (3)

airbyte_cdk/sources/streams/core.py (1)

logger (128-129)

airbyte_cdk/sources/file_based/stream/abstract_file_based_stream.py (1)

get_files (92-96)

airbyte_cdk/sources/file_based/stream/concurrent/adapters.py (1)

get_files (165-166)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (15)

GitHub Check: Check: source-hardcoded-records
GitHub Check: Check: source-google-drive
GitHub Check: Check: source-shopify
GitHub Check: Check: source-pokeapi
GitHub Check: Check: destination-motherduck
GitHub Check: Check: source-intercom
GitHub Check: SDM Docker Image Build
GitHub Check: Manifest Server Docker Image Build
GitHub Check: Pytest (Fast)
GitHub Check: Pytest (All, Python 3.10, Ubuntu)
GitHub Check: Pytest (All, Python 3.13, Ubuntu)
GitHub Check: Pytest (All, Python 3.11, Ubuntu)
GitHub Check: Pytest (All, Python 3.12, Ubuntu)
GitHub Check: Analyze (python)
GitHub Check: Analyze (python)

airbyte_cdk/sources/file_based/config/file_based_stream_config.py

airbyte_cdk/sources/file_based/stream/default_file_based_stream.py

unit_tests/sources/file_based/stream/test_default_file_based_stream.py

github-actions · 2025-11-12T14:06:09Z

PyTest Results (Fast)

2 648 tests - 1 163 2 636 ✅ - 1 164 6m 51s ⏱️ -21s
1 suites ± 0 11 💤 ± 0
1 files ± 0 1 ❌ + 1

For more details on these failures, see this check.

Results for commit 10a3b4e. ± Comparison against base commit f0443aa.

This pull request removes 1163 tests.

unit_tests.sources.file_based.stream.concurrent.test_adapters ‑ test_file_based_stream_partition[test_default_transform]
unit_tests.sources.file_based.stream.concurrent.test_adapters ‑ test_file_based_stream_partition[test_no_transform]
unit_tests.sources.file_based.stream.concurrent.test_adapters ‑ test_file_based_stream_partition_generator[test_full_refresh]
unit_tests.sources.file_based.stream.concurrent.test_adapters ‑ test_file_based_stream_partition_generator[test_incremental]
unit_tests.sources.file_based.stream.concurrent.test_adapters ‑ test_file_based_stream_partition_hash[test_hash_no_slice]
unit_tests.sources.file_based.stream.concurrent.test_adapters ‑ test_file_based_stream_partition_hash[test_hash_with_slice]
unit_tests.sources.file_based.stream.concurrent.test_adapters ‑ test_file_based_stream_partition_raising_exception[test_exception_no_display_message0]
unit_tests.sources.file_based.stream.concurrent.test_adapters ‑ test_file_based_stream_partition_raising_exception[test_exception_no_display_message1]
unit_tests.sources.file_based.stream.concurrent.test_adapters ‑ test_get_error_display_message[test_no_display_message0]
unit_tests.sources.file_based.stream.concurrent.test_adapters ‑ test_get_error_display_message[test_no_display_message1]
…

♻️ This comment has been updated with latest results.

github-actions · 2025-11-12T14:11:00Z

PyTest Results (Full)

3 815 tests +1 3 802 ✅ ±0 11m 9s ⏱️ +5s
1 suites ±0 12 💤 ±0
1 files ±0 1 ❌ +1

For more details on these failures, see this check.

Results for commit 10a3b4e. ± Comparison against base commit f0443aa.

♻️ This comment has been updated with latest results.

maxi297

A couple comments just to make sure we cover everything here

airbyte_cdk/sources/file_based/stream/default_file_based_stream.py

maxi297 · 2025-11-12T14:44:44Z

airbyte_cdk/sources/file_based/config/file_based_stream_config.py

    )
+    use_first_found_file_for_schema_discovery: bool = Field(
+        title="Use first found file for schema discovery",
+        description="When enabled, the source will use the first found file for schema discovery. Helps to avoid long discovery step",


It feels like we should describe how this config interacts with recent_n_files_to_read_for_schema_discovery. Should we have a config error when both are enabled?

good point, I think so, only one of it should be defined

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (2)

unit_tests/sources/file_based/scenarios/csv_scenarios.py (2)
488-493: Consider enhancing the description for clarity.

The description "Helps to avoid long discovery step" could be more explicit about the exact behavior. What do you think about making it clearer that schema inference will be limited to exactly the first file found? For example:
-                                    "description": "When enabled, the source will use the first found file for schema discovery. Helps to avoid long discovery step",
+                                    "description": "When enabled, the source will use only the first found file for schema discovery, rather than sampling multiple files. This can significantly speed up the discovery step when working with many files.",
This would help users understand both what happens (single file only) and why it matters (performance optimization), wdyt?

488-493: Consider adding a CSV scenario that exercises this flag.

While the new field is correctly added to the expected spec, I noticed that none of the CSV scenarios in this file actually set use_first_found_file_for_schema_discovery: True in their configuration.

Would it be valuable to add a CSV-specific test scenario that demonstrates this flag in action? For example, a scenario similar to multi_csv_scenario but with this flag enabled, expecting the schema to only include columns from the first file. This would:

Provide clear documentation of the feature's behavior with CSV files

Ensure the flag works correctly with CSV-specific configurations

Complement the unit test in test_default_file_based_stream.py mentioned in the AI summary

Wdyt?

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 377884c and 10a3b4e.

📒 Files selected for processing (1)

unit_tests/sources/file_based/scenarios/csv_scenarios.py (1 hunks)

🧰 Additional context used

🧠 Learnings (1)

📓 Common learnings

Learnt from: pnilan
Repo: airbytehq/airbyte-python-cdk PR: 0
File: :0-0
Timestamp: 2024-12-11T16:34:46.319Z
Learning: In the airbytehq/airbyte-python-cdk repository, ignore all `__init__.py` files when providing a recommended reviewing order.

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (14)

GitHub Check: Check: source-google-drive
GitHub Check: Check: source-shopify
GitHub Check: Check: source-pokeapi
GitHub Check: Check: source-intercom
GitHub Check: Check: source-hardcoded-records
GitHub Check: Check: destination-motherduck
GitHub Check: Pytest (Fast)
GitHub Check: Pytest (All, Python 3.10, Ubuntu)
GitHub Check: Manifest Server Docker Image Build
GitHub Check: SDM Docker Image Build
GitHub Check: Pytest (All, Python 3.12, Ubuntu)
GitHub Check: Pytest (All, Python 3.13, Ubuntu)
GitHub Check: Pytest (All, Python 3.11, Ubuntu)
GitHub Check: Analyze (python)

use only first found file for discover

8615bb9

darynaishchenko self-assigned this Nov 12, 2025

github-actions bot added the enhancement New feature or request label Nov 12, 2025

coderabbitai bot requested changes Nov 12, 2025

View reviewed changes

airbyte_cdk/sources/file_based/config/file_based_stream_config.py Outdated Show resolved Hide resolved

airbyte_cdk/sources/file_based/stream/default_file_based_stream.py Show resolved Hide resolved

unit_tests/sources/file_based/stream/test_default_file_based_stream.py Show resolved Hide resolved

refactor code

377884c

coderabbitai bot approved these changes Nov 12, 2025

View reviewed changes

darynaishchenko requested a review from maxi297 November 12, 2025 14:36

update expected spec

10a3b4e

maxi297 reviewed Nov 12, 2025

View reviewed changes

coderabbitai bot reviewed Nov 12, 2025

View reviewed changes

feat(file-based): use only first found file for discover #841

Are you sure you want to change the base?

feat(file-based): use only first found file for discover #841

Uh oh!

Conversation

darynaishchenko commented Nov 12, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

github-actions bot commented Nov 12, 2025

👋 Greetings, Airbyte Team Member!

Testing This CDK Version

Helpful Resources

PR Slash Commands

Uh oh!

coderabbitai bot commented Nov 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Nov 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PyTest Results (Fast)

Uh oh!

github-actions bot commented Nov 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PyTest Results (Full)

Uh oh!

maxi297 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

maxi297 Nov 12, 2025

Choose a reason for hiding this comment

Uh oh!

darynaishchenko Nov 12, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

darynaishchenko commented Nov 12, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Nov 12, 2025 •

edited

Loading

github-actions bot commented Nov 12, 2025 •

edited

Loading

github-actions bot commented Nov 12, 2025 •

edited

Loading