Skip to content

Conversation

@darynaishchenko
Copy link
Contributor

@darynaishchenko darynaishchenko commented Nov 12, 2025

https://github.com/airbytehq/oncall/issues/9301

Summary by CodeRabbit

  • New Features
    • Added a configuration option to allow schema discovery using only the first found file, enabling faster inference in file-based sources.
  • Tests
    • Added a test verifying schema discovery when using the first found file and confirming inferred schema includes expected metadata fields.

@darynaishchenko darynaishchenko self-assigned this Nov 12, 2025
@github-actions github-actions bot added the enhancement New feature or request label Nov 12, 2025
@github-actions
Copy link

👋 Greetings, Airbyte Team Member!

Here are some helpful tips and reminders for your convenience.

Testing This CDK Version

You can test this version of the CDK using the following:

# Run the CLI from this branch:
uvx 'git+https://github.com/airbytehq/airbyte-python-cdk.git@daryna/file-based/update-discover#egg=airbyte-python-cdk[dev]' --help

# Update a connector to use the CDK from this branch ref:
cd airbyte-integrations/connectors/source-example
poe use-cdk-branch daryna/file-based/update-discover

Helpful Resources

PR Slash Commands

Airbyte Maintainers can execute the following slash commands on your PR:

  • /autofix - Fixes most formatting and linting issues
  • /poetry-lock - Updates poetry.lock file
  • /test - Runs connector tests with the updated CDK
  • /prerelease - Triggers a prerelease publish with default arguments
  • /poe build - Regenerate git-committed build artifacts, such as the pydantic models which are generated from the manifest JSON schema in YAML.
  • /poe <command> - Runs any poe command in the CDK environment

📝 Edit this welcome message.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 12, 2025

📝 Walkthrough

Walkthrough

Added a boolean flag use_first_found_file_for_schema_discovery (default false) to stream configuration. When enabled, DefaultFileBasedStream._get_raw_json_schema restricts schema discovery to the first discovered file. A unit test and CSV scenario schema entry were added to exercise and declare the option.

Changes

Cohort / File(s) Summary
Configuration
airbyte_cdk/sources/file_based/config/file_based_stream_config.py, unit_tests/sources/file_based/scenarios/csv_scenarios.py
Added use_first_found_file_for_schema_discovery: bool (default False, with title/description) to FileBasedStreamConfig and to CSV stream scenario schema properties.
Stream Implementation
airbyte_cdk/sources/file_based/stream/default_file_based_stream.py
_get_raw_json_schema now checks config.use_first_found_file_for_schema_discovery; when true, logs an info message and limits discovery input to the first file by slicing the iterator returned from get_files() (using itertools.islice), adjusting first_n_files accordingly.
Tests
unit_tests/sources/file_based/stream/test_default_file_based_stream.py
Added test_use_first_found_file_for_schema_discovery which mocks multiple remote files, enables the flag, asserts infer_schema is invoked once on the first file, and checks the produced schema includes _ab_source_file_last_modified, _ab_source_file_url, and expected nullable string types.

Sequence Diagram(s)

sequenceDiagram
  participant Stream as DefaultFileBasedStream
  participant Config as FileBasedStreamConfig
  participant FS as FileFetcher
  participant Infer as SchemaInferencer
  rect rgb(245,250,255)
    Note over Stream,Config: Schema discovery flow (flag-aware)
  end
  Stream->>Config: read use_first_found_file_for_schema_discovery
  alt flag == true
    Stream->>FS: get_files() (iterator)
    FS-->>Stream: files iterator
    Stream->>Stream: select first file (islice(...,1))
    Stream->>Infer: infer_schema(first_file)
    Infer-->>Stream: schema
  else flag == false
    Stream->>FS: get_files() (iterator)
    FS-->>Stream: many files
    Stream->>Infer: infer_schema(files up to max_n_files)
    Infer-->>Stream: schema
  end
  Stream->>Stream: consolidate & return schema
Loading

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

  • Review the new conditional and itertools.islice usage in _get_raw_json_schema.
  • Verify the new Field metadata (title/description/default) in FileBasedStreamConfig.
  • Check the test mocks/assertions ensure infer_schema runs exactly once on the first file and that CSV scenario schema includes the new property.

Would you like a short changelog entry or a snippet to add to the config docs mentioning this flag, wdyt?

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: adding a feature to use only the first found file for schema discovery in file-based sources.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch daryna/file-based/update-discover

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f0443aa and 8615bb9.

📒 Files selected for processing (3)
  • airbyte_cdk/sources/file_based/config/file_based_stream_config.py (1 hunks)
  • airbyte_cdk/sources/file_based/stream/default_file_based_stream.py (1 hunks)
  • unit_tests/sources/file_based/stream/test_default_file_based_stream.py (1 hunks)
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: pnilan
Repo: airbytehq/airbyte-python-cdk PR: 0
File: :0-0
Timestamp: 2024-12-11T16:34:46.319Z
Learning: In the airbytehq/airbyte-python-cdk repository, ignore all `__init__.py` files when providing a recommended reviewing order.
📚 Learning: 2024-12-11T16:34:46.319Z
Learnt from: pnilan
Repo: airbytehq/airbyte-python-cdk PR: 0
File: :0-0
Timestamp: 2024-12-11T16:34:46.319Z
Learning: In the airbytehq/airbyte-python-cdk repository, the `declarative_component_schema.py` file is auto-generated from `declarative_component_schema.yaml` and should be ignored in the recommended reviewing order.

Applied to files:

  • airbyte_cdk/sources/file_based/config/file_based_stream_config.py
🧬 Code graph analysis (2)
unit_tests/sources/file_based/stream/test_default_file_based_stream.py (5)
airbyte_cdk/sources/file_based/file_based_stream_reader.py (3)
  • config (46-47)
  • config (51-61)
  • get_matching_files (79-99)
airbyte_cdk/sources/file_based/discovery_policy/abstract_discovery_policy.py (2)
  • get_max_n_files_for_schema_inference (21-21)
  • n_concurrent_requests (18-18)
airbyte_cdk/sources/file_based/discovery_policy/default_discovery_policy.py (2)
  • get_max_n_files_for_schema_inference (24-33)
  • n_concurrent_requests (21-22)
airbyte_cdk/sources/file_based/stream/default_file_based_stream.py (2)
  • infer_schema (338-342)
  • get_json_schema (246-269)
airbyte_cdk/sources/file_based/remote_file.py (1)
  • RemoteFile (11-18)
airbyte_cdk/sources/file_based/stream/default_file_based_stream.py (3)
airbyte_cdk/sources/streams/core.py (1)
  • logger (128-129)
airbyte_cdk/sources/file_based/stream/abstract_file_based_stream.py (1)
  • get_files (92-96)
airbyte_cdk/sources/file_based/stream/concurrent/adapters.py (1)
  • get_files (165-166)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (15)
  • GitHub Check: Check: source-hardcoded-records
  • GitHub Check: Check: source-google-drive
  • GitHub Check: Check: source-shopify
  • GitHub Check: Check: source-pokeapi
  • GitHub Check: Check: destination-motherduck
  • GitHub Check: Check: source-intercom
  • GitHub Check: SDM Docker Image Build
  • GitHub Check: Manifest Server Docker Image Build
  • GitHub Check: Pytest (Fast)
  • GitHub Check: Pytest (All, Python 3.10, Ubuntu)
  • GitHub Check: Pytest (All, Python 3.13, Ubuntu)
  • GitHub Check: Pytest (All, Python 3.11, Ubuntu)
  • GitHub Check: Pytest (All, Python 3.12, Ubuntu)
  • GitHub Check: Analyze (python)
  • GitHub Check: Analyze (python)

@github-actions
Copy link

github-actions bot commented Nov 12, 2025

PyTest Results (Fast)

2 648 tests   - 1 163   2 636 ✅  - 1 164   6m 51s ⏱️ -21s
    1 suites ±    0      11 💤 ±    0 
    1 files   ±    0       1 ❌ +    1 

For more details on these failures, see this check.

Results for commit 10a3b4e. ± Comparison against base commit f0443aa.

This pull request removes 1163 tests.
unit_tests.sources.file_based.stream.concurrent.test_adapters ‑ test_file_based_stream_partition[test_default_transform]
unit_tests.sources.file_based.stream.concurrent.test_adapters ‑ test_file_based_stream_partition[test_no_transform]
unit_tests.sources.file_based.stream.concurrent.test_adapters ‑ test_file_based_stream_partition_generator[test_full_refresh]
unit_tests.sources.file_based.stream.concurrent.test_adapters ‑ test_file_based_stream_partition_generator[test_incremental]
unit_tests.sources.file_based.stream.concurrent.test_adapters ‑ test_file_based_stream_partition_hash[test_hash_no_slice]
unit_tests.sources.file_based.stream.concurrent.test_adapters ‑ test_file_based_stream_partition_hash[test_hash_with_slice]
unit_tests.sources.file_based.stream.concurrent.test_adapters ‑ test_file_based_stream_partition_raising_exception[test_exception_no_display_message0]
unit_tests.sources.file_based.stream.concurrent.test_adapters ‑ test_file_based_stream_partition_raising_exception[test_exception_no_display_message1]
unit_tests.sources.file_based.stream.concurrent.test_adapters ‑ test_get_error_display_message[test_no_display_message0]
unit_tests.sources.file_based.stream.concurrent.test_adapters ‑ test_get_error_display_message[test_no_display_message1]
…

♻️ This comment has been updated with latest results.

@github-actions
Copy link

github-actions bot commented Nov 12, 2025

PyTest Results (Full)

3 815 tests  +1   3 802 ✅ ±0   11m 9s ⏱️ +5s
    1 suites ±0      12 💤 ±0 
    1 files   ±0       1 ❌ +1 

For more details on these failures, see this check.

Results for commit 10a3b4e. ± Comparison against base commit f0443aa.

♻️ This comment has been updated with latest results.

Copy link
Contributor

@maxi297 maxi297 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple comments just to make sure we cover everything here

)
use_first_found_file_for_schema_discovery: bool = Field(
title="Use first found file for schema discovery",
description="When enabled, the source will use the first found file for schema discovery. Helps to avoid long discovery step",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels like we should describe how this config interacts with recent_n_files_to_read_for_schema_discovery. Should we have a config error when both are enabled?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point, I think so, only one of it should be defined

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
unit_tests/sources/file_based/scenarios/csv_scenarios.py (2)

488-493: Consider enhancing the description for clarity.

The description "Helps to avoid long discovery step" could be more explicit about the exact behavior. What do you think about making it clearer that schema inference will be limited to exactly the first file found? For example:

-                                    "description": "When enabled, the source will use the first found file for schema discovery. Helps to avoid long discovery step",
+                                    "description": "When enabled, the source will use only the first found file for schema discovery, rather than sampling multiple files. This can significantly speed up the discovery step when working with many files.",

This would help users understand both what happens (single file only) and why it matters (performance optimization), wdyt?


488-493: Consider adding a CSV scenario that exercises this flag.

While the new field is correctly added to the expected spec, I noticed that none of the CSV scenarios in this file actually set use_first_found_file_for_schema_discovery: True in their configuration.

Would it be valuable to add a CSV-specific test scenario that demonstrates this flag in action? For example, a scenario similar to multi_csv_scenario but with this flag enabled, expecting the schema to only include columns from the first file. This would:

  1. Provide clear documentation of the feature's behavior with CSV files
  2. Ensure the flag works correctly with CSV-specific configurations
  3. Complement the unit test in test_default_file_based_stream.py mentioned in the AI summary

Wdyt?

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 377884c and 10a3b4e.

📒 Files selected for processing (1)
  • unit_tests/sources/file_based/scenarios/csv_scenarios.py (1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📓 Common learnings
Learnt from: pnilan
Repo: airbytehq/airbyte-python-cdk PR: 0
File: :0-0
Timestamp: 2024-12-11T16:34:46.319Z
Learning: In the airbytehq/airbyte-python-cdk repository, ignore all `__init__.py` files when providing a recommended reviewing order.
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (14)
  • GitHub Check: Check: source-google-drive
  • GitHub Check: Check: source-shopify
  • GitHub Check: Check: source-pokeapi
  • GitHub Check: Check: source-intercom
  • GitHub Check: Check: source-hardcoded-records
  • GitHub Check: Check: destination-motherduck
  • GitHub Check: Pytest (Fast)
  • GitHub Check: Pytest (All, Python 3.10, Ubuntu)
  • GitHub Check: Manifest Server Docker Image Build
  • GitHub Check: SDM Docker Image Build
  • GitHub Check: Pytest (All, Python 3.12, Ubuntu)
  • GitHub Check: Pytest (All, Python 3.13, Ubuntu)
  • GitHub Check: Pytest (All, Python 3.11, Ubuntu)
  • GitHub Check: Analyze (python)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants