-
Notifications
You must be signed in to change notification settings - Fork 30
feat(file-based): use only first found file for discover #841
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
👋 Greetings, Airbyte Team Member!Here are some helpful tips and reminders for your convenience. Testing This CDK VersionYou can test this version of the CDK using the following: # Run the CLI from this branch:
uvx 'git+https://github.com/airbytehq/airbyte-python-cdk.git@daryna/file-based/update-discover#egg=airbyte-python-cdk[dev]' --help
# Update a connector to use the CDK from this branch ref:
cd airbyte-integrations/connectors/source-example
poe use-cdk-branch daryna/file-based/update-discoverHelpful ResourcesPR Slash CommandsAirbyte Maintainers can execute the following slash commands on your PR:
|
📝 WalkthroughWalkthroughAdded a boolean flag Changes
Sequence Diagram(s)sequenceDiagram
participant Stream as DefaultFileBasedStream
participant Config as FileBasedStreamConfig
participant FS as FileFetcher
participant Infer as SchemaInferencer
rect rgb(245,250,255)
Note over Stream,Config: Schema discovery flow (flag-aware)
end
Stream->>Config: read use_first_found_file_for_schema_discovery
alt flag == true
Stream->>FS: get_files() (iterator)
FS-->>Stream: files iterator
Stream->>Stream: select first file (islice(...,1))
Stream->>Infer: infer_schema(first_file)
Infer-->>Stream: schema
else flag == false
Stream->>FS: get_files() (iterator)
FS-->>Stream: many files
Stream->>Infer: infer_schema(files up to max_n_files)
Infer-->>Stream: schema
end
Stream->>Stream: consolidate & return schema
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes
Would you like a short changelog entry or a snippet to add to the config docs mentioning this flag, wdyt? Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
airbyte_cdk/sources/file_based/config/file_based_stream_config.py(1 hunks)airbyte_cdk/sources/file_based/stream/default_file_based_stream.py(1 hunks)unit_tests/sources/file_based/stream/test_default_file_based_stream.py(1 hunks)
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: pnilan
Repo: airbytehq/airbyte-python-cdk PR: 0
File: :0-0
Timestamp: 2024-12-11T16:34:46.319Z
Learning: In the airbytehq/airbyte-python-cdk repository, ignore all `__init__.py` files when providing a recommended reviewing order.
📚 Learning: 2024-12-11T16:34:46.319Z
Learnt from: pnilan
Repo: airbytehq/airbyte-python-cdk PR: 0
File: :0-0
Timestamp: 2024-12-11T16:34:46.319Z
Learning: In the airbytehq/airbyte-python-cdk repository, the `declarative_component_schema.py` file is auto-generated from `declarative_component_schema.yaml` and should be ignored in the recommended reviewing order.
Applied to files:
airbyte_cdk/sources/file_based/config/file_based_stream_config.py
🧬 Code graph analysis (2)
unit_tests/sources/file_based/stream/test_default_file_based_stream.py (5)
airbyte_cdk/sources/file_based/file_based_stream_reader.py (3)
config(46-47)config(51-61)get_matching_files(79-99)airbyte_cdk/sources/file_based/discovery_policy/abstract_discovery_policy.py (2)
get_max_n_files_for_schema_inference(21-21)n_concurrent_requests(18-18)airbyte_cdk/sources/file_based/discovery_policy/default_discovery_policy.py (2)
get_max_n_files_for_schema_inference(24-33)n_concurrent_requests(21-22)airbyte_cdk/sources/file_based/stream/default_file_based_stream.py (2)
infer_schema(338-342)get_json_schema(246-269)airbyte_cdk/sources/file_based/remote_file.py (1)
RemoteFile(11-18)
airbyte_cdk/sources/file_based/stream/default_file_based_stream.py (3)
airbyte_cdk/sources/streams/core.py (1)
logger(128-129)airbyte_cdk/sources/file_based/stream/abstract_file_based_stream.py (1)
get_files(92-96)airbyte_cdk/sources/file_based/stream/concurrent/adapters.py (1)
get_files(165-166)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (15)
- GitHub Check: Check: source-hardcoded-records
- GitHub Check: Check: source-google-drive
- GitHub Check: Check: source-shopify
- GitHub Check: Check: source-pokeapi
- GitHub Check: Check: destination-motherduck
- GitHub Check: Check: source-intercom
- GitHub Check: SDM Docker Image Build
- GitHub Check: Manifest Server Docker Image Build
- GitHub Check: Pytest (Fast)
- GitHub Check: Pytest (All, Python 3.10, Ubuntu)
- GitHub Check: Pytest (All, Python 3.13, Ubuntu)
- GitHub Check: Pytest (All, Python 3.11, Ubuntu)
- GitHub Check: Pytest (All, Python 3.12, Ubuntu)
- GitHub Check: Analyze (python)
- GitHub Check: Analyze (python)
airbyte_cdk/sources/file_based/config/file_based_stream_config.py
Outdated
Show resolved
Hide resolved
PyTest Results (Fast)2 648 tests - 1 163 2 636 ✅ - 1 164 6m 51s ⏱️ -21s For more details on these failures, see this check. Results for commit 10a3b4e. ± Comparison against base commit f0443aa. This pull request removes 1163 tests.♻️ This comment has been updated with latest results. |
PyTest Results (Full)3 815 tests +1 3 802 ✅ ±0 11m 9s ⏱️ +5s For more details on these failures, see this check. Results for commit 10a3b4e. ± Comparison against base commit f0443aa. ♻️ This comment has been updated with latest results. |
maxi297
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple comments just to make sure we cover everything here
| ) | ||
| use_first_found_file_for_schema_discovery: bool = Field( | ||
| title="Use first found file for schema discovery", | ||
| description="When enabled, the source will use the first found file for schema discovery. Helps to avoid long discovery step", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It feels like we should describe how this config interacts with recent_n_files_to_read_for_schema_discovery. Should we have a config error when both are enabled?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point, I think so, only one of it should be defined
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (2)
unit_tests/sources/file_based/scenarios/csv_scenarios.py (2)
488-493: Consider enhancing the description for clarity.The description "Helps to avoid long discovery step" could be more explicit about the exact behavior. What do you think about making it clearer that schema inference will be limited to exactly the first file found? For example:
- "description": "When enabled, the source will use the first found file for schema discovery. Helps to avoid long discovery step", + "description": "When enabled, the source will use only the first found file for schema discovery, rather than sampling multiple files. This can significantly speed up the discovery step when working with many files.",This would help users understand both what happens (single file only) and why it matters (performance optimization), wdyt?
488-493: Consider adding a CSV scenario that exercises this flag.While the new field is correctly added to the expected spec, I noticed that none of the CSV scenarios in this file actually set
use_first_found_file_for_schema_discovery: Truein their configuration.Would it be valuable to add a CSV-specific test scenario that demonstrates this flag in action? For example, a scenario similar to
multi_csv_scenariobut with this flag enabled, expecting the schema to only include columns from the first file. This would:
- Provide clear documentation of the feature's behavior with CSV files
- Ensure the flag works correctly with CSV-specific configurations
- Complement the unit test in
test_default_file_based_stream.pymentioned in the AI summaryWdyt?
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
unit_tests/sources/file_based/scenarios/csv_scenarios.py(1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📓 Common learnings
Learnt from: pnilan
Repo: airbytehq/airbyte-python-cdk PR: 0
File: :0-0
Timestamp: 2024-12-11T16:34:46.319Z
Learning: In the airbytehq/airbyte-python-cdk repository, ignore all `__init__.py` files when providing a recommended reviewing order.
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (14)
- GitHub Check: Check: source-google-drive
- GitHub Check: Check: source-shopify
- GitHub Check: Check: source-pokeapi
- GitHub Check: Check: source-intercom
- GitHub Check: Check: source-hardcoded-records
- GitHub Check: Check: destination-motherduck
- GitHub Check: Pytest (Fast)
- GitHub Check: Pytest (All, Python 3.10, Ubuntu)
- GitHub Check: Manifest Server Docker Image Build
- GitHub Check: SDM Docker Image Build
- GitHub Check: Pytest (All, Python 3.12, Ubuntu)
- GitHub Check: Pytest (All, Python 3.13, Ubuntu)
- GitHub Check: Pytest (All, Python 3.11, Ubuntu)
- GitHub Check: Analyze (python)
https://github.com/airbytehq/oncall/issues/9301
Summary by CodeRabbit