Skip to content

Conversation

@no23reason
Copy link
Contributor

@no23reason no23reason commented Nov 7, 2025

Rationale for this change

After the changes in #47115, GeneratorStreams backed by anything else than RecordBatches failed. This includes Tables and RecordBatchReaders.

This was caused by a too strict assumption that the RecordBatchStream#GetSchemaPayload would always get called, which is not the case when the GeneratorStream is backed by a Table or a RecordBatchReader.

What changes are included in this PR?

Removal of the problematic assertion and initialization of the writer object when it is needed first.
Also, to accommodate for this case, drop the incoming message when initializing the writer in Next, as the message there
is of the SCHEMA type and we want RECORD_BATCH one.

Are these changes tested?

Yes, via CI. Tests for the GeneratorStreams were extended so that they test GeneratorStreams backed by Tables and RecordBatchReaders, not just RecordBatches.

Are there any user-facing changes?

No, just a fix for a regression restoring the functionality from version 21.0.0 and earlier.

@github-actions
Copy link

github-actions bot commented Nov 7, 2025

⚠️ GitHub issue #48076 has been automatically assigned in GitHub to PR creator.

@no23reason
Copy link
Contributor Author

Unfortunately, there is more at play than I expected and I am not sure how to proceed :( but I believe the tests I added should be included in upstream and should be passing so that all the possible types backing the GeneratorStream are tested.

Copy link
Member

@raulcd raulcd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for raising the issue and working on this! @no23reason

if (!writer_) {
return Status::UnknownError(
"Writer should be initialized before reading Next batches");
RETURN_NOT_OK(InitializeWriter());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in this specific scenario we might have to drop the first message after calling writer_->WriteRecordBatch, which will be a schema message, after creation of the ipc::internal::OpenRecordBatchWriter and we want to keep the rest of the messages which should be the expected RecordBatch for those cases.
I would have to debug but from what I understand that might be the cause of the current problem.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the suggestion, that is a good idea. I will take a look as soon as I can, unfortunately something more urgent cropped up I have to deal with first :( I will let you know what I find out

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting review Awaiting review labels Nov 7, 2025
@no23reason no23reason force-pushed the gh-48076-generator-stream branch from c11ea33 to 72d4258 Compare November 10, 2025 15:33
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Nov 10, 2025
After the changes in apache#47115, GeneratorStreams backed by anything else
than RecordBatches failed. This includes Tables and RecordBatchReaders.

This was caused by a too strict assumption that the
RecordBatchStream#GetSchemaPayload would always get called, which
is not the case when the GeneratorStream is backed by a Table
or a RecordBatchReader.

So to fix this, remove the assertion and instead initialize the writer
on first access. Also, to accommodate for this case, drop the incoming
message when initializing the writer in Next, as the message there
is of the SCHEMA type and we want RECORD_BATCH one.
@no23reason no23reason force-pushed the gh-48076-generator-stream branch from 72d4258 to 921e6b7 Compare November 10, 2025 16:10
Copy link
Member

@raulcd raulcd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Current CI failures are unrelated. Thanks for the fix! It seems dropping the schema message on those cases made the trick.
This looks good to me. @lidavidm what do you think?

@github-actions github-actions bot added awaiting merge Awaiting merge and removed awaiting change review Awaiting change review labels Nov 10, 2025
@no23reason
Copy link
Contributor Author

Thank you for guiding me through it! :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants