-
Notifications
You must be signed in to change notification settings - Fork 3.9k
GH-48076: [C++][Flight] fix GeneratorStream for Tables #48082
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
|
ef7f961 to
c11ea33
Compare
|
Unfortunately, there is more at play than I expected and I am not sure how to proceed :( but I believe the tests I added should be included in upstream and should be passing so that all the possible types backing the GeneratorStream are tested. |
raulcd
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much for raising the issue and working on this! @no23reason
| if (!writer_) { | ||
| return Status::UnknownError( | ||
| "Writer should be initialized before reading Next batches"); | ||
| RETURN_NOT_OK(InitializeWriter()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think in this specific scenario we might have to drop the first message after calling writer_->WriteRecordBatch, which will be a schema message, after creation of the ipc::internal::OpenRecordBatchWriter and we want to keep the rest of the messages which should be the expected RecordBatch for those cases.
I would have to debug but from what I understand that might be the cause of the current problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the suggestion, that is a good idea. I will take a look as soon as I can, unfortunately something more urgent cropped up I have to deal with first :( I will let you know what I find out
c11ea33 to
72d4258
Compare
After the changes in apache#47115, GeneratorStreams backed by anything else than RecordBatches failed. This includes Tables and RecordBatchReaders. This was caused by a too strict assumption that the RecordBatchStream#GetSchemaPayload would always get called, which is not the case when the GeneratorStream is backed by a Table or a RecordBatchReader. So to fix this, remove the assertion and instead initialize the writer on first access. Also, to accommodate for this case, drop the incoming message when initializing the writer in Next, as the message there is of the SCHEMA type and we want RECORD_BATCH one.
72d4258 to
921e6b7
Compare
raulcd
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Current CI failures are unrelated. Thanks for the fix! It seems dropping the schema message on those cases made the trick.
This looks good to me. @lidavidm what do you think?
|
Thank you for guiding me through it! :) |
Rationale for this change
After the changes in #47115, GeneratorStreams backed by anything else than RecordBatches failed. This includes Tables and RecordBatchReaders.
This was caused by a too strict assumption that the RecordBatchStream#GetSchemaPayload would always get called, which is not the case when the GeneratorStream is backed by a Table or a RecordBatchReader.
What changes are included in this PR?
Removal of the problematic assertion and initialization of the writer object when it is needed first.
Also, to accommodate for this case, drop the incoming message when initializing the writer in Next, as the message there
is of the SCHEMA type and we want RECORD_BATCH one.
Are these changes tested?
Yes, via CI. Tests for the GeneratorStreams were extended so that they test GeneratorStreams backed by Tables and RecordBatchReaders, not just RecordBatches.
Are there any user-facing changes?
No, just a fix for a regression restoring the functionality from version 21.0.0 and earlier.