make committed offset accurate when partition assigned to avoid offset reset#893
make committed offset accurate when partition assigned to avoid offset reset#893sangreal wants to merge 18 commits intoconfluentinc:masterfrom
Conversation
|
Hey @sangreal - i will dig a bit more into it - but it doesn't look right to me - whenever partition is dirty (any offset was processed) - we do need to commit - even if highest succeeded was not advanced. |
|
If you could build a test / reproducible example of this - it would help as well... |
|
@rkolesnev It is really great that you could take your personal time to review my pr 👍 Again, I am all ear for your suggestion. |
|
Hmm, I still don't fully understand where is the issue though. |
|
I have updated the way to fix after more thinking. The idea is make sure getOffsetToCommit is invoked only once to avoid dirty read and commit the wrong offset. |
|
@johnbyrnejb please help review as well, thx a lot |
|
Parallel Consumer Offset reset Issue flow.pdf |
|
Let me explain more for you guys to reviews.
@rkolesnev @johnbyrnejb please help review when you have time, we are waiting for the fix since this offset reset issue happens once every several days. Thanks a lot. |
|
@sangreal - i had spent more time looking into this - and will try to get it done this week. The Parallel Consumer has a bug somewhere in marking state dirty and advancing offset to commit by 1 - so after multiple rebalances it ends up committing not offset 10 - but offset 11 - which brings subscription out of valid range and causes auto offset reset to happen... I am in the process of mapping all possible state transitions for PartitionState to work out if there are any other race conditions / state mismatches. |
|
@rkolesnev thank you so much for your efforts! |
|
@rkolesnev FYI, this fix has been online with private build for more than a week now. This issue never happened again. |
|
@sangreal - it looks good to merge to me - I don't see any drawbacks to this change - it fixes a race condition in encoding / commit - as per above conversation. This PR is good to get merged IMHO. |
|
@rkolesnev Thanks a lot for the efforts on the review my code. |
|
@eddyv could u please take a look when you have time, thanks a lot in advance. |
Description...
make offsetHighestSucceeded accurate when partition assigned
We have encountered offset reset issue while frequent partition rebalancing.
The root cause is caused by :
(1) the
offsetHighestSucceededis assigned w/ offset inOffsetAndMetadatawhich is to-be processed(2)
incompletesis non-empty(3) one
WorkContaineris processed successfully, thendirtyistrue(this offset <offsetHighestSucceeded)(4) committer choose (
offsetHighestSucceeded + 1) to commit becauseincompletesis non-empty (the offset is removed fromincompletes)(5) rebalancing happens, new consumer try to pull record will throw
out of rangeand begin offset resetissue #894
Checklist