-
Notifications
You must be signed in to change notification settings - Fork 707
tx: add missing lock on meta page update #989
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Hi @roman-khimov. Thanks for your PR. I'm waiting for a etcd-io member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/ok-to-test |
tx.go
Outdated
| lg := tx.db.Logger() | ||
| buf := make([]byte, tx.db.pageSize) | ||
| p := tx.db.pageInBuffer(buf, 0) | ||
| tx.db.metalock.Lock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It isn't correct.
We don't need acquire metalock here, it's updating tx's local meta page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, i also thought the same, but u know better 👍🏽
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I see now, we're filling the buf basically. But then how about the writeAt below? It'll flush the buf onto the page 0 or 1 and that's exactly where (*DB) meta() reads data from.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or that's the question of how pwrite() updates the page. Not sure about that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
iiuc bbolt uses CoW, which means the updates will be written to a new pages, and only upon commit the metadata list will be updated to point to the new pages
@ahrtr know better these details 👍🏽
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But then how about the
writeAtbelow?
Only one write TXN is allowed at a time.
Or that's the question of how
pwrite()updates the page. Not sure about that.
pls refer to https://github.com/ahrtr/etcd-issues/blob/master/docs/cncf_storage_tag_etcd.md#storage-boltdb-feature
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only one write TXN is allowed at a time.
That I know, that's rwlock. But my concern here is concurrent RO and RW transactions and initialization specifically. To initialize an RO transaction we need to get the meta page and copy it, like in
Line 53 in 092ee98
| db.meta().Copy(tx.meta) |
(*DB) meta(), but an RW transaction can be updating one of these pages via writeAt. The question is whether this update is atomic or not. To me it's not, because if we're to strip down all the layers of Go, unsafe, structures and other things the situation is not much different from this C snippet (crude one, sorry, the last time I wrote C was years ago, I can't even name page_size correctly now):
#include <fcntl.h>
#include <pthread.h>
#include <stddef.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/mman.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <unistd.h>
size_t const pageSize = 4096;
char *mm;
void *reader(void *arg)
{
for (;;) {
int i;
char first, next;
for (i = 0; i < pageSize; i++) {
if (i == 0) {
first = mm[i];
} else {
next = mm[i];
if (next != first) {
printf("reader mismatch, first %hhX, read %hhX at %d", first, next, i);
exit(11);
}
}
}
}
}
int main()
{
char b[pageSize];
int fd, r, i;
ssize_t written;
pthread_t pth;
fd = open("data", O_RDWR | O_CREAT);
if (fd < 0) {
printf("bad fd: %d\n", fd);
exit(10);
}
r = ftruncate(fd, pageSize);
if (r < 0) {
printf("bad ftruncate: %d\n", r);
exit(10);
}
mm = mmap(NULL, pageSize, PROT_READ, MAP_SHARED, fd, 0);
if (mm == NULL) {
printf("mmap failed\n");
exit(10);
}
r = pthread_create(&pth, NULL, reader, NULL);
if (r < 0) {
printf("bad pthread_create: %d\n", r);
exit(10);
}
for (i = 0; i < 1000; i++) {
memset(b, i%256, pageSize);
written = pwrite(fd, b, pageSize, 0);
if (written != pageSize) {
printf("can't write properly\n");
exit(10);
}
}
printf( "Done\n" );
return 0;
}With one reader thread doing its things via mmapped region and writer doing its job via pwrite() (which is what (*File) WriteAt() does under the hood). To me it gives things like reader mismatch, first 4, read 5 at 1478 or reader mismatch, first 6, read 7 at 1115 easily.
Is this a correct analogy or am I missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's a valid point, although it's highly unlikely in practice. We have two meta pages, usually Readonly TXN won't read the same meta page as the RW TXN at the same time; even it does, if it reads some dirty data or partially written data, then the checksum won't match, so it still fallback to the other meta page. It's exactly the reason why we never see any issue in the concurrent test
Lines 1128 to 1140 in 092ee98
| metaA := db.meta0 | |
| metaB := db.meta1 | |
| if db.meta1.Txid() > db.meta0.Txid() { | |
| metaA = db.meta1 | |
| metaB = db.meta0 | |
| } | |
| // Use higher meta page if valid. Otherwise, fallback to previous, if valid. | |
| if err := metaA.Validate(); err == nil { | |
| return metaA | |
| } else if err := metaB.Validate(); err == nil { | |
| return metaB | |
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's highly unlikely in practice
I agree with that, the amount of data is small, there is validation, yet it's still a race that can happen. Threads can be stalled at any point on a busy machine as well in which case chances can get higher. Like consider validation done successfully in meta(), but then the reader thread paused and waiting in the queue while writers change the contents before or in the middle of reader doing a Copy().
metalock is supposed to protect meta page, but it looks like the only place where we're modifying it is not protected in fact. Since page update is not atomic a concurrent reader (RO transaction) can get an inconsistent page. It's likely to fall back to the other one in this case, but still we better not allow this to happen. Signed-off-by: Roman Khimov <[email protected]>
ahrtr
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
thx
We need to backport the fix to release-1.4 and release-1.3.
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: ahrtr, roman-khimov The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
| tx.meta.Write(p) | ||
|
|
||
| // Write the meta page to file. | ||
| tx.db.metalock.Lock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There might be some performance penalty under high concurrent readonly TXNs (especially short live TXNs)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
writeAt() should be fast, it's a single page. And the next patch will make readers work much better anyway. The thing is that I want to leverage this lock in that patch now.
Let's discuss this part of #967 (not related to the original problem in fact). But it can affect ed58abd rework, a better version of it is still in progress.
metalock is supposed to protect meta page, but it looks like the only place where we're modifying it is not protected in fact. Since page update is not atomic a concurrent reader (like transaction) can get an inconsistent page. It's likely to fall back to the other one in this case, but still we better not allow this to happen.