Skip to content

Conversation

@roman-khimov
Copy link
Contributor

Let's discuss this part of #967 (not related to the original problem in fact). But it can affect ed58abd rework, a better version of it is still in progress.

metalock is supposed to protect meta page, but it looks like the only place where we're modifying it is not protected in fact. Since page update is not atomic a concurrent reader (like transaction) can get an inconsistent page. It's likely to fall back to the other one in this case, but still we better not allow this to happen.

@k8s-ci-robot
Copy link

Hi @roman-khimov. Thanks for your PR.

I'm waiting for a etcd-io member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@Elbehery
Copy link
Member

/ok-to-test

tx.go Outdated
lg := tx.db.Logger()
buf := make([]byte, tx.db.pageSize)
p := tx.db.pageInBuffer(buf, 0)
tx.db.metalock.Lock()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It isn't correct.

We don't need acquire metalock here, it's updating tx's local meta page.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, i also thought the same, but u know better 👍🏽

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I see now, we're filling the buf basically. But then how about the writeAt below? It'll flush the buf onto the page 0 or 1 and that's exactly where (*DB) meta() reads data from.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or that's the question of how pwrite() updates the page. Not sure about that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

iiuc bbolt uses CoW, which means the updates will be written to a new pages, and only upon commit the metadata list will be updated to point to the new pages

@ahrtr know better these details 👍🏽

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But then how about the writeAt below?

Only one write TXN is allowed at a time.

Or that's the question of how pwrite() updates the page. Not sure about that.

pls refer to https://github.com/ahrtr/etcd-issues/blob/master/docs/cncf_storage_tag_etcd.md#storage-boltdb-feature

Copy link
Contributor Author

@roman-khimov roman-khimov Jun 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only one write TXN is allowed at a time.

That I know, that's rwlock. But my concern here is concurrent RO and RW transactions and initialization specifically. To initialize an RO transaction we need to get the meta page and copy it, like in

bbolt/tx.go

Line 53 in 092ee98

db.meta().Copy(tx.meta)
. The page is to be chosen by txid in (*DB) meta(), but an RW transaction can be updating one of these pages via writeAt. The question is whether this update is atomic or not. To me it's not, because if we're to strip down all the layers of Go, unsafe, structures and other things the situation is not much different from this C snippet (crude one, sorry, the last time I wrote C was years ago, I can't even name page_size correctly now):

#include <fcntl.h>
#include <pthread.h>
#include <stddef.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/mman.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <unistd.h>

size_t const pageSize = 4096;
char *mm;

void *reader(void *arg)
{
        for (;;) {
                int i;
                char first, next;

                for (i = 0; i < pageSize; i++) {
                        if (i == 0) {
                                first = mm[i];
                        } else {
                                next = mm[i];
                                if (next != first) {
                                        printf("reader mismatch, first %hhX, read %hhX at %d", first, next, i);
                                        exit(11);
                                }
                        }
                }
        }
}

int main()
{
        char b[pageSize];
        int fd, r, i;
        ssize_t written;
        pthread_t pth;

        fd = open("data", O_RDWR | O_CREAT);
        if (fd < 0) {
                printf("bad fd: %d\n", fd);
                exit(10);
        }
        r = ftruncate(fd, pageSize);
        if (r < 0) {
                printf("bad ftruncate: %d\n", r);
                exit(10);
        }
        mm = mmap(NULL, pageSize, PROT_READ, MAP_SHARED, fd, 0);
        if (mm == NULL) {
                printf("mmap failed\n");
                exit(10);
        }
        r = pthread_create(&pth, NULL, reader, NULL);
        if (r < 0) {
                printf("bad pthread_create: %d\n", r);
                exit(10);
        }
        for (i = 0; i < 1000; i++) {
                memset(b, i%256, pageSize);
                written = pwrite(fd, b, pageSize, 0);
                if (written != pageSize) {
                        printf("can't write properly\n");
                        exit(10);
                }
        }
        printf( "Done\n" );
        return 0;
}

With one reader thread doing its things via mmapped region and writer doing its job via pwrite() (which is what (*File) WriteAt() does under the hood). To me it gives things like reader mismatch, first 4, read 5 at 1478 or reader mismatch, first 6, read 7 at 1115 easily.

Is this a correct analogy or am I missing something?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's a valid point, although it's highly unlikely in practice. We have two meta pages, usually Readonly TXN won't read the same meta page as the RW TXN at the same time; even it does, if it reads some dirty data or partially written data, then the checksum won't match, so it still fallback to the other meta page. It's exactly the reason why we never see any issue in the concurrent test

bbolt/db.go

Lines 1128 to 1140 in 092ee98

metaA := db.meta0
metaB := db.meta1
if db.meta1.Txid() > db.meta0.Txid() {
metaA = db.meta1
metaB = db.meta0
}
// Use higher meta page if valid. Otherwise, fallback to previous, if valid.
if err := metaA.Validate(); err == nil {
return metaA
} else if err := metaB.Validate(); err == nil {
return metaB
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's highly unlikely in practice

I agree with that, the amount of data is small, there is validation, yet it's still a race that can happen. Threads can be stalled at any point on a busy machine as well in which case chances can get higher. Like consider validation done successfully in meta(), but then the reader thread paused and waiting in the queue while writers change the contents before or in the middle of reader doing a Copy().

metalock is supposed to protect meta page, but it looks like the only place
where we're modifying it is not protected in fact. Since page update is not
atomic a concurrent reader (RO transaction) can get an inconsistent page. It's
likely to fall back to the other one in this case, but still we better not
allow this to happen.

Signed-off-by: Roman Khimov <[email protected]>
Copy link
Member

@ahrtr ahrtr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

thx

We need to backport the fix to release-1.4 and release-1.3.

@k8s-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ahrtr, roman-khimov

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

tx.meta.Write(p)

// Write the meta page to file.
tx.db.metalock.Lock()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There might be some performance penalty under high concurrent readonly TXNs (especially short live TXNs)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

writeAt() should be fast, it's a single page. And the next patch will make readers work much better anyway. The thing is that I want to leverage this lock in that patch now.

@ahrtr
Copy link
Member

ahrtr commented Jun 20, 2025

cc @fuweid @tjungblu

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

4 participants