tx: add missing lock on meta page update #989

roman-khimov · 2025-06-13T14:36:18Z

Let's discuss this part of #967 (not related to the original problem in fact). But it can affect ed58abd rework, a better version of it is still in progress.

metalock is supposed to protect meta page, but it looks like the only place where we're modifying it is not protected in fact. Since page update is not atomic a concurrent reader (like transaction) can get an inconsistent page. It's likely to fall back to the other one in this case, but still we better not allow this to happen.

k8s-ci-robot · 2025-06-13T14:36:28Z

Hi @roman-khimov. Thanks for your PR.

I'm waiting for a etcd-io member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Elbehery · 2025-06-13T14:39:34Z

/ok-to-test

ahrtr · 2025-06-18T15:21:45Z

tx.go

 	lg := tx.db.Logger()
 	buf := make([]byte, tx.db.pageSize)
 	p := tx.db.pageInBuffer(buf, 0)
+	tx.db.metalock.Lock()


It isn't correct.

We don't need acquire metalock here, it's updating tx's local meta page.

+1, i also thought the same, but u know better 👍🏽

OK, I see now, we're filling the buf basically. But then how about the writeAt below? It'll flush the buf onto the page 0 or 1 and that's exactly where (*DB) meta() reads data from.

Or that's the question of how pwrite() updates the page. Not sure about that.

iiuc bbolt uses CoW, which means the updates will be written to a new pages, and only upon commit the metadata list will be updated to point to the new pages

@ahrtr know better these details 👍🏽

But then how about the writeAt below?

Only one write TXN is allowed at a time.

Or that's the question of how pwrite() updates the page. Not sure about that.

pls refer to https://github.com/ahrtr/etcd-issues/blob/master/docs/cncf_storage_tag_etcd.md#storage-boltdb-feature

Only one write TXN is allowed at a time.

That I know, that's rwlock. But my concern here is concurrent RO and RW transactions and initialization specifically. To initialize an RO transaction we need to get the meta page and copy it, like in

bbolt/tx.go

Line 53 in 092ee98

db.meta().Copy(tx.meta)

. The page is to be chosen by txid in (*DB) meta(), but an RW transaction can be updating one of these pages via writeAt. The question is whether this update is atomic or not. To me it's not, because if we're to strip down all the layers of Go, unsafe, structures and other things the situation is not much different from this C snippet (crude one, sorry, the last time I wrote C was years ago, I can't even name page_size correctly now):

#include <fcntl.h> #include <pthread.h> #include <stddef.h> #include <stdio.h> #include <stdlib.h> #include <string.h> #include <sys/mman.h> #include <sys/stat.h> #include <sys/types.h> #include <unistd.h> size_t const pageSize = 4096; char *mm; void *reader(void *arg) { for (;;) { int i; char first, next; for (i = 0; i < pageSize; i++) { if (i == 0) { first = mm[i]; } else { next = mm[i]; if (next != first) { printf("reader mismatch, first %hhX, read %hhX at %d", first, next, i); exit(11); } } } } } int main() { char b[pageSize]; int fd, r, i; ssize_t written; pthread_t pth; fd = open("data", O_RDWR | O_CREAT); if (fd < 0) { printf("bad fd: %d\n", fd); exit(10); } r = ftruncate(fd, pageSize); if (r < 0) { printf("bad ftruncate: %d\n", r); exit(10); } mm = mmap(NULL, pageSize, PROT_READ, MAP_SHARED, fd, 0); if (mm == NULL) { printf("mmap failed\n"); exit(10); } r = pthread_create(&pth, NULL, reader, NULL); if (r < 0) { printf("bad pthread_create: %d\n", r); exit(10); } for (i = 0; i < 1000; i++) { memset(b, i%256, pageSize); written = pwrite(fd, b, pageSize, 0); if (written != pageSize) { printf("can't write properly\n"); exit(10); } } printf( "Done\n" ); return 0; }

With one reader thread doing its things via mmapped region and writer doing its job via pwrite() (which is what (*File) WriteAt() does under the hood). To me it gives things like reader mismatch, first 4, read 5 at 1478 or reader mismatch, first 6, read 7 at 1115 easily.

Is this a correct analogy or am I missing something?

Yes, it's a valid point, although it's highly unlikely in practice. We have two meta pages, usually Readonly TXN won't read the same meta page as the RW TXN at the same time; even it does, if it reads some dirty data or partially written data, then the checksum won't match, so it still fallback to the other meta page. It's exactly the reason why we never see any issue in the concurrent test

bbolt/db.go

Lines 1128 to 1140 in 092ee98

metaA := db.meta0

metaB := db.meta1

if db.meta1.Txid() > db.meta0.Txid() {

metaA = db.meta1

metaB = db.meta0

}

// Use higher meta page if valid. Otherwise, fallback to previous, if valid.

if err := metaA.Validate(); err == nil {

return metaA

} else if err := metaB.Validate(); err == nil {

return metaB

}

it's highly unlikely in practice

I agree with that, the amount of data is small, there is validation, yet it's still a race that can happen. Threads can be stalled at any point on a busy machine as well in which case chances can get higher. Like consider validation done successfully in meta(), but then the reader thread paused and waiting in the queue while writers change the contents before or in the middle of reader doing a Copy().

metalock is supposed to protect meta page, but it looks like the only place where we're modifying it is not protected in fact. Since page update is not atomic a concurrent reader (RO transaction) can get an inconsistent page. It's likely to fall back to the other one in this case, but still we better not allow this to happen. Signed-off-by: Roman Khimov <[email protected]>

ahrtr

LGTM

thx

We need to backport the fix to release-1.4 and release-1.3.

k8s-ci-robot · 2025-06-19T10:58:03Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ahrtr, roman-khimov

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [ahrtr]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

ahrtr · 2025-06-19T11:04:50Z

tx.go

 	tx.meta.Write(p)

 	// Write the meta page to file.
+	tx.db.metalock.Lock()


There might be some performance penalty under high concurrent readonly TXNs (especially short live TXNs)

writeAt() should be fast, it's a single page. And the next patch will make readers work much better anyway. The thing is that I want to leverage this lock in that patch now.

ahrtr · 2025-06-20T09:38:44Z

cc @fuweid @tjungblu

k8s-ci-robot added needs-ok-to-test size/XS labels Jun 13, 2025

k8s-ci-robot added ok-to-test and removed needs-ok-to-test labels Jun 13, 2025

ahrtr reviewed Jun 18, 2025

View reviewed changes

roman-khimov force-pushed the metalock branch from 97719f7 to 249746f Compare June 18, 2025 19:27

ahrtr approved these changes Jun 19, 2025

View reviewed changes

k8s-ci-robot added the approved label Jun 19, 2025

ahrtr added backport/v1.3 backport/v1.4 labels Jun 19, 2025

ahrtr reviewed Jun 19, 2025

View reviewed changes

ahrtr merged commit 68b0ba4 into etcd-io:main Jun 24, 2025
21 checks passed

roman-khimov mentioned this pull request Jun 24, 2025

Improve RO transaction setup #967

Closed

This was referenced Jun 25, 2025

Plan to release v1.4.2 #996

Closed

Protect meta page when it's being written #1004

Closed

[release-1.4] Protect meta page when it's being written #1005

Merged

[release-1.3] Protect meta page when it's being written #1006

Merged

Elbehery mentioned this pull request Jul 20, 2025

Investigate Github's Windows Runner Timeout problem #1034

Closed

	metaA := db.meta0
	metaB := db.meta1
	if db.meta1.Txid() > db.meta0.Txid() {
	metaA = db.meta1
	metaB = db.meta0
	}

	// Use higher meta page if valid. Otherwise, fallback to previous, if valid.
	if err := metaA.Validate(); err == nil {
	return metaA
	} else if err := metaB.Validate(); err == nil {
	return metaB
	}

tx: add missing lock on meta page update #989

tx: add missing lock on meta page update #989

Uh oh!

Conversation

roman-khimov commented Jun 13, 2025

Uh oh!

k8s-ci-robot commented Jun 13, 2025

Uh oh!

Elbehery commented Jun 13, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

roman-khimov Jun 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ahrtr left a comment

Choose a reason for hiding this comment

Uh oh!

k8s-ci-robot commented Jun 19, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ahrtr commented Jun 20, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

4 participants

roman-khimov Jun 18, 2025 •

edited

Loading