Skip to content

Commit e126508

Browse files
committed
Fix a race condition in FIFO size-based compaction where concurrent threads could select the same non-L0 file (#13946)
Summary: **Context/Summary:** Fix a race condition (illustrated below) in FIFO size-based compaction where concurrent threads could select the same non-L0 file, causing assertion failures in debug builds or "Cannot delete table file from LSM tree" errors in release builds. ``` Thread 1 Thread 2 -------- -------- FIFO size-based compaction ↓ Pick L2 file ↓ Mark: file.being_compacted = true (file.being_compacted was false) ↓ WriteManifestStart (unlock mutex) ─→ FIFO size-based compaction starts ↓ ↓ Continue manifest write... Pick SAME L2 file ↓ Mark: file.being_compacted = true (file.being_compacted was true) ❌ ↓ ↓ ↓ Unlock mutex, wait for manifest ↓ ↓ Lock mutex ←─────────────────────────────────┘ ↓ Delete L2 file ✅ ↓ Complete ─────────────────────────────→ Try delete same file ❌ ↓ ERROR: "file not in LSM tree" 🐛 BUG: Both threads pick the same file! Thread 2 doesn't properly check file.being_compacted flag ``` **Test** New test that fails before the fix and passes after Pull Request resolved: #13946 Reviewed By: xingbowang Differential Revision: D82279731 Pulled By: hx235 fbshipit-source-id: b426517f2d1b23dd7d4951157822a2d322fe1435
1 parent 46c255c commit e126508

File tree

3 files changed

+68
-0
lines changed

3 files changed

+68
-0
lines changed

db/compaction/compaction_picker_fifo.cc

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -258,6 +258,9 @@ Compaction* FIFOCompactionPicker::PickSizeCompaction(
258258
// better serves a major type of FIFO use cases where smaller keys are
259259
// associated with older data.
260260
for (const auto& f : last_level_files) {
261+
if (f->being_compacted) {
262+
continue;
263+
}
261264
total_size -= f->fd.file_size;
262265
inputs[0].files.push_back(f);
263266
char tmp_fsize[16];

db/db_compaction_test.cc

Lines changed: 64 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7128,6 +7128,70 @@ TEST_F(DBCompactionTest, PartialManualCompaction) {
71287128
ASSERT_OK(dbfull()->CompactRange(cro, nullptr, nullptr));
71297129
}
71307130

7131+
TEST_F(DBCompactionTest, ConcurrentFIFOPickingSameFileBug) {
7132+
Options opts = CurrentOptions();
7133+
opts.compaction_style = CompactionStyle::kCompactionStyleLevel;
7134+
opts.num_levels = 3;
7135+
opts.disable_auto_compactions = true;
7136+
opts.max_background_jobs = 3;
7137+
7138+
DestroyAndReopen(opts);
7139+
7140+
ASSERT_OK(Put("k1", "v1"));
7141+
ASSERT_OK(Flush());
7142+
7143+
// Create a non-L0 SST file for multi-level FIFO size-based compaction later
7144+
MoveFilesToLevel(2);
7145+
7146+
Options opts_new(opts);
7147+
opts_new.compaction_style = CompactionStyle::kCompactionStyleFIFO;
7148+
opts_new.max_open_files = -1;
7149+
// Set a low threshold to trigger multi-level size-based compaction
7150+
opts_new.compaction_options_fifo.max_table_files_size = 1;
7151+
7152+
Reopen(opts_new);
7153+
7154+
const CompactRangeOptions cro;
7155+
const Slice begin_key("k1");
7156+
const Slice end_key("k2");
7157+
7158+
std::unique_ptr<port::Thread> concurrent_compaction;
7159+
7160+
bool within_first_compaction = true;
7161+
SyncPoint::GetInstance()->SetCallBack(
7162+
"VersionSet::LogAndApply:WriteManifestStart", [&](void* /*arg*/) {
7163+
if (!within_first_compaction) {
7164+
return;
7165+
}
7166+
within_first_compaction = false;
7167+
7168+
// To allow the second/concurrent compaction to still see the non-L0
7169+
// SST file and coerce the bug of picking that file
7170+
SyncPoint::GetInstance()->LoadDependency({
7171+
{"DBImpl::BackgroundCompaction:BeforeCompaction",
7172+
"VersionSet::LogAndApply:WriteManifest"},
7173+
});
7174+
7175+
concurrent_compaction.reset(new port::Thread([&]() {
7176+
// Before the fix, the second CompactRange() will either fail the
7177+
// assertion of double file picking `being_compacted !=
7178+
// inputs_[i][j]->being_compacted` in debug mode or cause LSM shape
7179+
// corruption "Cannot delete table file XXX from level 2 since it is
7180+
// not in the LSM tree" in release mode
7181+
Status s = db_->CompactRange(cro, &begin_key, &end_key);
7182+
ASSERT_OK(s);
7183+
}));
7184+
});
7185+
7186+
SyncPoint::GetInstance()->EnableProcessing();
7187+
Status s = db_->CompactRange(cro, &begin_key, &end_key);
7188+
SyncPoint::GetInstance()->DisableProcessing();
7189+
7190+
ASSERT_OK(s);
7191+
7192+
concurrent_compaction->join();
7193+
}
7194+
71317195
TEST_F(DBCompactionTest, ManualCompactionFailsInReadOnlyMode) {
71327196
// Regression test for bug where manual compaction hangs forever when the DB
71337197
// is in read-only mode. Verify it now at least returns, despite failing.
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
Fix a race condition in FIFO size-based compaction where concurrent threads could select the same non-L0 file, causing assertion failures in debug builds or "Cannot delete table file from LSM tree" errors in release builds.

0 commit comments

Comments
 (0)