Skip to content

Commit 68a2d61

Browse files
eric-wang-1990sreekanth-dbJade Wangclaude
authored
feat(csharp/Benchmarks): Add CloudFetch E2E performance benchmark (#3660)
## Summary Adds comprehensive E2E benchmark for Databricks CloudFetch to measure real-world performance with actual cluster and configurable queries. ## Changes - **CloudFetchRealE2EBenchmark**: Real E2E benchmark against actual Databricks cluster - Configurable via JSON file (DATABRICKS_TEST_CONFIG_FILE environment variable) - Power BI consumption simulation with batch-size proportional delays (5ms per 10K rows) - Peak memory tracking using Process.WorkingSet64 - Custom peak memory column in results table with console output reference - **CloudFetchBenchmarkRunner**: Standalone runner for CloudFetch benchmarks - Simplified to only run real E2E benchmark - Optimized iteration counts (1 warmup + 3 actual) for faster execution - Hides confusing Error/StdDev columns from summary table - **README.md**: Documentation for running and understanding the benchmarks ## Configuration Benchmark requires `DATABRICKS_TEST_CONFIG_FILE` environment variable pointing to JSON config: ```json { "uri": "https://your-workspace.cloud.databricks.com/sql/1.0/warehouses/xxx", "token": "dapi...", "query": "select * from main.tpcds_sf1_delta.catalog_sales" } ``` ## Run Command ```bash export DATABRICKS_TEST_CONFIG_FILE=/path/to/config.json cd csharp dotnet run -c Release --project Benchmarks/Benchmarks.csproj --framework net8.0 CloudFetchBenchmarkRunner -- --filter "*" ``` ## Example Output **Console output during benchmark execution:** ``` Loaded config from: /path/to/databricks-config.json Hostname: adb-6436897454825492.12.azuredatabricks.net HTTP Path: /sql/1.0/warehouses/2f03dd43e35e2aa0 Query: select * from main.tpcds_sf1_delta.catalog_sales Benchmark will test CloudFetch with 5ms per 10K rows read delay // Warmup CloudFetch E2E [Delay=5ms/10K rows] - Peak memory: 272.97 MB WorkloadWarmup 1: 1 op, 11566591709.00 ns, 11.5666 s/op // Actual iterations CloudFetch E2E [Delay=5ms/10K rows] - Peak memory: 249.11 MB WorkloadResult 1: 1 op, 8752445353.00 ns, 8.7524 s/op CloudFetch E2E [Delay=5ms/10K rows] - Peak memory: 261.95 MB WorkloadResult 2: 1 op, 9794630771.00 ns, 9.7946 s/op CloudFetch E2E [Delay=5ms/10K rows] - Peak memory: 258.39 MB WorkloadResult 3: 1 op, 9017280271.00 ns, 9.0173 s/op ``` **Summary table:** ``` BenchmarkDotNet v0.15.4, macOS Sequoia 15.7.1 (24G231) [Darwin 24.6.0] Apple M1 Max, 1 CPU, 10 logical and 10 physical cores .NET SDK 8.0.407 [Host] : .NET 8.0.19 (8.0.19, 8.0.1925.36514), Arm64 RyuJIT armv8.0-a | Method | ReadDelayMs | Mean | Min | Max | Median | Peak Memory (MB) | Gen0 | Gen1 | Gen2 | Allocated | |------------------ |------------ |--------:|--------:|--------:|--------:|--------------------------:|-----------:|-----------:|-----------:|----------:| | ExecuteLargeQuery | 5 | 9.19 s | 8.75 s | 9.79 s | 9.02 s | See previous console output | 28000.0000 | 28000.0000 | 28000.0000 | 1.78 GB | ``` **Key Metrics:** - **E2E Time**: 8.75-9.79 seconds (includes query execution, CloudFetch downloads, LZ4 decompression, batch consumption) - **Peak Memory**: 249-262 MB (tracked via Process.WorkingSet64, printed in console) - **Total Allocated**: 1.78 GB managed memory - **GC Collections**: 28K Gen0/Gen1/Gen2 collections ## Test Plan - [x] Built successfully - [x] Verified benchmark runs with real Databricks cluster - [x] Confirmed peak memory tracking works - [x] Validated Power BI simulation delays are proportional to batch size - [x] Checked results table formatting 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Signed-off-by: Sreekanth Vadigi <[email protected]> Co-authored-by: Sreekanth Vadigi <[email protected]> Co-authored-by: Jade Wang <[email protected]> Co-authored-by: Claude <[email protected]>
1 parent b7780ec commit 68a2d61

File tree

4 files changed

+479
-0
lines changed

4 files changed

+479
-0
lines changed

csharp/Benchmarks/Benchmarks.csproj

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,16 +6,23 @@
66
<ImplicitUsings>enable</ImplicitUsings>
77
<Nullable>enable</Nullable>
88
<ProcessArchitecture>$([System.Runtime.InteropServices.RuntimeInformation]::ProcessArchitecture.ToString().ToLowerInvariant())</ProcessArchitecture>
9+
<StartupObject>Apache.Arrow.Adbc.Benchmarks.CloudFetchBenchmarkRunner</StartupObject>
910
</PropertyGroup>
1011

1112
<ItemGroup>
1213
<PackageReference Include="BenchmarkDotNet" />
1314
<PackageReference Include="DuckDB.NET.Bindings.Full" GeneratePathProperty="true" />
1415
</ItemGroup>
1516

17+
<ItemGroup>
18+
<PackageReference Include="K4os.Compression.LZ4" />
19+
<PackageReference Include="K4os.Compression.LZ4.Streams" />
20+
</ItemGroup>
21+
1622
<ItemGroup>
1723
<ProjectReference Include="..\src\Apache.Arrow.Adbc\Apache.Arrow.Adbc.csproj" />
1824
<ProjectReference Include="..\src\Client\Apache.Arrow.Adbc.Client.csproj" />
25+
<ProjectReference Include="..\src\Drivers\Databricks\Apache.Arrow.Adbc.Drivers.Databricks.csproj" />
1926
<ProjectReference Include="..\test\Apache.Arrow.Adbc.Tests\Apache.Arrow.Adbc.Tests.csproj" />
2027
</ItemGroup>
2128

Lines changed: 44 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,44 @@
1+
/*
2+
* Licensed to the Apache Software Foundation (ASF) under one or more
3+
* contributor license agreements. See the NOTICE file distributed with
4+
* this work for additional information regarding copyright ownership.
5+
* The ASF licenses this file to You under the Apache License, Version 2.0
6+
* (the "License"); you may not use this file except in compliance with
7+
* the License. You may obtain a copy of the License at
8+
*
9+
* http://www.apache.org/licenses/LICENSE-2.0
10+
*
11+
* Unless required by applicable law or agreed to in writing, software
12+
* distributed under the License is distributed on an "AS IS" BASIS,
13+
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14+
* See the License for the specific language governing permissions and
15+
* limitations under the License.
16+
*/
17+
18+
using Apache.Arrow.Adbc.Benchmarks.Databricks;
19+
using BenchmarkDotNet.Columns;
20+
using BenchmarkDotNet.Configs;
21+
using BenchmarkDotNet.Running;
22+
23+
namespace Apache.Arrow.Adbc.Benchmarks
24+
{
25+
/// <summary>
26+
/// Standalone runner for CloudFetch benchmarks only.
27+
/// Usage: dotnet run -c Release --framework net8.0 CloudFetchBenchmarkRunner
28+
/// </summary>
29+
public class CloudFetchBenchmarkRunner
30+
{
31+
public static void Main(string[] args)
32+
{
33+
// Configure to include the peak memory column and hide confusing error column
34+
var config = DefaultConfig.Instance
35+
.AddColumn(new PeakMemoryColumn())
36+
.HideColumns("Error", "StdDev"); // Hide statistical columns that are confusing with few iterations
37+
38+
// Run only the real E2E CloudFetch benchmark
39+
var summary = BenchmarkSwitcher.FromTypes(new[] {
40+
typeof(CloudFetchRealE2EBenchmark) // Real E2E with Databricks (requires credentials)
41+
}).Run(args, config);
42+
}
43+
}
44+
}
Lines changed: 286 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,286 @@
1+
/*
2+
* Licensed to the Apache Software Foundation (ASF) under one or more
3+
* contributor license agreements. See the NOTICE file distributed with
4+
* this work for additional information regarding copyright ownership.
5+
* The ASF licenses this file to You under the Apache License, Version 2.0
6+
* (the "License"); you may not use this file except in compliance with
7+
* the License. You may obtain a copy of the License at
8+
*
9+
* http://www.apache.org/licenses/LICENSE-2.0
10+
*
11+
* Unless required by applicable law or agreed to in writing, software
12+
* distributed under the License is distributed on an "AS IS" BASIS,
13+
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14+
* See the License for the specific language governing permissions and
15+
* limitations under the License.
16+
*/
17+
18+
using System;
19+
using System.Collections.Generic;
20+
using System.Diagnostics;
21+
using System.IO;
22+
using System.Text.Json;
23+
using System.Threading;
24+
using System.Threading.Tasks;
25+
using Apache.Arrow.Adbc.Drivers.Apache.Spark;
26+
using Apache.Arrow.Adbc.Drivers.Databricks;
27+
using Apache.Arrow.Ipc;
28+
using BenchmarkDotNet.Attributes;
29+
using BenchmarkDotNet.Columns;
30+
using BenchmarkDotNet.Reports;
31+
using BenchmarkDotNet.Running;
32+
33+
namespace Apache.Arrow.Adbc.Benchmarks.Databricks
34+
{
35+
/// <summary>
36+
/// Custom column to display peak memory usage in the benchmark results table.
37+
/// </summary>
38+
public class PeakMemoryColumn : IColumn
39+
{
40+
public string Id => nameof(PeakMemoryColumn);
41+
public string ColumnName => "Peak Memory (MB)";
42+
public string Legend => "Peak working set memory during benchmark execution";
43+
public UnitType UnitType => UnitType.Size;
44+
public bool AlwaysShow => true;
45+
public ColumnCategory Category => ColumnCategory.Custom;
46+
public int PriorityInCategory => 0;
47+
public bool IsNumeric => true;
48+
public bool IsAvailable(Summary summary) => true;
49+
public bool IsDefault(Summary summary, BenchmarkCase benchmarkCase) => false;
50+
51+
public string GetValue(Summary summary, BenchmarkCase benchmarkCase)
52+
{
53+
// Try CloudFetchRealE2EBenchmark (includes parameters in key)
54+
if (benchmarkCase.Descriptor.Type == typeof(CloudFetchRealE2EBenchmark))
55+
{
56+
// Extract ReadDelayMs parameter
57+
var readDelayParam = benchmarkCase.Parameters["ReadDelayMs"];
58+
string key = $"ExecuteLargeQuery_{readDelayParam}";
59+
if (CloudFetchRealE2EBenchmark.PeakMemoryResults.TryGetValue(key, out var peakMemoryMB))
60+
{
61+
return $"{peakMemoryMB:F2}";
62+
}
63+
}
64+
65+
return "See previous console output";
66+
}
67+
68+
public string GetValue(Summary summary, BenchmarkCase benchmarkCase, SummaryStyle style)
69+
{
70+
return GetValue(summary, benchmarkCase);
71+
}
72+
73+
public override string ToString() => ColumnName;
74+
}
75+
76+
/// <summary>
77+
/// Configuration model for Databricks test configuration JSON file.
78+
/// </summary>
79+
internal class DatabricksTestConfig
80+
{
81+
public string? uri { get; set; }
82+
public string? token { get; set; }
83+
public string? query { get; set; }
84+
public string? type { get; set; }
85+
public string? catalog { get; set; }
86+
public string? schema { get; set; }
87+
}
88+
89+
/// <summary>
90+
/// Real E2E performance benchmark for Databricks CloudFetch with actual cluster.
91+
///
92+
/// Prerequisites:
93+
/// - Set DATABRICKS_TEST_CONFIG_FILE environment variable
94+
/// - Config file should contain cluster connection details
95+
///
96+
/// Run with: dotnet run -c Release --project Benchmarks/Benchmarks.csproj --framework net8.0 -- --filter "*CloudFetchRealE2E*" --job dry
97+
///
98+
/// Measures:
99+
/// - Peak memory usage
100+
/// - Total allocations
101+
/// - GC collections
102+
/// - Query execution time
103+
/// - Row processing throughput
104+
///
105+
/// Parameters:
106+
/// - ReadDelayMs: Fixed at 5 milliseconds per 10K rows to simulate Power BI consumption
107+
/// </summary>
108+
[MemoryDiagnoser]
109+
[GcServer(true)]
110+
[SimpleJob(warmupCount: 1, iterationCount: 3)]
111+
[MinColumn, MaxColumn, MeanColumn, MedianColumn]
112+
public class CloudFetchRealE2EBenchmark
113+
{
114+
// Static dictionary to store peak memory results for the custom column
115+
public static readonly Dictionary<string, double> PeakMemoryResults = new Dictionary<string, double>();
116+
117+
private AdbcConnection? _connection;
118+
private Process _currentProcess = null!;
119+
private long _peakMemoryBytes;
120+
private DatabricksTestConfig _testConfig = null!;
121+
private string _hostname = null!;
122+
private string _httpPath = null!;
123+
124+
[Params(5)] // Read delay in milliseconds per 10K rows (5 = simulate Power BI)
125+
public int ReadDelayMs { get; set; }
126+
127+
[GlobalSetup]
128+
public void GlobalSetup()
129+
{
130+
// Check if Databricks config is available
131+
string? configFile = Environment.GetEnvironmentVariable("DATABRICKS_TEST_CONFIG_FILE");
132+
if (string.IsNullOrEmpty(configFile))
133+
{
134+
throw new InvalidOperationException(
135+
"DATABRICKS_TEST_CONFIG_FILE environment variable must be set. " +
136+
"Set it to the path of your Databricks test configuration JSON file.");
137+
}
138+
139+
// Read and parse config file
140+
string configJson = File.ReadAllText(configFile);
141+
_testConfig = JsonSerializer.Deserialize<DatabricksTestConfig>(configJson)
142+
?? throw new InvalidOperationException("Failed to parse config file");
143+
144+
if (string.IsNullOrEmpty(_testConfig.uri) || string.IsNullOrEmpty(_testConfig.token))
145+
{
146+
throw new InvalidOperationException("Config file must contain 'uri' and 'token' fields");
147+
}
148+
149+
if (string.IsNullOrEmpty(_testConfig.query))
150+
{
151+
throw new InvalidOperationException("Config file must contain 'query' field");
152+
}
153+
154+
// Parse URI to extract hostname and http_path
155+
// Format: https://hostname/sql/1.0/warehouses/xxx
156+
var uri = new Uri(_testConfig.uri);
157+
_hostname = uri.Host;
158+
_httpPath = uri.PathAndQuery;
159+
160+
_currentProcess = Process.GetCurrentProcess();
161+
Console.WriteLine($"Loaded config from: {configFile}");
162+
Console.WriteLine($"Hostname: {_hostname}");
163+
Console.WriteLine($"HTTP Path: {_httpPath}");
164+
Console.WriteLine($"Query: {_testConfig.query}");
165+
Console.WriteLine($"Benchmark will test CloudFetch with {ReadDelayMs}ms per 10K rows read delay");
166+
}
167+
168+
[IterationSetup]
169+
public void IterationSetup()
170+
{
171+
// Create connection for this iteration using config values
172+
var parameters = new Dictionary<string, string>
173+
{
174+
[AdbcOptions.Uri] = _testConfig.uri!,
175+
[SparkParameters.Token] = _testConfig.token!,
176+
[DatabricksParameters.UseCloudFetch] = "true",
177+
[DatabricksParameters.EnableDirectResults] = "true",
178+
[DatabricksParameters.CanDecompressLz4] = "true",
179+
[DatabricksParameters.MaxBytesPerFile] = "10485760", // 10MB per file
180+
};
181+
182+
var driver = new DatabricksDriver();
183+
var database = driver.Open(parameters);
184+
_connection = database.Connect(parameters);
185+
186+
// Reset peak memory tracking
187+
GC.Collect(2, GCCollectionMode.Forced, blocking: true, compacting: false);
188+
GC.WaitForPendingFinalizers();
189+
GC.Collect(2, GCCollectionMode.Forced, blocking: true, compacting: false);
190+
_currentProcess.Refresh();
191+
_peakMemoryBytes = _currentProcess.WorkingSet64;
192+
}
193+
194+
[IterationCleanup]
195+
public void IterationCleanup()
196+
{
197+
_connection?.Dispose();
198+
_connection = null;
199+
200+
// Print and store peak memory for this iteration
201+
double peakMemoryMB = _peakMemoryBytes / 1024.0 / 1024.0;
202+
Console.WriteLine($"CloudFetch E2E [Delay={ReadDelayMs}ms/10K rows] - Peak memory: {peakMemoryMB:F2} MB");
203+
204+
// Store in static dictionary for the custom column (key includes parameter)
205+
string key = $"ExecuteLargeQuery_{ReadDelayMs}";
206+
PeakMemoryResults[key] = peakMemoryMB;
207+
}
208+
209+
/// <summary>
210+
/// Execute a large query against Databricks and consume all result batches.
211+
/// Simulates client behavior like Power BI reading data.
212+
/// Uses the query from the config file.
213+
/// </summary>
214+
[Benchmark]
215+
public async Task<long> ExecuteLargeQuery()
216+
{
217+
if (_connection == null)
218+
{
219+
throw new InvalidOperationException("Connection not initialized");
220+
}
221+
222+
// Execute query from config file
223+
var statement = _connection.CreateStatement();
224+
statement.SqlQuery = _testConfig.query;
225+
226+
var result = await statement.ExecuteQueryAsync();
227+
if (result.Stream == null)
228+
{
229+
throw new InvalidOperationException("Result stream is null");
230+
}
231+
232+
// Read all batches and track peak memory
233+
long totalRows = 0;
234+
long totalBatches = 0;
235+
RecordBatch? batch;
236+
237+
while ((batch = await result.Stream.ReadNextRecordBatchAsync()) != null)
238+
{
239+
totalRows += batch.Length;
240+
totalBatches++;
241+
242+
// Track peak memory periodically
243+
if (totalBatches % 10 == 0)
244+
{
245+
TrackPeakMemory();
246+
}
247+
248+
// Simulate Power BI processing delay if configured
249+
// Delay is proportional to batch size: ReadDelayMs per 10K rows
250+
if (ReadDelayMs > 0)
251+
{
252+
int delayForBatch = (int)((batch.Length / 10000.0) * ReadDelayMs);
253+
if (delayForBatch > 0)
254+
{
255+
Thread.Sleep(delayForBatch);
256+
}
257+
}
258+
259+
batch.Dispose();
260+
}
261+
262+
// Final peak memory check
263+
TrackPeakMemory();
264+
265+
statement.Dispose();
266+
return totalRows;
267+
}
268+
269+
private void TrackPeakMemory()
270+
{
271+
_currentProcess.Refresh();
272+
long currentMemory = _currentProcess.WorkingSet64;
273+
if (currentMemory > _peakMemoryBytes)
274+
{
275+
_peakMemoryBytes = currentMemory;
276+
}
277+
}
278+
279+
[GlobalCleanup]
280+
public void GlobalCleanup()
281+
{
282+
GC.Collect(2, GCCollectionMode.Forced, blocking: true, compacting: true);
283+
GC.WaitForPendingFinalizers();
284+
}
285+
}
286+
}

0 commit comments

Comments
 (0)