gRPC timeout configs by gene-bordegaray · Pull Request #365 · datafusion-contrib/datafusion-distributed

gene-bordegaray · 2026-03-06T20:39:59Z

Added the option to set some tonic / gRPC configs regarding timeouts. I see this being useful for new adopters to easily configure DFD to their liking.

Added Configs

Docs can be found here:
tonic
gRPC

grpc_connect_timeout_ms
- can I establish a channel to the worker?
grpc_request_timeout_ms
- can the full do_get RPC finish in time?
wait_plan_timeout_ms
- once I reached the worker, how long do I wait for task data?
grpc_tcp_keepalive_ms
- how do we notice the underlying TCP connection failed?

gabotechs · 2026-03-17T07:23:10Z

Thanks for submitting this! unfortunately I don't think I'll have time to review this any time soon, however, the first thing that comes to mind is that a 700 LOC PR for just timeout configuration is suspiciously too much code, I'd expect this kind of addition to take at least 1 order of magnitude less code.

This tends to happen when you let LLMs run too freely with the testing approach and/or the documentation. If you think there's an opportunity for reducing the amount code, that would be more than welcome.

gene-bordegaray · 2026-03-18T16:34:15Z

src/flight_service/do_get.rs

+    use uuid::Uuid;
+
+    #[tokio::test]
+    async fn do_get_honors_propagated_wait_timeout() {


seems do_get was most tested indirectly before this. Seems fitting to add this here as it's easy to isolate the failure

It had some tests before, but we deleted them on purpose, as do_get is an internal detail that is susceptible to change, making this test break.

I can see how this test will break in the future even if the functionality remains the same, maybe this is better represented as an integration test?

gabotechs

😅 ended up leaving a bunch of comments.

The TL;DR is that I think there's different reasons for each individual timeout config to not have it in this project, left individual comments for the different configs, let me know what you think.

src/distributed_planner/distributed_config.rs

gabotechs · 2026-03-23T18:08:29Z

src/distributed_planner/distributed_config.rs

+        /// Maximum time to wait while establishing a gRPC connection to a worker.
+        /// This is intended to bound connection setup to unreachable workers.
+        pub grpc_connect_timeout_ms: usize, default = grpc_connect_timeout_ms_default()


🤔 This one is not really per-request. The connection happens only once, but once a worker to worker connection is stablished, is never broken. This means that this parameter will be respected just once on the first query after a deployment, and ignored until a restart of the worker process.

gabotechs · 2026-03-23T18:11:39Z

src/networking/channel_resolver.rs

+    let default_cfg = DistributedConfig::default();
+    let grpc_connect_timeout_ms = distributed_cfg
+        .map(|cfg| cfg.grpc_connect_timeout_ms)
+        .unwrap_or(default_cfg.grpc_connect_timeout_ms);
+    let grpc_tcp_keepalive_ms = distributed_cfg
+        .map(|cfg| cfg.grpc_tcp_keepalive_ms)
+        .unwrap_or(default_cfg.grpc_tcp_keepalive_ms);
+    let key = DefaultChannelResolverKey {
+        runtime_addr: Arc::as_ptr(&task_ctx.runtime_env()) as usize,
+        grpc_connect_timeout_ms,
+        grpc_tcp_keepalive_ms,
+    };


This starts to get a bit hacky, it was hacky before already, but I don't think we should be doubling down on this pattern.

I think this is a limitation that could be better solved if the method in ChannelResolver accepted the Arc<TaskContext> in which they are running on as arguments.

Because of https://github.com/datafusion-contrib/datafusion-distributed/pull/365/changes#r2976789668, I'm not sure if it's worth the complexity...

gabotechs · 2026-03-23T18:15:26Z

src/flight_service/do_get.rs

+        let headers = metadata.into_headers();
+        let mut cfg = SessionConfig::default();
+        set_distributed_option_extension_from_headers::<DistributedConfig>(&mut cfg, &headers)
+            .map_err(datafusion_error_to_tonic_status)?;


It's a shame we need to pass the full DistributedConfig just for the wait_plan_timeout_ms.

If you ask me, I think I would not let the users configure the wait_plan_timeout_ms, this is an internal detail of how Distributed DataFusion works, and ideally users should not care about this.

Or if you strongly think it's necessary, I'd probably just let it be configurable at the Worker level, so that we don't need to thread the DistributedConfig across inter-worker calls just for this single field.

gabotechs · 2026-03-23T18:17:41Z

src/flight_service/do_get.rs

+    use uuid::Uuid;
+
+    #[tokio::test]
+    async fn do_get_honors_propagated_wait_timeout() {


It had some tests before, but we deleted them on purpose, as do_get is an internal detail that is susceptible to change, making this test break.

I can see how this test will break in the future even if the functionality remains the same, maybe this is better represented as an integration test?

gabotechs · 2026-03-23T18:22:55Z

src/networking/channel_resolver.rs

+        let grpc_connect_timeout_ms = self.grpc_connect_timeout_ms;
+        let grpc_tcp_keepalive_ms = self.grpc_tcp_keepalive_ms;


These two parameters are only relevant for if users decide to live with the DefaultChannelResolver, but I do would expect the majority of users to have their own ChannelResolver, as they will likely need to adapt this project to their own networking infrastructure.

This means that, for any user that does not rely on DefaultChannelResolver, these two grpc_connect_timeout_ms and grpc_tcp_keepalive_ms will pretty much be dead parameters that do nothing to their queries.

From the two scenarios that I imagine:

People do not want to customize their networking setup: they probably also do not care about tweaking gRPC configs, and they expect this project to ship good defaults

People that do want to customize their networking setup: these parameters are not useful for them, as they build their gRPC clients on their own

gabotechs · 2026-03-23T18:29:44Z

src/distributed_planner/distributed_config.rs

+        /// Total timeout for an outbound `do_get` request, in milliseconds.
+        /// This is a full-stream deadline for the whole RPC, not an idle timeout.
+        pub grpc_request_timeout_ms: usize, default = grpc_request_timeout_ms_default()


🤔 I imagine that users will have an API for serving their DataFusion queries, and they would already have a timeout on their API, and reaching the timeout in their API will automatically probably a cancellation recursively to all the workers.

What comes to mind is that, probably the timeout should not even be imposed by Distributed DataFusion, and should just be whatever timeout users naturally have in their APIs.

gabotechs · 2026-03-23T18:30:21Z

src/distributed_planner/distributed_config.rs

+        /// Maximum time a worker waits for task data to become available before failing the request.
+        pub wait_plan_timeout_ms: usize, default = wait_plan_timeout_ms_default()
+        /// TCP keepalive period used for worker-to-worker connections.
+        pub grpc_tcp_keepalive_ms: usize, default = grpc_tcp_keepalive_ms_default()


Same as https://github.com/datafusion-contrib/datafusion-distributed/pull/365/changes#r2976715396, the TCP keep alive will be taken into account once, and ignored until a worker restart.

gene-bordegaray added 5 commits March 14, 2026 13:06

add initial configs and wiring

a4cda8b

add integration to examples

ca1b6e5

add flags to bench for easst gRPC timeout tuning

2e4e4a0

make test names more concise

e58906e

increase request timeout default for CI

f3f5769

gene-bordegaray force-pushed the gene.bordegaray/2026/03/gRPC_timeout_configs branch from 4ef7b3c to f3f5769 Compare March 14, 2026 12:10

clean up AI slop

03f0d35

gene-bordegaray commented Mar 18, 2026

View reviewed changes

gabotechs reviewed Mar 23, 2026

View reviewed changes

		let grpc_connect_timeout_ms = self.grpc_connect_timeout_ms;
		let grpc_tcp_keepalive_ms = self.grpc_tcp_keepalive_ms;

Conversation

gene-bordegaray commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Added Configs

Uh oh!

gabotechs commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gabotechs left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gabotechs Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

gene-bordegaray commented Mar 6, 2026 •

edited

Loading

gabotechs commented Mar 17, 2026 •

edited

Loading

gabotechs Mar 23, 2026 •

edited

Loading