Skip to content

Conversation

@GWphua
Copy link
Contributor

@GWphua GWphua commented Oct 22, 2025

Fixes #18446

Description

This PR belongs to a set of PR's that hope to optimize the start-up time of Historical. I came across this problem when I am running 100+ Historical servers, each needing to process a large number of segments during start-up (~85k). When conducting any updates to the Historical, each segment will take 15.3ms to load, and the start-up time for one historical will easily take >20mins. (Meaning 1.5 days to complete update for all Historical servers!)

I looked to using lazyLoadOnStart to speed up startup time. Lazy loading processes each segment metadata in 3.23ms, and this shortens the start-up time to ~4min. However, using this strategy will cause some hiccups to query latency when we are trying to conduct an upgrade. I plan to solve this by selectively choosing which segments to load during Historical startup.

image

Finding a middle ground

By studying the usage pattern of my clusters, I noticed that each Historical stores 1 month worth of data, but the heavy querying is only restricted to 7 days, while occasional queries are issued for the time period out of the last 7 days. Hence, I changed the logic of segment loading during Historical startup to provide a configurable time period to load segments eagerly (and the rest lazily). The use of the time period is dependent on the querying habits of the cluster.

Here's a benchmark of the improvements. I also included a test for #18489, which helps me to shave 10s off the start-up time. (Total of 78.5% improvement in loading time)

Description Loading Time
Original 1,494.631 s
EagerLoadingForPeriod 328.819 s
EagerLoadingForPeriod + ConcurrentSegmentFileLoad 318.262 s

Release note

Segment loading during Historical service startup is now configurable with druid.segmentCache.startupLoadStrategy. This new setting allows users to choose between the existing eager loading (loadAllEagerly), a new lazy loading (loadAllLazily) option for faster startups, and a hybrid strategy (loadEagerlyBeforePeriod) that ensures low query latency for the most recent data while deferring the loading cost of older data.

Deprecated isLazyLoadOnStart.


Key changed/added classes in this PR
  • docs/configuration/index.md
  • SegmentStatsMonitor
  • SegmentLoaderConfig
  • SegmentLocalCacheManager
  • startup/HistoricalStartupCacheLoadStrategy
  • startup/HistoricalStartupCacheLoadStrategyFactory
  • startup/LoadAllEagerlyStrategy
  • startup/LoadAllLazilyStrategy
  • startup/LoadEagerlyBeforePeriod

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • been tested in a test Druid cluster.

@GWphua GWphua changed the title Lazy load by period Historical Startup -- Configurable loading strategy Oct 22, 2025
Copy link
Member

@FrankChen021 FrankChen021 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with minor suggestions

Copy link
Contributor

@abhishekrb19 abhishekrb19 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feature, @GWphua! I've left some comments.

|--------|-----------|
|`loadAllEagerly`|The default startup strategy. The Historical service will load all segment column metadata immediately during the initial startup process.|
|`loadAllLazily`|To significantly improve historical system startup time, segments are not loaded during the initial startup sequence. Instead, the loading cost is deferred, and will be incurred the first time a segment is referenced by a query.|
|`loadEagerlyBeforePeriod`|Provides a balance between fast startup and query performance. The Historical service will eagerly load column metadata only for segments that fall within the most recent period defined by `druid.segmentCache.startupLoadPeriod`. Segments outside this recent period will be loaded on-demand when first queried.|
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How feasible/extensible is it to accept a map of datasource to load period, to allow configurable periods per datasource? (similar to the loadByPeriod - load rules config where each datasource can have different load retention rules)

I think having that option would allow a lot more flexibility to operators as the query workloads can be vastly different.

Copy link
Contributor Author

@GWphua GWphua Oct 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is workable --

I can change startupLoadStrategy.period to startupLoadStrategy.datasourceToPeriodMapping, which receives something like a JSON

e.g.
{"DS1": "P7D", "DS2": "P2D", ".": "P7D"}

Where . refers to the default configuration (since datasources cannot start with .)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My opinion is that let's keep the change in this PR small enough. for datasource level configuration, if there's really need for this feature, we can implement it by defining a datasource level configuration

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How feasible/extensible is it to accept a map of datasource to load period, to allow configurable periods per datasource? (similar to the loadByPeriod - load rules config where each datasource can have different load retention rules)

I think having that option would allow a lot more flexibility to operators as the query workloads can be vastly different.

I feel we can leave this for another PR, since it is out of scope of this intended PR. WDYT? @abhishekrb19

Comment on lines 113 to 114
return startupLoadStrategy == null
? isLazyLoadOnStart()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have validations that we don't have incompatible configurations enabled accidentally? for example, when lazyLoadOnStart = true and startupLoadStrategy are both set

Copy link
Contributor Author

@GWphua GWphua Oct 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current implementation ensures that a declared startupLoadStrategy overwrites whatever isLazyLoadOnStart is set to, else isLazyLoadOnStart is used for forward compatibility purposes. Personally, there is no need to provide validation with this overwrite functionality -- though it is relatively simple to implement one if you feel it's needed.

public HistoricalStartupCacheLoadStrategy getStartupCacheLoadStrategy()
{
return startupLoadStrategy == null
? isLazyLoadOnStart() ? new LoadAllLazilyStrategy() : new LoadAllEagerlyStrategy()

Check notice

Code scanning / CodeQL

Deprecated method or constructor invocation Note

Invoking
SegmentLoaderConfig.isLazyLoadOnStart
should be avoided because it has been deprecated.
@kfaraz
Copy link
Contributor

kfaraz commented Oct 27, 2025

I guess it is natural to allow more start up strategies since we already support - loadAllLazily and loadAllEagerly.
But the approach here feels a little complicated. We are also trying to re-invent the things which load rules already do.
It would perhaps make more sense to use "Virtual Storage Fabric" + load rule combo since they are much more powerful and will only evolve over time.

IIUC, the use case here is that we frequently query the data of the last x days and also query the data of the last x + y days but less frequently so. Rather than doing the eager vs lazy load of the segments in the y days only on startup, can we always load these segments lazily?

So, maybe the solution could be a mix of the recently added "Virtual Storage Fabric" + some load rules.

Load rules would perhaps look something like:

Rule 1: loadByPeriod last 7 days (i.e. load eagaerly on startup and otherwise)
Rule 2: loadLazilyByPeriod last 30 days (i.e. these segments are present on virtual storage fabric and will be loaded when the first query is received for them. Coordinator remains agnostic of this fact. The virtual storage can also be reclaimed to load other virtual segments.).
Rule 3: dropForever any data beyond the last 30 days.

@FrankChen021 , @GWphua , would this satisfy your use case?
@clintropolis , @abhishekrb19 , what are your thoughts?

@FrankChen021
Copy link
Member

I guess it is natural to allow more start up strategies since we already support - loadAllLazily and loadAllEagerly. But the approach here feels a little complicated. We are also trying to re-invent the things which load rules already do. It would perhaps make more sense to use "Virtual Storage Fabric" + load rule combo since they are much more powerful and will only evolve over time.

IIUC, the use case here is that we frequently query the data of the last x days and also query the data of the last x + y days but less frequently so. Rather than doing the eager vs lazy load of the segments in the y days only on startup, can we always load these segments lazily?

So, maybe the solution could be a mix of the recently added "Virtual Storage Fabric" + some load rules.

Load rules would perhaps look something like:

Rule 1: loadByPeriod last 7 days (i.e. load eagaerly on startup and otherwise) Rule 2: loadLazilyByPeriod last 30 days (i.e. these segments are present on virtual storage fabric and will be loaded when the first query is received for them. Coordinator remains agnostic of this fact. The virtual storage can also be reclaimed to load other virtual segments.). Rule 3: dropForever any data beyond the last 30 days.

@FrankChen021 , @GWphua , would this satisfy your use case? @clintropolis , @abhishekrb19 , what are your thoughts?

I think virtual storage + loading rule is different from what this PR is doing.
under the virtual storage mode, the segment is not loaded from deep storage until first query comes.
Here, the segment is not loaded into page CACHE during start up of historical, after a segment has been loaded by a historical node. the implementation here is not re-invent, and not complicated.

If the goal of virtual storage will be the dominant mode in future, introducing loadLazilyByPeriod to loading rules makes sense.

Virtual storage is just merged and is still under experiemental, I don't know when it will be production ready, and what's the roadmap for it. Making a small change to existing segment cache loading is still worthy.

@kfaraz
Copy link
Contributor

kfaraz commented Oct 27, 2025

Thanks for the clarification, @FrankChen021 !
I agree that the solution here is different from what vsf + load rule would do.
But I was wondering if it was close enough to satisfy your use case. That is, why not delay downloading the segment from the deep storage too?

the implementation here is not re-invent, and not complicated.
Making a small change to existing segment cache loading is still worthy.

True, as mentioned, I am not opposed to the idea of new startup cache load strategies. It only seems natural.
But the additional configs mean more maintenance work for cluster admins. They would need to think about which segments they want to be loaded on the cache eagerly and which are okay to be left for later. It is a question (kind of) similar to load rules i.e. which segments to keep on historicals and which are okay to be left just on the deep store. The page cache is just the first level of caching, the second being the disk of the historical itself.

Since load rules already work well with the concept of period-based loading, I hoped it would be more useful for the future to just extend that concept to cover such use cases as well.

But if that doesn't cover your use case, I can understand.

If the goal of virtual storage will be the dominant mode in future, introducing loadLazilyByPeriod to loading rules makes sense.
Virtual storage is just merged and is still under experiemental, I don't know when it will be production ready, and what's the roadmap for it.

I think @clintropolis would have some insights there but I imagine it should see good adoption in the near future. I was hoping you would be one of the early birds! 😉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Historical Segment Cache Loading Strategy on Start-up

5 participants