Skip to content

Refactor: Extract index creation logic into smaller methods in MetaDataCreateIndexService#131

Open
augment-app-staging[bot] wants to merge 4828 commits intomainfrom
eval-504606be
Open

Refactor: Extract index creation logic into smaller methods in MetaDataCreateIndexService#131
augment-app-staging[bot] wants to merge 4828 commits intomainfrom
eval-504606be

Conversation

@augment-app-staging
Copy link

Summary

Extract the monolithic execute() method in IndexCreationTask (within MetaDataCreateIndexService.java) into smaller, well-named methods that each handle a single responsibility. This makes the service more readable and unit testable.

Changes

The ~300-line execute() method has been broken down into the following focused methods:

Method Visibility Responsibility
parseRequestMappings() package-private Parse request mappings into a structured map
applyTemplateMappings() package-private Apply and merge template mappings with request mappings
applyTemplateAliases() package-private Process template aliases with validation and templating
buildIndexSettings() package-private Build index settings from templates, request, and defaults
resolveRoutingNumberOfShards() package-private static Calculate routing number of shards
buildTemporaryIndexMetaData() package-private static Build temporary IndexMetaData for validation
validateActiveShardCount() package-private static Validate wait_for_active_shards setting
validateAliasFilters() package-private Validate alias filters using query shard context
collectMappingMetaData() package-private static Collect mapping metadata from mapper service
buildFinalIndexMetaData() package-private static Build the final IndexMetaData with all components
updateClusterState() package-private Update cluster state with blocks, metadata, and routing

Design Decisions

  • Package-private visibility: Methods are package-private to enable unit testing from the same package while not exposing internals to the public API.
  • Static where possible: Methods that don't depend on instance state (e.g., resolveRoutingNumberOfShards, buildTemporaryIndexMetaData, validateActiveShardCount, collectMappingMetaData, buildFinalIndexMetaData) are made static to make their dependencies explicit.
  • Behavior-preserving: This is a pure refactoring - no behavioral changes. The execute() method now reads as a clear high-level flow.

Testing

Note: The Gradle build system cannot resolve dependencies because jcenter.bintray.com (used by this older codebase) has been permanently shut down. The refactoring is structurally verified through code review. The extracted methods maintain the exact same logic as the original monolithic method.


Triggered by: Slack-Bot-Evals
Original discussion: https://augment-wic8570.slack.com/archives/C0AG2MXKEQJ/p1772676453939769

rjernst and others added 30 commits November 27, 2019 16:03
This commit moves the packaging tests for elasticsearch-setup-passwords
to java from bats. The change also enables future tests to enable
security in Elasticsearch and automatically have waitForElasticsearch
work correctly, at least to the same extent it worked in bats, by
waiting on the ES port instead of health check.

relates elastic#46005
…stic#49166)

This change adds a dynamic cluster setting named `indices.id_field_data.enabled`.
When set to `false` any attempt to load the fielddata for the `_id` field will fail
with an exception. The default value in this change is set to `false` in order to prevent
fielddata usage on this field for future versions but it will be set to `true` when backporting
to 7x. When the setting is set to true (manually or by default in 7x) the loading will also issue
a deprecation warning since we want to disallow fielddata entirely when elastic#26472
is implemented.

Closes elastic#43599
This stems from a time where index requests were directly forwarded to
TransportReplicationAction. Nowadays they are wrapped in a BulkShardRequest, and this logic is
obsolete.

Closes elastic#20279
…lastic#48580) (elastic#49683)

This commit adds  a new histogram field mapper that consists in a pre-aggregated format of numerical data to be used in percentiles aggregations.
…tic#49682)

This test no longer relies on jdk version, so the assume should be removed
relates elastic#48209
Add a couple of pointers for the user to check the
overall cluster health and the version of ES running
on every node.

Fixes: elastic#49670

(cherry picked from commit 8ca11f5)
…stic#49471) (elastic#49691)

This commit adds a function in NodeClient that allows to track the progress
of a search request locally. Progress is tracked through a SearchProgressListener
that exposes query and fetch responses as well as partial and final reduces.
This new method can be used by modules/plugins inside a node in order to track the
progress of a local search request.

Relates elastic#49091
Add a mirror of the maven repository of the shibboleth project
and upgrade opensaml and related dependencies to the latest
version available version

Resolves: elastic#44947
The documentation was added in elastic#47584 but those docs did not reflect the up-to-date behavior of the feature.

Backport of: elastic#47784
- Improves HTTP client hostname verification failure messages
- Adds "DiagnosticTrustManager" which logs certificate information
  when trust cannot be established (hostname failure, CA path failure,
  etc)

These diagnostic messages are designed so that many common TLS
problems can be diagnosed based solely (or primarily) on the
elasticsearch logs.

These diagnostics can be disabled by setting

     xpack.security.ssl.diagnose.trust: false

Backport of: elastic#48911
…tic#49701)

So that the tests can also run in a FIPS 140 JVM, where using a
JKS keystore is not allowed.

Resolves: elastic#49261
Removing a lot of needless buffering and array creation
to reduce the significant memory usage of tests using this.
The incoming stream from the `exchange` is already buffered
so there is no point in adding a ton of additional buffers
everywhere.
…ic#49711)

* Make BlobStoreRepository Aware of ClusterState (elastic#49639)

This is a preliminary to elastic#49060.

It does not introduce any substantial behavior change to how the blob store repository
operates. What it does is to add all the infrastructure changes around passing the cluster service to the blob store, associated test changes and a best effort approach to tracking the latest repository generation on all nodes from cluster state updates. This brings a slight improvement to the consistency
by which non-master nodes (or master directly after a failover) will be able to determine the latest repository generation. It does not however do any tricky checks for the situation after a repository operation
(create, delete or cleanup) that could theoretically be used to get even greater accuracy to keep this change simple.
This change does not in any way alter the behavior of the blobstore repository other than adding a better "guess" for the value of the latest repo generation and is mainly intended to isolate the actual logical change to how the
repository operates in elastic#49060
…lastic#49690) (elastic#49718)

This adds a `_source` setting under the `source` setting of a data
frame analytics config. The new `_source` is reusing the structure
of a `FetchSourceContext` like `analyzed_fields` does. Specifying
includes and excludes for source allows selecting which fields
will get reindexed and will be available in the destination index.

Closes elastic#49531

Backport of elastic#49690
This stems from a time where index requests were directly forwarded to
TransportReplicationAction. Nowadays they are wrapped in a BulkShardRequest, and this logic is
obsolete.

In contrast to prior PR (elastic#49647), this PR also fixes (see b3697cc) a situation where the previous
index expression logic had an interesting side effect. For bulk requests (which had resolveIndex
= false), the reroute phase was waiting for the index to appear in case where it was not present,
and for all other replication requests (resolveIndex = true) it would right away throw an
IndexNotFoundException while resolving the name and exit. With elastic#49647, every replication
request was now waiting for the index to appear, which was problematic when the given index
had just been deleted (e.g. deleting a follower index while it's still receiving requests from the
leader, where these requests would now wait up to a minute for the index to appear). This PR
now adds b3697cc on top of that prior PR to make sure to reestablish some of the prior behavior
where the reroute phase waits for the bulk request for the index to appear. That logic was in
place to ensure that when an index was created and not all nodes had learned about it yet, that
the bulk would not fail somewhere in the reroute phase. This is now only restricted to the
situation where the current node has an older cluster state than the one that coordinated the
bulk request (which checks that the index is present). This also means that when an index is
deleted, we will no longer unnecessarily wait up to the timeout for the index o appear, and
instead fail the request.

Closes elastic#20279
Some extended testing with MS-SQL server and H2 (which agree on
results) revealed bugs in the implementation of WEEK related extraction
and diff functions.

Non-iso WEEK seems to be broken since elastic#48209 because
of the replacement of Calendar and the change in the ISO rules.

ISO_WEEK failed for some edge cases around the January 1st.

DATE_DIFF was previously based on non-iso WEEK extraction which seems
not to be the case.

Fixes: elastic#49376

(cherry picked from commit 54fe7f5)
Reindex sort never gave a guarantee about the order of documents being
indexed into the destination, though it could give a sense of locality
of source data.

It prevents us from doing resilient reindex and other optimizations and
it has therefore been deprecated.

Related to elastic#47567
This rewrites long sort as a `DistanceFeatureQuery`, which can
efficiently skip non-competitive blocks and segments of documents.
Depending on the dataset, the speedups can be 2 - 10 times.

The optimization can be disabled with setting the system property
`es.search.rewrite_sort` to `false`.

Optimization is skipped when an index has 50% or more data with
the same value.

Optimization is done through:
1. Rewriting sort as `DistanceFeatureQuery` which can
efficiently skip non-competitive blocks and segments of documents.

2. Sorting segments according to the primary numeric sort field(elastic#44021)
This allows to skip non-competitive segments.

3. Using collector manager.
When we optimize sort, we sort segments by their min/max value.
As a collector expects to have segments in order,
we can not use a single collector for sorted segments.
We use collectorManager, where for every segment a dedicated collector
will be created.

4. Using Lucene's shared TopFieldCollector manager
This collector manager is able to exchange minimum competitive
score between collectors, which allows us to efficiently skip
the whole segments that don't contain competitive scores.

5. When index is force merged to a single segment, elastic#48533 interleaving
old and new segments allows for this optimization as well,
as blocks with non-competitive docs can be skipped.

Backport for elastic#48804


Co-authored-by: Jim Ferenczi <jim.ferenczi@elastic.co>
Some queries return bulk scorers that can be significantly faster than
iterating naively over the scorer. By giving script_score a BulkScorer
that would delegate to the wrapped query, we could make it faster in some cases.

Closes elastic#40837
Reindex sort never gave a guarantee about the order of documents being
indexed into the destination, though it could give a sense of locality
of source data.

It prevents us from doing resilient reindex and other optimizations and
it has therefore been deprecated.

Related to elastic#47567
martijnvg and others added 30 commits December 17, 2019 12:27
…ecutes async (elastic#50269)

Backport elastic#50244 to 7.x branch.

If a processor executes asynchronously and the ingest simulate api simulates with
multiple documents then the order of the documents in the response may not match
the order of the documents in the request.

Alexander Reelsen discovered this issue with the enrich processor with the following reproduction:

```
PUT cities/_doc/munich
{"zip":"80331","city":"Munich"}

PUT cities/_doc/berlin
{"zip":"10965","city":"Berlin"}

PUT /_enrich/policy/zip-policy
{
  "match": {
    "indices": "cities",
    "match_field": "zip",
    "enrich_fields": [ "city" ]
  }
}

POST /_enrich/policy/zip-policy/_execute

GET _cat/indices/.enrich-*

POST /_ingest/pipeline/_simulate
{
  "pipeline": {
  "processors" : [
    {
      "enrich" : {
        "policy_name": "zip-policy",
        "field" : "zip",
        "target_field": "city",
        "max_matches": "1"
      }
    }
  ]
  },
  "docs": [
    { "_id": "first", "_source" : { "zip" : "80331" } } ,
    { "_id": "second", "_source" : { "zip" : "50667" } }
  ]
}
```

* fixed test compile error
With node ordinals gone, there's no longer a need for such a complicated full cluster restart
procedure (as we can now uniquely associate nodes to data folders).

Follow-up to elastic#41652
Loading shard state information during shard allocation sometimes runs into a situation where a
data node does not know yet how to look up the shard on disk if custom data paths are used.
The current implementation loads the index metadata from disk to determine what the custom
data path looks like. This PR removes this dependency, simplifying the lookup.

Relates elastic#48701
For 7.x and earlier branches, `_cat/repositories` API requests require a
repository name.

This removes an erroneous request example without a repository name
added with a8e0275.
We need to read in a loop here. A single read to a huge byte array will
only read 16k max with the S3 SDK so if the blob we're trying to fully
read is larger we close early and fail the size comparison.
Also, drain streams fully when checking existence to avoid S3 SDK warnings.
elastic#49933)

Users often mistakenly map numeric IDs to numeric datatypes. However,
this is often slow for the `term` and other term-level queries.

The "Tune for search speed" docs includes advice for mapping numeric
IDs to `keyword` fields. However, this tip is not included in the
`numeric` or `keyword` field datatype doc pages.

This rewords the tip in the "Tune for search speed" docs, relocates it
to the `numeric` field docs, and reuses it using tagged regions.
…c#50249)

This commit ensures the global info plugin is applied, which supplies
the isInternal flag used to determine whether distro download looks for
bwcVersions.

relates elastic#50230
Co-Authored-By: Lisa Cawley <lcawley@elastic.co>
…lastic#50280)

This commit adds removal of unused data frame analytics state
from the _delete_expired_data API (and in extend th ML daily
maintenance task). At the moment the potential state docs
include the progress document and state for regression and
classification analyses.

Backport of elastic#50243
Our REST infrastructure will reject requests that have a body where the
body of the request is never consumed. This ensures that we reject
requests on endpoints that do not support having a body. This requires
cooperation from the REST handlers though, to actually consume the body,
otherwise the REST infrastructure will proceed with rejecting the
request. This commit addresses an issue in the has privileges API where
we would prematurely try to reject a request for not having a username,
before consuming the body. Since the body was not consumed, the REST
infrastructure would instead reject the request as a bad request.
Docker bypasses the Uncomplicated Firewall (UFW) on Linux by editing the `iptables` config directly, which leads to the exposure of port 9200, even if you blocked it via UFW.

This adds a warning along with work-arounds to the docs.

Signed-off-by: Kovah <mail@kovah.de>
Co-Authored-By: Lisa Cawley <lcawley@elastic.co>
…lastic#50320)

The `filter` rule is not allowed on the top-level of the query, so removing it
from the list of allowed rules. Where it can be nested inside other rules, those
rules already mention it.
)

This PR adds unit tests for wire and xContent serialization of `IntervalsSourceProvider.Wildcard`
and `IntervalsSourceProvider.Prefix`.

Relates elastic#50150
The freeze index API docs state that frozen indices are blocked for
write operations.

While this implies frozen indices are read-only, it does not explicitly
use the term "read-only", which is found in other docs, such as the
force merge docs.

This adds the "ready-only" term to the freeze index API docs as well
as other clarification.
…stic#50329)

Cache results from queries that use scripts if they use only
deterministic API calls.  Nondeterministic API calls are marked in the
whitelist with the `@nondeterministic` annotation.  Examples are
`Math.random()` and `new Date()`.

Refs: elastic#49466
…c#50335)

This fixes support for nested fields

We now support fully nested, fully collapsed, or a mix of both on inference docs.

ES mappings allow the `_source` to be any combination of nested objects + dot delimited fields.
So, we should do our best to find the best path down the Map for the desired field.
…tic#50337)

ScriptContextInfoSerializingTests:testEqualsAndHashcode was failing
because the mutation was generating the same name.

**Backport**

Fixes: elastic#50331
…job (elastic#50322) (elastic#50324)

In order to ensure any persisted model state is searchable by the moment
the job reports itself as `stopped`, we need to refresh the state index
before completing.

This should fix the occasional failures we see in elastic#50168 and elastic#50313 where
the model state appears missing.

Closes elastic#50168
Closes elastic#50313

Backport of elastic#50322
Switches generated WKT to upper case to
conform to the standard recommendation.

Relates elastic#49568
…taCreateIndexService

Extract the monolithic execute() method in IndexCreationTask into smaller,
well-named methods that each handle a single responsibility:

- parseRequestMappings(): Parse request mappings into a structured map
- applyTemplateMappings(): Apply and merge template mappings with request mappings
- applyTemplateAliases(): Process template aliases with validation and templating
- buildIndexSettings(): Build index settings from templates, request, and defaults
- resolveRoutingNumberOfShards(): Calculate routing number of shards
- buildTemporaryIndexMetaData(): Build temporary IndexMetaData for validation
- validateActiveShardCount(): Validate wait_for_active_shards setting
- validateAliasFilters(): Validate alias filters using query shard context
- collectMappingMetaData(): Collect mapping metadata from mapper service
- buildFinalIndexMetaData(): Build the final IndexMetaData with all components
- updateClusterState(): Update cluster state with blocks, metadata, and routing

These methods are package-private to enable unit testing while keeping the
same external behavior. Methods that only manipulate ClusterState or its
components are made static where possible.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.