You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Elasticsearch, a distributed search and analytics engine, is a powerful tool for full-text search and data analysis.
10
-
Built on Apache Lucene and written in Java, it has gained popularity for its flexibility, scalability, and ease of use.
9
+
Elasticsearch, a distributed search and analytics engine, is a powerful tool for full-text search and data analysis.
10
+
Built on Apache Lucene and written in Java, it has gained popularity for its flexibility, scalability, and ease of use.
11
11
This article provides both a broad overview on the components and background of Elasticsearch but also a more in-depth view on the techniques employed for efficient text searching.
12
12
13
-
14
13
## Introduction
15
14
16
-
Elasticsearch, first introduced in 2010 by Shay Bannon which lead to the company Elastic, is a distributed search and analytics engine
15
+
Elasticsearch, first introduced in 2010 by Shay Bannon which lead to the company Elastic, is a distributed search and analytics engine
17
16
designed for handling large amounts of unstructured data. Primarily used for full-text search, it employs a combination of indexing and searching to deliver relevant results efficiently.
18
17
Elasticsearch is often used to provide a broader amount of search functionality on other database systems that do not provide a sophisticated way to do Full-Text search (the data is then usually synchronized between both systems).
19
18
20
19
## What is Elasticsearch?
21
20
22
-
23
21
### Overview
24
22
25
23
### Elasticsearch Components: The ELK Stack
@@ -30,7 +28,7 @@ responsible for the search and analytics part, Logstash for data processing and
30
28
and managing stored data. Other services like Beats are often integrated for various functionalities, e.g. collection of data.
31
29
32
30
!!! example "Elastic Stack"
33
-
31
+
34
32
Logstash is usually used to ingest data from various sources into Elasticsearch (optionally parsing it beforehand).
35
33
Beats are Agents, attached to for example Applications to collect logs or other metrics. Kibana utilizes the powerful search engine of Elasticsearch to then visualize the data.
36
34

@@ -42,19 +40,17 @@ of the Server-Side Public License. This shift was driven by Elastic's [dissatisf
42
40
as a service. In response, an open-source fork named OpenSearch emerged, supported by AWS, RedHat, SAP, and others.
43
41
44
42
!!! info "[Licensing Situation now](https://www.elastic.co/de/pricing/faq/licensing)"
45
-
46
-
While no longer being open-source, Elasticsearch is still "source-available". Elasticsearch can still be used and modified at will.
47
-
It is just not allowed to offer Elasticsearch as a Service (Software as a Service - SaaS) to potential customers, like Amazon did in the past on AWS.
48
43
44
+
While no longer being open-source, Elasticsearch is still "source-available". Elasticsearch can still be used and modified at will.
45
+
It is just not allowed to offer Elasticsearch as a Service (Software as a Service - SaaS) to potential customers, like Amazon did in the past on AWS.
49
46
50
47
## Interlude: Full Text Search
51
48
52
-
With Full-Text Search the whole content of something is searched (e.g. a whole book) and not (only) its metadata (author, title, abstract).
49
+
With Full-Text Search the whole content of something is searched (e.g. a whole book) and not (only) its metadata (author, title, abstract).
53
50
It is therefore all about searching unstructured data for example Tweets.
54
51
When searching for a specific query inside a document or a small set of documents the whole content can be scanned, this is usually done when using _STRG + F_ in the browser or editor of your choice,
55
52
but also by cli tools like `grep` on Unix systems.
56
53
57
-
58
54
Once the amount of documents becomes larger it gets increasingly less efficient to scan all the documents and their content.
59
55
The amount of effort and therefore time needed to search for a query is no longer sustainable.
60
56
@@ -75,30 +71,26 @@ When ingesting documents Elasticsearch also builds and updates an Index, in this
75
71
76
72
To make more clear why an inverted Index is used and why it is so efficient for Full-Text search I will explain the difference between a _Forward Index_ and an _Inverted Index_.
77
73
78
-
79
74
### Different Index Types & Elasticsearch's Inverted Index
80
75
81
76
A _Forward Index_ saves for each document the keywords it contains, mapping the ID of that Document to the keywords.
82
77
Queriying the Index would mean that the entry for each Document would need to be searched for the search term of the query.
83
-
An example for such an Index would be the list of contents of a book, when looking for something you would be able to jump to the chapter through the entry in the list but you would still need to search the whole chapter
78
+
An example for such an Index would be the list of contents of a book, when looking for something you would be able to jump to the chapter through the entry in the list but you would still need to search the whole chapter
84
79
for the term you are looking for.
85
80
86
81
An _Inverted Index_ on the other hand maps the keyword onto the DocumentID's which contain that word, therefore it is only necessary to search the "keys" of the Index.
87
82
An example would be the Index at the end of the book, which lists all the pages where a keyword appears.
88
83
89
84
Generally a _Forward Index_ is fast when building the Index but slow when searching it, the _Inverted Index_ is rather slow when indexing documents but much faster when searching it.
90
85
91
-
92
86
!!! example "_Forward Index_ and _Inverted Index_"
93
87
94
88

95
89
96
-
97
90
The _Inverted Index_ utilized by Elasticsearch not only saves for each unique keyword in which documents it appears but also on which position inside the document.
98
91
Before building the Index an analysis process is run by an _analyzer_ on the input data for more accurate and flexible results when searching the Index and not only exact matches.
99
92
Indexing is done continuously, making documents available for searching directly after ingestion.
100
93
101
-
102
94
!!! info "Elasticsearch Field Types"
103
95
104
96
All of the mentioned processes are only applied for indexing so called _full text_ fields of the saved JSON documents.
@@ -107,20 +99,20 @@ Indexing is done continuously, making documents available for searching directly
107
99
108
100
### Text Analysis & Processing Techniques
109
101
110
-
To enhance full-text search, Elasticsearch employs [natural language processing techniques](/lectures/preprocessing/) during the analysis phase.
102
+
To enhance full-text search, Elasticsearch employs [natural language processing techniques](/lectures/preprocessing/) during the analysis phase.
111
103
Tokenization breaks strings into words, and normalization ensures consistent representation, handling variations like capitalization and synonyms.
112
-
Elasticsearch provides a couple of different built-in [_analyzers_](https://www.elastic.co/guide/en/elasticsearch/reference/8.12/analysis-overview.html)
104
+
Elasticsearch provides a couple of different built-in [_analyzers_](https://www.elastic.co/guide/en/elasticsearch/reference/8.12/analysis-overview.html)
113
105
next to the commonly used _standard analyzer_, but also the possibility to create an own, _custom analyzer_
114
106
115
-
Text Analysis in Elasticsearch usually involves two steps:
107
+
Text Analysis in Elasticsearch usually involves two steps:
116
108
117
-
1.**Tokenization**: splitting up text into tokens and indexing each word
109
+
1.**Tokenization**: splitting up text into tokens and indexing each word
118
110
2.**Normalization**: capitalization, synonyms and word stems are indexed as a single term
119
111
120
112
Tokenization enables the terms in a query string to be looked up individually, but not similar tokens (e.g. upper-and lowercase, word stems or synonyms) which makes a Normalization step necessary.
121
113
To make a query match to the analyzed and indexed keywords, the same analysis steps are applied to the string of the query.
122
114
123
-
While this makes it possible to fetch accurate results that match a search term, this could sometimes be hundreds of documents. It is cumbersome to search these results for
115
+
While this makes it possible to fetch accurate results that match a search term, this could sometimes be hundreds of documents. It is cumbersome to search these results for
124
116
the most relevant documents ourselves.
125
117
Elasticsearch applies similarity scoring on search results to solve this problem.
126
118
@@ -143,13 +135,11 @@ It still has a couple of shortcomings, for example the length of a document is n
143
135
Elasticsearch therefore utilizes the **BM25** algorithm which is based on **TF-IDF**. While the **IDF** part of the **BM25** algorithm is similar (rare words lead to a higher score), it also
144
136
addresses the length of a document: the score is lower for bigger documents (based on the amount of words that do not match the query).
145
137
146
-
147
138
### Scalability and Distribution
148
139
149
140
Elasticsearch's popularity stems from its scalability and distribution capabilities. Running on clusters, it automatically distributes data to nodes,
150
141
utilizing shards (each node gets a part of the inverted index, a shard) to enable parallel processing of search queries. This makes it well-suited for handling large datasets efficiently.
151
142
152
-
153
143

154
144
155
145
### Advanced Features and Use Cases - Vector Embeddings & Semantic Search
@@ -159,7 +149,6 @@ This is mostly used for k-nearest neighbor search, which returns the _k_ nearest
159
149
The embeddings can be generated before ingesting data into Elasticsearch or delegated to a NLP model inside of Elasticsearch which has to be added by the user
160
150
beforehand.
161
151
162
-
163
152
Elasticsearch also offers its own built-in, domain free **ELSER** model (Elastic Learned Sparse Encoder), which is a paid service that does not need to be trained on a customers data beforehand.
164
153
165
154
The storage of data as vector representations in Elasticsearch enables advanced searches, making it suitable for applications like recommendation engines and multimedia content searches.
@@ -184,4 +173,4 @@ The storage of data as vector representations in Elasticsearch enables advanced
Hugging Face has emerged as a pivotal player in the AI and machine learning arena, specializing in natural language processing (NLP). This article delves into its core offerings, including model hosting, spaces, datasets, pricing, and the Terraformer API. Hugging Face is not only a repository for cutting-edge models but also a platform for collaboration and innovation in AI.
Hugging Face has emerged as a pivotal player in the AI and machine learning arena, specializing in natural language processing (NLP). This article delves into its core offerings, including model hosting, spaces, datasets, pricing, and the Terraformer API. Hugging Face is not only a repository for cutting-edge models but also a platform for collaboration and innovation in AI.
8
8
9
9
## Model Hosting on Hugging Face
10
10
@@ -14,13 +14,14 @@ Hugging Face has made a name for itself in model hosting. It offers a vast repos
generated_texts = generator("The student worked on", max_length=30, num_return_sequences=2)
20
20
print(generated_texts)
21
21
```
22
22
23
23
This outputs the following:
24
+
24
25
```python
25
26
[{'generated_text': 'The student worked on his paper, which you can read about here. You can get an ebook with that part, or an audiobook with some of'}, {'generated_text': 'The student worked on this particular task by making the same basic task in his head again and again, without the help of some external helper, even when'}]
26
27
```
@@ -34,14 +35,14 @@ Spaces are an innovative feature of Hugging Face, offering a collaborative envir
34
35
The Hugging Face ecosystem includes a wide range of datasets, catering to different NLP tasks. The Datasets library simplifies the process of loading and processing data, ensuring efficiency and consistency in model training. According to them they host over 75k datasets.
The Transformers API is a testament to Hugging Face's innovation. This API simplifies the process of text transformation, making it accessible even to those with limited programming skills. It supports a variety of NLP tasks and can be integrated into various applications.
Hugging Face Inference plays a crucial role in turning trained language models into productive applications. The platform provides an intuitive and powerful infrastructure for inferencing models, which means that developers can easily access pre-trained models to generate real-time predictions for a wide range of NLP tasks. Thanks to its efficient implementation and support for hardware acceleration technologies, Hugging Face Inference enables the seamless integration of language models into applications ranging from chatbots to machine translation and sentiment analysis.
70
71
71
-
The Inference API Url is always defined like this:
72
+
The Inference API Url is always defined like this:
0 commit comments