Skip to content

Commit 1cd8c07

Browse files
authored
Run pre-commit hooks on all files
1 parent 7b975fd commit 1cd8c07

File tree

5 files changed

+65
-69
lines changed

5 files changed

+65
-69
lines changed

docs/presentations/articles/elasticsearch.md

Lines changed: 14 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -6,20 +6,18 @@ _Author: [Benjamin Brünau](mailto:[email protected])_
66

77
## TL;DR
88

9-
Elasticsearch, a distributed search and analytics engine, is a powerful tool for full-text search and data analysis.
10-
Built on Apache Lucene and written in Java, it has gained popularity for its flexibility, scalability, and ease of use.
9+
Elasticsearch, a distributed search and analytics engine, is a powerful tool for full-text search and data analysis.
10+
Built on Apache Lucene and written in Java, it has gained popularity for its flexibility, scalability, and ease of use.
1111
This article provides both a broad overview on the components and background of Elasticsearch but also a more in-depth view on the techniques employed for efficient text searching.
1212

13-
1413
## Introduction
1514

16-
Elasticsearch, first introduced in 2010 by Shay Bannon which lead to the company Elastic, is a distributed search and analytics engine
15+
Elasticsearch, first introduced in 2010 by Shay Bannon which lead to the company Elastic, is a distributed search and analytics engine
1716
designed for handling large amounts of unstructured data. Primarily used for full-text search, it employs a combination of indexing and searching to deliver relevant results efficiently.
1817
Elasticsearch is often used to provide a broader amount of search functionality on other database systems that do not provide a sophisticated way to do Full-Text search (the data is then usually synchronized between both systems).
1918

2019
## What is Elasticsearch?
2120

22-
2321
### Overview
2422

2523
### Elasticsearch Components: The ELK Stack
@@ -30,7 +28,7 @@ responsible for the search and analytics part, Logstash for data processing and
3028
and managing stored data. Other services like Beats are often integrated for various functionalities, e.g. collection of data.
3129

3230
!!! example "Elastic Stack"
33-
31+
3432
Logstash is usually used to ingest data from various sources into Elasticsearch (optionally parsing it beforehand).
3533
Beats are Agents, attached to for example Applications to collect logs or other metrics. Kibana utilizes the powerful search engine of Elasticsearch to then visualize the data.
3634
![ELK Stack - Data Flow](./assets/elasticsearch-elk-stack-data-flow.png)
@@ -42,19 +40,17 @@ of the Server-Side Public License. This shift was driven by Elastic's [dissatisf
4240
as a service. In response, an open-source fork named OpenSearch emerged, supported by AWS, RedHat, SAP, and others.
4341

4442
!!! info "[Licensing Situation now](https://www.elastic.co/de/pricing/faq/licensing)"
45-
46-
While no longer being open-source, Elasticsearch is still "source-available". Elasticsearch can still be used and modified at will.
47-
It is just not allowed to offer Elasticsearch as a Service (Software as a Service - SaaS) to potential customers, like Amazon did in the past on AWS.
4843

44+
While no longer being open-source, Elasticsearch is still "source-available". Elasticsearch can still be used and modified at will.
45+
It is just not allowed to offer Elasticsearch as a Service (Software as a Service - SaaS) to potential customers, like Amazon did in the past on AWS.
4946

5047
## Interlude: Full Text Search
5148

52-
With Full-Text Search the whole content of something is searched (e.g. a whole book) and not (only) its metadata (author, title, abstract).
49+
With Full-Text Search the whole content of something is searched (e.g. a whole book) and not (only) its metadata (author, title, abstract).
5350
It is therefore all about searching unstructured data for example Tweets.
5451
When searching for a specific query inside a document or a small set of documents the whole content can be scanned, this is usually done when using _STRG + F_ in the browser or editor of your choice,
5552
but also by cli tools like `grep` on Unix systems.
5653

57-
5854
Once the amount of documents becomes larger it gets increasingly less efficient to scan all the documents and their content.
5955
The amount of effort and therefore time needed to search for a query is no longer sustainable.
6056

@@ -75,30 +71,26 @@ When ingesting documents Elasticsearch also builds and updates an Index, in this
7571

7672
To make more clear why an inverted Index is used and why it is so efficient for Full-Text search I will explain the difference between a _Forward Index_ and an _Inverted Index_.
7773

78-
7974
### Different Index Types & Elasticsearch's Inverted Index
8075

8176
A _Forward Index_ saves for each document the keywords it contains, mapping the ID of that Document to the keywords.
8277
Queriying the Index would mean that the entry for each Document would need to be searched for the search term of the query.
83-
An example for such an Index would be the list of contents of a book, when looking for something you would be able to jump to the chapter through the entry in the list but you would still need to search the whole chapter
78+
An example for such an Index would be the list of contents of a book, when looking for something you would be able to jump to the chapter through the entry in the list but you would still need to search the whole chapter
8479
for the term you are looking for.
8580

8681
An _Inverted Index_ on the other hand maps the keyword onto the DocumentID's which contain that word, therefore it is only necessary to search the "keys" of the Index.
8782
An example would be the Index at the end of the book, which lists all the pages where a keyword appears.
8883

8984
Generally a _Forward Index_ is fast when building the Index but slow when searching it, the _Inverted Index_ is rather slow when indexing documents but much faster when searching it.
9085

91-
9286
!!! example "_Forward Index_ and _Inverted Index_"
9387

9488
![Example: Forward Index and Inverted Index](./assets/elasticsearch-index-example.png)
9589

96-
9790
The _Inverted Index_ utilized by Elasticsearch not only saves for each unique keyword in which documents it appears but also on which position inside the document.
9891
Before building the Index an analysis process is run by an _analyzer_ on the input data for more accurate and flexible results when searching the Index and not only exact matches.
9992
Indexing is done continuously, making documents available for searching directly after ingestion.
10093

101-
10294
!!! info "Elasticsearch Field Types"
10395

10496
All of the mentioned processes are only applied for indexing so called _full text_ fields of the saved JSON documents.
@@ -107,20 +99,20 @@ Indexing is done continuously, making documents available for searching directly
10799

108100
### Text Analysis & Processing Techniques
109101

110-
To enhance full-text search, Elasticsearch employs [natural language processing techniques](/lectures/preprocessing/) during the analysis phase.
102+
To enhance full-text search, Elasticsearch employs [natural language processing techniques](/lectures/preprocessing/) during the analysis phase.
111103
Tokenization breaks strings into words, and normalization ensures consistent representation, handling variations like capitalization and synonyms.
112-
Elasticsearch provides a couple of different built-in [_analyzers_](https://www.elastic.co/guide/en/elasticsearch/reference/8.12/analysis-overview.html)
104+
Elasticsearch provides a couple of different built-in [_analyzers_](https://www.elastic.co/guide/en/elasticsearch/reference/8.12/analysis-overview.html)
113105
next to the commonly used _standard analyzer_, but also the possibility to create an own, _custom analyzer_
114106

115-
Text Analysis in Elasticsearch usually involves two steps:
107+
Text Analysis in Elasticsearch usually involves two steps:
116108

117-
1. **Tokenization**: splitting up text into tokens and indexing each word
109+
1. **Tokenization**: splitting up text into tokens and indexing each word
118110
2. **Normalization**: capitalization, synonyms and word stems are indexed as a single term
119111

120112
Tokenization enables the terms in a query string to be looked up individually, but not similar tokens (e.g. upper-and lowercase, word stems or synonyms) which makes a Normalization step necessary.
121113
To make a query match to the analyzed and indexed keywords, the same analysis steps are applied to the string of the query.
122114

123-
While this makes it possible to fetch accurate results that match a search term, this could sometimes be hundreds of documents. It is cumbersome to search these results for
115+
While this makes it possible to fetch accurate results that match a search term, this could sometimes be hundreds of documents. It is cumbersome to search these results for
124116
the most relevant documents ourselves.
125117
Elasticsearch applies similarity scoring on search results to solve this problem.
126118

@@ -143,13 +135,11 @@ It still has a couple of shortcomings, for example the length of a document is n
143135
Elasticsearch therefore utilizes the **BM25** algorithm which is based on **TF-IDF**. While the **IDF** part of the **BM25** algorithm is similar (rare words lead to a higher score), it also
144136
addresses the length of a document: the score is lower for bigger documents (based on the amount of words that do not match the query).
145137

146-
147138
### Scalability and Distribution
148139

149140
Elasticsearch's popularity stems from its scalability and distribution capabilities. Running on clusters, it automatically distributes data to nodes,
150141
utilizing shards (each node gets a part of the inverted index, a shard) to enable parallel processing of search queries. This makes it well-suited for handling large datasets efficiently.
151142

152-
153143
![Elasticsearch as a distributed systems](./assets/elasticsearch-distributed-system.png)
154144

155145
### Advanced Features and Use Cases - Vector Embeddings & Semantic Search
@@ -159,7 +149,6 @@ This is mostly used for k-nearest neighbor search, which returns the _k_ nearest
159149
The embeddings can be generated before ingesting data into Elasticsearch or delegated to a NLP model inside of Elasticsearch which has to be added by the user
160150
beforehand.
161151

162-
163152
Elasticsearch also offers its own built-in, domain free **ELSER** model (Elastic Learned Sparse Encoder), which is a paid service that does not need to be trained on a customers data beforehand.
164153

165154
The storage of data as vector representations in Elasticsearch enables advanced searches, making it suitable for applications like recommendation engines and multimedia content searches.
@@ -184,4 +173,4 @@ The storage of data as vector representations in Elasticsearch enables advanced
184173
- [BM25 Algorithm 1](https://www.elastic.co/de/blog/practical-bm25-part-1-how-shards-affect-relevance-scoring-in-elasticsearch)
185174
- [BM25 Algorithm 2](https://www.elastic.co/de/blog/practical-bm25-part-2-the-bm25-algorithm-and-its-variables)
186175
- [OpenSearch Project](https://opensearch.org/)
187-
- [Apache Lucene](https://lucene.apache.org/)
176+
- [Apache Lucene](https://lucene.apache.org/)

docs/presentations/articles/gensim.md

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,5 @@
11
# Gensim
22

3-
43
_Author: [Fabian Renz](mailto:[email protected])_
54

65
## TL;DR

docs/presentations/articles/hugging_face.md

Lines changed: 13 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,10 @@
11
# Hugging Face
22

3-
_Author: [Luis Nothvogel](mailto:[email protected])_
4-
5-
## TL;DR
6-
7-
Hugging Face has emerged as a pivotal player in the AI and machine learning arena, specializing in natural language processing (NLP). This article delves into its core offerings, including model hosting, spaces, datasets, pricing, and the Terraformer API. Hugging Face is not only a repository for cutting-edge models but also a platform for collaboration and innovation in AI.
3+
_Author: [Luis Nothvogel](mailto:[email protected])_
4+
5+
## TL;DR
6+
7+
Hugging Face has emerged as a pivotal player in the AI and machine learning arena, specializing in natural language processing (NLP). This article delves into its core offerings, including model hosting, spaces, datasets, pricing, and the Terraformer API. Hugging Face is not only a repository for cutting-edge models but also a platform for collaboration and innovation in AI.
88

99
## Model Hosting on Hugging Face
1010

@@ -14,13 +14,14 @@ Hugging Face has made a name for itself in model hosting. It offers a vast repos
1414
from transformers import pipeline, set_seed
1515

1616
# Example of using a pre-trained model
17-
generator = pipeline('text-generation', model='gpt2')
18-
set_seed(42)
19-
generated_texts = generator("The student worked on", max_length=30, num_return_sequences=2)
17+
generator = pipeline('text-generation', model='gpt2')
18+
set_seed(42)
19+
generated_texts = generator("The student worked on", max_length=30, num_return_sequences=2)
2020
print(generated_texts)
2121
```
2222

2323
This outputs the following:
24+
2425
```python
2526
[{'generated_text': 'The student worked on his paper, which you can read about here. You can get an ebook with that part, or an audiobook with some of'}, {'generated_text': 'The student worked on this particular task by making the same basic task in his head again and again, without the help of some external helper, even when'}]
2627
```
@@ -34,14 +35,14 @@ Spaces are an innovative feature of Hugging Face, offering a collaborative envir
3435
The Hugging Face ecosystem includes a wide range of datasets, catering to different NLP tasks. The Datasets library simplifies the process of loading and processing data, ensuring efficiency and consistency in model training. According to them they host over 75k datasets.
3536

3637
[Wikipdia Referenz](https://huggingface.co/datasets/wikimedia/wikipedia)
38+
3739
```python
3840
from datasets import load_dataset
3941

4042
# Example of loading a dataset
4143
ds = load_dataset("wikimedia/wikipedia", "20231101.en")
4244
```
4345

44-
4546
## Transformers API: Transform Text Effortlessly
4647

4748
The Transformers API is a testament to Hugging Face's innovation. This API simplifies the process of text transformation, making it accessible even to those with limited programming skills. It supports a variety of NLP tasks and can be integrated into various applications.
@@ -68,12 +69,14 @@ tokenizer = Tokenizer(BPE(unk_token="[UNK]"))
6869

6970
Hugging Face Inference plays a crucial role in turning trained language models into productive applications. The platform provides an intuitive and powerful infrastructure for inferencing models, which means that developers can easily access pre-trained models to generate real-time predictions for a wide range of NLP tasks. Thanks to its efficient implementation and support for hardware acceleration technologies, Hugging Face Inference enables the seamless integration of language models into applications ranging from chatbots to machine translation and sentiment analysis.
7071

71-
The Inference API Url is always defined like this:
72+
The Inference API Url is always defined like this:
73+
7274
```python
7375
ENDPOINT = https://api-inference.huggingface.co/models/<MODEL_ID>
7476
```
7577

7678
Example in Python with gpt2:
79+
7780
```python
7881
import requests
7982
API_URL = "https://api-inference.huggingface.co/models/gpt2"

0 commit comments

Comments
 (0)