Skip to content

Commit 1088e57

Browse files
rdettaifmassot
andauthored
Create lambda infrastructure (#3830)
* Create test Lambda calling into QW to get its version * Switch to provided AL image * Revert Dockerfile changes * Add querying and indexing * Add querying and indexing * Rename querier to searcher to align with existing terminology * Fix failing CI tests * Use S3 store source instead of bundled * Refactor binaries to seprate bin dir * Fork the CLI local search and index methods * Create index if not found * Add flexible indexing and query inputs * Add instance of Quickwit lambda with mock data generation * Log end of indexing * Add benchmarck commands * wip: trying to setup tracing * Add root trace * Try fix trace flushing * Fix merge disabling * Cache and better tracing * Add pyaload to context span * Use API Gateway events in search * Log as json and extract logs from cloudwatch * Add histogram oneshot search example * Fix errors due to rebase * Improve example queries * Expose config to disable partial_request_cache_capacity * Improve benchmark script * Document the setup of an API Gateway * Add partial request cache to bench * Add packaging workflow * Address review regarding hdfs index config * Fix rebase errors * Fix CI errors * Add mypy linter * Improve release tag name * Try using package pip install in CI * Update lambda package version * Enable download using uploaded artifact * Add API Gateway construct and refactor cdk code * Add staging lifecycle rule * Fix SpawnPipeline field after rebase * Fix unused rust-toolchain.toml * Add tutorial * Final cleanup * Fix after rebase * Fix fmt in quickwit-lambda * Handle gzip file source. * Fix clippy. * Update github action. * Add telemetry. * Apply new versioning and skip hdfs decompression * Add comment in file source. * Add test on skip reader. * Take review comments into account. * Fix rebase. --------- Co-authored-by: fmassot <[email protected]>
1 parent 6d95fae commit 1088e57

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

44 files changed

+4036
-35
lines changed
Lines changed: 52 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,52 @@
1+
name: Build and publish AWS Lambda packages
2+
3+
on:
4+
push:
5+
tags:
6+
- "lambda-beta-*"
7+
8+
jobs:
9+
build-lambdas:
10+
name: Build Quickwit Lambdas
11+
runs-on: ubuntu-latest
12+
steps:
13+
- uses: actions/checkout@v4
14+
- name: Install Ubuntu packages
15+
run: sudo apt-get -y install protobuf-compiler python3 python3-pip
16+
- name: Install rustup
17+
run: curl https://sh.rustup.rs -sSf | sh -s -- --default-toolchain none -y
18+
- name: Install python dependencies
19+
run: pip install ./distribution/lambda
20+
- name: Mypy lint
21+
run: mypy distribution/lambda/
22+
23+
- name: Extract asset version of release
24+
run: echo "QW_LAMBDA_VERSION=${GITHUB_REF/refs\/tags\//}" >> $GITHUB_ENV
25+
if: ${{ github.event_name == 'push' }}
26+
- name: Retrieve and export commit date, hash, and tags
27+
run: |
28+
echo "QW_COMMIT_DATE=$(TZ=UTC0 git log -1 --format=%cd --date=format-local:%Y-%m-%dT%H:%M:%SZ)" >> $GITHUB_ENV
29+
echo "QW_COMMIT_HASH=$(git rev-parse HEAD)" >> $GITHUB_ENV
30+
echo "QW_COMMIT_TAGS=$(git tag --points-at HEAD | tr '\n' ',')" >> $GITHUB_ENV
31+
- name: Build Quickwit Lambdas
32+
run: make package
33+
env:
34+
QW_COMMIT_DATE: ${{ env.QW_COMMIT_DATE }}
35+
QW_COMMIT_HASH: ${{ env.QW_COMMIT_HASH }}
36+
QW_COMMIT_TAGS: ${{ env.QW_COMMIT_TAGS }}
37+
QW_LAMBDA_BUILD: 1
38+
working-directory: ./distribution/lambda
39+
- name: Extract package locations
40+
run: |
41+
echo "SEARCHER_PACKAGE_LOCATION=./distribution/lambda/$(make searcher-package-path)" >> $GITHUB_ENV
42+
echo "INDEXER_PACKAGE_LOCATION=./distribution/lambda/$(make indexer-package-path)" >> $GITHUB_ENV
43+
working-directory: ./distribution/lambda
44+
- name: Upload Lambda archives
45+
uses: quickwit-inc/upload-to-github-release@v1
46+
env:
47+
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
48+
with:
49+
file: ${{ env.SEARCHER_PACKAGE_LOCATION }};${{ env.INDEXER_PACKAGE_LOCATION }}
50+
overwrite: true
51+
draft: true
52+
tag_name: aws-${{ env.QW_LAMBDA_VERSION }}

distribution/lambda/.gitignore

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,18 @@
1+
*.swp
2+
package-lock.json
3+
__pycache__
4+
.pytest_cache
5+
.venv
6+
*.egg-info
7+
build/
8+
.mypy_cache
9+
10+
# CDK asset staging directory
11+
.cdk.staging
12+
cdk.out
13+
14+
# AWS SAM build directory
15+
.aws-sam
16+
17+
# Benchmark output files
18+
*.log

distribution/lambda/Makefile

Lines changed: 134 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,134 @@
1+
.SILENT:
2+
.ONESHELL:
3+
SHELL := bash
4+
.SHELLFLAGS := -eu -o pipefail -c
5+
6+
# Update this when cutting a new release
7+
QW_LAMBDA_VERSION?=beta-01
8+
PACKAGE_BASE_URL=https://github.com/quickwit-oss/quickwit/releases/download/aws-lambda-$(QW_LAMBDA_VERSION)/
9+
SEARCHER_PACKAGE_FILE=quickwit-lambda-searcher-$(QW_LAMBDA_VERSION)-x86_64.zip
10+
INDEXER_PACKAGE_FILE=quickwit-lambda-indexer-$(QW_LAMBDA_VERSION)-x86_64.zip
11+
export SEARCHER_PACKAGE_PATH=cdk.out/$(SEARCHER_PACKAGE_FILE)
12+
export INDEXER_PACKAGE_PATH=cdk.out/$(INDEXER_PACKAGE_FILE)
13+
14+
check-env:
15+
ifndef CDK_ACCOUNT
16+
$(error CDK_ACCOUNT is undefined)
17+
endif
18+
ifndef CDK_REGION
19+
$(error CDK_REGION is undefined)
20+
endif
21+
22+
# Build or download the packages from the release page
23+
# - Download by default, the version can be set with QW_LAMBDA_VERSION
24+
# - To build locally, set QW_LAMBDA_BUILD=1
25+
package:
26+
mkdir -p cdk.out
27+
if [ "$${QW_LAMBDA_BUILD:-0}" = "1" ]
28+
then
29+
pushd ../../quickwit/
30+
cargo lambda build \
31+
-p quickwit-lambda \
32+
--release \
33+
--output-format zip \
34+
--target x86_64-unknown-linux-gnu
35+
popd
36+
cp -u ../../quickwit/target/lambda/searcher/bootstrap.zip $(SEARCHER_PACKAGE_PATH)
37+
cp -u ../../quickwit/target/lambda/indexer/bootstrap.zip $(INDEXER_PACKAGE_PATH)
38+
else
39+
if ! [ -f $(SEARCHER_PACKAGE_PATH) ]; then
40+
echo "Downloading package $(PACKAGE_BASE_URL)$(SEARCHER_PACKAGE_FILE)"
41+
curl -C - -Ls -o $(SEARCHER_PACKAGE_PATH) $(PACKAGE_BASE_URL)$(SEARCHER_PACKAGE_FILE)
42+
else
43+
echo "Using cached package $(SEARCHER_PACKAGE_PATH)"
44+
fi
45+
if ! [ -f $(INDEXER_PACKAGE_PATH) ]; then
46+
echo "Downloading package $(PACKAGE_BASE_URL)$(INDEXER_PACKAGE_FILE)"
47+
curl -C - -Ls -o $(INDEXER_PACKAGE_PATH) $(PACKAGE_BASE_URL)$(INDEXER_PACKAGE_FILE)
48+
else
49+
echo "Using cached package $(INDEXER_PACKAGE_PATH)"
50+
fi
51+
fi
52+
53+
indexer-package-path:
54+
echo -n $(INDEXER_PACKAGE_PATH)
55+
56+
searcher-package-path:
57+
echo -n $(SEARCHER_PACKAGE_PATH)
58+
59+
bootstrap: package check-env
60+
cdk bootstrap aws://$$CDK_ACCOUNT/$$CDK_REGION
61+
62+
deploy-hdfs: package check-env
63+
cdk deploy -a cdk/app.py HdfsStack
64+
65+
deploy-mock-data: package check-env
66+
cdk deploy -a cdk/app.py MockDataStack
67+
68+
destroy-hdfs:
69+
cdk destroy -a cdk/app.py HdfsStack
70+
71+
destroy-mock-data:
72+
cdk destroy -a cdk/app.py MockDataStack
73+
74+
clean:
75+
rm -rf cdk.out
76+
77+
## Invocation examples
78+
79+
invoke-mock-data-searcher: check-env
80+
python -c 'from cdk import cli; cli.invoke_mock_data_searcher()'
81+
82+
invoke-hdfs-indexer: check-env
83+
python -c 'from cdk import cli; cli.upload_hdfs_src_file()'
84+
python -c 'from cdk import cli; cli.invoke_hdfs_indexer()'
85+
86+
invoke-hdfs-searcher-term: check-env
87+
python -c 'from cdk import cli; cli.invoke_hdfs_searcher("""{"query": "severity_text:ERROR", "max_hits": 10}""")'
88+
89+
invoke-hdfs-searcher-histogram: check-env
90+
python -c 'from cdk import cli; cli.invoke_hdfs_searcher("""{ "query": "*", "max_hits": 0, "aggs": { "events": { "date_histogram": { "field": "timestamp", "fixed_interval": "1d" }, "aggs": { "log_level": { "terms": { "size": 10, "field": "severity_text", "order": { "_count": "desc" } } } } } } }""")'
91+
92+
bench-index:
93+
mem_sizes=( 10240 8192 6144 4096 3072 2048 )
94+
export QW_LAMBDA_DISABLE_MERGE=true
95+
for mem_size in "$${mem_sizes[@]}"
96+
do
97+
export INDEXER_MEMORY_SIZE=$${mem_size}
98+
$(MAKE) deploy-hdfs
99+
python -c 'from cdk import cli; cli.benchmark_hdfs_indexing()'
100+
done
101+
102+
bench-search-term:
103+
mem_sizes=( 1024 2048 4096 8192 )
104+
for mem_size in "$${mem_sizes[@]}"
105+
do
106+
export SEARCHER_MEMORY_SIZE=$${mem_size}
107+
$(MAKE) deploy-hdfs
108+
python -c 'from cdk import cli; cli.benchmark_hdfs_search("""{"query": "severity_text:ERROR", "max_hits": 10}""")'
109+
done
110+
111+
bench-search-histogram:
112+
mem_sizes=( 1024 2048 4096 8192 )
113+
for mem_size in "$${mem_sizes[@]}"
114+
do
115+
export SEARCHER_MEMORY_SIZE=$${mem_size}
116+
$(MAKE) deploy-hdfs
117+
python -c 'from cdk import cli; cli.benchmark_hdfs_search("""{ "query": "*", "max_hits": 0, "aggs": { "events": { "date_histogram": { "field": "timestamp", "fixed_interval": "1d" }, "aggs": { "log_level": { "terms": { "size": 10, "field": "severity_text", "order": { "_count": "desc" } } } } } } }""")'
118+
done
119+
120+
bench-search:
121+
for run in {1..30}
122+
do
123+
export QW_LAMBDA_DISABLE_SEARCH_CACHE=true
124+
$(MAKE) bench-search-term
125+
$(MAKE) bench-search-histogram
126+
export QW_LAMBDA_DISABLE_SEARCH_CACHE=false
127+
export QW_LAMBDA_PARTIAL_REQUEST_CACHE_CAPACITY=0
128+
$(MAKE) bench-search-term
129+
$(MAKE) bench-search-histogram
130+
export QW_LAMBDA_DISABLE_SEARCH_CACHE=false
131+
export QW_LAMBDA_PARTIAL_REQUEST_CACHE_CAPACITY=64MB
132+
$(MAKE) bench-search-term
133+
$(MAKE) bench-search-histogram
134+
done

distribution/lambda/README.md

Lines changed: 120 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,120 @@
1+
2+
# CDK template for running Quickwit on AWS Lambda
3+
4+
## Prerequisites
5+
6+
- Install AWS CDK Toolkit (cdk command)
7+
- `npm install -g aws-cdk `
8+
- Ensure `curl` and `make` are installed
9+
- To run the invocation example `make` commands, you will also need Python 3.10
10+
or later and `pip` installed (see [Python venv](#python-venv) below).
11+
12+
## AWS Lambda service quotas
13+
14+
For newly created AWS accounts, a conservative quota of 10 concurrent executions
15+
is applied to Lambda in each individual region. If that's the case, CDK won't be
16+
able to apply the reserved concurrency of the indexing Quickwit lambda. You can
17+
increase the quota without charge using the [Service Quotas
18+
console](https://console.aws.amazon.com/servicequotas/home/services/lambda/quotas).
19+
20+
> **Note:** The request can take hours or even days to be processed.
21+
22+
## Python venv
23+
24+
This project is set up like a standard Python project. The initialization
25+
process also creates a virtualenv within this project, stored under the `.venv`
26+
directory. To create the virtualenv it assumes that there is a `python3`
27+
executable in your path with access to the `venv` package. If for any reason the
28+
automatic creation of the virtualenv fails, you can create the virtualenv
29+
manually.
30+
31+
To manually create a virtualenv on MacOS and Linux:
32+
33+
```bash
34+
python3 -m venv .venv
35+
```
36+
37+
After the init process completes and the virtualenv is created, you can use the following
38+
step to activate your virtualenv.
39+
40+
```bash
41+
source .venv/bin/activate
42+
```
43+
44+
Once the virtualenv is activated, you can install the required dependencies.
45+
46+
```bash
47+
pip install .
48+
```
49+
50+
If you prefer using Poetry, achieve the same by running:
51+
```bash
52+
poetry shell
53+
poetry install
54+
```
55+
56+
## Example stacks
57+
58+
Provided demonstration setups:
59+
- HDFS example data: index the the [HDFS
60+
dataset](https://quickwit-datasets-public.s3.amazonaws.com/hdfs-logs-multitenants-10000.json)
61+
by triggering the Quickwit lambda manually.
62+
- Mock Data generator: start a mock data generator lambda that pushes mock JSON
63+
data every X minutes to S3. Those file trigger the Quickwit indexer lambda
64+
automatically.
65+
66+
## Deploy and run
67+
68+
The Makefile is a usefull entrypoint to show how the Lambda deployment can used.
69+
70+
Configure your shell and AWS account:
71+
```bash
72+
# replace with you AWS account ID and preferred region
73+
export CDK_ACCOUNT=123456789
74+
export CDK_REGION=us-east-1
75+
make bootstrap
76+
```
77+
78+
Deploy, index and query the HDFS dataset:
79+
```bash
80+
make deploy-hdfs
81+
make invoke-hdfs-indexer
82+
make invoke-hdfs-searcher
83+
```
84+
85+
Deploy the mock data generator and query the indexed data:
86+
```bash
87+
make deploy-mock-data
88+
# wait a few minutes...
89+
make invoke-mock-data-searcher
90+
```
91+
92+
## Set up a search API
93+
94+
You can configure an HTTP API endpoint around the Quickwit Searcher Lambda. The
95+
mock data example stack shows such a configuration. The API Gateway is enabled
96+
when the `SEARCHER_API_KEY` environment variable is set:
97+
98+
```bash
99+
SEARCHER_API_KEY=my-at-least-20-char-long-key make deploy-mock-data
100+
```
101+
102+
> [!WARNING]
103+
> The API key is stored in plain text in the CDK stack. For a real world
104+
> deployment, the key should be fetched from something like [AWS Secrets
105+
> Manager](https://docs.aws.amazon.com/cdk/v2/guide/get_secrets_manager_value.html).
106+
107+
Note that the response is always gzipped compressed, regardless the
108+
`Accept-Encoding` request header:
109+
110+
```bash
111+
curl -d '{"query":"quantity:>5", "max_hits": 10}' -H "Content-Type: application/json" -H "x-api-key: my-at-least-20-char-long-key" -X POST https://{api_id}.execute-api.{region}.amazonaws.com/api/v1/mock-sales/search --compressed
112+
```
113+
114+
## Useful CDK commands
115+
116+
* `cdk ls` list all stacks in the app
117+
* `cdk synth` emits the synthesized CloudFormation template
118+
* `cdk deploy` deploy this stack to your default AWS account/region
119+
* `cdk diff` compare deployed stack with current state
120+
* `cdk docs` open CDK documentation

distribution/lambda/cdk/__init__.py

Whitespace-only changes.

distribution/lambda/cdk/app.py

Lines changed: 55 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,55 @@
1+
#!/usr/bin/env python3
2+
import os
3+
from typing import Literal
4+
5+
import aws_cdk as cdk
6+
7+
from cdk.stacks.services.quickwit_service import DEFAULT_LAMBDA_MEMORY_SIZE
8+
from cdk.stacks.examples.hdfs_stack import HdfsStack
9+
from cdk.stacks.examples.mock_data_stack import MockDataStack
10+
11+
HDFS_STACK_NAME = "HdfsStack"
12+
MOCK_DATA_STACK_NAME = "MockDataStack"
13+
14+
15+
def package_location_from_env(type: Literal["searcher"] | Literal["indexer"]) -> str:
16+
path_var = f"{type.upper()}_PACKAGE_PATH"
17+
if path_var in os.environ:
18+
return os.environ[path_var]
19+
else:
20+
print(
21+
f"Could not infer the {type} package location. Configure it using the {path_var} environment variable"
22+
)
23+
exit(1)
24+
25+
26+
app = cdk.App()
27+
28+
HdfsStack(
29+
app,
30+
HDFS_STACK_NAME,
31+
env=cdk.Environment(
32+
account=os.getenv("CDK_ACCOUNT"), region=os.getenv("CDK_REGION")
33+
),
34+
indexer_memory_size=int(
35+
os.environ.get("INDEXER_MEMORY_SIZE", DEFAULT_LAMBDA_MEMORY_SIZE)
36+
),
37+
searcher_memory_size=int(
38+
os.environ.get("SEARCHER_MEMORY_SIZE", DEFAULT_LAMBDA_MEMORY_SIZE)
39+
),
40+
indexer_package_location=package_location_from_env("indexer"),
41+
searcher_package_location=package_location_from_env("searcher"),
42+
)
43+
44+
MockDataStack(
45+
app,
46+
MOCK_DATA_STACK_NAME,
47+
env=cdk.Environment(
48+
account=os.getenv("CDK_ACCOUNT"), region=os.getenv("CDK_REGION")
49+
),
50+
indexer_package_location=package_location_from_env("indexer"),
51+
searcher_package_location=package_location_from_env("searcher"),
52+
search_api_key=os.getenv("SEARCHER_API_KEY", None),
53+
)
54+
55+
app.synth()

0 commit comments

Comments
 (0)