Skip to content
Open
Show file tree
Hide file tree
Changes from 20 commits
Commits
Show all changes
21 commits
Select commit Hold shift + click to select a range
fd257a6
fix: contribution
ImDoubD-datazip Sep 15, 2025
9f8e531
Modified the contribution.mdx and added the source.json for remaining…
nayanj98 Sep 15, 2025
9d53d2b
Checked and modified all the source connectors for the contribution s…
nayanj98 Sep 16, 2025
9b1199b
Added all the necessary information for the cli commands, also mentio…
nayanj98 Sep 16, 2025
611d640
Fixed some grammatical errors in the document
nayanj98 Sep 17, 2025
d3fa71c
fix: docker
ImDoubD-datazip Sep 17, 2025
7c23e88
fix: pull
ImDoubD-datazip Sep 17, 2025
43a28f8
Final changes done to the contribution page based on new source and d…
nayanj98 Sep 17, 2025
52a3b48
Modified the CLA content
nayanj98 Sep 18, 2025
23a631b
fix: changes
ImDoubD-datazip Sep 18, 2025
6a142be
fix: docker compose for source added
ImDoubD-datazip Sep 18, 2025
5930ce0
Modified the setting up a develeopment environment section and contri…
nayanj98 Sep 18, 2025
b4cf762
fix: Minor spelling mistakes and formatting fixes. With some other im…
shubham19may Sep 19, 2025
e205347
fix: changes
ImDoubD-datazip Sep 19, 2025
65d3599
fix: Fixed merged conflicts
shubham19may Sep 19, 2025
69c5b05
fix: modified minor thing in helm chart doc
shubham19may Sep 19, 2025
bcc0f25
fix: modified minor thing in helm chart doc
shubham19may Sep 19, 2025
8bc7a38
fix: spellings fixed
shubham19may Sep 19, 2025
72b1e0d
fix: spellings fixed
shubham19may Sep 19, 2025
ab1db10
fix: merge conflict fixes
shubham19may Sep 22, 2025
3bf6acf
fix: made more fixes
shubham19may Sep 22, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 5 additions & 5 deletions airflow/olake_sync_from_source.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,11 +14,11 @@
# This connection tells Airflow how to authenticate with your K8s cluster.
KUBERNETES_CONN_ID = "kubernetes_default" # <-- EDIT THIS LINE

# !!! IMPORTANT: Set this to the Kubernetes namespace where Olake pods should run !!!
# !!! IMPORTANT: Set this to the Kubernetes namespace where OLake pods should run !!!
# Ensure ConfigMaps and the PVC exist or will be created in this namespace.
TARGET_NAMESPACE = "olake" # <-- EDIT THIS LINE

# !!! IMPORTANT: Set this to the correct Olake image for your source database !!!
# !!! IMPORTANT: Set this to the correct OLake image for your source database !!!
# Find images at: https://hub.docker.com/u/olakego
# Examples: "olakego/source-mongodb:latest", "olakego/source-mysql:latest", "olakego/source-postgres:latest"
OLAKE_IMAGE = "olakego/source-db:latest" # <-- EDIT THIS LINE
Expand Down Expand Up @@ -54,9 +54,9 @@
# Generic tags
tags=["kubernetes", "olake", "etl", "sync"],
doc_md="""
### Olake Sync DAG
### OLake Sync DAG

This DAG runs the Olake `sync` command using pre-created ConfigMaps
This DAG runs the OLake `sync` command using pre-created ConfigMaps
for source, destination, and streams configuration. It ensures a persistent
volume claim exists before running the sync task.

Expand Down Expand Up @@ -249,7 +249,7 @@ def create_pvc_with_hook(**context):
),
],

# Use the container's default entrypoint (should be the Olake binary)
# Use the container's default entrypoint (should be the OLake binary)
cmds=None,
# Pass arguments for the 'sync' command
arguments=[
Expand Down
2 changes: 1 addition & 1 deletion airflow/olake_sync_from_source_ec2.py
Original file line number Diff line number Diff line change
Expand Up @@ -237,7 +237,7 @@ def run_olake_docker_via_ssh(ti, ssh_conn_id, command):
fi
echo "INFO: State file uploaded successfully."

# Now check the Olake exit code
# Now check the OLake exit code
if [ $OLAKE_EXIT_CODE -ne 0 ]; then
echo "ERROR: ETL job failed with exit code $OLAKE_EXIT_CODE."
exit $OLAKE_EXIT_CODE
Expand Down
2 changes: 1 addition & 1 deletion blog/2025-01-07-olake-architecture.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ update: [18.02.2025]

When building [OLake](https://olake.io/), our goal was simple: *Fastest DB to Data LakeHouse (Apache Iceberg to start) data pipeline.*

Checkout GtiHub repository for OLake - [https://github.com/datazip-inc/olake](https://github.com/datazip-inc/olake)
Checkout GitHub repository for OLake - [https://github.com/datazip-inc/olake](https://github.com/datazip-inc/olake)

Over time, many of us who’ve worked with data pipelines have dealt with the toil of building one-off ETL scripts, battling performance bottlenecks, or worrying about vendor lock-in.

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -251,7 +251,7 @@ Essential monitoring includes:

### **Does OLake take care of Full-historical snapshot/replication before CDC? How fast is it?**

OLake has fastest optimised historical load:
OLake has fastest optimized historical load:
- OLake has Historical-load + CDC mode for this
- Tables are chunked into smaller pieces to make it parallel and recoverable from failures
- Any new table additions is also taken care of automatically.
Expand Down
44 changes: 22 additions & 22 deletions blog/2025-05-08-olake-airflow-on-ec2.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ tags: [olake]

At OLake, we're building tools to make data integration seamless. Today, we're excited to show you how to leverage your existing Apache Airflow setup to automate OLake data synchronization tasks directly on your EC2 Server!

Olake is designed to efficiently sync data from various sources to your chosen destinations. This guide provides an Airflow DAG (Directed Acyclic Graph) that orchestrates the Olake sync command by provisioning a dedicated EC2 instance, executing Olake within a Docker container and handling configuration and state persistence through Amazon S3.
OLake is designed to efficiently sync data from various sources to your chosen destinations. This guide provides an Airflow DAG (Directed Acyclic Graph) that orchestrates the OLake sync command by provisioning a dedicated EC2 instance, executing OLake within a Docker container and handling configuration and state persistence through Amazon S3.

This post assumes you already have:

Expand Down Expand Up @@ -46,7 +46,7 @@ Before deploying the DAG, ensure the following are in place:
* Click on the + icon to `Add a new record `
* Select the `Connection Type` to be `Amazon Web Services `
* Enter a `Connection Id` (this would be later used in `AWS_CONNECTION_ID` variable in the DAG)
* **(Important)** Either enter `AWS Access Key Id` and `AWS Secret Access Key` or user can just attach an AWS IAM Role to the Airflow instance (with sufficient permissions as below code snippet). If no Access Keys are used, default boto3 behaviour is used.
* **(Important)** Either enter `AWS Access Key Id` and `AWS Secret Access Key` or user can just attach an AWS IAM Role to the Airflow instance (with sufficient permissions as below code snippet). If no Access Keys are used, default boto3 behavior is used.
* Click **Save**.


Expand Down Expand Up @@ -132,7 +132,7 @@ Before deploying the DAG, ensure the following are in place:
```


* **SSH Connection (`SSH_CONNECTION_ID` in the DAG):** This connection allows Airflow to securely connect to the dynamically created EC2 instance to execute the Olake setup and run commands.
* **SSH Connection (`SSH_CONNECTION_ID` in the DAG):** This connection allows Airflow to securely connect to the dynamically created EC2 instance to execute the OLake setup and run commands.
* Still in the Airflow UI (`Admin` -> `Connections`), click the `+` icon to add another new record.
* Set the **Connection Type** to **SSH**.
* Enter a **Connection Id** (e.g., `ssh_ec2_olake`). This exact ID will be used for the `SSH_CONNECTION_ID` variable in your DAG.
Expand Down Expand Up @@ -172,11 +172,11 @@ Before deploying the DAG, ensure the following are in place:



#### 3. **Amazon S3 Setup for Olake Configurations and State:**
* **S3 Bucket (`S3_BUCKET_NAME` in the DAG):** Create an S3 bucket where Olake's configuration files and persistent state file will be stored.
* **S3 Prefix for Configurations (`S3_PREFIX` in the DAG):** Decide on a "folder" (S3 prefix) within your bucket where your Olake configuration files will reside (e.g., `olake/projectA/configs/`).
#### 3. **Amazon S3 Setup for OLake Configurations and State:**
* **S3 Bucket (`S3_BUCKET_NAME` in the DAG):** Create an S3 bucket where OLake's configuration files and persistent state file will be stored.
* **S3 Prefix for Configurations (`S3_PREFIX` in the DAG):** Decide on a "folder" (S3 prefix) within your bucket where your OLake configuration files will reside (e.g., `olake/projectA/configs/`).

* **Upload Olake Configuration Files:** Before running the DAG, you must upload your Olake `source.json`, `streams.json`, and `destination.json` files to the S3 bucket under the prefix you defined. The DAG's SSH script will sync these files to the EC2 instance. Please visit[ OLake Docs](https://olake.io/docs) website to learn how the[ source](https://olake.io/docs/connectors/overview) and[ destinations](https://olake.io/docs/writers/overview) can be set up.
* **Upload OLake Configuration Files:** Before running the DAG, you must upload your OLake `source.json`, `streams.json`, and `destination.json` files to the S3 bucket under the prefix you defined. The DAG's SSH script will sync these files to the EC2 instance. Please visit[ OLake Docs](https://olake.io/docs) website to learn how the[ source](https://olake.io/docs/connectors/overview) and[ destinations](https://olake.io/docs/writers/overview) can be set up.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The usage of source (singular) and destinations (plural) feels inconsistent.


We need to generate `streams.json` beforehand using the OLake `discover` command against your source database.
* Streams Generation Guides:
Expand All @@ -185,9 +185,9 @@ We need to generate `streams.json` beforehand using the OLake `discover` command
* The content of this file will be placed within the `streams.json` file.

#### 4. **EC2 Instance IAM Role (`IAM_ROLE_NAME` in the DAG):**
The EC2 instances launched by Airflow (which will act as the worker nodes for Olake) need their own set of permissions to perform their tasks. This is achieved by assigning them an IAM Instance Profile. This instance profile must have an attached IAM policy granting permissions to:
* Access Amazon S3 to download Olake configuration files.
* Access Amazon S3 to read and write the Olake state file.
The EC2 instances launched by Airflow (which will act as the worker nodes for OLake) need their own set of permissions to perform their tasks. This is achieved by assigning them an IAM Instance Profile. This instance profile must have an attached IAM policy granting permissions to:
* Access Amazon S3 to download OLake configuration files.
* Access Amazon S3 to read and write the OLake state file.

```json
# s3_access_policy.json
Expand Down Expand Up @@ -290,7 +290,7 @@ OLAKE_IMAGE = "DOCKER_IMAGE_NAME"

## Recap of Values to Change:

To ensure the DAG runs correctly in your environment, you **must** update the following placeholder variables in the `olake_sync_from_source_ec2.py` (or your DAG file name) with your specific AWS and Olake details:
To ensure the DAG runs correctly in your environment, you **must** update the following placeholder variables in the `olake_sync_from_source_ec2.py` (or your DAG file name) with your specific AWS and OLake details:



Expand All @@ -303,19 +303,19 @@ To ensure the DAG runs correctly in your environment, you **must** update the fo
### **EC2 Instance Configuration:**

* `AMI_ID`: Replace with the actual AMI ID of a container-ready image (with Docker/containerd, aws-cli, jq) in your chosen `AWS_REGION_NAME`.
* `INSTANCE_TYPE`: (Optional) Select an appropriate EC2 instance type based on your Olake workload's resource needs (e.g., `t3.medium`, `m5.large`, or an ARM equivalent like `t4g.medium`). \
* `INSTANCE_TYPE`: (Optional) Select an appropriate EC2 instance type based on your OLake workload's resource needs (e.g., `t3.medium`, `m5.large`, or an ARM equivalent like `t4g.medium`). \
The AMI tag we have hardcoded is EKS supported Ubuntu image with containerd and aws-cli pre-installed which are very crucial for the DAG to work. Another point to note is that since Graviton powered machines are cheaper compared to x86 machines, so the AMI already uses ARM architecture AMI.
* `KEY_NAME`: Enter the name of the EC2 Key Pair you want to associate with the launched instances. This is the same key we have used while setting up the SSH Connection.
* `SUBNET_ID`: Provide the ID of the VPC subnet where the EC2 instance should be launched.
* `SECURITY_GROUP_ID`: Specify the ID of the Security Group that will be attached to the instance.
* `IAM_ROLE_NAME`: Enter the **name** (not the ARN) of the IAM Instance Profile that grants the EC2 instance necessary permissions (primarily S3 access).
* `DEFAULT_EC2_USER`: Change this if the default SSH username for your chosen `AMI_ID` is different from `ubuntu` (e.g., `ec2-user` for Amazon Linux).

### **ETL Configuration (S3 & Olake):**
### **ETL Configuration (S3 & OLake):**

* `S3_BUCKET_NAME`: The name of your S3 bucket where Olake configurations and state will be stored.
* `S3_BUCKET_PREFIX`: The "folder" path (prefix) within your S3 bucket for Olake files (e.g., `olake/projectA/configs/`). Remember the trailing slash if it's part of your intended structure.
* `OLAKE_IMAGE`: The full name of the Olake Docker image you want to use (e.g., `olakego/source-postgres:latest`, `olakego/source-mysql:latest`, `olakego/source-mongodb:latest`).
* `S3_BUCKET_NAME`: The name of your S3 bucket where OLake configurations and state will be stored.
* `S3_BUCKET_PREFIX`: The "folder" path (prefix) within your S3 bucket for OLake files (e.g., `olake/projectA/configs/`). Remember the trailing slash if it's part of your intended structure.
* `OLAKE_IMAGE`: The full name of the OLake Docker image you want to use (e.g., `olakego/source-postgres:latest`, `olakego/source-mysql:latest`, `olakego/source-mongodb:latest`).

### Deploying the DAG to Airflow

Expand All @@ -325,23 +325,23 @@ The AMI tag we have hardcoded is EKS supported Ubuntu image with containerd and
2. Place the file into the `dags` folder recognized by your Airflow instance. The location of this folder depends on your Airflow setup.
3. Airflow automatically scans this folder. Wait a minute or two, and the DAG named `olake_sync_from_source` should appear in the Airflow UI. You might need to unpause it (toggle button on the left) if it loads in a paused state.

### Running Your Dynamic Olake Sync on EC2
### Running Your Dynamic OLake Sync on EC2

1. **Access Airflow UI:** Navigate to your Airflow web UI.
2. **Find and Unpause DAG:** Locate the DAG, likely named `olake_sync_from_source` (or whatever `dag_id` you've set). If it's paused, click the toggle to unpause it.
3. **Trigger the DAG:** Click the "Play" button (▶️) on the right side of the DAG listing to initiate a manual run. You can also configure a schedule string in the DAG file for automatic runs.
4. **Monitor the Run:** Click on the DAG run instance to view its progress in the Graph, Gantt, or Tree view. You will see the following sequence of tasks:
* `create_ec2_instance_task`: This task will begin first, using the AWS connection to launch a new EC2 instance according to your DAG's configuration (AMI, instance type, networking, IAM role). Airflow will wait for this instance to be in a 'running' state.
* `get_instance_ip_task`: Once the instance is running, this Python task will execute. It queries AWS to get the IP address or DNS name of the new EC2 instance, making it available for the next task. It also includes a pause to allow the SSH service on the new instance to become fully available.
* `run_olake_docker_task`: This is the core task where Olake runs. It will:
* `run_olake_docker_task`: This is the core task where OLake runs. It will:
* Connect to the newly created EC2 instance via SSH using the configured SSH connection.
* Execute the shell commands defined in `olake_ssh_command` within your DAG. This script prepares the EC2 instance by:
* Creating necessary directories.
* Downloading your Olake configuration files and the latest state file from S3.
* Pulling the specified Olake Docker image using `ctr image pull`.
* Running the Olake `sync` process inside a Docker container using `ctr run ... /home/olake sync ...`.
* Downloading your OLake configuration files and the latest state file from S3.
* Pulling the specified OLake Docker image using `ctr image pull`.
* Running the OLake `sync` process inside a Docker container using `ctr run ... /home/olake sync ...`.
* Uploading the updated state file back to S3 upon successful completion.
* You can click on this task instance in the Airflow UI and view its logs. These logs will contain the **real-time STDOUT and STDERR** from the SSH session on the EC2 instance, including the output from the Olake Docker container. This is where you'll see Olake's synchronization progress and any potential errors from the Olake process itself.
* You can click on this task instance in the Airflow UI and view its logs. These logs will contain the **real-time STDOUT and STDERR** from the SSH session on the EC2 instance, including the output from the OLake Docker container. This is where you'll see OLake's synchronization progress and any potential errors from the OLake process itself.
* `terminate_ec2_instance_task`: After the `run_olake_docker_task` completes (whether it succeeds or fails, due to `trigger_rule=TriggerRule.ALL_DONE`), this final task will execute. It securely terminates the EC2 instance that was launched for this DAG run, ensuring you don't incur unnecessary AWS charges.

![olake-airflow-on-ec2-3](/img/blog/2025/05/olake-airflow-on-ec2-3.webp)
Expand Down
4 changes: 2 additions & 2 deletions blog/2025-08-12-building-open-data-lakehouse-from-scratch.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@ Here's where things get really interesting. Unlike traditional ETL pipelines tha

## Step 1: Setting Up OLake - CDC Engine

Olake has one of its unique offerings the OLake UI, which we will be using for our setup. This is a user-friendly control center for managing data pipelines without relying heavily on CLI commands. It allows you to configure sources, destinations, and jobs visually, making the setup more accessible and less error-prone. Many organizations actively use OLake UI to reduce manual CLI work, streamline CDC pipelines, and adopt a no-code-friendly approach.
OLake has one of its unique offerings the OLake UI, which we will be using for our setup. This is a user-friendly control center for managing data pipelines without relying heavily on CLI commands. It allows you to configure sources, destinations, and jobs visually, making the setup more accessible and less error-prone. Many organizations actively use OLake UI to reduce manual CLI work, streamline CDC pipelines, and adopt a no-code-friendly approach.

For our setup, we will be working with the OLake UI. We'll start by cloning the repository from GitHub and bringing it up using Docker Compose. Once the UI is running, it will serve as our control hub for creating and monitoring all CDC pipelines.

Expand Down Expand Up @@ -79,7 +79,7 @@ Once it's running, go ahead at http://localhost:8000, olake-ui and use these cre

![olake-login](/img/blog/2025/10/olake-login.webp)

**You are greeted with Olake UI!**
**You are greeted with OLake UI!**

![olake-ui](/img/blog/2025/10/olakeui.webp)

Expand Down
4 changes: 2 additions & 2 deletions blog/2025-08-29-deploying-olake-on-kubernetes.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -162,7 +162,7 @@ global:
olake.io/workload-type: "memory-optimized"
456:
olake.io/workload-type: "general-purpose"
# Default scheduling behaviour
# Default scheduling behavior
789: {}
```

Expand All @@ -172,7 +172,7 @@ A typical enterprise scenario can be considered: a massive customer transactions

Without node mapping, both operations might be scheduled on the same node by Kubernetes, causing memory contention. Or worse, the memory-hungry sync job might be put on a small node where an out-of-memory error would cause it to fail.

With JobID-based mapping, the heavy sync is necessarily landed on a node with label `olake.io/workload-type: "memory-optimized"` where completion is achieved in 30 minutes instead of timing out. The other sync job are run happily on smaller, cheaper nodes, finishing without waste.
With JobID-based mapping, the heavy sync is necessarily landed on a node with label `olake.io/workload-type: "memory-optimized"` where completion is achieved in 30 minutes instead of timing out. The other sync jobs are run happily on smaller, cheaper nodes, finishing without waste.

### The Progressive Advantage

Expand Down
2 changes: 1 addition & 1 deletion blog/authors.yml
Original file line number Diff line number Diff line change
Expand Up @@ -85,7 +85,7 @@ akshay:
duke:
page: true
name: Duke
title: Olake Maintainer
title: OLake Maintainer
image_url: /img/authors/duke.webp
socials:
linkedin: dukedhal
Expand Down
1 change: 1 addition & 0 deletions docs/community/setting-up-a-dev-env.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -727,6 +727,7 @@ Alternatively, you can generate the jar file by running the ./build.sh sync comm
|`mode` | `auto`, `debug` |
| `args` | `sync` , `discover`, `check` |

Update `PATH_TO_UPDATE` with the absolute path where the OLake project is located on your system. For example:
Update `workspaceFolder` with the absolute path where the OLake project is located on your system. For example:

```json
Expand Down
2 changes: 1 addition & 1 deletion docs/connectors/mongodb/cdc_setup.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,7 @@ This guide covers setting up Change Data Capture (CDC) for both self-hosted Mong

**Applicable for both MongoDB (Self-Hosted) and Atlas.**

Olake needs a user that can:
OLake needs a user that can:
- Read/write your application database (to ingest data).
- Read from the local database (where oplog is stored).

Expand Down
Loading
Loading