datazip-inc · shubham19may · Sep 15, 2025 · Sep 15, 2025 · Sep 16, 2025 · Sep 16, 2025
diff --git a/airflow/olake_sync_from_source.py b/airflow/olake_sync_from_source.py
@@ -14,11 +14,11 @@
 # This connection tells Airflow how to authenticate with your K8s cluster.
 KUBERNETES_CONN_ID = "kubernetes_default" # <-- EDIT THIS LINE
 
-# !!! IMPORTANT: Set this to the Kubernetes namespace where Olake pods should run !!!
+# !!! IMPORTANT: Set this to the Kubernetes namespace where OLake pods should run !!!
 # Ensure ConfigMaps and the PVC exist or will be created in this namespace.
 TARGET_NAMESPACE = "olake" # <-- EDIT THIS LINE
 
-# !!! IMPORTANT: Set this to the correct Olake image for your source database !!!
+# !!! IMPORTANT: Set this to the correct OLake image for your source database !!!
 # Find images at: https://hub.docker.com/u/olakego
 # Examples: "olakego/source-mongodb:latest", "olakego/source-mysql:latest", "olakego/source-postgres:latest"
 OLAKE_IMAGE = "olakego/source-db:latest" # <-- EDIT THIS LINE
@@ -54,9 +54,9 @@
     # Generic tags
     tags=["kubernetes", "olake", "etl", "sync"],
     doc_md="""
-    ### Olake Sync DAG
+    ### OLake Sync DAG
 
-    This DAG runs the Olake `sync` command using pre-created ConfigMaps
+    This DAG runs the OLake `sync` command using pre-created ConfigMaps
     for source, destination, and streams configuration. It ensures a persistent
     volume claim exists before running the sync task.
 
@@ -249,7 +249,7 @@ def create_pvc_with_hook(**context):
             ),
         ],
 
-        # Use the container's default entrypoint (should be the Olake binary)
+        # Use the container's default entrypoint (should be the OLake binary)
         cmds=None,
         # Pass arguments for the 'sync' command
         arguments=[

diff --git a/airflow/olake_sync_from_source_ec2.py b/airflow/olake_sync_from_source_ec2.py
@@ -237,7 +237,7 @@ def run_olake_docker_via_ssh(ti, ssh_conn_id, command):
 fi
 echo "INFO: State file uploaded successfully."
 
-# Now check the Olake exit code
+# Now check the OLake exit code
 if [ $OLAKE_EXIT_CODE -ne 0 ]; then
   echo "ERROR: ETL job failed with exit code $OLAKE_EXIT_CODE."
   exit $OLAKE_EXIT_CODE

diff --git a/blog/2025-01-07-olake-architecture.mdx b/blog/2025-01-07-olake-architecture.mdx
@@ -17,7 +17,7 @@ update: [18.02.2025]
 
 When building [OLake](https://olake.io/), our goal was simple: *Fastest DB to Data LakeHouse (Apache Iceberg to start) data pipeline.*
 
-Checkout GtiHub repository for OLake - [https://github.com/datazip-inc/olake](https://github.com/datazip-inc/olake)
+Checkout GitHub repository for OLake - [https://github.com/datazip-inc/olake](https://github.com/datazip-inc/olake)
 
 Over time, many of us who’ve worked with data pipelines have dealt with the toil of building one-off ETL scripts, battling performance bottlenecks, or worrying about vendor lock-in.
 

diff --git a/blog/2025-04-23-how-to-set-up-postgresql-cdc-on-aws-rds.mdx b/blog/2025-04-23-how-to-set-up-postgresql-cdc-on-aws-rds.mdx
@@ -251,7 +251,7 @@ Essential monitoring includes:
 
 ### **Does OLake take care of Full-historical snapshot/replication before CDC? How fast is it?**
 
-OLake has fastest optimised historical load:
+OLake has fastest optimized historical load:
 - OLake has Historical-load + CDC mode for this
 - Tables are chunked into smaller pieces to make it parallel and recoverable from failures
 - Any new table additions is also taken care of automatically.

diff --git a/blog/2025-05-08-olake-airflow-on-ec2.mdx b/blog/2025-05-08-olake-airflow-on-ec2.mdx
@@ -14,7 +14,7 @@ tags: [olake]
 
 At OLake, we're building tools to make data integration seamless. Today, we're excited to show you how to leverage your existing Apache Airflow setup to automate OLake data synchronization tasks directly on your EC2 Server! 
 
-Olake is designed to efficiently sync data from various sources to your chosen destinations. This guide provides an Airflow DAG (Directed Acyclic Graph) that orchestrates the Olake sync command by provisioning a dedicated EC2 instance, executing Olake within a Docker container and handling configuration and state persistence through Amazon S3. 
+OLake is designed to efficiently sync data from various sources to your chosen destinations. This guide provides an Airflow DAG (Directed Acyclic Graph) that orchestrates the OLake sync command by provisioning a dedicated EC2 instance, executing OLake within a Docker container and handling configuration and state persistence through Amazon S3. 
 
 This post assumes you already have: 
 
@@ -46,7 +46,7 @@ Before deploying the DAG, ensure the following are in place:
     * Click on the + icon to `Add a new record `
     * Select the `Connection Type` to be `Amazon Web Services `
     * Enter a `Connection Id` (this would be later used in `AWS_CONNECTION_ID` variable in the DAG) 
-    * **(Important)** Either enter `AWS Access Key Id` and `AWS Secret Access Key` or user can just attach an AWS IAM Role to the Airflow instance (with sufficient permissions as below code snippet). If no Access Keys are used, default boto3 behaviour is used. 
+    * **(Important)** Either enter `AWS Access Key Id` and `AWS Secret Access Key` or user can just attach an AWS IAM Role to the Airflow instance (with sufficient permissions as below code snippet). If no Access Keys are used, default boto3 behavior is used. 
     * Click **Save**. 
 
 
@@ -132,7 +132,7 @@ Before deploying the DAG, ensure the following are in place:
 ```
 
 
-* **SSH Connection (`SSH_CONNECTION_ID` in the DAG):** This connection allows Airflow to securely connect to the dynamically created EC2 instance to execute the Olake setup and run commands. 
+* **SSH Connection (`SSH_CONNECTION_ID` in the DAG):** This connection allows Airflow to securely connect to the dynamically created EC2 instance to execute the OLake setup and run commands. 
     * Still in the Airflow UI (`Admin` -> `Connections`), click the `+` icon to add another new record. 
     * Set the **Connection Type** to **SSH**. 
     * Enter a **Connection Id** (e.g., `ssh_ec2_olake`). This exact ID will be used for the `SSH_CONNECTION_ID` variable in your DAG. 
@@ -172,11 +172,11 @@ Before deploying the DAG, ensure the following are in place:
 
 
 
-#### 3. **Amazon S3 Setup for Olake Configurations and State:** 
-* **S3 Bucket (`S3_BUCKET_NAME` in the DAG):** Create an S3 bucket where Olake's configuration files and persistent state file will be stored. 
-* **S3 Prefix for Configurations (`S3_PREFIX` in the DAG):** Decide on a "folder" (S3 prefix) within your bucket where your Olake configuration files will reside (e.g., `olake/projectA/configs/`). 
+#### 3. **Amazon S3 Setup for OLake Configurations and State:** 
+* **S3 Bucket (`S3_BUCKET_NAME` in the DAG):** Create an S3 bucket where OLake's configuration files and persistent state file will be stored. 
+* **S3 Prefix for Configurations (`S3_PREFIX` in the DAG):** Decide on a "folder" (S3 prefix) within your bucket where your OLake configuration files will reside (e.g., `olake/projectA/configs/`). 
 
-* **Upload Olake Configuration Files:** Before running the DAG, you must upload your Olake `source.json`, `streams.json`, and `destination.json` files to the S3 bucket under the prefix you defined. The DAG's SSH script will sync these files to the EC2 instance. Please visit[ OLake Docs](https://olake.io/docs) website to learn how the[ source](https://olake.io/docs/connectors/overview) and[ destinations](https://olake.io/docs/writers/overview) can be set up. 
+* **Upload OLake Configuration Files:** Before running the DAG, you must upload your OLake `source.json`, `streams.json`, and `destination.json` files to the S3 bucket under the prefix you defined. The DAG's SSH script will sync these files to the EC2 instance. Please visit[ OLake Docs](https://olake.io/docs) website to learn how the[ source](https://olake.io/docs/connectors/overview) and[ destinations](https://olake.io/docs/writers/overview) can be set up. 
 
 We need to generate `streams.json` beforehand using the OLake `discover` command against your source database. 
     * Streams Generation Guides: 
@@ -185,9 +185,9 @@ We need to generate `streams.json` beforehand using the OLake `discover` command
 * The content of this file will be placed within the `streams.json` file. 
 
 #### 4. **EC2 Instance IAM Role (`IAM_ROLE_NAME` in the DAG):** 
-The EC2 instances launched by Airflow (which will act as the worker nodes for Olake) need their own set of permissions to perform their tasks. This is achieved by assigning them an IAM Instance Profile. This instance profile must have an attached IAM policy granting permissions to: 
-* Access Amazon S3 to download Olake configuration files. 
-* Access Amazon S3 to read and write the Olake state file. 
+The EC2 instances launched by Airflow (which will act as the worker nodes for OLake) need their own set of permissions to perform their tasks. This is achieved by assigning them an IAM Instance Profile. This instance profile must have an attached IAM policy granting permissions to: 
+* Access Amazon S3 to download OLake configuration files. 
+* Access Amazon S3 to read and write the OLake state file. 
 
 ```json
 # s3_access_policy.json
@@ -290,7 +290,7 @@ OLAKE_IMAGE = "DOCKER_IMAGE_NAME"
 
 ## Recap of Values to Change:
 
-To ensure the DAG runs correctly in your environment, you **must** update the following placeholder variables in the `olake_sync_from_source_ec2.py` (or your DAG file name) with your specific AWS and Olake details: 
+To ensure the DAG runs correctly in your environment, you **must** update the following placeholder variables in the `olake_sync_from_source_ec2.py` (or your DAG file name) with your specific AWS and OLake details: 
 
 
 
@@ -303,19 +303,19 @@ To ensure the DAG runs correctly in your environment, you **must** update the fo
 ### **EC2 Instance Configuration:** 
 
 * `AMI_ID`: Replace with the actual AMI ID of a container-ready image (with Docker/containerd, aws-cli, jq) in your chosen `AWS_REGION_NAME`. 
-* `INSTANCE_TYPE`: (Optional) Select an appropriate EC2 instance type based on your Olake workload's resource needs (e.g., `t3.medium`, `m5.large`, or an ARM equivalent like `t4g.medium`).  \
+* `INSTANCE_TYPE`: (Optional) Select an appropriate EC2 instance type based on your OLake workload's resource needs (e.g., `t3.medium`, `m5.large`, or an ARM equivalent like `t4g.medium`).  \
 The AMI tag we have hardcoded is EKS supported Ubuntu image with containerd and aws-cli pre-installed which are very crucial for the DAG to work. Another point to note is that since Graviton powered machines are cheaper compared to x86 machines, so the AMI already uses ARM architecture AMI. 
 * `KEY_NAME`: Enter the name of the EC2 Key Pair you want to associate with the launched instances. This is the same key we have used while setting up the SSH Connection. 
 * `SUBNET_ID`: Provide the ID of the VPC subnet where the EC2 instance should be launched. 
 * `SECURITY_GROUP_ID`: Specify the ID of the Security Group that will be attached to the instance. 
 * `IAM_ROLE_NAME`: Enter the **name** (not the ARN) of the IAM Instance Profile that grants the EC2 instance necessary permissions (primarily S3 access). 
 * `DEFAULT_EC2_USER`: Change this if the default SSH username for your chosen `AMI_ID` is different from `ubuntu` (e.g., `ec2-user` for Amazon Linux). 
 
-### **ETL Configuration (S3 & Olake):** 
+### **ETL Configuration (S3 & OLake):** 
 
-* `S3_BUCKET_NAME`: The name of your S3 bucket where Olake configurations and state will be stored. 
-* `S3_BUCKET_PREFIX`: The "folder" path (prefix) within your S3 bucket for Olake files (e.g., `olake/projectA/configs/`). Remember the trailing slash if it's part of your intended structure. 
-* `OLAKE_IMAGE`: The full name of the Olake Docker image you want to use (e.g., `olakego/source-postgres:latest`, `olakego/source-mysql:latest`, `olakego/source-mongodb:latest`). 
+* `S3_BUCKET_NAME`: The name of your S3 bucket where OLake configurations and state will be stored. 
+* `S3_BUCKET_PREFIX`: The "folder" path (prefix) within your S3 bucket for OLake files (e.g., `olake/projectA/configs/`). Remember the trailing slash if it's part of your intended structure. 
+* `OLAKE_IMAGE`: The full name of the OLake Docker image you want to use (e.g., `olakego/source-postgres:latest`, `olakego/source-mysql:latest`, `olakego/source-mongodb:latest`). 
 
 ### Deploying the DAG to Airflow
 
@@ -325,23 +325,23 @@ The AMI tag we have hardcoded is EKS supported Ubuntu image with containerd and
 2. Place the file into the `dags` folder recognized by your Airflow instance. The location of this folder depends on your Airflow setup. 
 3. Airflow automatically scans this folder. Wait a minute or two, and the DAG named `olake_sync_from_source` should appear in the Airflow UI. You might need to unpause it (toggle button on the left) if it loads in a paused state. 
 
-### Running Your Dynamic Olake Sync on EC2
+### Running Your Dynamic OLake Sync on EC2
 
 1. **Access Airflow UI:** Navigate to your Airflow web UI. 
 2. **Find and Unpause DAG:** Locate the DAG, likely named `olake_sync_from_source` (or whatever `dag_id` you've set). If it's paused, click the toggle to unpause it. 
 3. **Trigger the DAG:** Click the "Play" button (▶️) on the right side of the DAG listing to initiate a manual run. You can also configure a schedule string in the DAG file for automatic runs. 
 4. **Monitor the Run:** Click on the DAG run instance to view its progress in the Graph, Gantt, or Tree view. You will see the following sequence of tasks: 
     * `create_ec2_instance_task`: This task will begin first, using the AWS connection to launch a new EC2 instance according to your DAG's configuration (AMI, instance type, networking, IAM role). Airflow will wait for this instance to be in a 'running' state. 
     * `get_instance_ip_task`: Once the instance is running, this Python task will execute. It queries AWS to get the IP address or DNS name of the new EC2 instance, making it available for the next task. It also includes a pause to allow the SSH service on the new instance to become fully available. 
-    * `run_olake_docker_task`: This is the core task where Olake runs. It will: 
+    * `run_olake_docker_task`: This is the core task where OLake runs. It will: 
         * Connect to the newly created EC2 instance via SSH using the configured SSH connection. 
         * Execute the shell commands defined in `olake_ssh_command` within your DAG. This script prepares the EC2 instance by: 
             * Creating necessary directories. 
-            * Downloading your Olake configuration files and the latest state file from S3. 
-            * Pulling the specified Olake Docker image using `ctr image pull`. 
-            * Running the Olake `sync` process inside a Docker container using `ctr run ... /home/olake sync ...`. 
+            * Downloading your OLake configuration files and the latest state file from S3. 
+            * Pulling the specified OLake Docker image using `ctr image pull`. 
+            * Running the OLake `sync` process inside a Docker container using `ctr run ... /home/olake sync ...`. 
             * Uploading the updated state file back to S3 upon successful completion. 
-        * You can click on this task instance in the Airflow UI and view its logs. These logs will contain the **real-time STDOUT and STDERR** from the SSH session on the EC2 instance, including the output from the Olake Docker container. This is where you'll see Olake's synchronization progress and any potential errors from the Olake process itself. 
+        * You can click on this task instance in the Airflow UI and view its logs. These logs will contain the **real-time STDOUT and STDERR** from the SSH session on the EC2 instance, including the output from the OLake Docker container. This is where you'll see OLake's synchronization progress and any potential errors from the OLake process itself. 
     * `terminate_ec2_instance_task`: After the `run_olake_docker_task` completes (whether it succeeds or fails, due to `trigger_rule=TriggerRule.ALL_DONE`), this final task will execute. It securely terminates the EC2 instance that was launched for this DAG run, ensuring you don't incur unnecessary AWS charges. 
 
 ![olake-airflow-on-ec2-3](/img/blog/2025/05/olake-airflow-on-ec2-3.webp)

diff --git a/blog/2025-08-12-building-open-data-lakehouse-from-scratch.mdx b/blog/2025-08-12-building-open-data-lakehouse-from-scratch.mdx
@@ -39,7 +39,7 @@ Here's where things get really interesting. Unlike traditional ETL pipelines tha
 
 ## Step 1: Setting Up OLake - CDC Engine
 
-Olake has one of its unique offerings the OLake UI, which we will be using for our setup. This is a user-friendly control center for managing data pipelines without relying heavily on CLI commands. It allows you to configure sources, destinations, and jobs visually, making the setup more accessible and less error-prone. Many organizations actively use OLake UI to reduce manual CLI work, streamline CDC pipelines, and adopt a no-code-friendly approach.
+OLake has one of its unique offerings the OLake UI, which we will be using for our setup. This is a user-friendly control center for managing data pipelines without relying heavily on CLI commands. It allows you to configure sources, destinations, and jobs visually, making the setup more accessible and less error-prone. Many organizations actively use OLake UI to reduce manual CLI work, streamline CDC pipelines, and adopt a no-code-friendly approach.
 
 For our setup, we will be working with the OLake UI. We'll start by cloning the repository from GitHub and bringing it up using Docker Compose. Once the UI is running, it will serve as our control hub for creating and monitoring all CDC pipelines.
 
@@ -79,7 +79,7 @@ Once it's running, go ahead at http://localhost:8000, olake-ui and use these cre
 
 ![olake-login](/img/blog/2025/10/olake-login.webp)
 
-**You are greeted with Olake UI!**
+**You are greeted with OLake UI!**
 
 ![olake-ui](/img/blog/2025/10/olakeui.webp)
 

diff --git a/blog/2025-08-29-deploying-olake-on-kubernetes.mdx b/blog/2025-08-29-deploying-olake-on-kubernetes.mdx
@@ -162,7 +162,7 @@ global:
       olake.io/workload-type: "memory-optimized"
     456:
       olake.io/workload-type: "general-purpose"
-    # Default scheduling behaviour
+    # Default scheduling behavior
     789: {}
 ```
 
@@ -172,7 +172,7 @@ A typical enterprise scenario can be considered: a massive customer transactions
 
 Without node mapping, both operations might be scheduled on the same node by Kubernetes, causing memory contention. Or worse, the memory-hungry sync job might be put on a small node where an out-of-memory error would cause it to fail.
 
-With JobID-based mapping, the heavy sync is necessarily landed on a node with label `olake.io/workload-type: "memory-optimized"` where completion is achieved in 30 minutes instead of timing out. The other sync job are run happily on smaller, cheaper nodes, finishing without waste.
+With JobID-based mapping, the heavy sync is necessarily landed on a node with label `olake.io/workload-type: "memory-optimized"` where completion is achieved in 30 minutes instead of timing out. The other sync jobs are run happily on smaller, cheaper nodes, finishing without waste.
 
 ### The Progressive Advantage
 

diff --git a/blog/authors.yml b/blog/authors.yml
@@ -85,7 +85,7 @@ akshay:
 duke:
   page: true
   name: Duke
-  title: Olake Maintainer
+  title: OLake Maintainer
   image_url: /img/authors/duke.webp
   socials:
     linkedin: dukedhal

diff --git a/docs/community/setting-up-a-dev-env.mdx b/docs/community/setting-up-a-dev-env.mdx
@@ -727,6 +727,7 @@ Alternatively, you can generate the jar file by running the ./build.sh sync comm
 |`mode` | `auto`, `debug` |
 | `args` | `sync` , `discover`, `check` |
 
+Update `PATH_TO_UPDATE` with the absolute path where the OLake project is located on your system. For example:
 Update `workspaceFolder` with the absolute path where the OLake project is located on your system. For example:
 
 ```json

diff --git a/docs/connectors/mongodb/cdc_setup.mdx b/docs/connectors/mongodb/cdc_setup.mdx
@@ -51,7 +51,7 @@ This guide covers setting up Change Data Capture (CDC) for both self-hosted Mong
 
 **Applicable for both MongoDB (Self-Hosted) and Atlas.**
 
-Olake needs a user that can:  
+OLake needs a user that can:  
 - Read/write your application database (to ingest data).
 - Read from the local database (where oplog is stored).