Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions blog/2025-04-23-how-to-set-up-postgresql-cdc-on-aws-rds.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,8 @@ tags: [cdc]

# How to Set Up PostgreSQL CDC on AWS RDS - A Step-by-Step Guide

![PostgreSQL CDC Setup on AWS RDS](/img/blog/cover/how-to-set-up-postgresql-cdc-on-aws-rds-cover.webp)

### **What and Why's of CDC?**

Change Data Capture (CDC) is a method used in databases to track and record changes made to data. Recent trends in the data engineering industry point towards the increasing importance of real-time data processing and the integration of artificial intelligence (AI) in data architecture.
Expand Down Expand Up @@ -258,3 +260,5 @@ OLake has fastest optimised historical load:
- Any new table additions is also taken care of automatically.

For more detailed information about OLake's PostgreSQL CDC capabilities, visit [olake.io](https://olake.io) and [olake.io/docs](https://olake.io/docs).

<BlogCTA/>
2 changes: 2 additions & 0 deletions blog/2025-08-12-building-open-data-lakehouse-from-scratch.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -415,3 +415,5 @@ Building an open data lakehouse has never been this straightforward. With MySQL
Well, there you have it - your very own open data lakehouse, running locally and ready for real-world workloads. The combination of these tools creates something truly powerful: a platform where your data can be both a lake and a warehouse, structured and unstructured, batch and streaming, all at the same time.

Otherwise, you'd be stuck with traditional approaches that force you to choose between flexibility and performance. But with this modern lakehouse architecture, you get the best of both worlds and that's pretty exciting if you ask me!

<BlogCTA/>
6 changes: 5 additions & 1 deletion blog/2025-09-04-creating-job-olake-docker-cli.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,8 @@ image: /img/blog/cover/pipeline-on-olake.webp

# From Postgres to Iceberg: Creating OLake Jobs with Docker CLI and UI

![Creating OLake Replication Jobs](/img/blog/cover/pipeline-on-olake.webp)

Data replication has become one of the most essential building blocks in modern data engineering. Whether it's keeping your analytics warehouse in sync with operational databases or feeding real-time pipelines for machine learning, companies rely on tools to move data quickly and reliably.

Today, there's no shortage of options—platforms like Fivetran, Airbyte, Debezium, and even custom-built Flink or Spark pipelines are widely used to handle replication. But each of these comes with trade-offs: infrastructure complexity, cost, or lack of flexibility when you want to adapt replication to your specific needs.
Expand Down Expand Up @@ -377,4 +379,6 @@ You get a baseline snapshot *and* continuous changes—ideal for keeping downstr
* **UI**: unselect the stream → save → re-add with updated partitioning/filter/normalization.
* **CLI**: edit `streams.json` and re-run.

---
---

<BlogCTA/>
2 changes: 2 additions & 0 deletions blog/2025-09-09-mysql-to-apache-iceberg-replication.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,8 @@ image: /img/blog/cover/setup-sql-iceberg.webp

# Replicate MySQL to Iceberg in Real-time: No-Code + Manual Guide"

![MySQL to Apache Iceberg Replication](/img/blog/cover/setup-sql-iceberg.webp)

**MySQL** powers countless production applications as a reliable operational database. But when it comes to analytics at scale, running heavy queries directly on MySQL can quickly become expensive, slow, and disruptive to transactional workloads.

That's where **Apache Iceberg** comes in. By replicating MySQL data into Iceberg tables, you can unlock a modern, open-format data lakehouse that supports real-time analytics, schema evolution, partitioning, and time travel queries all without burdening your source database.
Expand Down
2 changes: 2 additions & 0 deletions blog/2025-09-10-how-to-set-up-mongodb-apache-iceberg.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,8 @@ image: /img/blog/cover/setup-mongodb.webp

# How to Set Up MongoDB Apache Iceberg: Complete Guide to Building a Modern Data Lakehouse

![How to Set Up MongoDB Apache Iceberg](/img/blog/cover/setup-mongodb.webp)

**MongoDB** has become the go-to database for modern applications, handling everything from user profiles to IoT sensor data with its flexible document model. But when it comes to analytics at scale, MongoDB's document-oriented architecture faces significant challenges with complex queries, aggregations, and large-scale data processing.

That's where **Apache Iceberg** comes in. By replicating MongoDB data into Iceberg tables, you can unlock a modern, open-format data lakehouse that supports real-time analytics, schema evolution, partitioning, and time travel queries while maintaining MongoDB's operational performance.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ image: /img/blog/cover/hive-vs-iceberg.webp

# When to Choose Apache Iceberg Over Hive: A Comparison Guide

![Apache Hive vs Iceberg Comparison](/img/blog/cover/hive-vs-iceberg.webp)

Apache Hive and Apache Iceberg represent two different generations of the data lake ecosystem. Hive was born in the **Hadoop era** as a SQL abstraction over HDFS, excelling in batch ETL workloads and still valuable for organizations with large Hadoop/ORC footprints. Iceberg, by contrast, emerged in the **cloud-native era** as an open table format designed for multi-engine interoperability, **schema evolution**, and features like **time travel**. If you are running a legacy Hadoop stack with minimal need for engine diversity, Hive remains a practical choice. If you want a **flexible, future-proof data lakehouse** that supports diverse engines, reliable transactions, and governance at scale, Iceberg is the more strategic investment.

Expand Down
2 changes: 2 additions & 0 deletions blog/2025-10-03-iceberg-metadata.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,8 @@ image: /img/blog/cover/ICEBERG-metadata.webp

# Apache Iceberg Metadata: The Hidden Power Behind the Lakehouse

![Apache Iceberg Metadata Explained](/img/blog/cover/ICEBERG-metadata.webp)

## Iceberg Metadata at a Glance

At its core, Apache Iceberg isn't just another file format; it's a **comprehensive table format** designed to bring the reliability and performance of a traditional database to the vast scale of a data lake. The secret sauce that makes this possible is its **sophisticated, multi-layered metadata system**. This metadata is the brain of the operation, completely decoupling the logical table structure from the physical data files stored in your lake.
Expand Down
4 changes: 4 additions & 0 deletions blog/2025-10-09-apache-polaris-lakehouse.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,8 @@ image: /img/blog/cover/polaris-blog.webp

# Building a Scalable Lakehouse with Iceberg, Trino, OLake & Apache Polaris

![Building a Scalable Lakehouse with Iceberg, Trino, OLake and Apache Polaris](/img/blog/cover/polaris-blog.webp)

### Why choose this lakehouse stack?

Modern data teams are moving toward the lakehouse architecture—combining the reliability of data warehouses with the scale and cost-efficiency of data lakes. But building one from scratch can feel overwhelming with so many moving parts.
Expand Down Expand Up @@ -716,3 +718,5 @@ Building a modern lakehouse doesn't have to be complex. With Iceberg + Polaris +

Welcome to the lakehouse era. 🚀

<BlogCTA/>

Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@ image: /img/blog/cover/parquet-vs-iceberg.webp

# Parquet vs. Iceberg: From File Format to Data Lakehouse King

![Parquet vs Iceberg: File Format vs Table Format](/img/blog/cover/parquet-vs-iceberg.webp)

Before we dissect the architecture, let's establish the fundamental distinction. **Apache Parquet** is a highly-efficient **columnar file format**, engineered to store data compactly and enable rapid analytical queries. Think of it as the optimally manufactured bricks and steel beams for constructing a massive warehouse. **Apache Iceberg**, in contrast, is an **open table format**; it is the architectural blueprint and inventory management system for that warehouse. It doesn't store the data itself—it meticulously tracks the collection of Parquet files that constitute a table.

Expand Down Expand Up @@ -445,3 +446,5 @@ Iceberg provides the missing management layer. It is the architectural specifica
Therefore, the architectural conclusion is clear. The question is not **Parquet *versus* Iceberg**. It is, and has always been, **Parquet *with* Iceberg**.

For any serious data lake initiative that demands reliability, performance, and agility, the choice is no longer *if* you should adopt a modern table format. The only question is how you will leverage a format like Iceberg to unlock the true potential of your data. To build a future-proof data platform, you need both the optimal storage container and the master blueprint, i.e. **Parquet with Iceberg**!

<BlogCTA/>
6 changes: 5 additions & 1 deletion blog/2025-11-03-olake-bauplan.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,8 @@ image: /img/blog/2025/10/olake_bauplan_cover.png
import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';

![Building a Serverless Iceberg Lakehouse with OLake and Bauplan](/img/blog/2025/10/olake_bauplan_cover.png)

If you've ever tried to build a data lake, you know it rarely feels simple. Data sits across operational systems (PostgreSQL, Oracle, MongoDB) and getting it into a usable analytical format means chaining together multiple tools for ingestion, transformation, orchestration, and governance. Each layer adds cost, complexity, and maintenance overhead. You end up managing clusters, debugging pipelines, and paying for infrastructure that sits idle more often than it runs.

This is where OLake and Bauplan change the game. OLake moves your data from databases to Apache Iceberg seamlessly skipping the headache of developing custom ETL pipelines. Bauplan, on the other hand, lets you build and run your data transformations serverlessly — in Python or SQL, with no provisioning or maintenance. Together, they form a **serverless open data lakehouse**.
Expand Down Expand Up @@ -278,4 +280,6 @@ You've just built a complete data lakehouse stack that bridges operational datab
- [OLake Documentation](https://olake.io/docs) - Complete guide to setting up OLake with various sources and destinations
- [Bauplan Documentation](https://docs.bauplanlabs.com) - Learn about branch workflows and data transformations
- [Lakekeeper](https://lakekeeper.io) - Open-source Iceberg REST catalog
- [Apache Iceberg](https://iceberg.apache.org) - The open table format powering this architecture
- [Apache Iceberg](https://iceberg.apache.org) - The open table format powering this architecture

<BlogCTA/>
4 changes: 4 additions & 0 deletions blog/2025-11-04-postgres-iceberg-doris-lakehouse-olake.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,8 @@ image: /img/blog/2025/20/olake-iceberg-doris-banner.webp

# Postgres → Iceberg → Doris: A Smooth Lakehouse Journey Powered by Olake

![Postgres to Iceberg to Doris Lakehouse Architecture](/img/blog/2025/20/olake-iceberg-doris-banner.webp)

If you've been working with data lakes, you've probably felt the friction of keeping your analytics engine separate from your storage layer. With your data neatly sitting in Iceberg, the next challenge is querying it efficiently without moving it around.

That's a pretty fair reason to bring Doris in.
Expand Down Expand Up @@ -358,3 +360,5 @@ then restart your Doris BE and then run your table query command and it should w

**Happy Engineering! Happy Iceberg!**

<BlogCTA/>

Original file line number Diff line number Diff line change
Expand Up @@ -147,4 +147,6 @@ Happy engineering!

Further Reading:
- [Merge-on-Read vs Copy-on-Write in Apache Iceberg](/iceberg/mor-vs-cow)
- [Why move to Apache Iceberg](/iceberg/move-to-iceberg)
- [Why move to Apache Iceberg](/iceberg/move-to-iceberg)

<BlogCTA/>
4 changes: 3 additions & 1 deletion iceberg/2025-10-28-databricks-vs-iceberg.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -2022,4 +2022,6 @@ The choice isn't about which platform is "better." It's about which trade-off al

For now, our benchmark gives you data to inform your decision. Test with your actual workloads, evaluate your team's expertise, and choose the platform that unblocks your business.

Because the best data platform isn't the fastest or cheapest—it's the one your team can successfully operate that delivers business value.
Because the best data platform isn't the fastest or cheapest—it's the one your team can successfully operate that delivers business value.

<BlogCTA/>