Skip to content

Conversation

@aldofunes
Copy link

This is my attempt at addressing issue #3546.

I managed to make it work on my setup, which pulls NULL JSONB columns in a Postgres db into Nullabl(JSON) columns in ClickHouse

@aldofunes aldofunes temporarily deployed to external-contributor October 24, 2025 18:32 — with GitHub Actions Inactive
@aldofunes aldofunes temporarily deployed to external-contributor October 24, 2025 18:32 — with GitHub Actions Inactive
@aldofunes aldofunes had a problem deploying to external-contributor October 24, 2025 18:32 — with GitHub Actions Failure
@aldofunes aldofunes temporarily deployed to external-contributor October 24, 2025 18:32 — with GitHub Actions Inactive
@CLAassistant
Copy link

CLAassistant commented Oct 24, 2025

CLA assistant check
All committers have signed the CLA.

@ilidemi ilidemi requested a review from jgao54 October 25, 2025 16:34
@jgao54
Copy link
Contributor

jgao54 commented Nov 4, 2025

Tested this change locally, it looks like with this change a regression is introduced: CDC is no longer replicating data correctly. It replicates empty object for JSON, or NULL for Nullable(JSON)).

(also, looks like our e2e tests for MongoDB should have caught this regression but didn't, so thanks for surfacing this issue.)

@aldofunes aldofunes temporarily deployed to external-contributor November 5, 2025 18:27 — with GitHub Actions Inactive
@aldofunes aldofunes temporarily deployed to external-contributor November 5, 2025 18:27 — with GitHub Actions Inactive
@aldofunes aldofunes temporarily deployed to external-contributor November 5, 2025 18:27 — with GitHub Actions Inactive
@aldofunes aldofunes temporarily deployed to external-contributor November 5, 2025 18:27 — with GitHub Actions Inactive
@aldofunes
Copy link
Author

@jgao54, you are totally right, how did I miss that?

I found that in the clone step, the avro files have a different schema from the CDC step.

Here's an example:

java -jar avro-tools-1.12.1.jar getschema cdc_file.avro

{
  "type" : "record",
  "name" : "_peerdb_raw_aggregators_basculin_dwh",
  "fields" : [ {
    "name" : "_peerdb_uid",
    "type" : "string"
  }, {
    "name" : "_peerdb_timestamp",
    "type" : "long"
  }, {
    "name" : "_peerdb_destination_table_name",
    "type" : "string"
  }, {
    "name" : "_peerdb_data",
    "type" : "string"
  }, {
    "name" : "_peerdb_record_type",
    "type" : [ "null", "long" ]
  }, {
    "name" : "_peerdb_match_data",
    "type" : [ "null", "string" ]
  }, {
    "name" : "_peerdb_batch_id",
    "type" : [ "null", "long" ]
  }, {
    "name" : "_peerdb_unchanged_toast_columns",
    "type" : [ "null", "string" ]
  } ]
}

java -jar avro-tools-1.12.1.jar getschema clone_file.avro

{
  "type" : "record",
  "name" : "aggregators_basculin_activity_views",
  "fields" : [ {
    "name" : "user_id_0",
    "type" : {
      "type" : "string",
      "logicalType" : "uuid"
    }
  }, {
    "name" : "operation_id_1",
    "type" : {
      "type" : "string",
      "logicalType" : "uuid"
    }
  }, {
    "name" : "data_2",
    "type" : [ "null", "string" ]
  }, {
    "name" : "created_at_3",
    "type" : {
      "type" : "long",
      "logicalType" : "timestamp-micros"
    }
  }, {
    "name" : "updated_at_4",
    "type" : {
      "type" : "long",
      "logicalType" : "timestamp-micros"
    }
  } ]
}

In other words, the CDC step is wrapping the record in an envelope, while the Clone step is dumping the record by itself.

For the life of me, I could not get the tests to run on my machine, but I was able to figure out that in the CDC step, it needs to first extract the jsonb column to a Nullable(String) and then cast that to a Nullable(JSON), using JSONExtractString() does not work because null values get converted to an empty string (instead of null) and hence, the cast to Nullable(JSON) fails.

I basically had to split the handling of both JSON and Nullable(JSON) as separate steps. I have zero experience with golang, so please forgive me if my approach is too naive.

@jgao54
Copy link
Contributor

jgao54 commented Nov 11, 2025

Thanks for the contribution! I'll run some tests this week and if things look good will get this merged.

)
}
case "JSON", "Nullable(JSON)":
case "JSON":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can be combined into:

Suggested change
case "JSON":
case "JSON", "Nullable(JSON)":
stringType := strings.Replace(clickHouseType, "JSON", "String", 1)
fmt.Fprintf(&projection,
"JSONExtract(_peerdb_data, %s, '%s')::%s AS %s,",
peerdb_clickhouse.QuoteLiteral(colName),
stringType,
clickHouseType,
peerdb_clickhouse.QuoteIdentifier(dstColName),
)
if t.enablePrimaryUpdate {
fmt.Fprintf(&projectionUpdate,
"JSONExtract(_peerdb_match_data, %s, '%s')::%s AS %s,",
peerdb_clickhouse.QuoteLiteral(colName),
stringType,
clickHouseType,
peerdb_clickhouse.QuoteIdentifier(dstColName),
)
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants