Skip to content

Docker images missing Hadoop LocalFileSystem handler - breaks FK relationships and SQL expressions #129

@abdousayed98

Description

@abdousayed98

Description
Summary
All official Data Caterer Docker images (tested: 0.11.10, 0.17.0, 0.19.0) fail with org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for scheme "file" when using features that require filesystem access. This makes foreign key relationships completely non-functional in Docker deployments.

Impact
❌ FK relationships don't work (record tracking requires filesystem)
❌ SQL expressions fail (uuid5(), CAST(), etc.)
❌ Faker expressions fail (#{Name.fullName})
❌ Unique checks fail (requires disk writes)
❌ Essentially unusable for any real-world scenario with related tables
Environment

Docker Image: datacatering/data-caterer:0.17.0 (also 0.11.10, 0.19.0)
Host OS: Ubuntu 24.04
Database: PostgreSQL 15
Steps to Reproduce
Create a simple plan with FK relationships:

name: "fk_test"
dataSources:

  • name: "my_postgres"
    connection:
    type: "postgres"
    options:
    url: "jdbc:postgresql://postgres:5432/testdb"
    user: "postgres"
    password: "postgres"
    steps:
    • name: "parent_table"
      options:
      dbtable: "public.parents"
      count:
      records: 10
      fields:

      • name: "id"
        options:
        uuid: "true"
        isUnique: "true"
    • name: "child_table"
      options:
      dbtable: "public.children"
      count:
      records: 20
      fields:

      • name: "parent_id"

        FK - should be populated by foreignKey mechanism

foreignKeys:

  • source:
    dataSource: "my_postgres"
    step: "parent_table"
    fields: ["id"]
    generate:
    • dataSource: "my_postgres"
      step: "child_table"
      fields: ["parent_id"]
      Set enableRecordTracking = true in application.conf

Run via Docker:

docker run -v ./plan:/opt/app/plan datacatering/data-caterer:0.17.0
Expected Behavior
Parent table generates 10 records
Child table generates 20 records with valid parent_id values referencing actual parent records
No filesystem errors
Actual Behavior

org.apache.spark.SparkException: Unable to create database default as failed to create its directory /tmp/spark-warehouse.
Caused by: org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for scheme "file"
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:3443)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3466)
at org.apache.hadoop.fs.FileSystem.access$300(FileSystem.java:174)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3574)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3521)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:540)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:365)
at org.apache.spark.sql.catalyst.catalog.InMemoryCatalog.liftedTree1$1(InMemoryCatalog.scala:121)
Root Cause
The Docker images are missing the Hadoop LocalFileSystem handler registration. The classloader cannot find org.apache.hadoop.fs.LocalFileSystem even though Hadoop JARs are present.

Attempted Workarounds (All Failed)

  1. Added core-site.xml:
fs.file.impl org.apache.hadoop.fs.LocalFileSystem Result: Not loaded by Spark
  1. Set environment variables:

environment:

  • HADOOP_CONF_DIR=/opt/app/hadoop-conf
  • JAVA_OPTS=-Dfs.file.impl=org.apache.hadoop.fs.LocalFileSystem
    Result: Configuration not picked up
  1. Disabled record tracking:

enableRecordTracking = false
Result: FK values not tracked → constraint violations

  1. Tried SQL expressions instead:

fields:

  • name: "id"
    options:
    sql: "uuid5(CAST(1 AS STRING), 'namespace')"
    Result: Same filesystem error (triggers Spark SQL)

Suggested Fix
Option 1: Include Hadoop filesystem JARs properly
Ensure hadoop-common and hadoop-client are on the classpath with proper service loader configuration for org.apache.hadoop.fs.FileSystem.

Option 2: Pre-create warehouse directory
In the Docker entrypoint:

mkdir -p /tmp/spark-warehouse
chmod 777 /tmp/spark-warehouse
Option 3: Document SDK build-from-source approach
If Docker images cannot be fixed, provide clear instructions for using the SDK locally with proper Hadoop setup.

Workaround for Users (Until Fixed)
Generate parent tables first, query their IDs from the database, then dynamically inject those IDs into child table plans using oneOf. This bypasses the FK mechanism entirely but requires custom orchestration code.

Additional Context
This issue makes Data Caterer effectively unusable in Docker for anyone with realistic database schemas containing foreign key relationships. The Docker-based deployment is advertised as the primary usage method, but it fundamentally doesn't work for multi-table scenarios.

Would appreciate either:

A fix in the Docker images
Clear documentation that FK relationships don't work in Docker
Instructions for building from source with proper Hadoop configuration

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions