Description
Summary
All official Data Caterer Docker images (tested: 0.11.10, 0.17.0, 0.19.0) fail with org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for scheme "file" when using features that require filesystem access. This makes foreign key relationships completely non-functional in Docker deployments.
Impact
❌ FK relationships don't work (record tracking requires filesystem)
❌ SQL expressions fail (uuid5(), CAST(), etc.)
❌ Faker expressions fail (#{Name.fullName})
❌ Unique checks fail (requires disk writes)
❌ Essentially unusable for any real-world scenario with related tables
Environment
Docker Image: datacatering/data-caterer:0.17.0 (also 0.11.10, 0.19.0)
Host OS: Ubuntu 24.04
Database: PostgreSQL 15
Steps to Reproduce
Create a simple plan with FK relationships:
name: "fk_test"
dataSources:
- name: "my_postgres"
connection:
type: "postgres"
options:
url: "jdbc:postgresql://postgres:5432/testdb"
user: "postgres"
password: "postgres"
steps:
foreignKeys:
- source:
dataSource: "my_postgres"
step: "parent_table"
fields: ["id"]
generate:
- dataSource: "my_postgres"
step: "child_table"
fields: ["parent_id"]
Set enableRecordTracking = true in application.conf
Run via Docker:
docker run -v ./plan:/opt/app/plan datacatering/data-caterer:0.17.0
Expected Behavior
Parent table generates 10 records
Child table generates 20 records with valid parent_id values referencing actual parent records
No filesystem errors
Actual Behavior
org.apache.spark.SparkException: Unable to create database default as failed to create its directory /tmp/spark-warehouse.
Caused by: org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for scheme "file"
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:3443)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3466)
at org.apache.hadoop.fs.FileSystem.access$300(FileSystem.java:174)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3574)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3521)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:540)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:365)
at org.apache.spark.sql.catalyst.catalog.InMemoryCatalog.liftedTree1$1(InMemoryCatalog.scala:121)
Root Cause
The Docker images are missing the Hadoop LocalFileSystem handler registration. The classloader cannot find org.apache.hadoop.fs.LocalFileSystem even though Hadoop JARs are present.
Attempted Workarounds (All Failed)
- Added core-site.xml:
fs.file.impl
org.apache.hadoop.fs.LocalFileSystem
Result: Not loaded by Spark
- Set environment variables:
environment:
- HADOOP_CONF_DIR=/opt/app/hadoop-conf
- JAVA_OPTS=-Dfs.file.impl=org.apache.hadoop.fs.LocalFileSystem
Result: Configuration not picked up
- Disabled record tracking:
enableRecordTracking = false
Result: FK values not tracked → constraint violations
- Tried SQL expressions instead:
fields:
- name: "id"
options:
sql: "uuid5(CAST(1 AS STRING), 'namespace')"
Result: Same filesystem error (triggers Spark SQL)
Suggested Fix
Option 1: Include Hadoop filesystem JARs properly
Ensure hadoop-common and hadoop-client are on the classpath with proper service loader configuration for org.apache.hadoop.fs.FileSystem.
Option 2: Pre-create warehouse directory
In the Docker entrypoint:
mkdir -p /tmp/spark-warehouse
chmod 777 /tmp/spark-warehouse
Option 3: Document SDK build-from-source approach
If Docker images cannot be fixed, provide clear instructions for using the SDK locally with proper Hadoop setup.
Workaround for Users (Until Fixed)
Generate parent tables first, query their IDs from the database, then dynamically inject those IDs into child table plans using oneOf. This bypasses the FK mechanism entirely but requires custom orchestration code.
Additional Context
This issue makes Data Caterer effectively unusable in Docker for anyone with realistic database schemas containing foreign key relationships. The Docker-based deployment is advertised as the primary usage method, but it fundamentally doesn't work for multi-table scenarios.
Would appreciate either:
A fix in the Docker images
Clear documentation that FK relationships don't work in Docker
Instructions for building from source with proper Hadoop configuration
Description
Summary
All official Data Caterer Docker images (tested: 0.11.10, 0.17.0, 0.19.0) fail with org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for scheme "file" when using features that require filesystem access. This makes foreign key relationships completely non-functional in Docker deployments.
Impact
❌ FK relationships don't work (record tracking requires filesystem)
❌ SQL expressions fail (uuid5(), CAST(), etc.)
❌ Faker expressions fail (#{Name.fullName})
❌ Unique checks fail (requires disk writes)
❌ Essentially unusable for any real-world scenario with related tables
Environment
Docker Image: datacatering/data-caterer:0.17.0 (also 0.11.10, 0.19.0)
Host OS: Ubuntu 24.04
Database: PostgreSQL 15
Steps to Reproduce
Create a simple plan with FK relationships:
name: "fk_test"
dataSources:
connection:
type: "postgres"
options:
url: "jdbc:postgresql://postgres:5432/testdb"
user: "postgres"
password: "postgres"
steps:
name: "parent_table"
options:
dbtable: "public.parents"
count:
records: 10
fields:
options:
uuid: "true"
isUnique: "true"
name: "child_table"
options:
dbtable: "public.children"
count:
records: 20
fields:
FK - should be populated by foreignKey mechanism
foreignKeys:
dataSource: "my_postgres"
step: "parent_table"
fields: ["id"]
generate:
step: "child_table"
fields: ["parent_id"]
Set enableRecordTracking = true in application.conf
Run via Docker:
docker run -v ./plan:/opt/app/plan datacatering/data-caterer:0.17.0
Expected Behavior
Parent table generates 10 records
Child table generates 20 records with valid parent_id values referencing actual parent records
No filesystem errors
Actual Behavior
org.apache.spark.SparkException: Unable to create database default as failed to create its directory /tmp/spark-warehouse.
Caused by: org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for scheme "file"
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:3443)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3466)
at org.apache.hadoop.fs.FileSystem.access$300(FileSystem.java:174)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3574)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3521)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:540)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:365)
at org.apache.spark.sql.catalyst.catalog.InMemoryCatalog.liftedTree1$1(InMemoryCatalog.scala:121)
Root Cause
The Docker images are missing the Hadoop LocalFileSystem handler registration. The classloader cannot find org.apache.hadoop.fs.LocalFileSystem even though Hadoop JARs are present.
Attempted Workarounds (All Failed)
- Added core-site.xml:
fs.file.impl org.apache.hadoop.fs.LocalFileSystem Result: Not loaded by Sparkenvironment:
Result: Configuration not picked up
enableRecordTracking = false
Result: FK values not tracked → constraint violations
fields:
options:
sql: "uuid5(CAST(1 AS STRING), 'namespace')"
Result: Same filesystem error (triggers Spark SQL)
Suggested Fix
Option 1: Include Hadoop filesystem JARs properly
Ensure hadoop-common and hadoop-client are on the classpath with proper service loader configuration for org.apache.hadoop.fs.FileSystem.
Option 2: Pre-create warehouse directory
In the Docker entrypoint:
mkdir -p /tmp/spark-warehouse
chmod 777 /tmp/spark-warehouse
Option 3: Document SDK build-from-source approach
If Docker images cannot be fixed, provide clear instructions for using the SDK locally with proper Hadoop setup.
Workaround for Users (Until Fixed)
Generate parent tables first, query their IDs from the database, then dynamically inject those IDs into child table plans using oneOf. This bypasses the FK mechanism entirely but requires custom orchestration code.
Additional Context
This issue makes Data Caterer effectively unusable in Docker for anyone with realistic database schemas containing foreign key relationships. The Docker-based deployment is advertised as the primary usage method, but it fundamentally doesn't work for multi-table scenarios.
Would appreciate either:
A fix in the Docker images
Clear documentation that FK relationships don't work in Docker
Instructions for building from source with proper Hadoop configuration