Skip to content

Why no bulk Arrow→Parquet write API in Java? How to avoid row-by-row RecordConsumer + optimize? #48071

@Fenil-v

Description

@Fenil-v

Describe the usage question you have. Please include as many useful details as possible.

I have ~ 20KB objects that I need to write to Parquet efficiently from Java.
In C++, C#, and Python there's a direct/bulk Arrow-Parquet write (e.g. WriteTable / write_table) that avoids row-by-row iteration, but in Java I only see row-by-row paths via RecordConsumer or internal/unstable column writers.
Questions:

  1. Is there a supported bulk/columnar Arrow-Parquet write API in Java (e.g, VectorSchemaRoot
    → Parquet) that avoids row-by-row calls?
  2. If not, why is Java limited to row-by-row writes today? Any roadmap for feature parity with C++/Python/C#?
  3. For now, what's the recommended optimization path to write 20KB objects at high throughput from Java (without JNI), or is JNI/Dataset the recommended route?
  4. Any best practices (batch sizing, encodings, writer settings) to mitigate the row-by-row overhead?

Component(s)

Java

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions