Batch processor sorting logic

I have studied the current state of our batch processing logic. I think the original author of the code might find a different solution, but I couldn't pick this up without starting over. I found it had some undesirable behavior, particularly around single payloads that exceeded the maximum size; the organization of batching into two discrete phases with split() and concatenate() is IMO not a great approach.

I have started with a rewrite of the batch processor, starting with the first principles that I wish to see that break with the current version:

- Focus entirely on Logs to start
- Provide an option to sort the main payload by some dimension(s), particularly useful ones (in logs) are timestamp, trace ID, event name
- Maximum size is not optional, since we have u16 limits anyway
- Produce batches of the maximum size with max 1 residual small batch
- Deduplicate Resource and Scope attribute values

There aspects of the current code that I will keep:

- basic data type Vec<[Option<RecordBatch>; N]> 
- select() helper function for iterating basic data type
- unify() logic for schema unifications

See also
#969 
#347 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Batch processor sorting logic #1376

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Batch processor sorting logic #1376

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions