Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
0d442fd
Add alp code
sfc-gh-pgaur Dec 4, 2025
06d1e19
Integrate ALP with arrow
sfc-gh-pgaur Dec 4, 2025
a98c594
Add alp benchmark
sfc-gh-pgaur Dec 4, 2025
c297f97
Add datasets for alp benchmarking
sfc-gh-pgaur Dec 4, 2025
ab928e8
Update cmake file
sfc-gh-pgaur Dec 4, 2025
6a95a59
Move hpp files to h
sfc-gh-pgaur Dec 6, 2025
865e46a
Update flow digram and layout digram to use ASCII and not unicode cha…
sfc-gh-pgaur Dec 7, 2025
cb6d0b6
Rename cpp files to cc
sfc-gh-pgaur Dec 7, 2025
496e23b
Update documentation to align with arrow's doxygen style
sfc-gh-pgaur Dec 7, 2025
8803b52
Adapt methods and variable names to arrow style
sfc-gh-pgaur Dec 7, 2025
31e94ec
Update the tests to adhere to arrow style code
sfc-gh-pgaur Dec 7, 2025
46c0ecc
Update callers
sfc-gh-pgaur Dec 7, 2025
a70b08f
Fuse FOR and decode loop
sfc-gh-pgaur Dec 7, 2025
ccbb1dd
Reduce memory allocation in the decompress call
sfc-gh-pgaur Dec 7, 2025
6a01df2
Attempt at making decoding faster with SIMD
sfc-gh-pgaur Dec 8, 2025
4ced783
Revert "Attempt at making decoding faster with SIMD"
sfc-gh-pgaur Dec 8, 2025
4fac73c
Move cpp files to cc
sfc-gh-pgaur Dec 8, 2025
1cb0852
Move data file to parquet-testing submodule
sfc-gh-pgaur Dec 8, 2025
8d307a6
Update path to the data file
sfc-gh-pgaur Dec 9, 2025
0908342
Adapt files names to arrow convention
sfc-gh-pgaur Dec 15, 2025
e56c877
File rename
sfc-gh-pgaur Dec 15, 2025
cfa00ba
Obtain compressed size and number of elements from page header
sfc-gh-pgaur Dec 15, 2025
a1d11ee
Fix namespace depth
sfc-gh-pgaur Dec 16, 2025
719468b
Better pack the compression block header
sfc-gh-pgaur Dec 16, 2025
69b4e07
Rename class
sfc-gh-pgaur Dec 16, 2025
193a808
Rearrage field for vector metadata for better packing
sfc-gh-pgaur Dec 16, 2025
11aaab3
Add spec files
sfc-gh-pgaur Dec 31, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions cpp/src/arrow/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -565,6 +565,12 @@ if(ARROW_WITH_ZSTD)
list(APPEND ARROW_UTIL_SRCS util/compression_zstd.cc)
endif()

# ALP (for Parquet encoder/decoder)
list(APPEND ARROW_UTIL_SRCS
util/alp/alp.cc
util/alp/alp_sampler.cc
util/alp/alp_wrapper.cc)

arrow_add_object_library(ARROW_UTIL ${ARROW_UTIL_SRCS})

# Disable DLL exports in vendored uriparser library
Expand Down
7 changes: 7 additions & 0 deletions cpp/src/arrow/util/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -103,6 +103,13 @@ add_arrow_test(bit-utility-test
rle_encoding_test.cc
test_common.cc)

add_arrow_test(alp-test
SOURCES
alp/alp_test.cc
alp/alp.cc
alp/alp_sampler.cc
alp/alp_wrapper.cc)

add_arrow_test(crc32-test
SOURCES
crc32_test.cc
Expand Down
601 changes: 601 additions & 0 deletions cpp/src/arrow/util/alp/ALP_Encoding_Specification.md

Large diffs are not rendered by default.

199 changes: 199 additions & 0 deletions cpp/src/arrow/util/alp/ALP_Encoding_Specification_terse.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,199 @@
# ALP Encoding Specification

**Types:** FLOAT, DOUBLE | **Reference:** [SIGMOD 2024](https://dl.acm.org/doi/10.1145/3626717)

---

## 1. Layout

```
[Page Header (8B)] [Vector 1] [Vector 2] ... [Vector N]
```

### Page Header (8 bytes)

| Offset | Field | Size | Value |
|--------|-------|------|-------|
| 0 | version | 1B | 1 |
| 1 | mode | 1B | 0 (ALP) |
| 2 | layout | 1B | 0 (normal) |
| 3 | reserved | 1B | 0 |
| 4 | vector_size | 4B | 1024 |

### Vector

```
[VectorInfo (24B)] [PackedValues] [ExceptionPos] [ExceptionVals]
```

### VectorInfo (24 bytes)

| Offset | Field | Size | Type |
|--------|-------|------|------|
| 0 | frame_of_reference | 8B | uint64 |
| 8 | exponent | 1B | uint8, 0..18 |
| 9 | factor | 1B | uint8, 0..e |
| 10 | bit_width | 1B | uint8, 0..64 |
| 11 | reserved | 1B | - |
| 12 | num_elements | 2B | uint16, <=1024 |
| 14 | num_exceptions | 2B | uint16 |
| 16 | bit_packed_size | 2B | uint16 |
| 18 | padding | 6B | - |

### Data Sections

| Section | Size |
|---------|------|
| PackedValues | `bit_packed_size` |
| ExceptionPos | `num_exceptions * 2` |
| ExceptionVals | `num_exceptions * sizeof(T)` |

---

## 2. Encoding

### Formula

```
encoded[i] = round(value[i] * 10^e * 10^-f)
```

Where:
- `e` = exponent (0..10 for float, 0..18 for double)
- `f` = factor (0..e)
- `round(n) = int(n + M) - M` where M = 2^22+2^23 (float) or 2^51+2^52 (double)

### Exception Detection

```
exception if: decode(encode(v)) != v
| isnan(v) | isinf(v) | v == -0.0
| v > MAX_INT | v < MIN_INT
```

### Frame of Reference (FOR)

```
FOR = min(encoded[])
delta[i] = encoded[i] - FOR
```

### Bit Packing

```
bit_width = ceil(log2(max(delta) + 1))
bit_packed_size = ceil(num_elements * bit_width / 8)
```

If `max(delta) == 0`: `bit_width = 0`, no packed data.

---

## 3. Decoding

```
delta[i] = unpack(packed, bit_width)
encoded[i] = delta[i] + FOR
value[i] = encoded[i] * 10^-f * 10^-e
value[exception_pos[j]] = exception_val[j] // patch
```

---

## 4. Examples

### Example 1: No Exceptions

**Input:** `[1.23, 4.56, 7.89, 0.12]` (float)

| Step | Computation | Result |
|------|-------------|--------|
| e=2, f=0 | `v * 100` | `[123, 456, 789, 12]` |
| FOR | `min = 12` | `delta = [111, 444, 777, 0]` |
| bit_width | `ceil(log2(778))` | 10 |
| packed_size | `ceil(4*10/8)` | 5B |

**Output:** 24B (info) + 5B (packed) = **29B**

### Example 2: With Exceptions

**Input:** `[1.5, NaN, 2.5, 0.333...]` (float)

| Step | Result |
|------|--------|
| e=1, f=0 | `[15, -, 25, 3]` |
| Exceptions | pos=[1,3], vals=[NaN, 0.333...] |
| Placeholders | `[15, 15, 25, 15]` |
| FOR=15 | `delta = [0, 0, 10, 0]` |
| bit_width=4 | packed_size = 2B |

**Output:** 24B + 2B + 4B + 8B = **38B**

### Example 3: 1024 Monetary Values ($0.01-$999.99)

| Metric | Value |
|--------|-------|
| e=2, f=0 | range: 1..99999 |
| bit_width | ceil(log2(99999)) = 17 |
| packed_size | ceil(1024*17/8) = 2176B |
| **Total** | ~2200B vs 4096B PLAIN (**46% smaller**) |

---

## 5. Constants

| Constant | Value |
|----------|-------|
| Vector size | 1024 |
| Version | 1 |
| Max combinations | 5 |
| Samples/vector | 256 |
| Float max_e | 10 |
| Double max_e | 18 |

---

## 6. Size Formulas

**Per vector:**
```
size = 24 + ceil(n * bw / 8) + exc * (2 + sizeof(T))
```

**Max compressed size:**
```
max = 8 + ceil(n/1024) * 24 + n * sizeof(T) * 2 + n * 2
```

---

## 7. Comparison

| Encoding | Compression | Best For |
|----------|-------------|----------|
| PLAIN | 1.0x | - |
| BYTE_STREAM_SPLIT | ~0.8x | random floats |
| ALP | ~0.5x | decimal floats |

---

## Appendix: Byte Layout

```
Offset Field
------ -----
0-7 frame_of_reference
8 exponent
9 factor
10 bit_width
11 reserved
12-13 num_elements
14-15 num_exceptions
16-17 bit_packed_size
18-23 padding
24 packed_values[bit_packed_size]
24+P exception_pos[num_exceptions]
24+P+2E exception_vals[num_exceptions]
```

Where `P = bit_packed_size`, `E = num_exceptions`
Loading