Skip to content

Conversation

@NEUpanning
Copy link
Contributor

@NEUpanning NEUpanning commented Nov 5, 2025

Allocating prefix sort buffer memory when spilling could lead to MEM_ALLOC_ERROR. This PR uses Timsort as a fallback when allocation of prefix sort buffer memory fails during spilling.

@netlify
Copy link

netlify bot commented Nov 5, 2025

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit 46c7873
🔍 Latest deploy log https://app.netlify.com/projects/meta-velox/deploys/69129c9a5b8feb0007655712

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Nov 5, 2025
@NEUpanning NEUpanning force-pushed the prefix-sort-fallback branch from f2dbab5 to b787658 Compare November 5, 2025 11:42
@NEUpanning
Copy link
Contributor Author

@jinchengchenghh Would you like to take a look? Thanks!

@jinchengchenghh
Copy link
Collaborator

Do you try this PR, this spill memory pool does not have capacity control, I don't know if it can throw exception, or killed by linux system

@NEUpanning
Copy link
Contributor Author

Do you try this PR, this spill memory pool does not have capacity control, I don't know if it can throw exception, or killed by linux system

@jinchengchenghh I think it depends on the setting value of allocator capacity. Gluten configures it as 0.75 * total overhead memory(code link), so an exception will be thrown when allocating memory exceeds allocator capacity. Below is a crash case we have encountered

Job aborted due to stage failure: Task 208 in stage 16.0 failed 4 times, most recent failure: Lost task 208.3 in stage 16.0 (TID 3229) (zw06-data-spark-vec95.mt executor 342): org.apache.gluten.exception.GlutenException: org.apache.gluten.exception.GlutenException: Exception: VeloxRuntimeError
Error Source: RUNTIME
Error Code: MEM_ALLOC_ERROR
Reason: allocateContiguous failed with 18432 pages from Memory Pool[__sys_spilling__ LEAF root[__sys_root__] parent[__sys_root__] MALLOC no-usage-track thread-safe]<unlimited max capacity unlimited capacity used 0B available 0B reservation [used 0B, reserved 0B, min 0B] counters [allocs 254492, frees 254485, reserves 0, releases 0, collisions 0])> Exceeded memory allocator limit when allocating 18432 new pages, the memory allocator capacity is 192.00MB
Retriable: True
Context: Operator: OrderBy[1] 1
Function: handleAllocationFailure
File: /tmp/build/src/dev/meituan/src/velox/velox/common/memory/MemoryPool.cpp
Line: 1319
Stack trace:
# 0  _ZN8facebook5velox7process10StackTraceC1Ei
# 1  _ZN8facebook5velox14VeloxExceptionC1EPKcmS3_St17basic_string_viewIcSt11char_traitsIcEES7_S7_S7_bNS1_4TypeES7_
# 2  _ZN8facebook5velox6detail14veloxCheckFailINS0_17VeloxRuntimeErrorERKSsEEvRKNS1_18VeloxCheckFailArgsET0_
# 3  _ZN8facebook5velox6memory14MemoryPoolImpl23handleAllocationFailureERKSs
# 4  _ZN8facebook5velox6memory14MemoryPoolImpl18allocateContiguousEmRNS1_20ContiguousAllocationEm
# 5  _ZN8facebook5velox4exec10PrefixSort12sortInternalERSt6vectorIPcNS0_6memory12StlAllocatorIS4_EEE
# 6  _ZN8facebook5velox4exec11SpillerBase12ensureSortedERNS2_8SpillRunE
# 7  _ZN8facebook5velox4exec11SpillerBase10writeSpillEi
# 8  _ZNSt17_Function_handlerIFSt10unique_ptrIN8facebook5velox4exec11SpillerBase11SpillStatusESt14default_deleteIS5_EEvEZNS4_8runSpillEbEUlvE_E9_M_invokeERKSt9_Any_data

@jinchengchenghh
Copy link
Collaborator

Looks like the pool usage has been tracked, could we use maybeReserve to check if the memory is enough?

  /// Checks if it is likely that the reservation on this memory pool can be
  /// incremented by 'size'. Returns false if this seems unlikely. Otherwise
  /// attempts the reservation increment and returns true if succeeded.
  virtual bool maybeReserve(uint64_t size) = 0;
memory::MemoryPool* spillMemoryPool() {
  return memory::MemoryManager::getInstance()->spillPool();
}

@NEUpanning NEUpanning force-pushed the prefix-sort-fallback branch from b787658 to 953671d Compare November 7, 2025 12:12
@NEUpanning
Copy link
Contributor Author

@jinchengchenghh Thank you for your idea. I've updated to use maybeReserve.

Copy link
Collaborator

@jinchengchenghh jinchengchenghh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your enhancement!

@jinchengchenghh
Copy link
Collaborator

Please resolve the red CI

@NEUpanning
Copy link
Contributor Author

@xiaoxmeng @tanjialiang The CI has passed. Would you like to take a look when you are available?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants