Skip to content

Conversation

@tasanuma
Copy link
Member

@tasanuma tasanuma commented Jun 2, 2025

What changes were proposed in this pull request?

When jobs fail at startup, such as those specifying an undefined queue, are executed multiple times, the Livy Thrift Server gets stuck.

Cause of the Problem

  • When a problematic query is submitted, the getAppIdFromTag method throws an IllegalStateException("spark-submit start failed").
  • Subsequently, a yarn application kill is executed in the kill() method, but the above IllegalStateException is not caught and slips through, resulting in yarnAppMonitorThread.interrupt() not being executed.
    • yarnClient.killApplication(Await.result(appIdPromise.future, timeout))
      } catch {
      // We cannot kill the YARN app without the app id.
      // There's a chance the YARN app hasn't been submitted during a livy-server failure.
      // We don't want a stuck session that can't be deleted. Emit a warning and move on.
      case _: TimeoutException | _: InterruptedException =>
      warn("Deleting a session while its YARN application is not found.")
      yarnAppMonitorThread.interrupt()
  • If the yarnAppMonitorThread is not stopped, it seems that the Livy Thrift Server eventually hangs.

Proposed Fix: Change the exception capture target to NonFatal to catch all exceptions.

How was this patch tested?

  • unit test (I added a new unit test in SparkYarnAppSpec.)

Please review https://livy.incubator.apache.org/community/ before opening a pull request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant