Sparkbench: Fix - Correct Spark version path to spark-4.0.1 (facebookresearch#257)

NikhilTarte-Arm · meta-codesync[bot] · commit 53130341ef79 · 2025-10-24T09:23:38.000-07:00
Summary: While running SparkBench, setup failed with: "Env var SPARK_HOME not set & default path ~/DCPerf/benchmarks/spark_standalone/spark-4.0.0-bin-hadoop3 not exist" Root cause: The DCPerf SparkBench installer downloads Spark 4.0.1 from the official Apache mirrors , but the package still referenced spark-4.0.0. The version 4.0.0 does not exist on the Apache download site (https://dlcdn.apache.org/spark/). Fix: Updated SparkBench package references from spark-4.0.0 to spark-4.0.1 so that SPARK_HOME points to the correct SPARK_HOME after download. Pull Request resolved: facebookresearch#257 Reviewed By: excelle08 Differential Revision: D85394529 Pulled By: charles-typ fbshipit-source-id: 24db68323dc8502bc9d04c34604aed3a33d73ae2
diff --git a/packages/spark_standalone/README.md b/packages/spark_standalone/README.md
@@ -398,7 +398,7 @@ please remove the data from previous runs so that SparkBench can rebuild databas
 
 ```bash
 rm -rf /flash23/warehouse
-rm -rf <benchpressPath>/benchmarks/spark_standalone/spark-4.0.0-bin-hadoop3/metastore_db
+rm -rf <benchpressPath>/benchmarks/spark_standalone/spark-4.0.1-bin-hadoop3/metastore_db
 ```
 
 4. Create the `/flash23` folder. Copy `bpc_t93586_s2_synthetic_5GB`
@@ -408,7 +408,7 @@ Note that SparkBench mini does not require the high I/O throughput
 
 5. Run `spark_standalone_remote_mini` job on a real machine.
 This will create data in `/flash23/warehouse`
-and `<benchpressPath>/benchmarks/spark_standalone/spark-4.0.0-bin-hadoop3/metastore_db`.
+and `<benchpressPath>/benchmarks/spark_standalone/spark-4.0.1-bin-hadoop3/metastore_db`.
 Create a backup of these two folders. By default, this job uses the 5GB dataset.
 If you want to use the 1GB dataset, run the job with specifying the
 `dataset_name` parameter like this:
@@ -425,12 +425,12 @@ with the same commands and options.
 Building database takes a considerable amount of time, so it's advisable to consider
 using the same set of storage nodes and NVMe drives when running SparkBench on another
 compute node server. If you choose to reuse the Spark database in `/flash23/warehouse`,
-please also make sure to copy the folder `metastore_db` under `benchmarks/spark_standalone/spark-4.0.0-bin-hadoop3`
+please also make sure to copy the folder `metastore_db` under `benchmarks/spark_standalone/spark-4.0.1-bin-hadoop3`
 to the new machine's same location, for example:
 
 ```
 # Under the DCPerf folder
-rsync -a benchmarks/spark_standalone/spark-4.0.0-bin-hadoop3/metastore_db root@<target-hostname>:~/DCPerf/benchmarks/spark_standalone/spark-4.0.0-bin-hadoop3/
+rsync -a benchmarks/spark_standalone/spark-4.0.1-bin-hadoop3/metastore_db root@<target-hostname>:~/DCPerf/benchmarks/spark_standalone/spark-4.0.1-bin-hadoop3/
 ```
 
 If you do not copy over the `metastore_db` folder, you will see errors like the following
diff --git a/packages/spark_standalone/templates/proj_root/scripts/utils.py b/packages/spark_standalone/templates/proj_root/scripts/utils.py
@@ -108,7 +108,7 @@ def read_environ() -> Dict[str, str]:
     env_vars["PROJ_ROOT"] = "/".join(os.path.abspath(__file__).split("/")[:-2])
     env_vars["JAVA_HOME"] = find_java_home()
     env_vars["SPARK_HOME"] = os.path.join(
-        env_vars["PROJ_ROOT"], "spark-4.0.0-bin-hadoop3"
+        env_vars["PROJ_ROOT"], "spark-4.0.1-bin-hadoop3"
     )
     # read from actual environment
     for k in env_vars:
diff --git a/packages/spark_standalone/templates/runner.py b/packages/spark_standalone/templates/runner.py
@@ -51,7 +51,7 @@ def download_dataset(args):
 
 
 def install_database(args):
-    metadata_dir = os.path.join(SPARK_DIR, "spark-4.0.0-bin-hadoop3", "metastore_db")
+    metadata_dir = os.path.join(SPARK_DIR, "spark-4.0.1-bin-hadoop3", "metastore_db")
     database_dir = os.path.join(args.warehouse_dir, f"{args.dataset_name.lower()}.db")
     if os.path.exists(metadata_dir) and os.path.exists(database_dir):
         print("Database already created; directly run test")

Original file line number	Diff line number	Diff line change
`@@ -108,7 +108,7 @@ def read_environ() -> Dict[str, str]:`
`108`	`108`	`env_vars["PROJ_ROOT"] = "/".join(os.path.abspath(__file__).split("/")[:-2])`
`109`	`109`	`env_vars["JAVA_HOME"] = find_java_home()`
`110`	`110`	`env_vars["SPARK_HOME"] = os.path.join(`
`111`		`- env_vars["PROJ_ROOT"], "spark-4.0.0-bin-hadoop3"`
	`111`	`+ env_vars["PROJ_ROOT"], "spark-4.0.1-bin-hadoop3"`
`112`	`112`	`)`
`113`	`113`	`# read from actual environment`
`114`	`114`	`for k in env_vars:`