Skip to content

Commit 5313034

Browse files
NikhilTarte-Armmeta-codesync[bot]
authored andcommitted
Sparkbench: Fix - Correct Spark version path to spark-4.0.1 (facebookresearch#257)
Summary: While running SparkBench, setup failed with: "Env var SPARK_HOME not set & default path ~/DCPerf/benchmarks/spark_standalone/spark-4.0.0-bin-hadoop3 not exist" Root cause: The DCPerf SparkBench installer downloads Spark 4.0.1 from the official Apache mirrors , but the package still referenced spark-4.0.0. The version 4.0.0 does not exist on the Apache download site (https://dlcdn.apache.org/spark/). Fix: Updated SparkBench package references from spark-4.0.0 to spark-4.0.1 so that SPARK_HOME points to the correct SPARK_HOME after download. Pull Request resolved: facebookresearch#257 Reviewed By: excelle08 Differential Revision: D85394529 Pulled By: charles-typ fbshipit-source-id: 24db68323dc8502bc9d04c34604aed3a33d73ae2
1 parent 4847ed2 commit 5313034

File tree

3 files changed

+6
-6
lines changed

3 files changed

+6
-6
lines changed

packages/spark_standalone/README.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -398,7 +398,7 @@ please remove the data from previous runs so that SparkBench can rebuild databas
398398
399399
```bash
400400
rm -rf /flash23/warehouse
401-
rm -rf <benchpressPath>/benchmarks/spark_standalone/spark-4.0.0-bin-hadoop3/metastore_db
401+
rm -rf <benchpressPath>/benchmarks/spark_standalone/spark-4.0.1-bin-hadoop3/metastore_db
402402
```
403403

404404
4. Create the `/flash23` folder. Copy `bpc_t93586_s2_synthetic_5GB`
@@ -408,7 +408,7 @@ Note that SparkBench mini does not require the high I/O throughput
408408

409409
5. Run `spark_standalone_remote_mini` job on a real machine.
410410
This will create data in `/flash23/warehouse`
411-
and `<benchpressPath>/benchmarks/spark_standalone/spark-4.0.0-bin-hadoop3/metastore_db`.
411+
and `<benchpressPath>/benchmarks/spark_standalone/spark-4.0.1-bin-hadoop3/metastore_db`.
412412
Create a backup of these two folders. By default, this job uses the 5GB dataset.
413413
If you want to use the 1GB dataset, run the job with specifying the
414414
`dataset_name` parameter like this:
@@ -425,12 +425,12 @@ with the same commands and options.
425425
Building database takes a considerable amount of time, so it's advisable to consider
426426
using the same set of storage nodes and NVMe drives when running SparkBench on another
427427
compute node server. If you choose to reuse the Spark database in `/flash23/warehouse`,
428-
please also make sure to copy the folder `metastore_db` under `benchmarks/spark_standalone/spark-4.0.0-bin-hadoop3`
428+
please also make sure to copy the folder `metastore_db` under `benchmarks/spark_standalone/spark-4.0.1-bin-hadoop3`
429429
to the new machine's same location, for example:
430430

431431
```
432432
# Under the DCPerf folder
433-
rsync -a benchmarks/spark_standalone/spark-4.0.0-bin-hadoop3/metastore_db root@<target-hostname>:~/DCPerf/benchmarks/spark_standalone/spark-4.0.0-bin-hadoop3/
433+
rsync -a benchmarks/spark_standalone/spark-4.0.1-bin-hadoop3/metastore_db root@<target-hostname>:~/DCPerf/benchmarks/spark_standalone/spark-4.0.1-bin-hadoop3/
434434
```
435435

436436
If you do not copy over the `metastore_db` folder, you will see errors like the following

packages/spark_standalone/templates/proj_root/scripts/utils.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -108,7 +108,7 @@ def read_environ() -> Dict[str, str]:
108108
env_vars["PROJ_ROOT"] = "/".join(os.path.abspath(__file__).split("/")[:-2])
109109
env_vars["JAVA_HOME"] = find_java_home()
110110
env_vars["SPARK_HOME"] = os.path.join(
111-
env_vars["PROJ_ROOT"], "spark-4.0.0-bin-hadoop3"
111+
env_vars["PROJ_ROOT"], "spark-4.0.1-bin-hadoop3"
112112
)
113113
# read from actual environment
114114
for k in env_vars:

packages/spark_standalone/templates/runner.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -51,7 +51,7 @@ def download_dataset(args):
5151

5252

5353
def install_database(args):
54-
metadata_dir = os.path.join(SPARK_DIR, "spark-4.0.0-bin-hadoop3", "metastore_db")
54+
metadata_dir = os.path.join(SPARK_DIR, "spark-4.0.1-bin-hadoop3", "metastore_db")
5555
database_dir = os.path.join(args.warehouse_dir, f"{args.dataset_name.lower()}.db")
5656
if os.path.exists(metadata_dir) and os.path.exists(database_dir):
5757
print("Database already created; directly run test")

0 commit comments

Comments
 (0)