Skip to content

Conversation

@alexsnowdon
Copy link
Collaborator

Merge request template: please remove the appropriate parts of this template.

Pre-merge request checklist (to be completed by the one making the request):

  • [ x] I have performed a full review of this code myself.
    • For Python code in PySpark specific sections, all code should have been run in Jupyter notebooks.
    • For code in sections of the book containing both Python and R code, the page of the book should be constructed as described in the contributing guide and converted to a markdown file.
  • [x ] I have formatted the outputs of code blocks correctly (to match other outputs in the book and in line with the style guide [coming soon])
  • [ x] I have built the book as outlined in the contributing guide and confirmed that any additional/modified content is displaying as expected.

Details of this request (such as):

  • Adding a new page to the Spark Analysis section of the book on sampling big data for EDA.

Things to note about this request (such as):

  • I have added this new section into the TOC under the Spark Analysis section
  • Images have been added to the images folder
  • The sample data (used for the second half of the page) has been added to the data folder so you can read it in from there.
  • The big data that is required for the page is in an S3 bucket outlined at the top of the page.
  • The convert.py would not work for me therefore the page has been written as a markdown (sampling-for-eda.md) and the reverse_convert.py was used instead.

Requirements for review (such as):

  • run certain files, perform build of book etc
  • check the code and written content
  • check for consistency
  • check the book displays for you

@alexsnowdon alexsnowdon linked an issue Sep 24, 2025 that may be closed by this pull request
1 task
Copy link
Contributor

@donadviser donadviser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have review the PR with some actionable suggestions you can make. I'm requesting some changes be made before approval. Thanks for the great work!

@@ -0,0 +1,1205 @@
# Sampling Big Data for Exploratory Data Analysis
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The title of each page starts with header2, that is "## Sampling Big Data for Exploratory Data Analysis"

@@ -0,0 +1,1205 @@
# Sampling Big Data for Exploratory Data Analysis

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use "_" for filenames and "-" for directory/folder names. E.g., change "sampling-for-eda.ipynb" to "sampling_for_eda.ipynb"

```{code-tab} py
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
from pyspark.sql.window import Window
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can remove this import "from pyspark.sql.window import Window" as Window was never used.

```
````

It is important to consider what data is really needed for your purpose. Filter out the unnecessary data early on to reduce the size of your dataset and therefore compute resource, time and money. For example, select which columns you need to be in your dataset. Once you know which columns you want for analysis, you won't to load in the overall dataset every time you open a session, you could just use `sparklyr::select(column_name_1, column_name_2, ..., column_name_5)` at the point of reading in your data.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replace

"Once you know which columns you want for analysis, you won't to load in the overall dataset every time you open a session, you could just use sparklyr::select(column_name_1, column_name_2, ..., column_name_5) at the point of reading in your data."

with

"Once you know which columns you want for analysis, you won't need to load the entire dataset every time you open a session; use sparklyr::select(column_name_1, column_name_2, ..., column_name_5) (or Spark's select) when reading the data."


## Pre-sampling

Pre-sampling is executed before taking a sample your big data, it works to clean your data and it gives a 'quick' idea of what the data looks like and helps inform decisions on what to include in your sample. Pre-sampling involves looking at nulls, duplicates, quick summary stats. If these steps are not taken then the results ouputted from analysis on your sample could be skewed and non-representative of the big data.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check if the grammar is Okay and clear
"Pre-sampling is executed before taking a sample your big data, it works to clean your data and ...."


The sample size suggestion is approximately 0.1 % of the mot_clean dataset. Note, that additional iterations were tested and this sample size gave a good representation of categorical and numerical variables when compared to the big data **for this specific example**.

### Taking the sample
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make this header 4, that is, ####


```
````
Once the sample has been taken it can be exported. Bare in mind that although you have sampled your original dataframe the partition number is retained. As the sample is much smaller than the original dataframe we can re-partition the sample data using `.coalesce()` in Pyspark or `sdf_coalesce()` in SparklyR. Please remember to close the Spark session after your have exported your data.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: Change "Bare" to "Bear"

spark_disconnect(sc)
```
````
## EDA on a big data sample
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Promote to Header 3

print(summary)

# you can also use .describe() on specified cateogircal columns
mot['colour].describe()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replace
mot['colour].describe()

with
mot_eda_sample['colour'].describe()

````{tabs}
```{code-tab} py

import matplotlib.pyplot as plt
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

import matplotlib.pyplot as plt was not used. Delete.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[New page]: sampling_for_eda

3 participants