-
Notifications
You must be signed in to change notification settings - Fork 6
162 new page sampling for eda #163
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
… reverse_convert.py
donadviser
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have review the PR with some actionable suggestions you can make. I'm requesting some changes be made before approval. Thanks for the great work!
| @@ -0,0 +1,1205 @@ | |||
| # Sampling Big Data for Exploratory Data Analysis | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The title of each page starts with header2, that is "## Sampling Big Data for Exploratory Data Analysis"
| @@ -0,0 +1,1205 @@ | |||
| # Sampling Big Data for Exploratory Data Analysis | |||
|
|
|||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use "_" for filenames and "-" for directory/folder names. E.g., change "sampling-for-eda.ipynb" to "sampling_for_eda.ipynb"
| ```{code-tab} py | ||
| from pyspark.sql import SparkSession | ||
| import pyspark.sql.functions as F | ||
| from pyspark.sql.window import Window |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can remove this import "from pyspark.sql.window import Window" as Window was never used.
| ``` | ||
| ```` | ||
|
|
||
| It is important to consider what data is really needed for your purpose. Filter out the unnecessary data early on to reduce the size of your dataset and therefore compute resource, time and money. For example, select which columns you need to be in your dataset. Once you know which columns you want for analysis, you won't to load in the overall dataset every time you open a session, you could just use `sparklyr::select(column_name_1, column_name_2, ..., column_name_5)` at the point of reading in your data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replace
"Once you know which columns you want for analysis, you won't to load in the overall dataset every time you open a session, you could just use sparklyr::select(column_name_1, column_name_2, ..., column_name_5) at the point of reading in your data."
with
"Once you know which columns you want for analysis, you won't need to load the entire dataset every time you open a session; use sparklyr::select(column_name_1, column_name_2, ..., column_name_5) (or Spark's select) when reading the data."
|
|
||
| ## Pre-sampling | ||
|
|
||
| Pre-sampling is executed before taking a sample your big data, it works to clean your data and it gives a 'quick' idea of what the data looks like and helps inform decisions on what to include in your sample. Pre-sampling involves looking at nulls, duplicates, quick summary stats. If these steps are not taken then the results ouputted from analysis on your sample could be skewed and non-representative of the big data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check if the grammar is Okay and clear
"Pre-sampling is executed before taking a sample your big data, it works to clean your data and ...."
|
|
||
| The sample size suggestion is approximately 0.1 % of the mot_clean dataset. Note, that additional iterations were tested and this sample size gave a good representation of categorical and numerical variables when compared to the big data **for this specific example**. | ||
|
|
||
| ### Taking the sample |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make this header 4, that is, ####
|
|
||
| ``` | ||
| ```` | ||
| Once the sample has been taken it can be exported. Bare in mind that although you have sampled your original dataframe the partition number is retained. As the sample is much smaller than the original dataframe we can re-partition the sample data using `.coalesce()` in Pyspark or `sdf_coalesce()` in SparklyR. Please remember to close the Spark session after your have exported your data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo: Change "Bare" to "Bear"
| spark_disconnect(sc) | ||
| ``` | ||
| ```` | ||
| ## EDA on a big data sample |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Promote to Header 3
| print(summary) | ||
|
|
||
| # you can also use .describe() on specified cateogircal columns | ||
| mot['colour].describe() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replace
mot['colour].describe()
with
mot_eda_sample['colour'].describe()
| ````{tabs} | ||
| ```{code-tab} py | ||
|
|
||
| import matplotlib.pyplot as plt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
import matplotlib.pyplot as plt was not used. Delete.
Merge request template: please remove the appropriate parts of this template.
Pre-merge request checklist (to be completed by the one making the request):
Details of this request (such as):
Things to note about this request (such as):
Requirements for review (such as):