Skip to content

Commit 3b5f318

Browse files
authored
added quick start example (#9)
1 parent 905557e commit 3b5f318

File tree

7 files changed

+95
-1
lines changed

7 files changed

+95
-1
lines changed

β€ŽREADME.mdβ€Ž

Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,34 @@ The latest release of `mostlyai-qa` can be installed via pip:
1818
pip install -U mostlyai-qa
1919
```
2020

21+
## Quick start
22+
23+
```python
24+
import pandas as pd
25+
import webbrowser
26+
import json
27+
from mostlyai import qa
28+
29+
# fetch original + synthetic data (in this case a 30% perturbation of the training)
30+
repo_url = 'https://github.com/mostly-ai/paper-fidelity-accuracy/raw/refs/heads/main/data/'
31+
synthetic_df = pd.read_csv(repo_url + 'online-shoppers_flip30.csv.gz')
32+
training_df = pd.read_csv(repo_url + 'online-shoppers_trn.csv.gz')
33+
holdout_df = pd.read_csv(repo_url + 'online-shoppers_val.csv.gz')
34+
35+
# runs for ~60secs
36+
report_path, metrics = qa.report(
37+
syn_tgt_data = synthetic_df,
38+
trn_tgt_data = training_df,
39+
hol_tgt_data = holdout_df,
40+
)
41+
42+
# pretty print metrics
43+
print(json.dumps(metrics, indent=4))
44+
45+
# open up HTML report in new browser window
46+
webbrowser.open(f"file://{report_path.absolute()}")
47+
```
48+
2149
## Basic usage
2250

2351
```python

β€Žexamples/quick-start.ipynbβ€Ž

Lines changed: 65 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,65 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"id": "e0de43fa-e337-4590-ac48-9574c2283795",
6+
"metadata": {},
7+
"source": [
8+
"# Example: Quick start"
9+
]
10+
},
11+
{
12+
"cell_type": "code",
13+
"execution_count": null,
14+
"id": "5c2afb34-600a-42a0-86c1-4ccdce006adf",
15+
"metadata": {},
16+
"outputs": [],
17+
"source": [
18+
"import pandas as pd\n",
19+
"import webbrowser\n",
20+
"import json\n",
21+
"from mostlyai import qa\n",
22+
"\n",
23+
"# fetch original + synthetic data\n",
24+
"syn = pd.read_csv('quick-start/census2k-syn_mostly.csv.gz')\n",
25+
"# syn = pd.read_csv('quick-start/census2k-syn_flip30.csv.gz') # a 30% perturbation of trn\n",
26+
"trn = pd.read_csv('quick-start/census2k-trn.csv.gz')\n",
27+
"hol = pd.read_csv('quick-start/census2k-hol.csv.gz')\n",
28+
"\n",
29+
"# runs for ~30secs\n",
30+
"report_path, metrics = qa.report(\n",
31+
" syn_tgt_data = syn,\n",
32+
" trn_tgt_data = trn,\n",
33+
" hol_tgt_data = hol,\n",
34+
")\n",
35+
"\n",
36+
"# pretty print metrics\n",
37+
"print(json.dumps(metrics, indent=4))\n",
38+
"\n",
39+
"# open up HTML report in new browser window\n",
40+
"webbrowser.open(f\"file://{report_path.absolute()}\")"
41+
]
42+
}
43+
],
44+
"metadata": {
45+
"kernelspec": {
46+
"display_name": "Python 3 (ipykernel)",
47+
"language": "python",
48+
"name": "python3"
49+
},
50+
"language_info": {
51+
"codemirror_mode": {
52+
"name": "ipython",
53+
"version": 3
54+
},
55+
"file_extension": ".py",
56+
"mimetype": "text/x-python",
57+
"name": "python",
58+
"nbconvert_exporter": "python",
59+
"pygments_lexer": "ipython3",
60+
"version": "3.11.7"
61+
}
62+
},
63+
"nbformat": 4,
64+
"nbformat_minor": 5
65+
}
26.2 KB
Binary file not shown.
28.3 KB
Binary file not shown.
27.1 KB
Binary file not shown.
26.3 KB
Binary file not shown.

β€Žsrc/mostlyai/qa/assets/html/report_template.htmlβ€Ž

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -404,7 +404,8 @@ <h2 id="distances" class="anchor">Distances</h2>
404404
Synthetic data shall be as close to the original training samples, as it is close to original holdout samples, which serve us as a reference.
405405
This can be asserted empirically by measuring distances between synthetic samples to their closest original samples, whereas training and holdout sets are sampled to be of equal size.
406406
For the visualization above, the distances of synthetic samples to the training samples are displayed in green, and the distances of synthetic samples to the holdout samples (if available) displayed in gray.
407-
A green line that is overlaps with the gray line validates that the trained model represents the general rules, that can be found in training just as well as in holdout samples.
407+
A green line that is significantly left of the gray line implies that synthetic samples are closer to the training samples than to the holdout samples, indicating that the data has overfitted to the training data.
408+
A green line that overlays with the gray line validates that the trained model indeed represents the general rules, that can be found in training just as well as in holdout samples.
408409
</div>
409410
</div>
410411
</div>

0 commit comments

Comments
Β (0)