mostly-ai
diff --git a/‎README.md‎
Lines changed: 28 additions & 0 deletions b/‎README.md‎
Lines changed: 28 additions & 0 deletions
diff --git a/‎examples/quick-start.ipynb‎
Lines changed: 65 additions & 0 deletions b/‎examples/quick-start.ipynb‎
Lines changed: 65 additions & 0 deletions
diff --git a/‎examples/quick-start/census2k-hol.csv.gz‎
26.2 KB b/‎examples/quick-start/census2k-hol.csv.gz‎
26.2 KB
diff --git a/‎examples/quick-start/census2k-syn_flip30.csv.gz‎
28.3 KB b/‎examples/quick-start/census2k-syn_flip30.csv.gz‎
28.3 KB
diff --git a/‎examples/quick-start/census2k-syn_mostly.csv.gz‎
27.1 KB b/‎examples/quick-start/census2k-syn_mostly.csv.gz‎
27.1 KB
diff --git a/‎examples/quick-start/census2k-trn.csv.gz‎
26.3 KB b/‎examples/quick-start/census2k-trn.csv.gz‎
26.3 KB
diff --git a/‎src/mostlyai/qa/assets/html/report_template.html‎
Lines changed: 2 additions & 1 deletion b/‎src/mostlyai/qa/assets/html/report_template.html‎
Lines changed: 2 additions & 1 deletion
@@ -18,6 +18,34 @@ The latest release of `mostlyai-qa` can be installed via pip:
 pip install -U mostlyai-qa
 ```
 
+## Quick start
+
+```python
+import pandas as pd
+import webbrowser
+import json
+from mostlyai import qa
+
+# fetch original + synthetic data (in this case a 30% perturbation of the training)
+repo_url = 'https://github.com/mostly-ai/paper-fidelity-accuracy/raw/refs/heads/main/data/'
+synthetic_df = pd.read_csv(repo_url + 'online-shoppers_flip30.csv.gz')
+training_df = pd.read_csv(repo_url + 'online-shoppers_trn.csv.gz')
+holdout_df = pd.read_csv(repo_url + 'online-shoppers_val.csv.gz')
+
+# runs for ~60secs
+report_path, metrics = qa.report(
+    syn_tgt_data = synthetic_df,
+    trn_tgt_data = training_df,
+    hol_tgt_data = holdout_df,
+)
+
+# pretty print metrics
+print(json.dumps(metrics, indent=4))
+
+# open up HTML report in new browser window
+webbrowser.open(f"file://{report_path.absolute()}")
+```
+
 ## Basic usage
 
 ```python
 
@@ -0,0 +1,65 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "e0de43fa-e337-4590-ac48-9574c2283795",
+   "metadata": {},
+   "source": [
+    "# Example: Quick start"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "5c2afb34-600a-42a0-86c1-4ccdce006adf",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import pandas as pd\n",
+    "import webbrowser\n",
+    "import json\n",
+    "from mostlyai import qa\n",
+    "\n",
+    "# fetch original + synthetic data\n",
+    "syn = pd.read_csv('quick-start/census2k-syn_mostly.csv.gz')\n",
+    "# syn = pd.read_csv('quick-start/census2k-syn_flip30.csv.gz') # a 30% perturbation of trn\n",
+    "trn = pd.read_csv('quick-start/census2k-trn.csv.gz')\n",
+    "hol = pd.read_csv('quick-start/census2k-hol.csv.gz')\n",
+    "\n",
+    "# runs for ~30secs\n",
+    "report_path, metrics = qa.report(\n",
+    "    syn_tgt_data = syn,\n",
+    "    trn_tgt_data = trn,\n",
+    "    hol_tgt_data = hol,\n",
+    ")\n",
+    "\n",
+    "# pretty print metrics\n",
+    "print(json.dumps(metrics, indent=4))\n",
+    "\n",
+    "# open up HTML report in new browser window\n",
+    "webbrowser.open(f\"file://{report_path.absolute()}\")"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.11.7"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
@@ -404,7 +404,8 @@ <h2 id="distances" class="anchor">Distances</h2>
         Synthetic data shall be as close to the original training samples, as it is close to original holdout samples, which serve us as a reference.
         This can be asserted empirically by measuring distances between synthetic samples to their closest original samples, whereas training and holdout sets are sampled to be of equal size.
         For the visualization above, the distances of synthetic samples to the training samples are displayed in green, and the distances of synthetic samples to the holdout samples (if available) displayed in gray.
-        A green line that is overlaps with the gray line validates that the trained model represents the general rules, that can be found in training just as well as in holdout samples.
+        A green line that is significantly left of the gray line implies that synthetic samples are closer to the training samples than to the holdout samples, indicating that the data has overfitted to the training data.
+        A green line that overlays with the gray line validates that the trained model indeed represents the general rules, that can be found in training just as well as in holdout samples.
       </div>
     </div>
   </div>