Skip to content

Commit 54cea05

Browse files
committed
README
1 parent 4ac69df commit 54cea05

File tree

1 file changed

+17
-17
lines changed

1 file changed

+17
-17
lines changed

README.md

Lines changed: 17 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -8,15 +8,15 @@ Assess the fidelity and novelty of synthetic samples with respect to original sa
88

99
...all with a single line of Python code 💥.
1010

11-
## Getting Started
11+
## Installation
1212

13-
### Installation
13+
The latest release of `mostlyai-qa` can be installed via pip:
1414

1515
```bash
1616
pip install -U mostlyai-qa
1717
```
1818

19-
### Basic Usage
19+
## Basic usage
2020

2121
```python
2222
from mostlyai import qa
@@ -49,7 +49,7 @@ report_path, metrics = qa.report(
4949
)
5050
```
5151

52-
### Syntax
52+
## Function signature
5353

5454
```python
5555
def report(
@@ -121,30 +121,30 @@ def report(
121121

122122
## Metrics
123123

124-
We calculate three sets of metrics to compare synthetic data with the original data.
124+
Three sets of metrics are calculated to compare synthetic data with the original data.
125125

126126
### Accuracy
127127

128-
We calculate discretized marginal distributions for all columns, to then calculate the L1 distance between the synthetic and the original training data.
129-
The reported accuracy is then expressed as 100% minus the total variational distance (TVD), which is half the L1 distance between the two distributions.
130-
We then average across these accuracies to get a single accuracy score. The higher the score, the better the synthetic data.
128+
The L1 distances between the discretized marginal distributions of the synthetic and the original training data are being calculated for all columns.
129+
The reported accuracy is expressed as 100% minus the total variational distance (TVD), which is half the L1 distance between the two distributions.
130+
These accuracies are then averaged to produce a single accuracy score, where higher scores indicate better synthetic data.
131131

132-
1. **Univariate Accuracy**: We measure the accuracy for the univariate distributions for all target columns.
133-
2. **Bivariate Accuracy**: We measure the accuracy for all pair-wise distributions for target columns, as well as for target columns with respect to the context columns.
134-
3. **Coherence Accuracy**: We measure the accuracy for the auto-correlation for all target columns. Only applicable for sequential data.
132+
1. **Univariate Accuracy**: The accuracy of the univariate distributions for all target columns is measured.
133+
2. **Bivariate Accuracy**: The accuracy of all pair-wise distributions for target columns, as well as for target columns with respect to the context columns, is measured.
134+
3. **Coherence Accuracy**: The accuracy of the auto-correlation for all target columns is measured. This is applicable only for sequential data.
135135

136-
An overall accuracy score is then calculated as the average of these aggregate-level scores.
136+
An overall accuracy score is calculated as the average of these aggregate-level scores.
137137

138138
### Similarity
139139

140-
We embed all records into an embedding space, to calculate two metrics:
140+
All records are embedded into an embedding space to calculate two metrics:
141141

142-
1. **Cosine Similarity**: We calculate the cosine similarity between the centroids of the synthetic and the original training data. This is then compared to the cosine similarity between the centroids of the original training and holdout data. The higher the score, the better the synthetic data.
143-
2. **Discriminator AUC**: We train a binary classifier to check whether one can distinguish between synthetic and original training data based on their embeddings. This is again compared to the same metric for the original training and holdout data. A score close to 50% indicates, that synthetic samples are indistinguishable from original samples.
142+
1. **Cosine Similarity**: The cosine similarity between the centroids of the synthetic and the original training data is calculated and compared to the cosine similarity between the centroids of the original training and holdout data. Higher scores indicate better synthetic data.
143+
2. **Discriminator AUC**: A binary classifier is trained to determine whether synthetic and original training data can be distinguished based on their embeddings. This score is compared to the same metric for the original training and holdout data. A score close to 50% indicates that synthetic samples are indistinguishable from original samples.
144144

145145
### Distances
146146

147-
We again embed all records into an embedding space, to then measure individual-level L2 distances between samples. For each synthetic sample, we calculate the distance to the nearest original sample (DCR). We once do this with respect to original training records, and once with respect to holdout records, and then compare these DCRs to each other. For privacy-safe synthetic data we expect to see that synthetic data is just as close to original training data, as it is to original holdout data.
147+
All records are embedded into an embedding space, and individual-level L2 distances between samples are measured. For each synthetic sample, the distance to the nearest original sample (DCR) is calculated. This is done once with respect to original training records and once with respect to holdout records. These DCRs are then compared. For privacy-safe synthetic data, it is expected that synthetic data is as close to original training data as it is to original holdout data.
148148

149149
## Sample HTML Report
150150

@@ -155,4 +155,4 @@ We again embed all records into an embedding space, to then measure individual-l
155155
![Similarity](./docs/screenshots/similarity.png)
156156
![Distances](./docs/screenshots/distances.png)
157157

158-
See the [examples](./examples/) directory for further examples.
158+
See [here](./examples/) for further examples.

0 commit comments

Comments
 (0)