You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We calculate three sets of metrics to compare synthetic data with the original data.
124
+
Three sets of metrics are calculated to compare synthetic data with the original data.
125
125
126
126
### Accuracy
127
127
128
-
We calculate discretized marginal distributions for all columns, to then calculate the L1 distance between the synthetic and the original training data.
129
-
The reported accuracy is then expressed as 100% minus the total variational distance (TVD), which is half the L1 distance between the two distributions.
130
-
We then average across these accuracies to get a single accuracy score. The higher the score, the better the synthetic data.
128
+
The L1 distances between the discretized marginal distributions of the synthetic and the original training data are being calculated for all columns.
129
+
The reported accuracy is expressed as 100% minus the total variational distance (TVD), which is half the L1 distance between the two distributions.
130
+
These accuracies are then averaged to produce a single accuracy score, where higher scores indicate better synthetic data.
131
131
132
-
1.**Univariate Accuracy**: We measure the accuracy for the univariate distributions for all target columns.
133
-
2.**Bivariate Accuracy**: We measure the accuracy for all pair-wise distributions for target columns, as well as for target columns with respect to the context columns.
134
-
3.**Coherence Accuracy**: We measure the accuracy for the auto-correlation for all target columns. Only applicable for sequential data.
132
+
1.**Univariate Accuracy**: The accuracy of the univariate distributions for all target columns is measured.
133
+
2.**Bivariate Accuracy**: The accuracy of all pair-wise distributions for target columns, as well as for target columns with respect to the context columns, is measured.
134
+
3.**Coherence Accuracy**: The accuracy of the auto-correlation for all target columns is measured. This is applicable only for sequential data.
135
135
136
-
An overall accuracy score is then calculated as the average of these aggregate-level scores.
136
+
An overall accuracy score is calculated as the average of these aggregate-level scores.
137
137
138
138
### Similarity
139
139
140
-
We embed all records into an embedding space, to calculate two metrics:
140
+
All records are embedded into an embedding space to calculate two metrics:
141
141
142
-
1.**Cosine Similarity**: We calculate the cosine similarity between the centroids of the synthetic and the original training data. This is then compared to the cosine similarity between the centroids of the original training and holdout data. The higher the score, the better the synthetic data.
143
-
2.**Discriminator AUC**: We train a binary classifier to check whether one can distinguish between synthetic and original training data based on their embeddings. This is again compared to the same metric for the original training and holdout data. A score close to 50% indicates, that synthetic samples are indistinguishable from original samples.
142
+
1.**Cosine Similarity**: The cosine similarity between the centroids of the synthetic and the original training datais calculated and compared to the cosine similarity between the centroids of the original training and holdout data. Higher scores indicate better synthetic data.
143
+
2.**Discriminator AUC**: A binary classifier is trained to determine whether synthetic and original training data can be distinguished based on their embeddings. This score is compared to the same metric for the original training and holdout data. A score close to 50% indicates that synthetic samples are indistinguishable from original samples.
144
144
145
145
### Distances
146
146
147
-
We again embed all records into an embedding space, to then measure individual-level L2 distances between samples. For each synthetic sample, we calculate the distance to the nearest original sample (DCR). We once do this with respect to original training records, and once with respect to holdout records, and then compare these DCRs to each other. For privacy-safe synthetic data we expect to see that synthetic data is just as close to original training data, as it is to original holdout data.
147
+
All records are embedded into an embedding space, and individual-level L2 distances between samples are measured. For each synthetic sample, the distance to the nearest original sample (DCR) is calculated. This is done once with respect to original training records and once with respect to holdout records. These DCRs are then compared. For privacy-safe synthetic data, it is expected that synthetic data is as close to original training data as it is to original holdout data.
148
148
149
149
## Sample HTML Report
150
150
@@ -155,4 +155,4 @@ We again embed all records into an embedding space, to then measure individual-l
155
155

156
156

157
157
158
-
See the [examples](./examples/) directory for further examples.
0 commit comments