Skip to content

Commit 86bd635

Browse files
authored
Update minimizing-costs.md
1 parent c4b6b73 commit 86bd635

File tree

1 file changed

+15
-1
lines changed

1 file changed

+15
-1
lines changed

src/content/docs/guides/minimizing-costs.md

Lines changed: 15 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -138,7 +138,7 @@ LIMIT
138138

139139
## Use RANK
140140

141-
An alternative to `TABLESAMPLE`, to get a consistent set of data returning for a subset of data, is to use the `rank` column as mentioned previously:
141+
An alternative to `TABLESAMPLE`, to get a consistent set of data returning for a subset of data, is to use the `rank` column as mentioned previously. For the top 1,000 or even 10,000 sites:
142142

143143
```sql
144144
SELECT
@@ -163,6 +163,20 @@ Table names correspond to their full-size counterparts of the form `[table]_10k`
163163

164164
In reality as `rank` is part of the clustering of the tables you don't need to use the `sample_data` dataset. However, due to inaccurate estimates mentioned above, the `sample_data` dataset is safer since it only contains 10,000 pages so even with inaccurate estimates it will be smaller than the full `crawl` dataset.
165165

166+
## Whether to use `TABLESAMPLE`, `rank`, or `sample_data`
167+
168+
This comes down largely to a matter of personal preference. Each has their advantage and disadvantage.
169+
170+
Advantage |`TABLESAMPLE`|`rank`|`sample_data`
171+
----|---|---|---
172+
Consistency of results returned|❌|✅|✅ (if run in same month)
173+
Accurate estimates|✅|❌|✅
174+
Ease of commenting out for full run|✅|✅|❌
175+
Allows querying of any months|✅|✅|❌ (previous month only)
176+
Allows variable sample size|✅|✅|❌
177+
178+
If they ever fix the estimate bug then `rank` will be a clear winner. Until then use whatever works for you!
179+
166180
## Use table previews
167181

168182
BigQuery allows you to preview entire rows of a table without incurring a query cost. This is useful for getting a rough idea of the data in a table before running a more expensive query.

0 commit comments

Comments
 (0)