Skip to content

Commit e496aec

Browse files
authored
More clarifications on rank and TABLESAMPLE (#90)
* More clarifications on rank and TABLESAMPLE * Update minimizing-costs.md
1 parent edc3c9f commit e496aec

File tree

1 file changed

+34
-0
lines changed

1 file changed

+34
-0
lines changed

src/content/docs/guides/minimizing-costs.md

Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -117,6 +117,7 @@ This query will only process 680.01 MB when run.
117117
The 0.01% of rows that are sampled are chosen randomly, so the results of the query will be different each time it's run.
118118

119119
:::danger
120+
## Don't rely on LIMIT
120121
Don't rely on the `LIMIT` clause to reduce the amount of data scanned. `LIMIT` is applied after the query is run, so the entire table will still be scanned.
121122

122123
For example, this query still processes 6.56 TB:
@@ -135,6 +136,25 @@ LIMIT
135136

136137
:::
137138

139+
## Use RANK
140+
141+
An alternative to `TABLESAMPLE`, to get a consistent set of data returning for a subset of data, is to use the `rank` column as mentioned previously. For the top 1,000 or even 10,000 sites:
142+
143+
```sql
144+
SELECT
145+
custom_metrics.other.avg_dom_depth
146+
FROM
147+
`httparchive.crawl.pages`
148+
WHERE
149+
date = '2023-05-01' AND
150+
client = 'desktop' AND
151+
rank <= 1000
152+
```
153+
154+
While this constency is an advantage over `TABLESAMPLE`, annoyingly due to the [previously mentioned bug](https://issuetracker.google.com/issues/176795805), using `rank` will not give an accurate estimate, while `TABLESAMPLE` will. So it can be a bit more of a leap of faith using `rank`.
155+
156+
To get around that you can use the `sample_data` dataset.
157+
138158
## Use the `sample_data` dataset
139159

140160
The `sample_data` dataset contains 10k subsets of the full pages and requests tables. These tables are useful for testing queries before running them on the full dataset, without the risk of incurring a large query cost.
@@ -143,6 +163,20 @@ Table names correspond to their full-size counterparts of the form `[table]_10k`
143163

144164
In reality as `rank` is part of the clustering of the tables you don't need to use the `sample_data` dataset. However, due to inaccurate estimates mentioned above, the `sample_data` dataset is safer since it only contains 10,000 pages so even with inaccurate estimates it will be smaller than the full `crawl` dataset.
145165

166+
## Whether to use `TABLESAMPLE`, `rank`, or `sample_data`
167+
168+
This comes down largely to a matter of personal preference. Each has their advantage and disadvantage.
169+
170+
Advantage |`TABLESAMPLE`|`rank`|`sample_data`
171+
----|---|---|---
172+
Consistency of results returned|❌|✅|✅ (if run in same month)
173+
Accurate estimates|✅|❌|✅
174+
Ease of commenting out for full run|✅|✅|❌
175+
Allows querying of any months|✅|✅|❌ (previous month only)
176+
Allows variable sample size|✅|✅|❌
177+
178+
If they ever fix the estimate bug then `rank` will be a clear winner. Until then use whatever works for you!
179+
146180
## Use table previews
147181

148182
BigQuery allows you to preview entire rows of a table without incurring a query cost. This is useful for getting a rough idea of the data in a table before running a more expensive query.

0 commit comments

Comments
 (0)