Skip to content

Commit c4b6b73

Browse files
authored
More clarifications on rank and TABLESAMPLE
1 parent edc3c9f commit c4b6b73

File tree

1 file changed

+20
-0
lines changed

1 file changed

+20
-0
lines changed

src/content/docs/guides/minimizing-costs.md

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -117,6 +117,7 @@ This query will only process 680.01 MB when run.
117117
The 0.01% of rows that are sampled are chosen randomly, so the results of the query will be different each time it's run.
118118

119119
:::danger
120+
## Don't rely on LIMIT
120121
Don't rely on the `LIMIT` clause to reduce the amount of data scanned. `LIMIT` is applied after the query is run, so the entire table will still be scanned.
121122

122123
For example, this query still processes 6.56 TB:
@@ -135,6 +136,25 @@ LIMIT
135136

136137
:::
137138

139+
## Use RANK
140+
141+
An alternative to `TABLESAMPLE`, to get a consistent set of data returning for a subset of data, is to use the `rank` column as mentioned previously:
142+
143+
```sql
144+
SELECT
145+
custom_metrics.other.avg_dom_depth
146+
FROM
147+
`httparchive.crawl.pages`
148+
WHERE
149+
date = '2023-05-01' AND
150+
client = 'desktop' AND
151+
rank <= 1000
152+
```
153+
154+
While this constency is an advantage over `TABLESAMPLE`, annoyingly due to the [previously mentioned bug](https://issuetracker.google.com/issues/176795805), using `rank` will not give an accurate estimate, while `TABLESAMPLE` will. So it can be a bit more of a leap of faith using `rank`.
155+
156+
To get around that you can use the `sample_data` dataset.
157+
138158
## Use the `sample_data` dataset
139159

140160
The `sample_data` dataset contains 10k subsets of the full pages and requests tables. These tables are useful for testing queries before running them on the full dataset, without the risk of incurring a large query cost.

0 commit comments

Comments
 (0)