You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: src/content/docs/guides/minimizing-costs.md
+34Lines changed: 34 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -117,6 +117,7 @@ This query will only process 680.01 MB when run.
117
117
The 0.01% of rows that are sampled are chosen randomly, so the results of the query will be different each time it's run.
118
118
119
119
:::danger
120
+
## Don't rely on LIMIT
120
121
Don't rely on the `LIMIT` clause to reduce the amount of data scanned. `LIMIT` is applied after the query is run, so the entire table will still be scanned.
121
122
122
123
For example, this query still processes 6.56 TB:
@@ -135,6 +136,25 @@ LIMIT
135
136
136
137
:::
137
138
139
+
## Use RANK
140
+
141
+
An alternative to `TABLESAMPLE`, to get a consistent set of data returning for a subset of data, is to use the `rank` column as mentioned previously. For the top 1,000 or even 10,000 sites:
142
+
143
+
```sql
144
+
SELECT
145
+
custom_metrics.other.avg_dom_depth
146
+
FROM
147
+
`httparchive.crawl.pages`
148
+
WHERE
149
+
date='2023-05-01'AND
150
+
client ='desktop'AND
151
+
rank <=1000
152
+
```
153
+
154
+
While this constency is an advantage over `TABLESAMPLE`, annoyingly due to the [previously mentioned bug](https://issuetracker.google.com/issues/176795805), using `rank` will not give an accurate estimate, while `TABLESAMPLE` will. So it can be a bit more of a leap of faith using `rank`.
155
+
156
+
To get around that you can use the `sample_data` dataset.
157
+
138
158
## Use the `sample_data` dataset
139
159
140
160
The `sample_data` dataset contains 10k subsets of the full pages and requests tables. These tables are useful for testing queries before running them on the full dataset, without the risk of incurring a large query cost.
@@ -143,6 +163,20 @@ Table names correspond to their full-size counterparts of the form `[table]_10k`
143
163
144
164
In reality as `rank` is part of the clustering of the tables you don't need to use the `sample_data` dataset. However, due to inaccurate estimates mentioned above, the `sample_data` dataset is safer since it only contains 10,000 pages so even with inaccurate estimates it will be smaller than the full `crawl` dataset.
145
165
166
+
## Whether to use `TABLESAMPLE`, `rank`, or `sample_data`
167
+
168
+
This comes down largely to a matter of personal preference. Each has their advantage and disadvantage.
169
+
170
+
Advantage |`TABLESAMPLE`|`rank`|`sample_data`
171
+
----|---|---|---
172
+
Consistency of results returned|❌|✅|✅ (if run in same month)
173
+
Accurate estimates|✅|❌|✅
174
+
Ease of commenting out for full run|✅|✅|❌
175
+
Allows querying of any months|✅|✅|❌ (previous month only)
176
+
Allows variable sample size|✅|✅|❌
177
+
178
+
If they ever fix the estimate bug then `rank` will be a clear winner. Until then use whatever works for you!
179
+
146
180
## Use table previews
147
181
148
182
BigQuery allows you to preview entire rows of a table without incurring a query cost. This is useful for getting a rough idea of the data in a table before running a more expensive query.
0 commit comments