Storm crawler not honouring crawl delay #1808
Replies: 2 comments 3 replies
-
|
Might be good, if you can reformat that discussion thread and put your config into code formatting. Otherwise, I fear, that this is unreadable. |
Beta Was this translation helpful? Give feedback.
-
|
You do have a 15-minute delay configured:
but that delay is enforced per queue key, not globally per domain. In StormCrawler, politeness is guaranteed only if all URLs for a host end up in the same FetcherBolt queue. If they don’t, parallel fetchers will happily fetch the same host in parallel and you’ll see requests much sooner than 15 minutes apart. The delay is enforced inside each FetcherBolt instance, which is subject to |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi,
I have spent days on it but couldn't get it honour crawl delay. I have tried multipe combinations. I have come here as a last resort.
I have a crawl delay of 15 mins, but pages are getting fetched earlier than that. I suspect this has something to do with parallelism.
my crawler.flux
My crawler-conf.yaml
My opensearch-conf.yaml
Beta Was this translation helpful? Give feedback.
All reactions