Skip to content

Conversation

@Stormheg
Copy link
Member

See #501

Some changes that I want to test on staging, see if this stops the bot protection from triggering.

Maybe this is the cause of our crawler getting blocked by
djangopackages.org Cloudflare protection?
Comment on lines +12 to +17
headers = {
"Accept": "application/json",
"User-Agent": "Wagtail.org Packages Importer",
}


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think no need of the headers.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @p-r-a-v-i-n, the goal was to seem less suspicious to Cloudflare by not using the default User-Agent header set by requests.

This did not work, so I might change this to the header of a browser to see if that works instead.

Comment on lines +19 to +23
response = requests.get(url, headers=headers, timeout=10)
if not response.ok:
raise ValueError(f"Failed to fetch data from {url}: {response.status_code}")

grid_data = response.json()
Copy link
Contributor

@p-r-a-v-i-n p-r-a-v-i-n Sep 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM ! but just opinon here:
what do you think of using try-except block here to raise error, bcs sometimes requests won't even allow to perform request.ok if respective server is down.

@Stormheg
Copy link
Member Author

Stormheg commented Sep 6, 2025

Hello @p-r-a-v-i-n, feel free to continue my work on this PR. This PR is not on my list of priorities to work on.

It might be hard to reproduce the issue with Cloudflare, since the issue only seems to occur when deployed to staging or production. My guess is the IP addresses used by our Heroku hosting are on a list at Cloudflare, and together with the way the import behaves makes Cloudflare think we are a malicious bot.

Things that could be changed:

  • Error to user clicking the 'import' button (located in Django Packages CMS setting I believe)
  • Request headers, make us seem more normal and less like a bot.
  • General behaviour, our importer rapidly requests all pages from the paginated responses. There might be some rate limit in place.

@p-r-a-v-i-n
Copy link
Contributor

p-r-a-v-i-n commented Sep 6, 2025

Thanks. I think you are right about horoku's IPs , they often get flagged by cloudflare , i don't have much exposure .

  • General behaviour, our importer rapidly requests all pages from the paginated responses. There might be some rate limit in place.

This seems like hole.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants