Skip to content

Conversation

@foxt451
Copy link
Collaborator

@foxt451 foxt451 commented Dec 18, 2025

The framework was mostly copied from cheerio-scraper (but trimmed in a lot of places), and the request handler inspired by sitemap scraper in WCC.

There has been discussion of how to avoid duplicating code between wcc and here, and some advised to extract sitemap scraper into a package. But I then checked the code of sitemap crawler in WCC, and it's really coupled to wcc, and itself is quite short, so I just copied it over with modifications.

BUT, the one thing I copied without changes at all is discoverValidSitemaps util from WCC. I'd like to extract it somewhere e.g. into scraper-tools, because it seems like quite a generic function.

Tested locally - for now will just push dataset items with a url and status code for each page.

Closes apify/apify-sdk-js#486

@foxt451 foxt451 changed the title Feat/sitemap scraper Add sitemap scraper Dec 18, 2025
foxt451 added a commit to apify/crawlee that referenced this pull request Jan 19, 2026
Related to apify/apify-sdk-js#486. I'm
[developing generic sitemap
scraper](apify/actor-scraper#205) and it's going
to share a big utility function (main chunk of logic) with wcc -
`discoverValidSitemaps`. I've asked @barjin if I could factor it out and
he told this util could fit into crawlee. It's mainly copied from wcc,
but to keep the dependencies unchanged, it's using got-scraping to check
for url existence instead of impit (I think it doesn't matter for
sitemaps), and `urlExists` is inlined (until we don't add http client to
these utils in v4 as @barjin told me). It's also turned into an async
generator. Let me know if you see a better place for this util.
@foxt451 foxt451 changed the title Add sitemap scraper feat(actor-scraper): add sitemap scraper Jan 19, 2026
@foxt451 foxt451 marked this pull request as ready for review January 19, 2026 14:44
@foxt451 foxt451 merged commit 4b83fa3 into master Jan 20, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Actor to check web page availability

2 participants