Skip to content

Use addLink in behaviors to crawl additional pages without scope limitation #71

@cmillet2127

Description

@cmillet2127

I try to crawl subpages from a main page based on an Xpath expression.

As I can't use window.location.href to crawl additional pages, it throws "Execution context was destroyed". I try to use the ctx.Lib.addLink.
After reading the code of browsertrix-crawler, it seems addLink callback is not set in my case. It seems also, when addLink is set, it is restricted by the scopeType.

Url to crawl : https://group.bnpparibas/toutes-actualites/communique-de-presse

Behavior to crawl additional pages (the first 8 articles)

`
class BnpCommuniquesdePresseBehavior {
static id = "BnpCommuniquesdePresse";

static init() {
	return {
		state: { links: 0 },
		opts: {}
	};
}

static isMatch() {
	return window.location.href === "https://group.bnpparibas/toutes-actualites/communique-de-presse";
}

async *run(ctx) {
	const { getState, awaitLoad, sleep, xpathNodes, addLink } = ctx.Lib;
	
	yield getState(ctx, "BnpCommuniquesdePresseBehavior starting...");
	
	const aTags = Array.from(xpathNodes("//main//div//div//div//div//div//ul/li[position() <= 8]/article/a"));

	if (aTags && aTags.length) {
		yield getState(ctx, aTags.length + " hrefs found");
		for await (const aTag of aTags) {
			await addLink(aTag.href);
			yield getState(ctx, "Add a link to crawl: " + aTag.href, "links");
		}
	}
	else
		yield getState(ctx, "no link found");
	yield getState(ctx, "BnpCommuniquesdePresseBehavior done");
}

}
`

The docker command line
docker run -p 6080:6080 -p 9223:9223 -v c:\tmp\crawls\:/crawls/ -v c:\tmp\custom-behaviors\:/custom-behaviors/ -it webrecorder/browsertrix-crawler:latest crawl --url https://group.bnpparibas/toutes-actualites/communique-de-presse --generateWACZ final-to-warc --text --wait-until domcontentloaded --screenshot thumbnail,view,fullPage --scopeType page --customBehaviors /custom-behaviors/ --pageLimit 10 --screencastPort 9223 --profile "/crawls/profiles/group.bnpparibas.tar.gz" --behaviors siteSpecific

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions