Skip to content

Only one url from sourceRegion can be extracted #107

@code4craft

Description

@code4craft

In Annotation Mode, the sourceRegion can be specified for TargetUrl.

But only the first of urls can be extracted.

private void extractLinks(Page page, Selector urlRegionSelector, List<Pattern> urlPatterns) {
    List<String> links;
    if (urlRegionSelector == null) {
        links = page.getHtml().links().all();
    } else {
        links = urlRegionSelector.selectList(page.getHtml().toString());
    }
    for (String link : links) {
        for (Pattern targetUrlPattern : urlPatterns) {
            Matcher matcher = targetUrlPattern.matcher(link);
            if (matcher.find()) {
                page.addTargetRequest(new Request(matcher.group(1)));
            }
        }
    }
}

I changed links = urlRegionSelector.selectList(page.getHtml().toString()); to links = page.getHtml().selectList(urlRegionSelector).links().all(); to fix it.

Metadata

Metadata

Assignees

Labels

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions