Skip to content

How to detect or track the page redirection and non-standard href? #212

@hsluoyz

Description

@hsluoyz

Some websites use page redirection and non-standard a href a lot. It makes the crawler difficult to recognize whether two pages are actually the same page.

For example, when I enter the page: http://www.ruanyifeng.com, it will be redirected to: http://www.ruanyifeng.com/home.html. So my crawler will recognize the path / and /home.html as two different pages. But I want to recognize them as one page.

As the following code shows,

c.OnHTML("a[href]", func(e *colly.HTMLElement) {
	fmt.Printf("a[href]: %s\n", e.Attr("href"))
	fmt.Printf("path: %s\n", e.Request.URL.Path)
})

c.OnRequest(func(r *colly.Request) {
	fmt.Printf("OnRequest: %s\n", r.URL.Path)
})

c.OnResponse(func(r *colly.Response) {
	fmt.Printf("OnResponse: %s\n", r.Request.URL.Path)
})

c.Visit("http://www.ruanyifeng.com/")

The log is as follows. I also printed other lines like [x-y] text, x is request ID, y is the index of the a href. text is the r.URL.Path.

[0-0] /
OnRequest: / <------------ redirect happens, OnRequest != OnResponse
OnResponse: /home.html  <------------ redirect happens, OnRequest != OnResponse
a[href]: blog/ <------------ blog/ is not /blog/, a[href] != OnRequest
path: /home.html
  [1-0] blog
OnRequest: /blog/ <------------ blog/ is not /blog/, a[href] != OnRequest
OnResponse: /blog/
a[href]: /feed.html
path: /blog/
(/blog/ != blog)
    [2-0] /feed.html
OnRequest: /feed.html
OnResponse: /feed.html
a[href]: http://www.ruanyifeng.com/
path: /feed.html
a[href]: http://feeds.feedburner.com/ruanyifeng
path: /feed.html
a[href]: http://www.ruanyifeng.com/blog/atom.xml
path: /feed.html
      [3-2] /blog/*
OnRequest: /blog/atom.xml
OnResponse: /blog/atom.xml
a[href]: http://feedburner.google.com
path: /feed.html
a[href]: http://feeds.feedburner.com/ruanyifeng
path: /feed.html
a[href]: http://feedburner.google.com/fb/a/mailverify?uri=ruanyifeng&loc=zh_CN
path: /feed.html
a[href]: http://ruanyf.blogspot.com
path: /feed.html
a[href]: http://www.ruanyifeng.com/contact.html
path: /feed.html
      [3-7] /contact.html
OnRequest: /contact.html
OnResponse: /contact.html
a[href]: http://www.ruanyifeng.com/
path: /contact.html
a[href]: http://www.ruanyifeng.com/contact.html
path: /contact.html
a[href]: https://www.injz.net/dzs/sitemap.xml
path: /contact.html
a[href]: https://www.injz.net/dzs/sitemap.xml
path: /feed.html
a[href]: http://www.ruanyifeng.com/blog/2018/08/weixin.html
path: /blog/
    [2-1] /blog/*/*/*.html
OnRequest: /blog/2018/08/weixin.html
OnResponse: /blog/2018/08/weixin.html
a[href]: http://www.ruanyifeng.com/blog/
path: /blog/2018/08/weixin.html
      [6-0] /blog
a[href]: http://www.ruanyifeng.com/blog/archives.html
path: /blog/2018/08/weixin.html
a[href]: /feed.html
path: /blog/2018/08/weixin.html
a[href]: http://www.ruanyifeng.com/blog/2018/08/weekly-issue-18.html
path: /blog/2018/08/weixin.html
a[href]: http://www.ruanyifeng.com/blog/clipboard/
path: /blog/2018/08/weixin.html
a[href]: http://www.ruanyifeng.com/blog/2018/08/weekly-issue-18.html
path: /blog/2018/08/weixin.html
a[href]: http://www.ruanyifeng.com
path: /blog/2018/08/weixin.html
a[href]: http://www.ruanyifeng.com/blog/2018/08/ <------------ this is the absolute URL
path: /blog/2018/08/weixin.html
      [6-7] /blog/*/*
OnRequest: /blog/2018/08/ <------------ this is the relative URL.
OnResponse: /blog/2018/08/
a[href]: http://www.ruanyifeng.com/blog/
path: /blog/2018/08/
(/blog/2018/08/ != /blog/2018/08)
a[href]: http://www.ruanyifeng.com/blog/archives.html
path: /blog/2018/08/
a[href]: /feed.html
path: /blog/2018/08/
a[href]: http://www.ruanyifeng.com/blog/2018/07/
path: /blog/2018/08/
a[href]: http://www.ruanyifeng.com/blog/2018/08/weixin.html
path: /blog/2018/08/
a[href]: http://www.ruanyifeng.com/blog/clipboard/
path: /blog/2018/08/
a[href]: https://www.scmp.com/tech/article/2159831/how-wechat-became-chinas-everyday-mobile-app
path: /blog/2018/08/
a[href]: http://www.ruanyifeng.com/blog/2018/08/weixin.html
path: /blog/2018/08/
a[href]: http://www.ruanyifeng.com/blog/2018/08/weixin.html#comments
path: /blog/2018/08/
a[href]: http://www.ruanyifeng.com/blog/2018/08/weekly-issue-18.html
path: /blog/2018/08/
a[href]: http://www.ruanyifeng.com/blog/2018/08/api-below.html
path: /blog/2018/08/
a[href]: http://www.ruanyifeng.com/blog/2018/08/weekly-issue-17.html
path: /blog/2018/08/
a[href]: http://www.ruanyifeng.com/blog/2018/08/svg.html
path: /blog/2018/08/
a[href]: http://www.ruanyifeng.com/blog/2018/08/weekly-issue-16.html
path: /blog/2018/08/
a[href]: http://www.ruanyifeng.com/blog/2018/07/
path: /blog/2018/08/
a[href]: http://weibo.com/ruanyf
path: /blog/2018/08/
a[href]: https://twitter.com/ruanyf
path: /blog/2018/08/
a[href]: https://github.com/ruanyf
path: /blog/2018/08/
a[href]: /contact.html
path: /blog/2018/08/
a[href]: http://www.zhufengpeixun.cn/main/index.html?ref=ruanyifeng
path: /blog/2018/08/weixin.html
a[href]: https://www.scmp.com/tech/article/2159831/how-wechat-became-chinas-everyday-mobile-app
path: /blog/2018/08/weixin.html
a[href]: https://zh.wikipedia.org/zh-sg/%E5%BE%AE%E4%BF%A1
path: /blog/2018/08/weixin.html
a[href]: http://creativecommons.org/licenses/by-nc-nd/3.0/deed.zh
path: /blog/2018/08/weixin.html
a[href]: http://www.ruanyifeng.com/blog/archives.html
path: /blog/2018/08/weixin.html
a[href]: http://www.ruanyifeng.com/blog/clipboard/
path: /blog/2018/08/weixin.html
a[href]: /road
path: /blog/2018/08/weixin.html
      [6-14] /road
OnRequest: /road
OnResponse: /road/ <------------ This time OnResponse() adds a trailing "/"
a[href]: /home.html
path: /road/
(/road/ != /road)
        [8-0] /home.html
OnRequest: /home.html
OnResponse: /home.html
a[href]: blog/
path: /home.html
a[href]: survivor/
path: /home.html
          [9-1] survivor
OnRequest: /survivor/ <------------ This time OnRequest() adds both heading "/" and trailing "/"
OnResponse: /survivor/
a[href]: /home.html
path: /survivor/
(/survivor/ != survivor)
a[href]: https://search.jd.com/Search?keyword=%E6%9C%AA%E6%9D%A5%E4%B8%96%E7%95%8C%E7%9A%84%E5%B9%B8%E5%AD%98%E8%80%85&enc=utf-8&wq=%E6%9C%AA%E6%9D%A5%E4%B8%96%E7%95%8C%E7%9A%84%E5%B9%B8%E5%AD%98%E8%80%85
path: /survivor/
a[href]: https://s.taobao.com/search?q=%E6%9C%AA%E6%9D%A5%E4%B8%96%E7%95%8C%E7%9A%84%E5%B9%B8%E5%AD%98%E8%80%85
path: /survivor/
a[href]: https://www.amazon.cn/dp/B07DY286SY/
path: /survivor/
a[href]: http://product.dangdang.com/25300552.html
path: /survivor/
a[href]: images/published_cover.jpg
path: /survivor/
            [10-5] images/published_cover.jpg
OnRequest: /survivor/images/published_cover.jpg
OnResponse: /survivor/images/published_cover.jpg
a[href]: http://survivor.ruanyifeng.com/collapse/working-poor.html
path: /survivor/
a[href]: ./collapse/index.html <------------ There are also "./" in href, which makes me frustrating.
path: /survivor/

So you can see there are four things here:

href -> OnRequest -> OnResponse -> path

The crawler sees the href link first on a page, then sends a request which triggers OnRequest, then response comes and triggers OnResponse, at last OnHTML is triggered and path can be retrieved.

It seems that:

  • OnResponse and path will always be same, as they are printing the path of the same page.
  • href and OnRequest can be different when href is non-standard, like with or without heading or trailing "/", with "./", with "../”, some even use absolute URL.
  • OnRequest and OnResponse can be different when page is redirected, OnRequest will be the original path, OnResponse will be the redirected path. But in the above example of /road, OnResponse also adds a trailing "/" to be /road/.

So I'm confused about these paths. Things like trailing "/" makes my crawler difficult to recognize. Is there any way to solve it? Like provide a previous page pointer to show /home.html is redirected from /. And let me know /road/ in the response path is the same thing as the /road in the request path. Thanks.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions