How to detect or track the page redirection and non-standard href?

Some websites use page redirection and non-standard ``a href`` a lot. It makes the crawler difficult to recognize whether two pages are actually the same page.

For example, when I enter the page: http://www.ruanyifeng.com, it will be redirected to: http://www.ruanyifeng.com/home.html. So my crawler will recognize the path ``/`` and ``/home.html`` as two different pages. But I want to recognize them as one page.

As the following code shows, 

```go
c.OnHTML("a[href]", func(e *colly.HTMLElement) {
	fmt.Printf("a[href]: %s\n", e.Attr("href"))
	fmt.Printf("path: %s\n", e.Request.URL.Path)
})

c.OnRequest(func(r *colly.Request) {
	fmt.Printf("OnRequest: %s\n", r.URL.Path)
})

c.OnResponse(func(r *colly.Response) {
	fmt.Printf("OnResponse: %s\n", r.Request.URL.Path)
})

c.Visit("http://www.ruanyifeng.com/")
```

The log is as follows. I also printed other lines like ``[x-y] text``, ``x`` is request ID, ``y`` is the index of the ``a href``. ``text`` is the ``r.URL.Path``.

```log
[0-0] /
OnRequest: / <------------ redirect happens, OnRequest != OnResponse
OnResponse: /home.html  <------------ redirect happens, OnRequest != OnResponse
a[href]: blog/ <------------ blog/ is not /blog/, a[href] != OnRequest
path: /home.html
  [1-0] blog
OnRequest: /blog/ <------------ blog/ is not /blog/, a[href] != OnRequest
OnResponse: /blog/
a[href]: /feed.html
path: /blog/
(/blog/ != blog)
    [2-0] /feed.html
OnRequest: /feed.html
OnResponse: /feed.html
a[href]: http://www.ruanyifeng.com/
path: /feed.html
a[href]: http://feeds.feedburner.com/ruanyifeng
path: /feed.html
a[href]: http://www.ruanyifeng.com/blog/atom.xml
path: /feed.html
      [3-2] /blog/*
OnRequest: /blog/atom.xml
OnResponse: /blog/atom.xml
a[href]: http://feedburner.google.com
path: /feed.html
a[href]: http://feeds.feedburner.com/ruanyifeng
path: /feed.html
a[href]: http://feedburner.google.com/fb/a/mailverify?uri=ruanyifeng&loc=zh_CN
path: /feed.html
a[href]: http://ruanyf.blogspot.com
path: /feed.html
a[href]: http://www.ruanyifeng.com/contact.html
path: /feed.html
      [3-7] /contact.html
OnRequest: /contact.html
OnResponse: /contact.html
a[href]: http://www.ruanyifeng.com/
path: /contact.html
a[href]: http://www.ruanyifeng.com/contact.html
path: /contact.html
a[href]: https://www.injz.net/dzs/sitemap.xml
path: /contact.html
a[href]: https://www.injz.net/dzs/sitemap.xml
path: /feed.html
a[href]: http://www.ruanyifeng.com/blog/2018/08/weixin.html
path: /blog/
    [2-1] /blog/*/*/*.html
OnRequest: /blog/2018/08/weixin.html
OnResponse: /blog/2018/08/weixin.html
a[href]: http://www.ruanyifeng.com/blog/
path: /blog/2018/08/weixin.html
      [6-0] /blog
a[href]: http://www.ruanyifeng.com/blog/archives.html
path: /blog/2018/08/weixin.html
a[href]: /feed.html
path: /blog/2018/08/weixin.html
a[href]: http://www.ruanyifeng.com/blog/2018/08/weekly-issue-18.html
path: /blog/2018/08/weixin.html
a[href]: http://www.ruanyifeng.com/blog/clipboard/
path: /blog/2018/08/weixin.html
a[href]: http://www.ruanyifeng.com/blog/2018/08/weekly-issue-18.html
path: /blog/2018/08/weixin.html
a[href]: http://www.ruanyifeng.com
path: /blog/2018/08/weixin.html
a[href]: http://www.ruanyifeng.com/blog/2018/08/ <------------ this is the absolute URL
path: /blog/2018/08/weixin.html
      [6-7] /blog/*/*
OnRequest: /blog/2018/08/ <------------ this is the relative URL.
OnResponse: /blog/2018/08/
a[href]: http://www.ruanyifeng.com/blog/
path: /blog/2018/08/
(/blog/2018/08/ != /blog/2018/08)
a[href]: http://www.ruanyifeng.com/blog/archives.html
path: /blog/2018/08/
a[href]: /feed.html
path: /blog/2018/08/
a[href]: http://www.ruanyifeng.com/blog/2018/07/
path: /blog/2018/08/
a[href]: http://www.ruanyifeng.com/blog/2018/08/weixin.html
path: /blog/2018/08/
a[href]: http://www.ruanyifeng.com/blog/clipboard/
path: /blog/2018/08/
a[href]: https://www.scmp.com/tech/article/2159831/how-wechat-became-chinas-everyday-mobile-app
path: /blog/2018/08/
a[href]: http://www.ruanyifeng.com/blog/2018/08/weixin.html
path: /blog/2018/08/
a[href]: http://www.ruanyifeng.com/blog/2018/08/weixin.html#comments
path: /blog/2018/08/
a[href]: http://www.ruanyifeng.com/blog/2018/08/weekly-issue-18.html
path: /blog/2018/08/
a[href]: http://www.ruanyifeng.com/blog/2018/08/api-below.html
path: /blog/2018/08/
a[href]: http://www.ruanyifeng.com/blog/2018/08/weekly-issue-17.html
path: /blog/2018/08/
a[href]: http://www.ruanyifeng.com/blog/2018/08/svg.html
path: /blog/2018/08/
a[href]: http://www.ruanyifeng.com/blog/2018/08/weekly-issue-16.html
path: /blog/2018/08/
a[href]: http://www.ruanyifeng.com/blog/2018/07/
path: /blog/2018/08/
a[href]: http://weibo.com/ruanyf
path: /blog/2018/08/
a[href]: https://twitter.com/ruanyf
path: /blog/2018/08/
a[href]: https://github.com/ruanyf
path: /blog/2018/08/
a[href]: /contact.html
path: /blog/2018/08/
a[href]: http://www.zhufengpeixun.cn/main/index.html?ref=ruanyifeng
path: /blog/2018/08/weixin.html
a[href]: https://www.scmp.com/tech/article/2159831/how-wechat-became-chinas-everyday-mobile-app
path: /blog/2018/08/weixin.html
a[href]: https://zh.wikipedia.org/zh-sg/%E5%BE%AE%E4%BF%A1
path: /blog/2018/08/weixin.html
a[href]: http://creativecommons.org/licenses/by-nc-nd/3.0/deed.zh
path: /blog/2018/08/weixin.html
a[href]: http://www.ruanyifeng.com/blog/archives.html
path: /blog/2018/08/weixin.html
a[href]: http://www.ruanyifeng.com/blog/clipboard/
path: /blog/2018/08/weixin.html
a[href]: /road
path: /blog/2018/08/weixin.html
      [6-14] /road
OnRequest: /road
OnResponse: /road/ <------------ This time OnResponse() adds a trailing "/"
a[href]: /home.html
path: /road/
(/road/ != /road)
        [8-0] /home.html
OnRequest: /home.html
OnResponse: /home.html
a[href]: blog/
path: /home.html
a[href]: survivor/
path: /home.html
          [9-1] survivor
OnRequest: /survivor/ <------------ This time OnRequest() adds both heading "/" and trailing "/"
OnResponse: /survivor/
a[href]: /home.html
path: /survivor/
(/survivor/ != survivor)
a[href]: https://search.jd.com/Search?keyword=%E6%9C%AA%E6%9D%A5%E4%B8%96%E7%95%8C%E7%9A%84%E5%B9%B8%E5%AD%98%E8%80%85&enc=utf-8&wq=%E6%9C%AA%E6%9D%A5%E4%B8%96%E7%95%8C%E7%9A%84%E5%B9%B8%E5%AD%98%E8%80%85
path: /survivor/
a[href]: https://s.taobao.com/search?q=%E6%9C%AA%E6%9D%A5%E4%B8%96%E7%95%8C%E7%9A%84%E5%B9%B8%E5%AD%98%E8%80%85
path: /survivor/
a[href]: https://www.amazon.cn/dp/B07DY286SY/
path: /survivor/
a[href]: http://product.dangdang.com/25300552.html
path: /survivor/
a[href]: images/published_cover.jpg
path: /survivor/
            [10-5] images/published_cover.jpg
OnRequest: /survivor/images/published_cover.jpg
OnResponse: /survivor/images/published_cover.jpg
a[href]: http://survivor.ruanyifeng.com/collapse/working-poor.html
path: /survivor/
a[href]: ./collapse/index.html <------------ There are also "./" in href, which makes me frustrating.
path: /survivor/
```

So you can see there are four things here:

``href`` -> ``OnRequest`` -> ``OnResponse`` -> ``path``

The crawler sees the ``href`` link first on a page, then sends a request which triggers ``OnRequest``, then response comes and triggers ``OnResponse``, at last ``OnHTML`` is triggered and ``path`` can be retrieved.

It seems that:

- ``OnResponse`` and ``path`` will always be same, as they are printing the path of the same page.
- ``href`` and ``OnRequest`` can be different when ``href`` is non-standard, like with or without heading or trailing "/", with "./", with "../”, some even use absolute URL.
- ``OnRequest`` and ``OnResponse`` can be different when page is redirected, ``OnRequest`` will be the original path, ``OnResponse`` will be the redirected path. But in the above example of ``/road``, ``OnResponse`` also adds a trailing "/" to be ``/road/``.

So I'm confused about these paths. Things like trailing "/" makes my crawler difficult to recognize. Is there any way to solve it? Like provide a previous page pointer to show ``/home.html`` is redirected from ``/``. And let me know ``/road/`` in the response path is the same thing as the ``/road`` in the request path. Thanks.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to detect or track the page redirection and non-standard href? #212

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

How to detect or track the page redirection and non-standard href? #212

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions