Some websites use page redirection and non-standard a href a lot. It makes the crawler difficult to recognize whether two pages are actually the same page.
For example, when I enter the page: http://www.ruanyifeng.com, it will be redirected to: http://www.ruanyifeng.com/home.html. So my crawler will recognize the path / and /home.html as two different pages. But I want to recognize them as one page.
As the following code shows,
c.OnHTML("a[href]", func(e *colly.HTMLElement) {
fmt.Printf("a[href]: %s\n", e.Attr("href"))
fmt.Printf("path: %s\n", e.Request.URL.Path)
})
c.OnRequest(func(r *colly.Request) {
fmt.Printf("OnRequest: %s\n", r.URL.Path)
})
c.OnResponse(func(r *colly.Response) {
fmt.Printf("OnResponse: %s\n", r.Request.URL.Path)
})
c.Visit("http://www.ruanyifeng.com/")
The log is as follows. I also printed other lines like [x-y] text, x is request ID, y is the index of the a href. text is the r.URL.Path.
[0-0] /
OnRequest: / <------------ redirect happens, OnRequest != OnResponse
OnResponse: /home.html <------------ redirect happens, OnRequest != OnResponse
a[href]: blog/ <------------ blog/ is not /blog/, a[href] != OnRequest
path: /home.html
[1-0] blog
OnRequest: /blog/ <------------ blog/ is not /blog/, a[href] != OnRequest
OnResponse: /blog/
a[href]: /feed.html
path: /blog/
(/blog/ != blog)
[2-0] /feed.html
OnRequest: /feed.html
OnResponse: /feed.html
a[href]: http://www.ruanyifeng.com/
path: /feed.html
a[href]: http://feeds.feedburner.com/ruanyifeng
path: /feed.html
a[href]: http://www.ruanyifeng.com/blog/atom.xml
path: /feed.html
[3-2] /blog/*
OnRequest: /blog/atom.xml
OnResponse: /blog/atom.xml
a[href]: http://feedburner.google.com
path: /feed.html
a[href]: http://feeds.feedburner.com/ruanyifeng
path: /feed.html
a[href]: http://feedburner.google.com/fb/a/mailverify?uri=ruanyifeng&loc=zh_CN
path: /feed.html
a[href]: http://ruanyf.blogspot.com
path: /feed.html
a[href]: http://www.ruanyifeng.com/contact.html
path: /feed.html
[3-7] /contact.html
OnRequest: /contact.html
OnResponse: /contact.html
a[href]: http://www.ruanyifeng.com/
path: /contact.html
a[href]: http://www.ruanyifeng.com/contact.html
path: /contact.html
a[href]: https://www.injz.net/dzs/sitemap.xml
path: /contact.html
a[href]: https://www.injz.net/dzs/sitemap.xml
path: /feed.html
a[href]: http://www.ruanyifeng.com/blog/2018/08/weixin.html
path: /blog/
[2-1] /blog/*/*/*.html
OnRequest: /blog/2018/08/weixin.html
OnResponse: /blog/2018/08/weixin.html
a[href]: http://www.ruanyifeng.com/blog/
path: /blog/2018/08/weixin.html
[6-0] /blog
a[href]: http://www.ruanyifeng.com/blog/archives.html
path: /blog/2018/08/weixin.html
a[href]: /feed.html
path: /blog/2018/08/weixin.html
a[href]: http://www.ruanyifeng.com/blog/2018/08/weekly-issue-18.html
path: /blog/2018/08/weixin.html
a[href]: http://www.ruanyifeng.com/blog/clipboard/
path: /blog/2018/08/weixin.html
a[href]: http://www.ruanyifeng.com/blog/2018/08/weekly-issue-18.html
path: /blog/2018/08/weixin.html
a[href]: http://www.ruanyifeng.com
path: /blog/2018/08/weixin.html
a[href]: http://www.ruanyifeng.com/blog/2018/08/ <------------ this is the absolute URL
path: /blog/2018/08/weixin.html
[6-7] /blog/*/*
OnRequest: /blog/2018/08/ <------------ this is the relative URL.
OnResponse: /blog/2018/08/
a[href]: http://www.ruanyifeng.com/blog/
path: /blog/2018/08/
(/blog/2018/08/ != /blog/2018/08)
a[href]: http://www.ruanyifeng.com/blog/archives.html
path: /blog/2018/08/
a[href]: /feed.html
path: /blog/2018/08/
a[href]: http://www.ruanyifeng.com/blog/2018/07/
path: /blog/2018/08/
a[href]: http://www.ruanyifeng.com/blog/2018/08/weixin.html
path: /blog/2018/08/
a[href]: http://www.ruanyifeng.com/blog/clipboard/
path: /blog/2018/08/
a[href]: https://www.scmp.com/tech/article/2159831/how-wechat-became-chinas-everyday-mobile-app
path: /blog/2018/08/
a[href]: http://www.ruanyifeng.com/blog/2018/08/weixin.html
path: /blog/2018/08/
a[href]: http://www.ruanyifeng.com/blog/2018/08/weixin.html#comments
path: /blog/2018/08/
a[href]: http://www.ruanyifeng.com/blog/2018/08/weekly-issue-18.html
path: /blog/2018/08/
a[href]: http://www.ruanyifeng.com/blog/2018/08/api-below.html
path: /blog/2018/08/
a[href]: http://www.ruanyifeng.com/blog/2018/08/weekly-issue-17.html
path: /blog/2018/08/
a[href]: http://www.ruanyifeng.com/blog/2018/08/svg.html
path: /blog/2018/08/
a[href]: http://www.ruanyifeng.com/blog/2018/08/weekly-issue-16.html
path: /blog/2018/08/
a[href]: http://www.ruanyifeng.com/blog/2018/07/
path: /blog/2018/08/
a[href]: http://weibo.com/ruanyf
path: /blog/2018/08/
a[href]: https://twitter.com/ruanyf
path: /blog/2018/08/
a[href]: https://github.com/ruanyf
path: /blog/2018/08/
a[href]: /contact.html
path: /blog/2018/08/
a[href]: http://www.zhufengpeixun.cn/main/index.html?ref=ruanyifeng
path: /blog/2018/08/weixin.html
a[href]: https://www.scmp.com/tech/article/2159831/how-wechat-became-chinas-everyday-mobile-app
path: /blog/2018/08/weixin.html
a[href]: https://zh.wikipedia.org/zh-sg/%E5%BE%AE%E4%BF%A1
path: /blog/2018/08/weixin.html
a[href]: http://creativecommons.org/licenses/by-nc-nd/3.0/deed.zh
path: /blog/2018/08/weixin.html
a[href]: http://www.ruanyifeng.com/blog/archives.html
path: /blog/2018/08/weixin.html
a[href]: http://www.ruanyifeng.com/blog/clipboard/
path: /blog/2018/08/weixin.html
a[href]: /road
path: /blog/2018/08/weixin.html
[6-14] /road
OnRequest: /road
OnResponse: /road/ <------------ This time OnResponse() adds a trailing "/"
a[href]: /home.html
path: /road/
(/road/ != /road)
[8-0] /home.html
OnRequest: /home.html
OnResponse: /home.html
a[href]: blog/
path: /home.html
a[href]: survivor/
path: /home.html
[9-1] survivor
OnRequest: /survivor/ <------------ This time OnRequest() adds both heading "/" and trailing "/"
OnResponse: /survivor/
a[href]: /home.html
path: /survivor/
(/survivor/ != survivor)
a[href]: https://search.jd.com/Search?keyword=%E6%9C%AA%E6%9D%A5%E4%B8%96%E7%95%8C%E7%9A%84%E5%B9%B8%E5%AD%98%E8%80%85&enc=utf-8&wq=%E6%9C%AA%E6%9D%A5%E4%B8%96%E7%95%8C%E7%9A%84%E5%B9%B8%E5%AD%98%E8%80%85
path: /survivor/
a[href]: https://s.taobao.com/search?q=%E6%9C%AA%E6%9D%A5%E4%B8%96%E7%95%8C%E7%9A%84%E5%B9%B8%E5%AD%98%E8%80%85
path: /survivor/
a[href]: https://www.amazon.cn/dp/B07DY286SY/
path: /survivor/
a[href]: http://product.dangdang.com/25300552.html
path: /survivor/
a[href]: images/published_cover.jpg
path: /survivor/
[10-5] images/published_cover.jpg
OnRequest: /survivor/images/published_cover.jpg
OnResponse: /survivor/images/published_cover.jpg
a[href]: http://survivor.ruanyifeng.com/collapse/working-poor.html
path: /survivor/
a[href]: ./collapse/index.html <------------ There are also "./" in href, which makes me frustrating.
path: /survivor/
So you can see there are four things here:
href -> OnRequest -> OnResponse -> path
The crawler sees the href link first on a page, then sends a request which triggers OnRequest, then response comes and triggers OnResponse, at last OnHTML is triggered and path can be retrieved.
It seems that:
OnResponse and path will always be same, as they are printing the path of the same page.
href and OnRequest can be different when href is non-standard, like with or without heading or trailing "/", with "./", with "../”, some even use absolute URL.
OnRequest and OnResponse can be different when page is redirected, OnRequest will be the original path, OnResponse will be the redirected path. But in the above example of /road, OnResponse also adds a trailing "/" to be /road/.
So I'm confused about these paths. Things like trailing "/" makes my crawler difficult to recognize. Is there any way to solve it? Like provide a previous page pointer to show /home.html is redirected from /. And let me know /road/ in the response path is the same thing as the /road in the request path. Thanks.
Some websites use page redirection and non-standard
a hrefa lot. It makes the crawler difficult to recognize whether two pages are actually the same page.For example, when I enter the page: http://www.ruanyifeng.com, it will be redirected to: http://www.ruanyifeng.com/home.html. So my crawler will recognize the path
/and/home.htmlas two different pages. But I want to recognize them as one page.As the following code shows,
The log is as follows. I also printed other lines like
[x-y] text,xis request ID,yis the index of thea href.textis ther.URL.Path.So you can see there are four things here:
href->OnRequest->OnResponse->pathThe crawler sees the
hreflink first on a page, then sends a request which triggersOnRequest, then response comes and triggersOnResponse, at lastOnHTMLis triggered andpathcan be retrieved.It seems that:
OnResponseandpathwill always be same, as they are printing the path of the same page.hrefandOnRequestcan be different whenhrefis non-standard, like with or without heading or trailing "/", with "./", with "../”, some even use absolute URL.OnRequestandOnResponsecan be different when page is redirected,OnRequestwill be the original path,OnResponsewill be the redirected path. But in the above example of/road,OnResponsealso adds a trailing "/" to be/road/.So I'm confused about these paths. Things like trailing "/" makes my crawler difficult to recognize. Is there any way to solve it? Like provide a previous page pointer to show
/home.htmlis redirected from/. And let me know/road/in the response path is the same thing as the/roadin the request path. Thanks.