Introducing Url Identifier Economic & pragmatic URL duplication avoidance solution for web crawlers.
In One Word
A url identifying component for duplication prevention. Used in crawlers.
Say you want to crawl a whole bunch of web pages, and most of them come from major websites like Blogger and YouTube. In the beginning things are fine, you grab the url addresses and get what you want via Ruby’s URL::Open. As time goes on, you begin to find duplicates in your url, like these:
- http://happy-fake-blogger.blogspot.com/2013/02/o-la-la.html and
- http://www.youtube.com/watch?v=9bZkp7q19f0 and
In both examples, the three urls essentially point to the same resource (where in Exhibition 1 we can surmise they refer to a blog article, while in Exhibition 2 they all refer to the same music video).
The problem is we only want one URL for each unique resource.
(Feel free to comment on things I am not yet aware of. Here it goes. ) * [Detecting Near-Duplicates for Web Crawling]http://www.wwwconference.org/www2007/papers/paper215.pdf * Use a frequently-updated, exhaustive list to record patterns for common websites
The second solution is what url-identifier goes after.
url-identifer is a pragmatic solution that uses very limited resources.
For now, simply add Ruby rules to lib/url-identifer.rb.
I expect this to change soon as this project gains momentum. Will possibly add a rule table and provide more convenient way for adding & removing rules.
blog comments powered by Disqus