Wednesday, July 23, 2014

Crawling a site with Java

I wanted to crawl a site using Java.


I didn't want to invent the wheel so I wanted to find a good java library which crawls sites.

On top of my head I know of several libraries, so I thought of checking them first.
I began with Solr & Lucene and continued with Nutch.

After doing some reading I understood that the above libraries are out of my scope and although they will do the job, they are an absolute overkill.

Why ?


Well, I guess I need to begin with my actual requirements
Before the requirement I suggest reading the following post, just to understand the basics (I won't use that actual code, but it does help to understand the concept and to sharpen the requirements):
How to build a crawler in Java



Requirements

  1. Crawls a full site
  2. Minimum amount of logic so it won't go to the same URL more than once etc
  3. Multithreading
  4. Good API so it will be easy to work with, I need a simple API to define the number of threads, site root, filter which pages to pull according to their URL, not going to external links, easy way to parse every page and do something with it.
  5. Unbloated code, so no need for any GUI on top of it etc - just a minimal library
  6. Open source
  7. As always I prefer a current library (from the last 3 years), and prefer one which is regularly updated
  8. Not search engine library, so no "client / server" code, no need to save all of the website to a nosql DB, no need for scaling, multiple servers etc - just a simple library


Requirement #7 removed the libraries I knew of from the list (Solr / Lucene / Nutch / Bixo / Elstic search)

I googled for a list of crawling libraries and found this good list:

I checked all of those libraries but almost all of them fell on requirement #6
Most of those libraries are ancient.
Archnid (2002), Weblech (2004), WebEater (2003), Jspider (2003), WebSphinx (2002)

Some libraries were closed source (Requirement #5): Infant, Heritrix (Used by web-archive!  which stores all of the internet in all times)
Some were bloated: Web-harvest (and last release was at 2010)


Which left with just two options:
Crawler4j (2013)
Niocchi (2011)

Both of these seem good choices.


Summary

In my search for a library which will crawl a full site, I found that the crawling libraries are separated into two main groups:
Libraries which are intended to be part of a full fledged search engine, which needs to contain all of the crawled data and to index it for fast retrieval, where the better ones give support for several servers holding the content as a grid, in this group you will find big names like Solr / Lucene / Nutch / Bixo / Elstic search.

Second group consists of the atomic task of crawling a site, and providing a small and efficient library for doing so.
After eliminating most of the libraries due to my requirements, I boiled them down to the following two libraries:
Crawler4j
Niocchi

I picked Crawler4j.




As a side note: if I was pressed to use a closed source library and had a big project I would definitely check Heritrix.




No comments: