Crawling a site with Java

July 23, 2014

I wanted to crawl a site using Java.

I didn't want to invent the wheel so I wanted to find a good java library which crawls sites.

On top of my head I know of several libraries, so I thought of checking them first.

I began with Solr & Lucene and continued with Nutch.

After doing some reading I understood that the above libraries are out of my scope and although they will do the job, they are an absolute overkill.

Why ?

Well, I guess I need to begin with my actual requirements

Before the requirement I suggest reading the following post, just to understand the basics (I won't use that actual code, but it does help to understand the concept and to sharpen the requirements):
How to build a crawler in Java

Requirements

Crawls a full site
Minimum amount of logic so it won't go to the same URL more than once etc
Multithreading
Good API so it will be easy to work with, I need a simple API to define the number of threads, site root, filter which pages to pull according to their URL, not going to external links, easy way to parse every page and do something with it.
Unbloated code, so no need for any GUI on top of it etc - just a minimal library
Open source
As always I prefer a current library (from the last 3 years), and prefer one which is regularly updated
Not search engine library, so no "client / server" code, no need to save all of the website to a nosql DB, no need for scaling, multiple servers etc - just a simple library

Requirement #7 removed the libraries I knew of from the list (Solr / Lucene / Nutch / Bixo / Elstic search)

I googled for a list of crawling libraries and found this good list:

Good list from Java-source

I checked all of those libraries but almost all of them fell on requirement #6

Most of those libraries are ancient.

Archnid (2002), Weblech (2004), WebEater (2003), Jspider (2003), WebSphinx (2002)

Some libraries were closed source (Requirement #5): Infant, Heritrix (Used by web-archive! which stores all of the internet in all times)

Some were bloated: Web-harvest (and last release was at 2010)

Which left with just two options:

Crawler4j (2013)

Niocchi (2011)

Both of these seem good choices.

Summary

In my search for a library which will crawl a full site, I found that the crawling libraries are separated into two main groups:
Libraries which are intended to be part of a full fledged search engine, which needs to contain all of the crawled data and to index it for fast retrieval, where the better ones give support for several servers holding the content as a grid, in this group you will find big names like Solr / Lucene / Nutch / Bixo / Elstic search.

Second group consists of the atomic task of crawling a site, and providing a small and efficient library for doing so.

After eliminating most of the libraries due to my requirements, I boiled them down to the following two libraries:
Crawler4j

Niocchi

I picked Crawler4j.

As a side note: if I was pressed to use a closed source library and had a big project I would definitely check Heritrix.

Search This Blog

ChaiPuter