Posts

Showing posts from June, 2015

Web Crawler / Spider Features

Let's build a web spider / crawler

Features Multithreaded (else network will slow it to a crawl)Work with proxiesManage Robots.txt (disable or change agent)Resumeable (in case of a crash it should continue where it stopped)Politeness (wait at least X milliseconds before connecting the same domain)Binary crawling (be able to parse PDFs etc)Configurable stop conditions like: MaxDepth, MaxNumOfPages, MaxSize?Be able to crawl httpsEvade spider trapsHeed redirects (status code 3xx)Normalize URLs (so it won't crawl same page twice)Configurable shouldVisit(Page page, URL url) which will enable the user to focus the crawler on specific pages (regexp of suffix) specific domain etcSIMPLE API (disable all advanced features)Use third party libraries for: URL Normalization, Robot management, PageFetcher, Multithreaded framework, Frontier DB, Page Parser (Tika for binary)Enable the user to grab and whatever he wants with the visit(Page page) (Optional) pluggable Jsoup for the output?(Option…