Showing posts from June, 2015

Web Crawler / Spider Features

Let's build a web spider / crawler Features Multithreaded (else network will slow it to a crawl) Work with proxies Manage Robots .txt (disable or change agent) Resumeable (in case of a crash it should continue where it stopped) Politeness (wait at least X milliseconds before connecting the same domain) Binary crawling (be able to parse PDFs etc) Configurable stop conditions like: MaxDepth, MaxNumOfPages, MaxSize? Be able to crawl https Evade spider traps Heed redirects (status code 3xx) Normalize URLs (so it won't crawl same page twice) Configurable shouldVisit (Page page, URL url) which will enable the user to focus the crawler on specific pages (regexp of suffix) specific domain etc SIMPLE API (disable all advanced features) Use third party libraries for: URL Normalization, Robot management, PageFetcher, Multithreaded framework, Frontier DB, Page Parser (Tika for binary) Enable the user to grab and whatever he wants with the visit (Page page)  (Op