Web Crawler / Spider Features

Let's build a web spider / crawler

Multithreaded (else network will slow it to a crawl)
Work with proxies
Manage Robots.txt (disable or change agent)
Resumeable (in case of a crash it should continue where it stopped)
Politeness (wait at least X milliseconds before connecting the same domain)
Binary crawling (be able to parse PDFs etc)
Configurable stop conditions like: MaxDepth, MaxNumOfPages, MaxSize?
Be able to crawl https
Evade spider traps
Heed redirects (status code 3xx)
Normalize URLs (so it won't crawl same page twice)
Configurable shouldVisit(Page page, URL url) which will enable the user to focus the crawler on specific pages (regexp of suffix) specific domain etc
SIMPLE API (disable all advanced features)
Use third party libraries for: URL Normalization, Robot management, PageFetcher, Multithreaded framework, Frontier DB, Page Parser (Tika for binary)
Enable the user to grab and whatever he wants with the visit(Page page)
(Optional) pluggable Jsoup for the output?
(Optional) Pluggable big data for the output ?
(Optional) Pluggable internal DB for frontier ?

Initial configuration (properties file as the spider's only argument)
Initial Seed
URL passes through filters1 to see if we should proceed (design pattern?), filters will be:

User's configured regexp shouldVisit()
Depth
MaxPages
Non crawlable suffix like: jpg, mp3 etc
Checks internal archived frontier DB (if already parsed then don't do it again), this will be done using the Normalized url's Hash as key

Should be a queue where all new URLs are stored and then popped when the crawler will consume them
Should be an archive of crawled URLs, so the same URLs won't be crawled twice
The URLs in those DBs should be normalized
The URLs in those dbs should be hashed for quick find
The internal DB technology should be a pluggable implementation so it could be changed easily
The DB schema should be something as follows: