Web Crawler / Spider Features
Let's build a web spider / crawler
Features
- Multithreaded (else network will slow it to a crawl)
- Work with proxies
- Manage Robots.txt (disable or change agent)
- Resumeable (in case of a crash it should continue where it stopped)
- Politeness (wait at least X milliseconds before connecting the same domain)
- Binary crawling (be able to parse PDFs etc)
- Configurable stop conditions like: MaxDepth, MaxNumOfPages, MaxSize?
- Be able to crawl https
- Evade spider traps
- Heed redirects (status code 3xx)
- Normalize URLs (so it won't crawl same page twice)
- Configurable shouldVisit(Page page, URL url) which will enable the user to focus the crawler on specific pages (regexp of suffix) specific domain etc
- SIMPLE API (disable all advanced features)
- Use third party libraries for: URL Normalization, Robot management, PageFetcher, Multithreaded framework, Frontier DB, Page Parser (Tika for binary)
- Enable the user to grab and whatever he wants with the visit(Page page)
- (Optional) pluggable Jsoup for the output?
- (Optional) Pluggable big data for the output ?
- (Optional) Pluggable internal DB for frontier ?
The Crawler's Flow
- Initial configuration (properties file as the spider's only argument)
- Initial Seed
- URL passes through filters1 to see if we should proceed (design pattern?), filters will be:
- User's configured regexp shouldVisit()
- Depth
- MaxPages
- Non crawlable suffix like: jpg, mp3 etc
- Checks internal archived frontier DB (if already parsed then don't do it again), this will be done using the Normalized url's Hash as key
- Robots is checked (if not disabled)
- PageFetcher (CrawlerCommon)
- Page passes through filters2
- Status Code (4xx ? Redirect ?)
- Size (if size was limited, should it be limited?)
- Binary content (if user configured not to fetch binary content)
- Page is parsed
- URLs extracted and adder to frontier
- Parsed page sent to visit(Page page) for the user's enjoyment
- Pluggable JSoup? Pluggable BigData ?
Internal DB
- Should be a queue where all new URLs are stored and then popped when the crawler will consume them
- Should be an archive of crawled URLs, so the same URLs won't be crawled twice
- The URLs in those DBs should be normalized
- The URLs in those dbs should be hashed for quick find
- The internal DB technology should be a pluggable implementation so it could be changed easily
- The DB schema should be something as follows:
- Hash | normalized URL | Timestamp | ParentID | Depth | Rank | Counter
- Where Rank can be calculated for an advanced spider
- Counter is a sort of a quick rank to see how many times a URL was called
- Famous implementatios for the internal DB will be:
- BerklyDB
- LevelDB
- ...
Comments