Thursday, June 11, 2015

Web Crawler / Spider Features

Let's build a web spider / crawler



Features

  • Multithreaded (else network will slow it to a crawl)
  • Work with proxies
  • Manage Robots.txt (disable or change agent)
  • Resumeable (in case of a crash it should continue where it stopped)
  • Politeness (wait at least X milliseconds before connecting the same domain)
  • Binary crawling (be able to parse PDFs etc)
  • Configurable stop conditions like: MaxDepth, MaxNumOfPages, MaxSize?
  • Be able to crawl https
  • Evade spider traps
  • Heed redirects (status code 3xx)
  • Normalize URLs (so it won't crawl same page twice)
  • Configurable shouldVisit(Page page, URL url) which will enable the user to focus the crawler on specific pages (regexp of suffix) specific domain etc
  • SIMPLE API (disable all advanced features)
  • Use third party libraries for: URL Normalization, Robot management, PageFetcher, Multithreaded framework, Frontier DB, Page Parser (Tika for binary)
  • Enable the user to grab and whatever he wants with the visit(Page page) 
  • (Optional) pluggable Jsoup for the output?
  • (Optional) Pluggable big data for the output ?
  • (Optional) Pluggable internal DB for frontier ?


The Crawler's Flow

  • Initial configuration (properties file as the spider's only argument)
  • Initial Seed
  • URL passes through filters1 to see if we should proceed (design pattern?), filters will be:
    • User's configured regexp shouldVisit()
    • Depth
    • MaxPages
    • Non crawlable suffix like: jpg, mp3 etc
    • Checks internal archived frontier DB (if already parsed then don't do it again), this will be done using the Normalized url's Hash as key
  • Robots is checked (if not disabled)
  • PageFetcher (CrawlerCommon)
  • Page passes through filters2
    • Status Code (4xx ?  Redirect ?)
    • Size (if size was limited, should it be limited?)
    • Binary content (if user configured not to fetch binary content)
  • Page is parsed
  • URLs extracted and adder to frontier
  • Parsed page sent to visit(Page page) for the user's enjoyment
  • Pluggable JSoup? Pluggable BigData ?


Internal DB

  • Should be a queue where all new URLs are stored and then popped when the crawler will consume them
  • Should be an archive of crawled URLs, so the same URLs won't be crawled twice
  • The URLs in those DBs should be normalized
  • The URLs in those dbs should be hashed for quick find
  • The internal DB technology should be a pluggable implementation so it could be changed easily
  • The DB schema should be something as follows:
    • Hash | normalized URL | Timestamp | ParentID | Depth | Rank | Counter
    • Where Rank can be calculated for an advanced spider
    • Counter is a sort of a quick rank to see how many times a URL was called
  • Famous implementatios for the internal DB will be:
    • BerklyDB
    • LevelDB
    • ...

No comments: