网络爬虫, 爬遍整个网络.

Web crawler vs web scraper : crawler基本上是download整个webpage, 而 web scraper则是focus on part of the webpage, e.g. prices.

Topics

References

Step 1: Requirements

crawl all the web

  • scalability - scalable so that it can crawl the entire web and fetch all the web documents
  • extensibility - should be in a modular way so that new functions can be added to it

Some considerations:

  • what to crawl ? html pages, medias (sound files, images, videos, etc) For extensibility, need different modules to parse different types. htmlParser, soundParser, imageParser, videoParser
  • protocols ? HTTP, FTP, etc
  • politeness: web crawler should not exhaust hosts
  • priority: not all websites have the same priority to crawl
  • dedup: crawler should not waste time crawling on duplicate urls
  • how frequently to scrawl ?

Step 2: Data Estimation

Assume:

  • crawl html pages only
  • 15 billion different webpages to crawl in total
  • finish crawling them in 4 weeks
  • 100 kb per page size, stores 500 kb per page for page meta data

Estimation:

  • crawl TPS: 15B / 4 * 7 * 24 * 60 * 60 = 6200 pages / second
  • storage: 15 B * (100 + 500 kb) = 1.5 peta bytes
  • assume 70% capacity model, need 1.5 / 0.7 = 2.14 peta bytes

Step 3: High level design - Most important

先弄清楚一下web crawler的steps

  1. 在还没有被visited的url里面, 选一个进行crawl
    • 如何选? 涉及到priority和politeness
    • 一个单独的service来负责选url -> Url Frontier
  2. 找到这个url对应的IP address -> DNS
  3. establish a connection to host for that IP address to download the webpage document
    • 一个download service
    • download 到哪里? page document之后还要被access到, 所以可以放cache for quick access
  4. Parse the document to find new URLs embedded in this document
    • different parser: urlParser, videoParser, priceParser(web scraper), etc
  5. Add new URLs to un-visited-url list -> dedupe how?
  6. Process downloaded document -> store it, index its contents
    • where to store ?
  7. start from step 1.

BFS vs DFS

大部分情况下是用BFS. DFS用于在当已经established a connection to an IP address的时候, 就直接DFS这个website了, 来节省handshake的时间

Components

Seed Urls

通过提供一些起始url来开始crawl

1. UrlFrontier - 取得下一个要进行crawl的url

所有的un-visited url都在一个queue里面, 如何从这里面取出下一个要visit的url呢? 这就要考虑到priority和politeness了

priority: priority router + queues + priority selector

  • 一个priority router来把url按照一定priority放到相对应的多个queue里面, 一个queue放同一priority的url, 不同的queue不同priority
    • 如何决定一个page的priority ? quality of a page / change rate of a page
  • 一个priority selector来从这些具有不同的priority 的queue里面取出url, high priority的queue pull的更加频繁

politeness 用来保证不exhaust host, 一次仅有一个connection to the host. mapping table + politeness router + queues + politeness selector

  • 也是由多个queue组成, 每个queue里面存的都是属于一个ip address / host的url, queue和host的mapping 存在一个mapping table里面
  • queue为empty的时候, 说明这个host的url已经都被crawl完了, 下面一个url是一个新的host的url, update mapping table
  • 同时可以implement一个heap, heap里面存每一个queue 刚刚pull的time + waiting time, 取最近的time来找到queue 进行pull
host queueid
http://www.a.com 1
http://www.b.com 2

所以urlFrontier实现的功能:

  1. url are crawled according to their priority
  2. 每次每个host只有一个connection
  3. 在每次crawl 一个url之后有一个wait time

2. DNS resolver

UrlFrontier选到了下一个要crawl的url之后, 传给了DNS resolver, DNS找到url对应的ipaddress传给下一个html fetcher去下载html.

  • An average response time for a DNS request can be between 10ms - 200ms.
  • When one thread makes a request to DNS, the others are blocked until the first request is finished.

所以可以用一个DNS cache (in memory or separate db) can help. cache can be updated periodically using a simple cron job

3. html fetcher

html fetch fetches html pages from the ip address.

  • 这里可以有多个worker来作为html fetcher来 speed up
  • 同时也可以有不同的fetcher来通过不同的protocol来fetch

4. content caching

html fetcher 取下来的content 存在一个cache里面, 方便之后的service 来进行读取

5. content seen

(在content seen之前可能会有一个parse) 看这个html page是否有duplicate -> 可以用checksum. 但是现在很多网页并不是完全一模一样, 部分一样, 有一些方法来判断这种情况下的duplication. 如果content seen, duplicate, 就discard; 否则, 继续进行处理-> content storage / indexing

6. urlExtractor

从html page里面extract url的一个service, format url, 给url tags

7. urlFilter

因为从6里面出来的url有很多不同的种类: .jpg之类的. 这里可以apply filtering logic to discard urls that we will not crawl.

  • robot.txt -> urls that disallowed in this file should be filtered out
  • example: Blocking a specific web crawler from a specific web page: User-agent: Bingbot Disallow: /example-subfolder/blocked-page.html

8. url de-dup

所有已经爬过的url都应该存在一个database里面, 可以用来对新的url 进行de-dup check

Bloom Filter: bit vector. probabilistic data structure: it tells us that the element either definitely is not in the set or may be in the set.

  • simply hash it a few times and set the bits in the bit vector at the index of those hashes to 1.
  • 这里可以用到bloom filter的原因是, 就算这次把这个false-positive的url pass掉了之后, 下一次碰到duplicate的url还是可以继续(???我觉得这个是错的,相同的url如果一次被bloomfilter判定为存在的话, 就都会被判定存在)

经过url de-dup之后的url, 放入到url database里面, 也放入到传入到urlFrontier的那个queue里面等待urlFrontier进行处理.

Step 4. Storage

  • html storage: s3
  • url store : key-value

This free site is ad-supported. Learn more