网络爬虫, 爬遍整个网络.
Web crawler vs web scraper : crawler基本上是download整个webpage, 而 web scraper则是focus on part of the webpage, e.g. prices.
Topics References Step 1: Requirements crawl all the web
scalability - scalable so that it can crawl the entire web and fetch all the web documents extensibility - should be in a modular way so that new functions can be added to it Some considerations:
what to crawl ? html pages, medias (sound files, images, videos, etc) For extensibility, need different modules to parse different types. htmlParser, soundParser, imageParser, videoParser protocols ? HTTP, FTP, etc politeness: web crawler should not exhaust hosts priority: not all websites have the same priority to crawl dedup: crawler should not waste time crawling on duplicate urls how frequently to scrawl ? Step 2: Data Estimation Assume:
crawl html pages only 15 billion different webpages to crawl in total finish crawling them in 4 weeks 100 kb per page size, stores 500 kb per page for page meta data Estimation:
crawl TPS: 15B / 4 * 7 * 24 * 60 * 60 = 6200 pages / second storage: 15 B * (100 + 500 kb) = 1.5 peta bytes assume 70% capacity model, need 1.5 / 0.7 = 2.14 peta bytes Step 3: High level design - Most important 先弄清楚一下web crawler的steps
在还没有被visited的url里面, 选一个进行crawl 如何选? 涉及到priority和politeness 一个单独的service来负责选url -> Url Frontier 找到这个url对应的IP address -> DNS establish a connection to host for that IP address to download the webpage document 一个download service download 到哪里? page document之后还要被access到, 所以可以放cache for quick access Parse the document to find new URLs embedded in this document different parser: urlParser, videoParser, priceParser(web scraper), etc Add new URLs to un-visited-url list -> dedupe how? Process downloaded document -> store it, index its contents start from step 1. BFS vs DFS
大部分情况下是用BFS. DFS用于在当已经established a connection to an IP address的时候, 就直接DFS这个website了, 来节省handshake的时间
Components Seed Urls 通过提供一些起始url来开始crawl
1. UrlFrontier - 取得下一个要进行crawl的url 所有的un-visited url都在一个queue里面, 如何从这里面取出下一个要visit的url呢? 这就要考虑到priority和politeness了
priority: priority router + queues + priority selector
一个priority router来把url按照一定priority放到相对应的多个queue里面, 一个queue放同一priority的url, 不同的queue不同priority 如何决定一个page的priority ? quality of a page / change rate of a page 一个priority selector来从这些具有不同的priority 的queue里面取出url, high priority的queue pull的更加频繁 politeness 用来保证不exhaust host, 一次仅有一个connection to the host. mapping table + politeness router + queues + politeness selector
也是由多个queue组成, 每个queue里面存的都是属于一个ip address / host的url, queue和host的mapping 存在一个mapping table里面 queue为empty的时候, 说明这个host的url已经都被crawl完了, 下面一个url是一个新的host的url, update mapping table 同时可以implement一个heap, heap里面存每一个queue 刚刚pull的time + waiting time, 取最近的time来找到queue 进行pull 所以urlFrontier实现的功能:
url are crawled according to their priority 每次每个host只有一个connection 在每次crawl 一个url之后有一个wait time 2. DNS resolver UrlFrontier选到了下一个要crawl的url之后, 传给了DNS resolver, DNS找到url对应的ipaddress传给下一个html fetcher去下载html.
An average response time for a DNS request can be between 10ms - 200ms. When one thread makes a request to DNS, the others are blocked until the first request is finished. 所以可以用一个DNS cache (in memory or separate db) can help. cache can be updated periodically using a simple cron job
3. html fetcher html fetch fetches html pages from the ip address.
这里可以有多个worker来作为html fetcher来 speed up 同时也可以有不同的fetcher来通过不同的protocol来fetch 4. content caching html fetcher 取下来的content 存在一个cache里面, 方便之后的service 来进行读取
5. content seen (在content seen之前可能会有一个parse) 看这个html page是否有duplicate -> 可以用checksum. 但是现在很多网页并不是完全一模一样, 部分一样, 有一些方法来判断这种情况下的duplication. 如果content seen, duplicate, 就discard; 否则, 继续进行处理-> content storage / indexing
从html page里面extract url的一个service, format url, 给url tags
7. urlFilter 因为从6里面出来的url有很多不同的种类: .jpg之类的. 这里可以apply filtering logic to discard urls that we will not crawl.
robot.txt -> urls that disallowed in this file should be filtered out example: Blocking a specific web crawler from a specific web page : User-agent: Bingbot Disallow: /example-subfolder/blocked-page.html
8. url de-dup 所有已经爬过的url都应该存在一个database里面, 可以用来对新的url 进行de-dup check
Bloom Filter: bit vector. probabilistic data structure : it tells us that the element either definitely is not in the set or may be in the set.
simply hash it a few times and set the bits in the bit vector at the index of those hashes to 1. 这里可以用到bloom filter的原因是, 就算这次把这个false-positive的url pass掉了之后, 下一次碰到duplicate的url还是可以继续(???我觉得这个是错的,相同的url如果一次被bloomfilter判定为存在的话, 就都会被判定存在) 经过url de-dup之后的url, 放入到url database里面, 也放入到传入到urlFrontier的那个queue里面等待urlFrontier进行处理.
Step 4. Storage html storage: s3 url store : key-value
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.