Trendlist: [New post] System Design

Monday, November 15, 2021

[New post] System Design – web crawler

System Design – web crawler

by oneOokay

网络爬虫, 爬遍整个网络.

Web crawler vs web scraper : crawler基本上是download整个webpage, 而 web scraper则是focus on part of the webpage, e.g. prices.

Topics

References

Step 1: Requirements

crawl all the web

scalability - scalable so that it can crawl the entire web and fetch all the web documents
extensibility - should be in a modular way so that new functions can be added to it

Some considerations:

what to crawl ? html pages, medias (sound files, images, videos, etc) For extensibility, need different modules to parse different types. htmlParser, soundParser, imageParser, videoParser
protocols ? HTTP, FTP, etc
politeness: web crawler should not exhaust hosts
priority: not all websites have the same priority to crawl
dedup: crawler should not waste time crawling on duplicate urls
how frequently to scrawl ?

Step 2: Data Estimation

Assume:

crawl html pages only
15 billion different webpages to crawl in total
finish crawling them in 4 weeks
100 kb per page size, stores 500 kb per page for page meta data

Estimation:

crawl TPS: 15B / 4 * 7 * 24 * 60 * 60 = 6200 pages / second
storage: 15 B * (100 + 500 kb) = 1.5 peta bytes
assume 70% capacity model, need 1.5 / 0.7 = 2.14 peta bytes

Step 3: High level design - Most important

先弄清楚一下web crawler的steps

在还没有被visited的url里面, 选一个进行crawl
- 如何选? 涉及到priority和politeness
- 一个单独的service来负责选url -> Url Frontier
找到这个url对应的IP address -> DNS
establish a connection to host for that IP address to download the webpage document
- 一个download service
- download 到哪里? page document之后还要被access到, 所以可以放cache for quick access
Parse the document to find new URLs embedded in this document
- different parser: urlParser, videoParser, priceParser(web scraper), etc
Add new URLs to un-visited-url list -> dedupe how?
Process downloaded document -> store it, index its contents
- where to store ?
start from step 1.

BFS vs DFS

大部分情况下是用BFS. DFS用于在当已经established a connection to an IP address的时候, 就直接DFS这个website了, 来节省handshake的时间

Components

Seed Urls

通过提供一些起始url来开始crawl

1. UrlFrontier - 取得下一个要进行crawl的url

所有的un-visited url都在一个queue里面, 如何从这里面取出下一个要visit的url呢? 这就要考虑到priority和politeness了

priority: priority router + queues + priority selector

一个priority router来把url按照一定priority放到相对应的多个queue里面, 一个queue放同一priority的url, 不同的queue不同priority
- 如何决定一个page的priority ? quality of a page / change rate of a page
一个priority selector来从这些具有不同的priority 的queue里面取出url, high priority的queue pull的更加频繁

politeness 用来保证不exhaust host, 一次仅有一个connection to the host. mapping table + politeness router + queues + politeness selector

也是由多个queue组成, 每个queue里面存的都是属于一个ip address / host的url, queue和host的mapping 存在一个mapping table里面
queue为empty的时候, 说明这个host的url已经都被crawl完了, 下面一个url是一个新的host的url, update mapping table
同时可以implement一个heap, heap里面存每一个queue 刚刚pull的time + waiting time, 取最近的time来找到queue 进行pull

host	queueid
http://www.a.com	1
http://www.b.com	2

所以urlFrontier实现的功能:

url are crawled according to their priority
每次每个host只有一个connection
在每次crawl 一个url之后有一个wait time

2. DNS resolver

UrlFrontier选到了下一个要crawl的url之后, 传给了DNS resolver, DNS找到url对应的ipaddress传给下一个html fetcher去下载html.

An average response time for a DNS request can be between 10ms - 200ms.
When one thread makes a request to DNS, the others are blocked until the first request is finished.

所以可以用一个DNS cache (in memory or separate db) can help. cache can be updated periodically using a simple cron job

3. html fetcher

html fetch fetches html pages from the ip address.

这里可以有多个worker来作为html fetcher来 speed up
同时也可以有不同的fetcher来通过不同的protocol来fetch

4. content caching

html fetcher 取下来的content 存在一个cache里面, 方便之后的service 来进行读取

5. content seen

(在content seen之前可能会有一个parse) 看这个html page是否有duplicate -> 可以用checksum. 但是现在很多网页并不是完全一模一样, 部分一样, 有一些方法来判断这种情况下的duplication. 如果content seen, duplicate, 就discard; 否则, 继续进行处理-> content storage / indexing

6. urlExtractor

从html page里面extract url的一个service, format url, 给url tags

7. urlFilter

因为从6里面出来的url有很多不同的种类: .jpg之类的. 这里可以apply filtering logic to discard urls that we will not crawl.

robot.txt -> urls that disallowed in this file should be filtered out
example: Blocking a specific web crawler from a specific web page: User-agent: Bingbot Disallow: /example-subfolder/blocked-page.html

8. url de-dup

所有已经爬过的url都应该存在一个database里面, 可以用来对新的url 进行de-dup check

Bloom Filter: bit vector. probabilistic data structure: it tells us that the element either definitely is not in the set or may be in the set.

simply hash it a few times and set the bits in the bit vector at the index of those hashes to 1.
这里可以用到bloom filter的原因是, 就算这次把这个false-positive的url pass掉了之后, 下一次碰到duplicate的url还是可以继续(???我觉得这个是错的,相同的url如果一次被bloomfilter判定为存在的话, 就都会被判定存在)

经过url de-dup之后的url, 放入到url database里面, 也放入到传入到urlFrontier的那个queue里面等待urlFrontier进行处理.

Step 4. Storage

html storage: s3
url store : key-value

This free site is ad-supported. Learn more

oneOokay | November 15, 2021 at 3:50 am | Categories: System Design | URL: https://wp.me/p7wN9I-2Ld

Trendlist

Monday, November 15, 2021

[New post] System Design – web crawler

System Design – web crawler

Topics

References

Step 1: Requirements

Step 2: Data Estimation

Step 3: High level design - Most important

Components

Seed Urls

1. UrlFrontier - 取得下一个要进行crawl的url

2. DNS resolver

3. html fetcher

4. content caching

5. content seen

6. urlExtractor

7. urlFilter

8. url de-dup

Step 4. Storage

No comments:

Post a Comment

Generate a catchy title for a collection of newfangled music by making it your own

Report Abuse

Labels

Monday, November 15, 2021

[New post] System Design – web crawler

New post on oneokay

System Design – web crawler

Topics

References

Step 1: Requirements

Step 2: Data Estimation

Step 3: High level design - Most important

Components

Seed Urls

1. UrlFrontier - 取得下一个要进行crawl的url

2. DNS resolver

3. html fetcher

4. content caching

5. content seen

6. urlExtractor

7. urlFilter

8. url de-dup

Step 4. Storage

No comments:

Post a Comment

Generate a catchy title for a collection of newfangled music by making it your own