Design A Web Crawler

rw-book-cover

Metadata

Author: ByteByteGo
Full Title: Design A Web Crawler
Document Note: Assumptions:
- Indexing for search engine
- 1 billion pages per month
- HTML only
- Consider newly added and edited web pages
- Store HTML for 5 years
- Ignore duplicate pages
URL: https://bytebytego.com/courses/system-design-interview/design-a-web-crawler

Highlights

Back of the envelope estimation The following estimations are based on many assumptions, and it is important to communicate with the interviewer to be on the same page. • Assume 1 billion web pages are downloaded every month. • QPS: 1,000,000,000 / 30 days / 24 hours / 3600 seconds = ~400 pages per second. • Peak QPS = 2 * QPS = 800 • Assume the average web page size is 500k. • 1-billion-page x 500k = 500 TB storage per month. If you are unclear about digital storage units, go through “Power of 2” section in the “Back-of-the-envelope Estimation” chapter again. • Assuming data are stored for five years, 500 TB * 12 months * 5 years = 30 PB. A 30 PB storage is needed to store five-year content. (View Highlight)
A web crawler uses seed URLs as a starting point for the crawl process (View Highlight)
A good seed URL serves as a good starting point that a crawler can utilize to traverse as many links as possible. (View Highlight)
The first proposed approach is based on locality as different countries may have different popular websites. Another way is to choose seed URLs based on topics; for example, we can divide URL space into shopping, sports, healthcare, etc. Seed URL selection is an open-ended question. (View Highlight)
Most modern web crawlers split the crawl state into two: to be downloaded and already downloaded. The component that stores URLs to be downloaded is called the URL Frontier. You can refer to this as a First-in-First-out (FIFO) queue. (View Highlight)
The HTML downloader downloads web pages from the internet. Those URLs are provided by the URL Frontier. (View Highlight)
To download a web page, a URL must be translated into an IP address. The HTML Downloader calls the DNS Resolver to get the corresponding IP address for the URL. (View Highlight)
After a web page is downloaded, it must be parsed and validated because malformed web pages could provoke problems and waste storage space. Implementing a content parser in a crawl server will slow down the crawling process. Thus, the content parser is a separate component. (View Highlight)
We introduce the “Content Seen?” data structure to eliminate data redundancy and shorten processing time. It helps to detect new content previously stored in the system (View Highlight)
To compare two HTML documents, we can compare them character by character. However, this method is slow and time-consuming, especially when billions of web pages are involved. An efficient way to accomplish this task is to compare the hash values of the two web pages (View Highlight)
It is a storage system for storing HTML content. The choice of storage system depends on factors such as data type, data size, access frequency, life span, etc. Both disk and memory are used. (View Highlight)
• Most of the content is stored on disk because the data set is too big to fit in memory. (View Highlight)
• Popular content is kept in memory to reduce latency. (View Highlight)
URL Extractor parses and extracts links from HTML pages. Figure 3 shows an example of a link extraction process. (View Highlight)
The URL filter excludes certain content types, file extensions, error links and URLs in “blacklisted” sites. (View Highlight)
“URL Seen?” is a data structure that keeps track of URLs that are visited before or already in the Frontier. “URL Seen?” helps to avoid adding the same URL multiple times (View Highlight)
Bloom filter and hash table are common techniques to implement the “URL Seen?” component. (View Highlight)
URL Storage stores already visited URLs. (View Highlight)
(View Highlight)
Step 3 - Design deep dive (View Highlight)
You can think of the web as a directed graph where web pages serve as nodes and hyperlinks (URLs) as edges. The crawl process can be seen as traversing a directed graph from one web page to others. Two common graph traversal algorithms are DFS and BFS (View Highlight)
DFS is usually not a good choice because the depth of DFS can be very deep. (View Highlight)
BFS is commonly used by web crawlers and is implemented by a first-in-first-out (FIFO) queue. (View Highlight)
this implementation has two problems (View Highlight)
• Most links from the same web page are linked back to the same host. In Figure 5, all the links in wikipedia.com are internal links, making the crawler busy processing URLs from the same host (wikipedia.com). When the crawler tries to download web pages in parallel, Wikipedia servers will be flooded with requests. This is considered as “impolite”. (View Highlight)
Standard BFS does not take the priority of a URL into consideration. The web is large and not every page has the same level of quality and importance. Therefore, we may want to prioritize URLs according to their page ranks, web traffic, update frequency, etc. (View Highlight)
URL frontier helps to address these problems. A URL frontier is a data structure that stores URLs to be downloaded. The URL frontier is an important component to ensure politeness, URL prioritization, and freshness (View Highlight)
Generally, a web crawler should avoid sending too many requests to the same hosting server within a short period. Sending too many requests is considered as “impolite” or even treated as denial-of-service (DOS) attack. (View Highlight)
The general idea of enforcing politeness is to download one page at a time from the same host. A delay can be added between two download tasks. The politeness constraint is implemented by maintain a mapping from website hostnames to download (worker) threads. Each downloader thread has a separate FIFO queue and only downloads URLs obtained from that queue. Figure 6 shows the design that manages politeness. (View Highlight)
- Note: https://capture.dropbox.com/e79dXf9U7t8MwSRK
Queue router: It ensures that each queue (b1, b2, … bn) only contains URLs from the same host. (View Highlight)
Mapping table: It maps each host to a queue. (View Highlight)
- Note: https://capture.dropbox.com/7kto8TuDvOplA5FL
• FIFO queues b1, b2 to bn: Each queue contains URLs from the same host. (View Highlight)
Queue selector: Each worker thread is mapped to a FIFO queue, and it only downloads URLs from that queue. The queue selection logic is done by the Queue selector. (View Highlight)
• Worker thread 1 to N. A worker thread downloads web pages one by one from the same host. A delay can be added between two download tasks. (View Highlight)
We prioritize URLs based on usefulness, which can be measured by PageRank [10], website traffic, update frequency, etc. “Prioritizer” is the component that handles URL prioritization. (View Highlight)
(View Highlight)
Prioritizer: It takes URLs as input and computes the priorities. (View Highlight)
Queue f1 to fn: Each queue has an assigned priority. Queues with high priority are selected with higher probability. (View Highlight)
Queue selector: Randomly choose a queue with a bias towards queues with higher priority. (View Highlight)
Figure 8 presents the URL frontier design, and it contains two modules: • Front queues: manage prioritization • Back queues: manage politeness (View Highlight)
- Note: https://capture.dropbox.com/KPErst98uQqZYPCa
frontier (View Highlight)
Web pages are constantly being added, deleted, and edited. A web crawler must periodically recrawl downloaded pages to keep our data set fresh (View Highlight)
Recrawl all the URLs is time-consuming and resource intensive. Few strategies to optimize freshness are listed as follows: • Recrawl based on web pages’ update history. • Prioritize URLs and recrawl important pages first and more frequently. (View Highlight)
In real-world crawl for search engines, the number of URLs in the frontier could be hundreds of millions [4]. Putting everything in memory is neither durable nor scalable. Keeping everything in the disk is undesirable neither because the disk is slow; and it can easily become a bottleneck for the crawl. (View Highlight)
We adopted a hybrid approach. The majority of URLs are stored on disk, so the storage space is not a problem. To reduce the cost of reading from the disk and writing to the disk, we maintain buffers in memory for enqueue/dequeue operations. Data in the buffer is periodically written to the disk. (View Highlight)
The HTML Downloader downloads web pages from the internet using the HTTP protocol. (View Highlight)
Robots.txt, called Robots Exclusion Protocol, is a standard used by websites to communicate with crawlers. It specifies what pages crawlers are allowed to download. Before attempting to crawl a web site, a crawler should check its corresponding robots.txt first and follow its rules. (View Highlight)
To avoid repeat downloads of robots.txt file, we cache the results of the file. The file is downloaded and saved to cache periodically (View Highlight)
Performance optimization (View Highlight)
To achieve high performance, crawl jobs are distributed into multiple servers, and each server runs multiple threads (View Highlight)
(View Highlight)
DNS Resolver is a bottleneck for crawlers because DNS requests might take time due to the synchronous nature of many DNS interfaces. (View Highlight)
Maintaining our DNS cache to avoid calling DNS frequently is an effective technique for speed optimization. Our DNS cache keeps the domain name to IP address mapping and is updated periodically by cron jobs. (View Highlight)
Distribute crawl servers geographically. When crawl servers are closer to website hosts, crawlers experience faster download time (View Highlight)
Some web servers respond slowly or may not respond at all. To avoid long wait time, a maximal wait time is specified. (View Highlight)
Robustness (View Highlight)
To guard against failures, crawl states and data are written to a storage system. A disrupted crawl can be restarted easily by loading saved states and data. (View Highlight)
Detect and avoid problematic content (View Highlight)
1. Spider traps A spider trap is a web page that causes a crawler in an infinite loop. For instance, an infinite deep directory structure is listed as follows: http://www.spidertrapexample.com/foo/bar/foo/bar/foo/bar/... Such spider traps can be avoided by setting a maximal length for URLs. However, no one-size-fits-all solution exists to detect spider traps. Websites containing spider traps are easy to identify due to an unusually large number of web pages discovered on such websites. It is hard to develop automatic algorithms to avoid spider traps; however, a user can manually verify and identify a spider trap, and either exclude those websites from the crawler or apply some customized URL filters. (View Highlight)
Even though we have covered many topics, we still miss many relevant talking points: (View Highlight)
Server-side rendering: Numerous websites use scripts like JavaScript, AJAX, etc to generate links on the fly. If we download and parse web pages directly, we will not be able to retrieve dynamically generated links. To solve this problem, we perform server-side rendering (also called dynamic rendering) first before parsing a page [12]. (View Highlight)
Filter out unwanted pages: With finite storage capacity and crawl resources, an anti-spam component is beneficial in filtering out low quality and spam pages [13] [14]. (View Highlight)

Ayush Poddar's Wiki

Explorer

Design A Web Crawler - Highlights

Metadata

Highlights

Graph View

Table of Contents

Backlinks