Frontera (web crawling)

Frontera
Original author(s) Alexander Sibiryakov, Javier Casas
Developer(s) Scrapinghub Ltd., GitHub community
Initial release November 1, 2014 (2014-11-01)
Stable release
v0.7.0 / February 9, 2017 (2017-02-09)
Written in Python
Operating system OS X, Linux
Type web crawling
License BSD 3-clause license
Website github.com/scrapinghub/frontera

Frontera is an open source, web crawling framework implementing crawl frontier component and providing scalability primitives for web crawler applications.

Overview

The content and structure of the World Wide Web changes rapidly. Frontera is designed to be able to adapt quickly to these changes. Most large scale web crawlers operate in batch mode with sequential phases of injection, fetching, parsing, deduplication, and scheduling. This leads to a delay in updating the crawl when the web changes. The design is mostly motivated by the relatively low random access performance of hard disks compared to sequential access. Frontera instead relies on modern key value storage systems, using efficient data structures and powerful hardware to crawling, parsing and schedule indexing of new links concurrently. It's an open-source project designed to fit various use cases, with high flexibility and configurability.

Large-scale web crawls are Frontera's only purpose. Its flexibility allows crawls of moderate size on a single machine with a few cores by leveraging single process and distributed spiders run modes.

Features

Frontera is written mainly in Python. Data transport and formats are well abstracted and out-of-box implementations include support of MessagePack, JSON, Kafka and ZeroMQ.

  • Online operation: small requests batches, with parsing done right after fetch.
  • Pluggable backend architecture: low-level storage logic is separated from crawling policy.
  • Three run modes: single process, distributed spiders, distributed backend and spiders.
  • Transparent data flow, allowing to integrate custom components easily.
  • Message bus abstraction, providing a way to implement your own transport (ZeroMQ and Kafka are available out of the box).
  • SQLAlchemy and HBase storage backends.
  • Revisiting logic (only with RDBMS backend).
  • Optional use of Scrapy for fetching and parsing.
  • BSD 3-clause license, allowing to use in any commercial product.
  • Python 3 support.

Comparison to other web crawlers

Although, Frontera isn't a web crawler itself, it requires a streaming crawling architecture rather than a batch crawling approach.[1]

StormCrawler is another stream-oriented crawler built on top of Apache Storm whilst using some components from the Apache Nutch ecosystem. Scrapy Cluster was deisgned by ISTResearch with precise monitoring and management of the queue in mind. These systems provide fetching and/or queueing mechanisms, but no link database or content processing.

Architecture

Single process [2]

Fetcher

The Fetcher is responsible for fetching web pages from the sites and feeding them to the frontier which manages what pages should be crawled next. Fetcher can be implemented using Scrapy or any other crawling framework/system as the framework offers a generic frontier functionality. In distributed run mode Fetcher is replaced with message bus producer from Frontera Manager side and consumer from Fetcher side.

Frontera API / Manager

The main entry point to Frontera API is the FrontierManager object. Frontier users, in our case the Fetcher, will communicate with the frontier through it.

Middlewares

Frontier middlewares are specific hooks that sit between the Manager and the Backend. These middlewares process Request and Response objects when they pass to and from the Frontier and the Backend. They provide a convenient mechanism for extending functionality by plugging custom code. Canonical URL solver is a specific case of middleware responsible for substituting non-canonical document URLs with canonical ones.

Backend

The frontier Backend is where the crawling logic/policies lies. It's responsible for receiving all the crawl info and selecting the next pages to be crawled. Backend is meant to be operating on higher level, and Queue, Metadata and States objects are responsible for low-level storage communication code.

May require, depending on the logic implemented, a persistent storage to manage Request and Response objects info.

Data Flow

The data flow in Frontera is controlled by the Frontier Manager, all data passes through the manager-middlewares-backend scheme and goes like this: The frontier is initialized with a list of seed requests (seed URLs) as entry point for the crawl. The fetcher asks for a list of requests to crawl. Each URL is fetched and the frontier is notified back of the crawl result as well of the extracted data the page contains. If anything went wrong during the crawl, the frontier is also informed of it. Once all URLs have been crawled, steps 2-3 are repeated until crawl of frontier end condition is reached. Each loop (steps 2-3) repetition is called a frontier iteration.

Distributed [3]

The same Frontera Manager pipeline is used in all Frontera processes when running in distributed mode.

Overall system forms a closed circle and all the components are working as daemons in infinite cycles. There is a message bus responsible for transmitting messages between components, persistent storage and fetchers (when combined with extraction these processes called spiders). There is a transport and storage layer abstractions, so one can plug its own transport. Distributed backend run mode has instances of three types:

  • Spiders or fetchers, implemented using Scrapy. Responsible for resolving DNS queries, getting content from the Internet and doing link (or other data) extraction from content.
  • Strategy workers. Run the crawling strategy code: scoring the links, deciding if link needs to be scheduled and when to stop crawling.
  • DB workers. Store all the metadata, including scores and content, and generating new batches for downloading by spiders.

Such design allows operating online. Crawling strategy can be changed without having to stop the crawl. Also crawling strategy can be implemented as a separate module; containing logic for checking the crawling stopping condition, URL ordering, and scoring model.

Frontera is polite to web hosts by design and each host is downloaded by no more than one spider process. This is achieved by stream partitioning.

Data flow

The seed URLs defined by the user in spiders are propagated to strategy workers and DB workers by means of spider log stream. Strategy workers decide which pages to crawl using state cache, assigning a score to each page and sends the results to the scoring log stream.

DB Worker stores all kinds of metadata, including content and scores. Also, DB worker checks for the spider’s consumers offsets and generates new batches if needed and sends them to spider feed stream. Spiders consume these batches, downloading each page and extracting links from them. The links are then sent to the spider log stream where they are stored and scored. That way the flow repeats indefinitely.

Battle testing

At Scrapinghub Ltd. there is a crawler processing 1600 requests per second at peak, built using primarily Frontera using Kafka as a message bus and HBase as storage for link states and link database. Such crawler operates in cycles, each cycle takes 1.5 months and results in 1.7B of downloaded pages.[4]

Crawl of Spanish internet resulted in 46.5M pages in 1.5 months on AWS cluster with 2 spider machines.[5]

Used by

Frontera is used by several companies

History

First version of Frontera operated in single process, as part of custom scheduler for Scrapy, using on-disk SQLite database to store link states and queue. It was able to crawl for days. After getting to some noticeable volume of links it started to spend more and more time on SELECT queries, making crawl inefficient. This time Frontera is developed under DARPA's Memex program and included in its catalog of open source projects.[6]

In 2015 subsequent versions of Frontera used HBase for storing link database and queue. Application was distributed on two parts: backend and fetcher. Backend was responsible for communicating with HBase by means of Kafka and fetcher was only reading Kafka topic with URLs to crawl, and producing crawl results to another topic consumed by backend, thus creating a closed cycle. First priority queue prototype suitable for web scale crawling was implemented during that time. The queue was producing batches with limits on a number of hosts and requests per host.

Next significant milestone of Frontera development was the introduction of crawling strategy and strategy worker, along with abstraction of the message bus. It became possible to code the custom crawling strategy without dealing with low-level backend code operating with the queue. An easy way to say what links should be scheduled, when and with what priority made Frontera a truly crawl frontier framework. Kafka was quite a heavy requirement for small crawlers and message bus abstraction allowed to integrate almost any messaging system with Frontera.

See also

References

  1. Sibiryakov, Alexander (22 Jun 2015). "What is better - Scrapy or Apache Nutch?". Quora.
  2. Dowinton, Richard (15 Apr 2015). "Frontera: the brain behind the crawls". Scrapinghub blog.
  3. Sibiryakov, Alexander (8 August 2015). "Distributed Frontera: web crawling at large scale". Scrapinghub blog.
  4. Sibiryakov, Alexander (29 Mar 2017). "Frontera: архитектура фреймворка для обхода веба и текущие проблемы". Habrahabr.
  5. Sibiryakov, Alexander (15 Oct 2015). "frontera-open-source-large-scale-web-crawling-framework". Speakerdeck.
  6. "Open Catalog, Memex (Domain-Specific Search)".
This article is issued from Wikipedia. The text is licensed under Creative Commons - Attribution - Sharealike. Additional terms may apply for the media files.