Overview


Crawling with a headless browser is different from traditional approaches. Conventional spiders gives you control over the requests and sequence of requests. Browser-less spiders integrate network/transport level into the framework. Roughly speaking, browser-less spiders programs transport level to traverse a website in a specific way. You control how they will access static assets (if you need them) and how they load resources which are additions to the plain HTML which comes as a response.


Browsers do not give you such flexibility. They can do much more than browser-less spiders (render DOM changes, execute JavaScript, establish WebSocket connections) but usually, you cannot control them when it comes to the loading of static assets, queuing of such requests or applying additional logic, specific to web scraping. For example, it is hard to force browsers to exclude files based on file extensions, ignore certain paths, provide HTTP Basic Auth credentials, modify cookie jar on the flight or append some special headers and so on. Browsers were created to show the resources to end users and their API is quite limited.


Crawlera is the HTTP proxy which supports Proxy Authorization protocol and configured by the special X-Headers. It is fine for browser-less spiders which usually come with a straightforward way of using the service but it is really tricky to configure headless browsers to use Crawlera.


Crawlera knows how to work with browser workloads but to simplify configuration, we recommend using crawlera-headless-proxy, which is the self-hosted complimentary tool that separates the proxy interface from Crawlera configuration.


This tutorial will cover crawlera-headless-proxy installation, usage, and configuration. We also provide a list of examples for how to use some headless browsers with this tool.


crawlera-headless-proxy


crawlera-headless-proxy is a complimentary proxy which is distributed as a statically linked binary. This tool was created to be a self-hosted service which you should tailor to the grid of your headless browsers. 


The main idea is to delegate Crawlera configuration to headless-proxy. After that, it has to provide the simplest HTTP proxy interface to a headless browser so users do not have to worry how to propagate Crawlera settings to the browser itself.


Also, crawlera-headless-proxy provides a list of common features which usually are required for web scraping.


Please pay attention to the fact that crawlera-headless-proxy does MITM. Unfortunately, there is no way how to bypass it if we need to hijack the secure requests to append headers to them. Or filter by some adblock-list. You can use embedded TLS keys or provide your own.


Installation


Source code of this tool is available on GitHub: https://github.com/scrapinghub/crawlera-headless-proxy

It is also available in DockerHub here: https://hub.docker.com/r/scrapinghub/crawlera-headless-proxy/


Pre-made binaries


crawlera-headless-proxy is distributed as a statically linked binary with no runtime dependencies. To obtain the latest version, please check the Releases page on GitHub: https://github.com/scrapinghub/crawlera-headless-proxy/releases


If you have Go installed (version >= 1.11), then you can install this tool with the following command:

$ go get github.com/scrapinghub/crawlera-headless-proxy


If you use OS X and HomeBrew, please install with

$ brew install https://raw.githubusercontent.com/scrapinghub/crawlera-headless-proxy/master/crawlera-headless-proxy.rb


Also, there is a Docker image with headless-proxy. To obtain it, please execute the following command:

$ docker pull scrapinghub/crawlera-headless-proxy


Installation from source code


To install from sources, please refer to the official README on https://github.com/scrapinghub/crawlera-headless-proxy.


Configuration


crawlera-headless-proxy can be configured by a number of ways:

  1. Config file (in TOML)
  2. Command line parameters
  3. Environment variables

You can find a comprehensive example with all options here: https://github.com/scrapinghub/crawlera-headless-proxt/blob/master/config.toml

For all options, their meaning and configuration details please see official README: https://github.com/scrapinghub/crawlera-headless-proxy.


Usage


crawlera-headless-proxy provides a sensible defaults so minimal example is

$ crawlera-headless-proxy -a MYAPIKEY

where MYAPIKEY is an API key for Crawlera which you can find on the page of your Crawlera user.


The same example with docker:

$ docker run --rm -it --name crawlera-headless-proxy scrapinghub/crawlera-headless-proxy -a MYAPIKEY


Or, if you want to propagate API key with environment variable:

$ docker run --rm -it --name crawlera-headless-proxy -e CRAWLERA_HEADLESS_APIKEY=MYAPIKEY scrapinghub/crawlera-headless-proxy


If you like to use configuration file only, please run the tool with the following command line:

$ crawlera-headless-proxy -c /path/to/my/config/file.toml


If you want to use docker, please mount it to /config.toml within a container. Example:

$ docker run --rm -it -- name -v /path/to/my/config/file.toml:/config.toml:ro scrapinghub/crawlera-headless-proxy


Headless Browsers options


There are several ways of how to use headless browsers. The most common way is to use Selenium or Puppeteer. Another great option is Splash. Please choose the way you like.


You can find examples of how to use headless browsers with Crawlera and crawlera-headless-proxy in examples directory: https://github.com/scrapinghub/crawlera-headless-proxy/tree/master/examples


Let’s assume that you have crawlera-headless-proxy up and running. For the sake of simplicity, let’s assume it is accessible by IP 10.11.12.13 and port 3128.


Splash


Splash is the open source project of ScrapingHub which presents an HTTP API to WebKit. It has the possibility to execute browser automation scripts in Lua stateless.


If you want to render an HTML using Splash with the render.html endpoint, just pass proxy (proxy=http://10.11.12.13:3128) along with other parameters. If you need to use Crawlera conditionally, you need to use a custom Lua script. Please find the simplest example below:


function main(splash, args)
  if args.proxy_host ~= nil and args.proxy_port ~= nil then
    splash:on_request(function(request)
      request:set_proxy{
        host = args.proxy_host,
        port = args.proxy_port,
      }
    end)
  end
 
  splash:set_result_content_type("text/html; charset=utf-8")
  assert(splash:go(args.url))
  return splash:html()
end


This will activate Crawlera for the request if you propagate proxy_host and proxy_port parameters to execute endpoint.


Puppeteer


Puppeteer is an official project which provides node.js API for headless Chrome. Configuration is simple:


browser = await pyppeteer.launch(
    ignoreHTTPSErrors=True,
    args=["--proxy-server=10.11.12.13:3128"]
)


Pyppeteer


Pyppeteer is an unofficial port of Puppeteer to Python. API is quite similar to JS one:


browser = await pyppeteer.launch(
    ignoreHTTPSErrors=True,
    args=["--proxy-server=10.11.12.13:3128"]
)


Selenium


Selenium is a browser automation project which uses webdriver API. Almost all browsers implement this API so you are not limited to a single project. A browser is configured by Selenium capabilities. Here is an example on how to configure Chrome with Selenium Grid:

CRAWLERA_HEADLESS_PROXY = "10.11.12.13:3128"
profile = webdriver.DesiredCapabilities.CHROME.copy()
 
profile["proxy"] = {
  "httpProxy": CRAWLERA_HEADLESS_PROXY,
  "ftpProxy": CRAWLERA_HEADLESS_PROXY,
  "sslProxy": CRAWLERA_HEADLESS_PROXY,
  "noProxy": None,
  "proxyType": "MANUAL",
  "class": "org.openqa.selenium.Proxy",
  "autodetect": False
}
 
profile["acceptSslCerts"] = True
driver = webdriver.Remote("http://localhost:4444/wd/hub", profile)


As you can see, the configuration is quite similar to the other headless browsers. You just need to propagate HTTP proxy settings.


Integration with Scrapy


If you want to integrate Selenium with Scrapy, please use scrapy-selenium plugin: https://bitbucket.org/scrapinghub/scrapy_selenium/src/master/


To install it with pip, please run the following command:

$ pip install -e git+https://bitbucket.org/scrapinghub/scrapy_selenium.git#scrapy_selenium


Update your settings.py with the following lines:

from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
 
SELENIUM_GRID_URL = 'http://localhost:4444/wb/hub'  # Example for local grid with docker-compose
SELENIUM_NODES = 3  # Number of nodes(browsers) you are running on your grid
SELENIUM_CAPABILITIES = DesiredCapabilities.CHROME  # Example for Chrome
SELENIUM_PROXY = 'http://proxy.url:port'
 
# You need also to change the default download handlers, like so:
DOWNLOAD_HANDLERS = {
    "http": "scrapy_selenium.SeleniumDownloadHandler",
    "https": "scrapy_selenium.SeleniumDownloadHandler",
}


Example of spider which uses scrapy_selenium:

from scrapy import Spider, Request
from scrapy_selenium import SeleniumRequest
 
class SomeSpider(Spider):
    ...
    def parse(self, response):
        ...
        yield Request(url, callback=self.some_parser)  # This will be handled just like any scrapy request
 
    def some_parser(self, response):
        ...
        yield SeleniumRequest(some_url, callback=self.other_parser, driver_callback=self.process_webdriver)  # This will be handled by Selenium Grid
 
    def process_webdriver(self, driver):
        ...