Using Crawlera with Splash is possible, but you have to keep some things in mind before integrating them.


Unlike a standard proxy, Crawlera is designed for crawling and it throttles requests speed to avoid users getting banned or imposing too much load on websites. This throttling translates to Splash being slow when also using Crawlera.


When you access a web page in a browser (like Splash), you typically have to download many resources to render it (images, CSS styles, JavaScript code, etc.) and each resource is fetched by a different request against the site. Crawlera will throttle each request separately, which means that the load time of the page will increase dramatically.


To avoid the page loading being too slow, you should avoid unnecessary requests. You can do so by:

  • Disabling images in Splash
  • Blocking requests to advertisement and tracking domains
  • Not using Crawlera for subresource requests when not necessary (for example, you probably don't need Crawlera to fetch jQuery from a static CDN)


How to integrate them

To make Splash and Crawlera work together, you'll need to pass a Lua script similar to this example to Splash's /execute endpoint. This script will configure Splash to use Crawlera as a proxy and will also perform some optimizations, such as disabling images and avoiding some sorts of requests. It will also make sure that the Splash requests go through the same IP address, by creating a Crawlera session.


In order to make it work, you have to provide your Crawlera API key via the crawlera_user argument for your Splash requests (example). Or, if you prefer, you could hardcode your API key in the script.


Using Splash + Crawlera with Scrapy via scrapy-splash

Let's dive into an example to see how to use Crawlera and Splash in a Scrapy spider via scrapy-splash (for the full working example, check this repo).


This is the project structure:

├── scrapy.cfg
├── setup.py
└── splash_crawlera_example
    ├── __init__.py
    ├── settings.py
    ├── scripts
    │   └── crawlera.lua
    └── spiders
        ├── __init__.py
        └── quotes-js.py


A few details about the files listed above:

  • settings.py: contains the configurations for both Crawlera and Splash, including the API keys required for authorization (note that Crawlera should be disabled in the settings, since routing requests to Crawlera is handled by the Lua script mentioned below).
  • scripts/crawlera.lua: the Lua script that integrates Splash and Crawlera.
  • spiders/quote-js.py: the spider that needs Splash and Crawlera for its requests. This spider loads the Lua script into a string and sends it along with its requests.


In our spider, we load the Lua script into a string in the __init__ method:


self.LUA_SOURCE = pkgutil.get_data(
    'splash_crawlera_example', 'scripts/crawlera.lua'
).decode('utf-8')


Note: to load the script from a file both locally and on Scrapy Cloud, you have to include the Lua script in your package's setup.py file, as shown below:


from setuptools import setup, find_packages

setup(
    name = 'project',
    version = '1.0',
    packages = find_packages(),
    package_data = {'splash_crawlera_example': ['scripts/*.lua',]},
    entry_points = {'scrapy': ['settings = splash_crawlera_example.settings']},
)


Once we have the Lua script loaded in our spider, we pass it as an argument to the SplashRequest objects, along with Crawlera's and Splash's credentials (authorization with Splash can be also be done via http_user setting):


yield SplashRequest(
    url='http://quotes.toscrape.com/js',
    endpoint='execute',
    splash_headers={
        'Authorization': basic_auth_header(self.settings['SPLASH_APIKEY'], ''),
    },
    args={
        'lua_source': self.LUA_SOURCE,
        'crawlera_user': self.settings['CRAWLERA_APIKEY'],
    },
    # tell Splash to cache the lua script, to avoid sending it for every request
    cache_args=['lua_source'],
)


And that's it. Now, this request will go through your Splash instance, and Splash will use Crawlera as its proxy to download the pages and resources you need.


Customizing the Lua Script

You can go further and customize the Lua script to fit your exact requirements. In the example provided here, we commented out some lines that filter requests to useless resources or undesired domains. You can uncomment those and customize them to your own needs. Check out Splash's official docs to lean more about scripting.


A working example for Scrapy Cloud

You can find a working example of a Scrapy project using Splash and Crawlera in this repository. The example is ready to be executed on Scrapy Cloud.