The Auto Throttle addon makes spiders crawl the target sites with more caution, by dynamically adjusting request concurrency and delay according to the site lag and user control parameters. For more details see the Scrapy Autothrottle documentation.

This addon is enabled by default in every Scrapy Cloud project. The basic settings controlling its behaviour are:

  • CONCURRENT_REQUESTS_PER_DOMAIN  - limits the maximum number of concurrent requests sent to the same host domain (default value is 8)
  • DOWNLOAD_DELAY  - limits the minimum download delay (in seconds) between each burst of requests (default value is 0)
  • AUTOTHROTTLE_ENABLED  - enables or disables Autothrottle addon (default value is True, i.e. enabled)

Adjusting Auto Throttle settings

The settings depend on the user’s needs, there are no values that will work for every website. The default values are in general a good starting point and most servers tolerate them. Still there’s a possibility of blocking and a need to slow down the crawling rate may emerge. Or quite the contrary, you may want the bot to crawl faster, in such instance you should fully realize that the risk of blocking increases.

The crawling rate may be slowed down by adjusting the maximum concurrency CONCURRENT_REQUESTS_PER_DOMAIN to 1, and increasing the minimum download delay DOWNLOAD_DELAY at will. Regarding the maximum effective crawling rate, in practice it will be limited to the target server response rate, but may try to speed it up by randomly increasing maximum concurrency (although in reality it produces no significant effect as concurrency will hardly exceed 2 for most sites).

As Auto Throttle dynamically adjusts delay and concurrency depending on the website response delay, the parameters only define limits while not forcing values. The minimum download delay value will not let the effective download delay take lower values during crawling, and the maximum concurrency value will not let the effective concurrency take higher ones. If there’s a need for fixed values, Auto Throttle and its functionality of adjusting effective parameters during crawling have to be disabled by setting AUTOTHROTTLE_ENABLED to False. Under such conditions, the settings CONCURRENT_REQUESTS_PER_DOMAIN and DOWNLOAD_DELAY may be redefined with required values. 

But be warned, you will be doing so at your own risk – as stated before, increasing the crawling rate results in considerably increasing the probability to being blocked by the target site or your Scrapinghub account getting suspended.