0
Under review
joaquin 7 days ago in Crawlera • updated 2 days ago 13

I am running some crawlers on a site, all of them using crawlera, and I am getting several 429 error statuses (which mean too many requests), so crawlera doesn't seem to be adjusting its throttling to accommodate for these errors.


Does your throttling algorithm consider 429 status codes?


I am using scrapy plus the crawlera middleware btw.

Under review

I am not from crawlera, but I suppose that what you suggest implies to add some kind of vulnerability in crawlera. The only way that crawlera can throttle requests coming from spider at a bigger rate than the limit, is to accumulate the excess of requests until it can process them. This excess grows continuously if spider systematically sends requests to a speed bigger than the limit and implies a growing caching. Probably already does some caching/throttling at some extent, but this caching cannot grow forever, specially if spider rate is much bigger than crawlera rate limit. Crawlera can't control the rate at which spider sends requests to it. It can only discard this excess and send a signal to the spider. This discard signal comes in the form of a 429 status response, indicating that spider should slow down requests.

Hey thanks for the response, but if you go to Crawlera's site you'll read that they offer this throttling as a feature.

+1

response from crawlera team:

there’s a limit as to how many concurrent requests a user can send based on their plan. As you said excess of these requests are discarded with a 429 code from Crawlera to the client. The throttling is within the “pool” or “bounds” of the plan limit, meaning if user has a C10, Crawlera will process 10 concurrent requests max and these may be delayed before being sent to the target domain, but it won’t accept more from the user to “queue up” until there’s an available slot.

Gotcha, how can I know if the 429 error is coming from crawlera or from the site im scraping? Cause I dont think I'm coming even close to the 200 concurrent requests my plan allows for.

Also, I read on Crawlera's site:


Crawlera’s default request limit is 5 requests per second (rps) for each website. There is a default delay of 200ms between each request and a default delay of 1 second between requests through the same slave. These delays can differ for more popular domains. If the requests per second limit is exceeded, further requests will be delayed for up to 15 minutes. Each request made after exceeding the limit will increase the request delay. If the request delay reaches the soft limit (120 seconds), then each subsequent request will contain X-Crawlera-Next-Request-In header with the calculated delay as the value.

Does this also happen when I go over the maximum amount of concurrent requests my plan allows for? Do you start increasing the request delay, or is the 200 concurrent delay limit fixed?

Have you tested with different sites? If a site is particularly prone to ban, the 200 slots can be filled very quickly and they will be freed very slowly, thus giving the impression that you have fewer concurrent requests available.

Yeah, I only get this problem with this site, and I am trying to find out why this is. But I am having a hard time getting the necessary information to diagnose the problem. All I know is that I get 429 responses sometimes, but I have no information on what Crawlera is doing, or if its the site that is banning Crawlera proxies and that is the reason why I am using up all my slots. Is there an account rep who I can talk to, who can look at my specific requests?

Regarding your last comment, that refers to the limitations per slave, not overall one. By default, each slave can be used 5 times per second for same website. But crawlera has a pool of thousands of ips, so that is not related with the plan max concurrency.

The quote I posted yesterday seems to say otherwise, the 5 times per second is a global per domain limit, while the limit per slave appears to be 1 request per second..

I think you are right. Still, concurrency is one thing and max rate is another. The best way to check that is observing that different sites has different response. Sites that ban more makes requests to be attended slower and so concurrency slots are freed slower. If you were having issues with your concurrency, you would have same issue with every site, not just one.

Check the Crawlera menu at the top bar of the user interface, and select your crawlera account. That will give some of the information you want (including bans). I anyway notified crawlera team in order to provide further help to you. Which is the site you are talking about?

with 'issues with your concurrency' I mean, that the service is not providing you the concurrency you hired. But the problem you have is really about concurrency: you are sending requests at a rate bigger than can be attended by crawlera, so you are surpassing the concurrency limit.

Hey, thanks a lot for helping me with this. This is the first site in which I make use of sessions and slaves, so that may be related to the problem. I took consideration not to send more than 1req/s (I actually send 1 every 3 seconds per slave) though, as crawlera documentation indicates, but there may be something else I am not considering. The Crawlera dashboard  doesn't show any bans.


I got an email from a Crawlera representative, so thank you for connecting us.

Also, plan max concurrency is overall account usage, which may be used for crawling many different sites at same time.