Start a new topic
Answered

Cache and Pricing

 Hello,

I'm new to web scraping and I read a lot of tutorials, but some questions are still open. I want to scrape about 1M housing offers daily from 7 domains (= sources) and want to use Scrapy Cloud, Splash for Screenshots und Crawlera for this job. Scrapy comes with a HTTP Cache Middleware and my questions are related to this cache mechanism and the pricing:

1.) Only a small percentage of the 1M housing offers are new or changed in a daily crawl. With the Cache Middleware (RFC2616 policy) enabled, the crawler checks the E-Tag or header from server first (and than consults the cache or server for a fresh response). Do these E-Tag or Header-Only requests via Crawlera count as full/successfull requests towards the quota (for pricing)? Or isn't it neccessary to request the E-Tag / Header-only via Crawlera?

2.) A screenshot is only necessary if the housing offer page is new or changed. In my opinion, a second request must be "send" by Splash via Crawlera. Does this means, that a second request via Crawlera is required? Or are Splash and Scrapy Cloud using the same cache and the second request is answered by cache? Or does Crawlera cache a request for a short time, so that a second request is answered by this cache and isn't count towards the quota?

Thanks in advance
Christian

Best Answer

1) Yeah, every request routed through Crawlera is counted, however only successful requests (200, 301, 302 HTTP Codes) are counted towards the monthly quota.

2) If you're requesting the website with Scrapy+Crawlera to check if the page has changed, then if it has you need make a request via Splash and you want to use Crawlera with it, then yeah that would be 2 requests. However you don't have to route the Splash request for the screenshot through Crawlera.


Answer

1) Yeah, every request routed through Crawlera is counted, however only successful requests (200, 301, 302 HTTP Codes) are counted towards the monthly quota.

2) If you're requesting the website with Scrapy+Crawlera to check if the page has changed, then if it has you need make a request via Splash and you want to use Crawlera with it, then yeah that would be 2 requests. However you don't have to route the Splash request for the screenshot through Crawlera.

Thanks. So I have to calculate: (1,1M [1M housing offers plus 100k following link pages] * 31 [days in a month]) + (probably 100k changed/new housing offers in a month: 100k additional requests for screenshot)) = 34,2M requests in a month. Phew!

I would suggest Crawlera Enterprise for that amount of requests in a month.

Login to post a comment