I've been hired to scrape ~123,000 text files from a government website.
To get the URLs for a file, I need to submit a search request to that same website while providing a unique ID corresponding to the file, and then scrape the URL of the file from the HTML the website sends me.
To accomplish my goal, I signed up for a Scrapy Cloud for one month, and I signed up for the 150,000 request per month plan on Crawlera to avoid getting IP blocked.
My initial idea was to first crawl the target website to create a CSV containing the URLs for the files I want to download, and then to do a separate job that actually downloads the files.
I sent ~123,000 requests to get the URLs and successfully created the CSV.
I now want to download the files corresponding to the URLs I have in the CSV.
After researching how to download files with Scrapy Cloud, I realized that the normal way of getting the files would have been to have the files downloaded with the initial job by using a FilesPipeline, rather than getting the file's URL and downloading the URL as two separate jobs.
I'm now close to hitting my 150,000 request per month limit for Crawlera, and I want to know what plan I should sign up for to be able to download the files.
If a file download will count as the same request as the search request (that finds the file URL), I'd prefer to just modify my original job (that retrieved the URLs) and re-run it while having it download the files, because that looks like it will be less work for me.
However, if the search query and the file download will count as two separate requests, that will increase the cost of the plan I'll need to use enough that I might want to try just having a scraper directly download the URLs I have in my CSV file rather than re-querying for the URLs.