0
Answered
abbyinohio 4 months ago in Portia • updated by Pablo Vaz (Support Engineer) 4 months ago 1

Help! I deployed a Portia project (https://portia.scrapinghub.com/#/projects/167699) to scrape Rotten Tomatoes. But when I run the job on Scrapy Cloud, it crawls hundreds of pages and scrapes only the first ten. I verified that none of my fields are required, and I can't find any error messages in the log file. Thank you!

Answer

Answer
Answered

Hi Abby,

If Portia scrape successfully the first pages but then it started to fail, could be a ban issue.

When you start to crawl, Portia crawls from a fixed IP and the site can detect you are requesting and start to banning you.
We can suggest to use Crawlera, our intelligent proxy rotator. It can help you to crawl more efficiently.

https://scrapinghub.com/crawlera/


Also, if the site is complex to scrape, it is recommended to start with Scrapy:

https://doc.scrapy.org/en/latest/intro/tutorial.html

Finally, you can always ask to our sales team for our data on demand services. We can extract the data you need for you and deliver to you in the most useful formats.


I hope to be helpful with this suggestions.

Kind regards,


Pablo

Answer
Answered

Hi Abby,

If Portia scrape successfully the first pages but then it started to fail, could be a ban issue.

When you start to crawl, Portia crawls from a fixed IP and the site can detect you are requesting and start to banning you.
We can suggest to use Crawlera, our intelligent proxy rotator. It can help you to crawl more efficiently.

https://scrapinghub.com/crawlera/


Also, if the site is complex to scrape, it is recommended to start with Scrapy:

https://doc.scrapy.org/en/latest/intro/tutorial.html

Finally, you can always ask to our sales team for our data on demand services. We can extract the data you need for you and deliver to you in the most useful formats.


I hope to be helpful with this suggestions.

Kind regards,


Pablo