0
Answered
Matthew Sealey 2 weeks ago in Portia • updated 2 weeks ago 5

I'm trying to use Portia to pull data from a possible list of pages based on a list. I know a lot of pages don't exist, but I don't know which ones.


So far Portia gets stuck in a loop of reattempting pages multiple times. That increases the request limit unnecessarily. Is there a way of limiting Portia to perhaps just two attempts at a single page before it discards it from attempting again?

Answer

+1
Answer
Answered

Hi Matthew!

Have you tried some extra setting using regex? Perhaps, you don't know exactly which pages are unnecessary but you know some more information about the URL and avoid it.


Check this article:

Portia > Regex


Best regards!

Pablo

GOOD, I'M SATISFIED
Satisfaction mark by Matthew Sealey 2 weeks ago
+1
Answer
Answered

Hi Matthew!

Have you tried some extra setting using regex? Perhaps, you don't know exactly which pages are unnecessary but you know some more information about the URL and avoid it.


Check this article:

Portia > Regex


Best regards!

Pablo

No, and in my situation I don't think that would help.


I'll explain a little more.


I have a list of possible URLs. I know that the page format will be something like:


http://www.mysite.co.uk/acondition/


After /acondition/ would be an item from the list I have.


In my case the page redirects to a specific 404 not found page unique to the website if the page doesn't exist. If it does Portia correctly scrapes the data.


From what I can tell, Portia is still trying the page it hasn't found. I'd like it to stop doing that... :D


Does that make sense? Does it help with understanding what my issue is a bit better?


PS thanks for the suggestion. :)

+1

Thanks for your nice feedback =)


Well this is posible using a proxy rotator like Crawlera:

https://scrapinghub.com/crawlera/

This helps to rotate IPs and increase performance.


If you sign up to this service you can enable as Add-on.


Do you think could help in your case? The other suggestion is to try explore which other properties has those broken URLs.


Best,

I've already setup Crawlera because of the sheer volume of pages I need to check. I looked again at the redirect page, and it was simply the home page, which means it has no URL properties.


Any other suggestions?

Anyone else any suggestions on a way forward with this problem? :)