Welcome to the Scrapinghub feedback & support site! We discuss all things related to Scrapy Cloud, Portia and Crawlera. You can participate by reading posts, asking questions, providing feedback, helping others, and voting the best questions & answers. Scrapinghub employees regularly pop in to answer questions, share tips, and post announcements.
0
Answered
Inva 3 weeks ago in Portia • updated by Pablo Vaz (Support Engineer) 2 weeks ago 1

Trying to scrape the buylist information from this website's buy list http://www.coolstuffinc.com/main_buylist_display.php


Having difficulty figuring out how to even begin with it.  it all happens in that url, there's multiple steps in the process to get to a particular game's buylist with multiple sets per game.

Answer

Inva,


Please try to follow all the material provided in our Learn center to adapt your spider to your needs after:

https://learn.scrapinghub.com/portia/

We don't provide assistance for each case in particular, but we can offer our experts assistance, if interested, let us know by providing more information through:

https://scrapinghub.com/quote

Our engineers will freely provide a solution according your needs once to know if it's possible to extract data from that site.


Best,


Pablo

0
Answered
mark80 3 weeks ago in Portia • updated by Pablo Vaz (Support Engineer) 3 weeks ago 1

http://www.ancebergamo.it/elencoimprese.asp?lettera=C


i need extract company name and mail info from every company paginated in alphabetic order..

i've set url generation, specifing every alphabetic letter but only 3 item after thousand of request..strangely slow..

can you take a look?


Pablo i've tried to add you to project? can you see my invitation?

Answer

Hi Mark,


As you can see, every page is like:

http://www.ancebergamo.it/elencoimprese.asp?lettera=1
http://www.ancebergamo.it/elencoimprese.asp?lettera=A
http://www.ancebergamo.it/elencoimprese.asp?lettera=B

and so on...


Please try to use pagination or URL list generation as seen on:

1. Extract data from a List of URLs

2. Handle pagination in Portia


I've noticed you has Follow all domain links on the pagination settings, at least for link_name_ink_extraction.


I hope you find this helpful.


Best regards!


Pablo

0
Not a bug
smi 4 weeks ago in Portia • updated 3 weeks ago 2

Hi!


I'm using portia to get data from a list of search results (businesses).


In that list are lnadline numbers, mobile numbers and so on.


Using .result-container > .row > .col2 > .icon-mobile > a it should return all numbers that have a mobile icon in front of it (when searching for mobile numbers) and in the GUI it works as it should.

But when the crawler runs and finds an item in the list that does not contain a mobile number, it picks the ones from all others on the page.


How can I prevent that?


Thanks! 

Answer

Hey smi, 


Try to play a bit with CSS selector or XPath selector to fetch that information:


 

Once you make the anotation to extract items, you just click on the wheel near the type of anotation, to config this. 


I hope this information helps.


Best regards,


Pablo Vaz

0
Answered
mark80 1 month ago in Portia • updated by Pablo Vaz (Support Engineer) 3 weeks ago 5

http://vetrina.assimpredilance.it/results.aspx?rs=

I need to go through 113 page scraping for every name the address and mail ..

How can I set up multipage scraping?


Answer

Hey Mark, I hope you find the solution by checking these two approaches:


1. Extract data from a List of URLs


2. Handle pagination in Portia


if not possible with those two approaches, perhaps Portia is not the right solution for your needs as mouch1 suggested.


mouch1, thanks for your suggestions.


Best regards!


Pablo

0
Planned
v2065925 1 month ago in Portia • updated by Thriveni Patil (Support Engineer) 4 weeks ago 6

Starting from yesterday In URL Generation, GENERATION LIST сontains errors


Before the bug

I introduced a fixed part x and list 1 2 3

GENERATION LIST consisted of correct urls:

x1

x2

x3


But since yesterday urls became incorrect and contain an extra sign %20

x%201

x%202

x%203


How can fix this?


Answer

The URL rendering %20 in browser is a bug and Portia team would be working to fix the bug. 


But when the spider is run, the URLs in the request are correctly rendered. They seem to be blocked by site with 403 http code. You may need to use Proxy rotater like Crawlera to evade the bans.

0
Answered
Isaan Online 1 month ago in Portia • updated 1 month ago 2

I am trying using Portia for the first time, first url I want to extract data from, it never stops loading, anything to do to prevent that. In web browser it load normally.


Before I start buying slots for Cloud, I will prefer that I can use Portia


Supatra

Answer

Hi Isaan, for most sites Portia loads the page correctly and allows you to extract without any worries. Unfortunately some other are too complex to interact with this browser.


For this kind of sites I suggest as soon as possible to start with Scrapy, involves more coding but it's more powerful.

If you are interested in to learn more please visit our new learning center:

https://learn.scrapinghub.com

BTW there are more resources for Portia too.


Best regards,


Pablo

0
Answered
oday_merhi 1 month ago in Portia • updated by Pablo Vaz (Support Engineer) 1 month ago 1

Hello guys,


So i'm trying to scrape articles with certain texts in them. Can i teach my spider to scrape specific articles. For example if i was on a food site that had articles and i only wanted recipes with banana. Is there a way for me to set up the spider to only scrape the articles with the key word "banana"?


Thank you for your help!

Answer

Hey Oday, yes I think is possible to set up some kind of extractor using regular expressions.


If not possible with Portia, you can try with Scrapy:

https://helpdesk.scrapinghub.com/support/solutions/articles/22000201028-learn-scrapy-video-tutorials-


Best regards,


Pablo

0
Answered
Rodrigo Barria 2 months ago in Portia • updated by Pablo Vaz (Support Engineer) 1 month ago 1

Hi,


would like to know how to make the Spider go inside each one of the real-estate properties listed in this pages:


http://www.portalinmobiliario.com/venta/departamento/santa-isabel-santiago-santiago-metropolitana?tp=2&op=1&ca=2&ts=1&dd=2&dh=6&bd=0&bh=6&or=f-des&mn=1&sf=1&sp=0&sd=47%2C00



In addition, that principal page consists of several pages, how can I make the spider go to the 2nd page and so on...


thank you very much

Answer

Hi Rodrigo,


You have many options on how Portia crawls the site, try with "Follow all in-domain links"



If this doesn't work, try with different alternatives. Take a few minutes to explore these articles:

Portia > List of URLs

Portia > Pagination


Best regards,


Pablo


0
Answered
Matthew Sealey 2 months ago in Portia • updated 2 months ago 5

I'm trying to use Portia to pull data from a possible list of pages based on a list. I know a lot of pages don't exist, but I don't know which ones.


So far Portia gets stuck in a loop of reattempting pages multiple times. That increases the request limit unnecessarily. Is there a way of limiting Portia to perhaps just two attempts at a single page before it discards it from attempting again?

Answer

Hi Matthew!

Have you tried some extra setting using regex? Perhaps, you don't know exactly which pages are unnecessary but you know some more information about the URL and avoid it.


Check this article:

Portia > Regex


Best regards!

Pablo

0
Answered
mark80 2 months ago in Portia • updated by Pablo Vaz (Support Engineer) 2 months ago 5

it's my first project and it seams to me great app!! 

here http://www.sia.ch/it/affiliazione/elenco-dei-membri/socii-individuali/ we have 255 page (little number on top of list) and i need not only extract these 4 visible column but either mail and telephone inside every name of the list..

i've yet extracted 255 page with main 4 column sample of the link, but i don't know how go one level deeper in every name

can i do all job with a single crawler project?

Answer

Hey Mark, I think I could make it. Sent you an invitation to take a look into the project.

Feel free to open portia to check the settings I made.


Best,


Pablo

0
Answered
Jorge 2 months ago in Portia • updated by Pablo Vaz (Support Engineer) 2 months ago 1

it's posible apply filter to scraped data (i know it is posible) but i would like to download .JSON code with the filter criteria, and dodge rest of data, it is posible?


thanks in advance

Answer

Hola Jorge!


I think you can play a bit sharing data with spiders as shown in this article:

https://helpdesk.scrapinghub.com/support/solutions/articles/22000200420-sharing-data-between-spiders

but not sure if this is efficient for your purposes.


I would prefer to filter locally, but of course that depends on the project.


Best,


Pablo

0
Answered
vl2017 2 months ago in Portia • updated by Pablo Vaz (Support Engineer) 2 months ago 2

Could you add support for UTF-8. Not English letters are not shown in the sample page editor, and regexp-conditions are not working with them.

Answer

Hey!

Your inquiry has been escalated to our Portia team.

UTF-8 is supported for non latin characters, but perhaps needs to be improved when interacts with regex.

This feature it´s planned for next releases.


Thanks for your valuable feedback and for helping us to improve our services.


Kind regards,

Pablo

0
Answered
vl2017 2 months ago in Portia • updated by Pablo Vaz (Support Engineer) 2 months ago 1

What is the difference between annotations and fields? In the "Sample page → Items" each field has configuration icons that open a tab with separate groups "Annotation" and "Field". There are separate "required" options, what do they mean and whether they overlap each other? The "Annotation" group sets the path to the element, but it is already hidden in the "Item", why "required"?


How do I configure the scrapper to ignore any pages that containing a specified attribute or word?
Answer

Hi!


Annotation Count are not the same as Extracted Items count.

If the webpage contains a list of items and the user uses the repeated annotations icon, the annotations will propagate and reflect the number of items present in the page.


However, it may happen that the algorithm responsible for data extraction is unable to use the annotations provided by the user to properly extract data, thus extracting a number of items different from the count next to the annotations.


For example, on the image above, we have one annotation with count equal to 10, hinting that we are extracting 10 items from the page. However, the Extracted Items count shows that 0 items were extracted. This means that our annotations haven't worked with Portia's algorithm, so we may have to try updating our annotations to get the data from alternative elements.


To know more see Portia documentation:

Annotations


Excellent question!

Kind regards,


Pablo


0
Fixed
sappollo 2 months ago in Portia • updated 2 months ago 2

Hi all,


Since yesterday my Portia crawls are failing with certain error:


I don't know whether this is Scrapinghub/Portia error or related to the external page to be scraped (which worked previously successfully before since months)

Answer

Dear Sapollo,


Sometimes, backend updates or new Portia releases could affect old extractors and that's why we always suggest to give some maintenance to the spiders, refresh and redeploy when necessary.


If possible, try to recreate your spider and launch again. This should work.


Kind regards,


Pablo

0
Answered
MSH 2 months ago in Portia • updated by Pablo Vaz (Support Engineer) 2 months ago 1

Hi.


I created a portia spider for a website which created by asp.net and uses (javascript:__doPostBack) for the pagination links.


is it possible to use this kind of links (javascript:__doPostBack) in portia?

for example:

<a href="javascript:__doPostBack('p$lt$ctl06$pageplaceholder$p$lt$ctl00$ApplyPageJobSearch$GridView1','Page$2')">2</a>


Thanks

Answer

Hi MSH!


Why you don't try this approach for pagination links:

How to handle pagination in Portia


Also could be helpful this article:

Portia List of URLs


If these approach doesn't work, perhaps you should try with Scrapy. Portia, sometimes can't handle complex projects involving JS.


Best regards!


Pablo