Trying to scrape the buylist information from this website's buy list http://www.coolstuffinc.com/main_buylist_display.php
Having difficulty figuring out how to even begin with it. it all happens in that url, there's multiple steps in the process to get to a particular game's buylist with multiple sets per game.
Please try to follow all the material provided in our Learn center to adapt your spider to your needs after:
We don't provide assistance for each case in particular, but we can offer our experts assistance, if interested, let us know by providing more information through:
Our engineers will freely provide a solution according your needs once to know if it's possible to extract data from that site.
i need extract company name and mail info from every company paginated in alphabetic order..
i've set url generation, specifing every alphabetic letter but only 3 item after thousand of request..strangely slow..
can you take a look?
Pablo i've tried to add you to project? can you see my invitation?
As you can see, every page is like:
and so on...
Please try to use pagination or URL list generation as seen on:
I've noticed you has Follow all domain links on the pagination settings, at least for link_name_ink_extraction.
I hope you find this helpful.
I'm using portia to get data from a list of search results (businesses).
In that list are lnadline numbers, mobile numbers and so on.
Using .result-container > .row > .col2 > .icon-mobile > a it should return all numbers that have a mobile icon in front of it (when searching for mobile numbers) and in the GUI it works as it should.
But when the crawler runs and finds an item in the list that does not contain a mobile number, it picks the ones from all others on the page.
How can I prevent that?
Try to play a bit with CSS selector or XPath selector to fetch that information:
Once you make the anotation to extract items, you just click on the wheel near the type of anotation, to config this.
I hope this information helps.
I need to go through 113 page scraping for every name the address and mail ..
How can I set up multipage scraping?
Hey Mark, I hope you find the solution by checking these two approaches:
if not possible with those two approaches, perhaps Portia is not the right solution for your needs as mouch1 suggested.
mouch1, thanks for your suggestions.
Starting from yesterday In URL Generation, GENERATION LIST сontains errors
Before the bug
I introduced a fixed part x and list 1 2 3
GENERATION LIST consisted of correct urls:
But since yesterday urls became incorrect and contain an extra sign %20
How can fix this?
The URL rendering %20 in browser is a bug and Portia team would be working to fix the bug.
But when the spider is run, the URLs in the request are correctly rendered. They seem to be blocked by site with 403 http code. You may need to use Proxy rotater like Crawlera to evade the bans.
I am trying using Portia for the first time, first url I want to extract data from, it never stops loading, anything to do to prevent that. In web browser it load normally.
Before I start buying slots for Cloud, I will prefer that I can use Portia
Hi Isaan, for most sites Portia loads the page correctly and allows you to extract without any worries. Unfortunately some other are too complex to interact with this browser.
For this kind of sites I suggest as soon as possible to start with Scrapy, involves more coding but it's more powerful.
If you are interested in to learn more please visit our new learning center:
BTW there are more resources for Portia too.
So i'm trying to scrape articles with certain texts in them. Can i teach my spider to scrape specific articles. For example if i was on a food site that had articles and i only wanted recipes with banana. Is there a way for me to set up the spider to only scrape the articles with the key word "banana"?
Thank you for your help!
Hey Oday, yes I think is possible to set up some kind of extractor using regular expressions.
If not possible with Portia, you can try with Scrapy:
would like to know how to make the Spider go inside each one of the real-estate properties listed in this pages:
In addition, that principal page consists of several pages, how can I make the spider go to the 2nd page and so on...
thank you very much
I'm trying to use Portia to pull data from a possible list of pages based on a list. I know a lot of pages don't exist, but I don't know which ones.
So far Portia gets stuck in a loop of reattempting pages multiple times. That increases the request limit unnecessarily. Is there a way of limiting Portia to perhaps just two attempts at a single page before it discards it from attempting again?
it's my first project and it seams to me great app!!
here http://www.sia.ch/it/affiliazione/elenco-dei-membri/socii-individuali/ we have 255 page (little number on top of list) and i need not only extract these 4 visible column but either mail and telephone inside every name of the list..
i've yet extracted 255 page with main 4 column sample of the link, but i don't know how go one level deeper in every name
can i do all job with a single crawler project?
Hey Mark, I think I could make it. Sent you an invitation to take a look into the project.
Feel free to open portia to check the settings I made.
it's posible apply filter to scraped data (i know it is posible) but i would like to download .JSON code with the filter criteria, and dodge rest of data, it is posible?
thanks in advance
I think you can play a bit sharing data with spiders as shown in this article:
but not sure if this is efficient for your purposes.
I would prefer to filter locally, but of course that depends on the project.
Could you add support for UTF-8. Not English letters are not shown in the sample page editor, and regexp-conditions are not working with them.
Your inquiry has been escalated to our Portia team.
UTF-8 is supported for non latin characters, but perhaps needs to be improved when interacts with regex.
This feature it´s planned for next releases.
Thanks for your valuable feedback and for helping us to improve our services.
What is the difference between annotations and fields? In the "Sample page → Items" each field has configuration icons that open a tab with separate groups "Annotation" and "Field". There are separate "required" options, what do they mean and whether they overlap each other? The "Annotation" group sets the path to the element, but it is already hidden in the "Item", why "required"?
Annotation Count are not the same as Extracted Items count.
If the webpage contains a list of items and the user uses the repeated annotations icon, the annotations will propagate and reflect the number of items present in the page.
However, it may happen that the algorithm responsible for data extraction is unable to use the annotations provided by the user to properly extract data, thus extracting a number of items different from the count next to the annotations.
For example, on the image above, we have one annotation with count equal to 10, hinting that we are extracting 10 items from the page. However, the Extracted Items count shows that 0 items were extracted. This means that our annotations haven't worked with Portia's algorithm, so we may have to try updating our annotations to get the data from alternative elements.
To know more see Portia documentation:
Since yesterday my Portia crawls are failing with certain error:
I don't know whether this is Scrapinghub/Portia error or related to the external page to be scraped (which worked previously successfully before since months)
Sometimes, backend updates or new Portia releases could affect old extractors and that's why we always suggest to give some maintenance to the spiders, refresh and redeploy when necessary.
If possible, try to recreate your spider and launch again. This should work.
Customer support service by UserEcho