Welcome to the Scrapinghub feedback & support site! We discuss all things related to Scrapy Cloud, Portia and Crawlera. You can participate by reading posts, asking questions, providing feedback, helping others, and voting the best questions & answers. Scrapinghub employees regularly pop in to answer questions, share tips, and post announcements.
0
Answered
mescalante1988 23 hours ago in Portia • updated by Thriveni Patil (Support Engineer) 17 hours ago 1

Hello, I am doing a Project and I think Portia is great!

I have a doubt because I am extracting data from a webpage, but I want to include the category on all items I am extracting.. but I only have from each item the image, price and description.

What I want to do is force to add manually a category..

For example now I am receiving:


[ { "image": ["urlImage" ], "description": [ "TV LED " ], "price": [ "565" ] },[ { "image": [urlImage1], "description": [ "TV1" ], "price": [ "867" ] },


I want to add manually a category called TV and obtain the next result:


[ { "image": ["urlImage" ], "description": [ "TV LED " ], "price": [ "565" ], "category": ["TV"] },[ { "image": [urlImage1], "description": [ "TV1" ], "price": [ "867" ], "category": ["TV"] },

Could anyone help me with this?

I only know how to work with Portia on webpage on graphic mode.

Thanks!

Answer

Good to know that you are liking Portia :)


To add a field for every Item you can make use of Magic Fields addon, Please refer http://help.scrapinghub.com/scrapy-cloud/addons/magic-fields-addon to know more about the Magic Fields.


Regards,

Thriveni

0
Fixed
Uptown Found yesterday at 1:21 a.m. in Portia • updated by Pablo Vaz (Support Engineer) yesterday at 6:11 p.m. 1

When I try to access my Portia project using Chrome, I get a blank page. Opening the Chrome Inspector shows there are several CSS and JS files that cannot be loaded (404 errors):


Answer
Pablo Vaz (Support Engineer) yesterday at 6:11 p.m.

Hi Uptown found,


We have been doing some maintenance work, it should be working now.

Please be sure to clean cache to avoid related issues.


Best regards,

Pablo

0
Not a bug
IncentFit IncentFit 3 days ago in Portia • updated by Pablo Vaz (Support Engineer) yesterday at 6:15 p.m. 1

I'm trying to scrape this Yoga Works website to get a list of their locations. Notice that it shows 43 results on the left and 50 on the right. How does that make any sense? Then when I run the job in ScrapingHub it times out after 24 hours. It's just trying to scrape one page!


Am I doing something wrong here or is it just that buggy? The correct answer of results is 43...

Answer
Pablo Vaz (Support Engineer) yesterday at 6:15 p.m.

Hey IncentFit,


Thanks for your feedback, yeah it could be confusing, the sample count is the amount of elements annotated, the extracted items count is the amount of items the extraction algorithm was able to actually extract.


We forwarded your feedback to our Portia team. Thanks for helping us to provide a more stable Portia platform.


Kind regards,


Pablo Vaz

0
Not a bug
A Rj 3 days ago in Portia • updated by Pablo Vaz (Support Engineer) 17 hours ago 2

none of these pages load any more in portia spider editing tool. I've tried to recreate the spider but it doesn't help is there anything that I'm doing wrong? I've followed the tutorials step by step and am able to get the data from other similar pages, but portia fails on these specific sites - they just don't open in portia spider editing tool:


1) anthonysheatingoilri.com - scraper doesn't process html tags/no elements can be selected:
2) anthonysoil.com - scraper doesn't process html tags/no elements can be selected:

3) scraper doesn't process html tags/no elements can be selected:


4) big-oats.com - can't select the particular element on the page. Nothing gets parsed (0 results):
Answer

Hi Rj and Markus,


We have been doing some maintenance tasks lastly so with our last release Portia should be working stable.


About specific domains, sometimes Portia can't handle complex components of the site and fails to extract data. Keep in mind that this tool was designed for easy and mid size projects. If interested to develop more powerful extractors, you should consider using Scrapy: https://doc.scrapy.org/en/latest/intro/tutorial.html

And deploy your projects for free in our Scrapy Cloud.


You can also hire our experts to set a specific crawler for your needs. If interested, don't hesitate to request our free quote form: https://scrapinghub.com/quote


Kind regards,


Pablo Vaz

Support Team

0
Waiting for Customer
Markus 3 days ago in Portia • updated by Nestor Toledo Koplin (Support Engineer) yesterday at 11:03 a.m. 4

When I try to create a new project and then open it in Portia, I either get an error message saying that "Project 169900 not found" in Portia or a 502 Bad Gateway error message. I can see the project in the scrapinghub dashboard (https://app.scrapinghub.com/p/169900/jobs), but it's failing to open in Portia. The URL to the project in Portia is https://portia.scrapinghub.com/#/projects/169900.


Thanks for your help,

Markus

0
Answered
saurabh9 6 days ago in Portia • updated by adebar 4 days ago 4

Hi,

I am trying Portia on few sites, but adding new annotations has been a problem. Almost 80% time I am getting error "Resource 'api/projects/.........' not found" . I tried different brwosers, c;eared cache, incognito mode, but nothing helped. Any clues?

Answer

Hi Saurabh.

Thanks for reporting this issue. Some sites are hard to crawl with Portia due its complexity. We suggest to try with a new project and create that new annotations. If the problem persists Portia could be not working well with that site.


In that case, it is recommended to try with Scrapy:

https://doc.scrapy.org/en/latest/intro/tutorial.html

You can run your Scrapy spiders in our Platform for free! And there's a vast community willing to help in StackOverflow and Github.

Finally, you can always ask to our sales team for our data on demand services. We can extract the data you need for you and deliver to you in the most useful formats. If interested don't hesitate to contact us through:

https://scrapinghub.com/quote


Kind regards,

Pablo

0
Answered
Dave 1 week ago in Portia • updated by Pablo Vaz (Support Engineer) 6 days ago 1

Am I able to create a new column based on a regex extractor with Portia?

Answer

Hi Dave, using regex you can:


1. Configure URL patterns and use Query cleaner Addon:

http://help.scrapinghub.com/portia/using-regular-expressions-and-query-cleaner-addon


2. You can also use regex for more complex actions like crawl paginated listings:

https://portia.readthedocs.io/en/latest/examples.html#crawling-paginated-listings


3. You can also use regular expressions to extract a portion of the variable.

For example, let’s say you need to extract a parameter from a URL like this: http://www.example.com/product.html?item_no=345. The normal syntax, { "sku": "$field:url" } will store the full URL into the sku field. If we want to extract only the item_no value, we can use a regex like this:

{ "sku": "$field:url,r'item_no=(\d+)'" }

Not sure if the above suggestions can help but you can find more information in Portia docs:
https://portia.readthedocs.io/en/latest/index.html


Kind regards,

Pablo

0
Answered
vicente.tronco 1 week ago in Portia • updated by Pablo Vaz (Support Engineer) 6 days ago 1

For instance, I am extracting product informations of a website, but I would like to include a column with all the elements with the name of the brand, all the elements of this column should have the name XXX (and it is not on the elements page)

Answer

Hi Vicente, what if you go to the page you want (with that products) using the Portia browser, and then you set as start page.

I think could be a simple solution.

Kind regards,

Pablo

0
Answered
abbyinohio 1 week ago in Portia • updated by Pablo Vaz (Support Engineer) 1 week ago 1

Help! I deployed a Portia project (https://portia.scrapinghub.com/#/projects/167699) to scrape Rotten Tomatoes. But when I run the job on Scrapy Cloud, it crawls hundreds of pages and scrapes only the first ten. I verified that none of my fields are required, and I can't find any error messages in the log file. Thank you!

Answer

Hi Abby,

If Portia scrape successfully the first pages but then it started to fail, could be a ban issue.

When you start to crawl, Portia crawls from a fixed IP and the site can detect you are requesting and start to banning you.
We can suggest to use Crawlera, our intelligent proxy rotator. It can help you to crawl more efficiently.

https://scrapinghub.com/crawlera/


Also, if the site is complex to scrape, it is recommended to start with Scrapy:

https://doc.scrapy.org/en/latest/intro/tutorial.html

Finally, you can always ask to our sales team for our data on demand services. We can extract the data you need for you and deliver to you in the most useful formats.


I hope to be helpful with this suggestions.

Kind regards,


Pablo

0
Answered
vicente.tronco 1 week ago in Portia • updated by Pablo Vaz (Support Engineer) 1 week ago 1
Answer

Hi Vicente, to delete any project please check this article:

http://help.scrapinghub.com/scrapy-cloud/deleting-projects

Kind regards,

Pablo

0
Answered
Herman 1 week ago in Portia • updated 1 week ago 4

(1)

[scrapy.core.scraper] Error downloading <GET https://www.openrice.com/en/hongkong/restaurants>:[<twisted.python.failure.Failure twisted.internet.error.ConnectionDone: Connection was closed cleanly.>]


3 ScrapyDeprecationWarning

py.warnings] /src/slybot/slybot/slybot/closespider.py:10: ScrapyDeprecationWarning: Importing from scrapy.xlib.pydispatch is deprecated and will no longer be supported in future Scrapy versions. If you just want to connect signals use the from_crawler class method, otherwise import pydispatch directly if needed. See:https://github.com/scrapy/scrapy/issues/1762

More
Answer

Hi Herman, the error reported seems to be a connection failure according our experts.

Please try to run it again.

About the warning, it shouldn't give you any problem. Our developers will update all necessary libraries when required.

Kind regards,

Pablo

0
Not a bug
Firdaus AD 2 weeks ago in Portia • updated by Pablo Vaz (Support Engineer) 1 week ago 2

hi all,

i use portia very good and extract data from my target website . for two month


but last week portia not extract data i want,


please anybody help me what happen.


http://m.mudah.my/view?q=&ca=1_3_a&sa=&cg=1000&catname=VEHICLES&o=1&f=a&srch=1&so=1&ad_id=53011516


i want extract CALL | SMS data..

Answer

Hi Firdaus,


Have you tried with other similar sites? If works for other sites, please check:

http://help.scrapinghub.com/portia/troubleshooting-portia

Kind regards,

Pablo

0
Answered
brandonmp 2 weeks ago in Portia • updated 1 week ago 2

I've looked in the forums for a similar question, but the only relevant questions are a few years old.


I have a site with a paginated list of links I want to crawl & scrape.


The pagination links are of the form:


`<a href="javascript:void(0);" onclick="changePage(1);">2</a>`


The links I want to crawl and scrape are of the pattern:


`<div class="someClass" onclick="window.location='/detailsLite?acId=8545&listingId=1037524'"> // content </div>`


Can Portia handle either of these types of navigation?


I've tried activating Javascript for the crawler on all pages, as well as configuring URL patterns to follow all links, but Portia still indicates it can't find any links to crawl.


Answer

Hi Brandon,

You can find how to set paths for crawling on this article:

http://help.scrapinghub.com/portia/using-regular-expressions-and-query-cleaner-addon
and also please check:

http://help.scrapinghub.com/portia/how-do-you-extract-data-from-a-list-of-urls
I hope you find this helpful.

Kind regards,

Pablo

0
Not a bug
Håkan Waara 2 weeks ago in Portia • updated by Pablo Vaz (Support Engineer) 1 week ago 2

I've created a Portia spider for a site that is rendered with javascript and React. It works great in the UI and finds the right items. But when run, it returns 0 items. Any idea how I can debug this? Does Portia support "dynamic" pages?

Answer

Hi Hakan,

Could you share the site you are trying to scrape? Have you tried with other similar sites? If works for other sites, please check:

http://help.scrapinghub.com/portia/troubleshooting-portia


Kind regards,

Pablo

0
Answered
Dante 3 weeks ago in Portia • updated by Pablo Vaz (Support Engineer) 2 weeks ago 1

I am trying to parse a website with portia. I can extract all fields but i cannot extract email eventhough it is viewable on the browser. When creating fields with portia it returns "email protected" for this field. what can i do? is there any command/regular expression i can put on the particular field?

PS: I have no idea about programming. copy paste only if that helps :D


it has cloudflare protection i guess. i ve seen the scripts in chrome console

Answer

Yes, for some sites or fields, Portia can't retrieve data beyond their security policies specially for anti-bot protection. Please check for more information here: http://help.scrapinghub.com/portia/troubleshooting-portia
Kind regards,

Pablo