Our Forums have moved to a new location - please visit Scrapinghub Help Center to post new topics.

You can still browse older topics on this page.


Welcome to the Scrapinghub community forum! We discuss all things related to Scrapy Cloud, Portia and Crawlera. You can participate by reading posts, asking questions, providing feedback, helping others, and voting the best questions & answers. Scrapinghub employees regularly pop in to answer questions, share tips, and post announcements.

0
Fixed
Ron Johnson 3 years ago in Portia • updated by Andrés Pérez-Albela H. 3 years ago 5
The spider at this page failed to delete when deleting the autospider. Now I cannot go to the autospider page to try and redelete this spider.
Answer
Hi Ron,

I have deleted the failed entry. Sorry for delay, I was trying to get a time for investigate how that could have happened, and even I was not able to reproduce for other cases.

Let us know if happen again
0
Answered
Ron Johnson 3 years ago in Portia • updated by Oleg Tarasenko (Support Engineer) 3 years ago 5
On this web page I want to extract the list of names under owner name. When I establish a template to do so for this page, it successfully extracts the owner names. However, as the spider crawls on to the next page as seen here the template starts to fall apart.

The end goal is to generate a csv of all the owner names to import into excel for post processing. I can get the spider to crawl every page but I can't seem to get the items to extract properly. Am I just missing something?

0
Completed
Ayush Lodhi 3 years ago in Portia • updated by Oleg Tarasenko (Support Engineer) 3 years ago 1
how can i download the data which is scraped, i have seen every where but i cant find a way to download the data
0
Answered
Robert Clements 3 years ago in Portia • updated by Samir 7 months ago 8
I'm trying to build a database of houses for sale in London from Zoopla.co.uk. I've managed to scrape description, price etc. but i'm trying to scrape images from an embedded 'carousel' but i'm not sure if i can. 

Any help appreciated, cheers Rob
0
Answered
Sammy Kiogora 3 years ago in Scrapy Cloud • updated by Pablo Hoffman (Director) 3 years ago 1
Answer
There was a problem with project creation that is now fixed.
+1
Completed
Rolando Espinoza (Engineer) 3 years ago in Scrapy Cloud • updated by Pablo Hoffman (Director) 1 year ago 1
In my case, it is usual that I have to look for a particular field count or check if it's missing. The current unsorted display makes hard to:

1. Find a particular field.
2. Tell whether a field is missing.

Display the scraped fields sorted would make these task easy.
Answer

Fields are sorted alphabetically now.

+8
Completed
Rolando Espinoza (Engineer) 3 years ago in Scrapy Cloud • updated by Paul Tremberth (Engineer) 3 years ago 7
I have some spiders that needs to be schedule with a couple of arguments. This is no a hassle when scheduling the job via the API, but when doing manually (i.e. using a test input and updating the spider code each run) it would be nice to be able to re-schedule a job without having to re-enter all the custom arguments.
Answer
It's available now in the "Completed Jobs" tab., at the bottom of the page, next to the "Remove" button.

Select a job's checkbox, and click "Restart"
+2
Completed
Rolando Espinoza (Engineer) 3 years ago in Scrapy Cloud • updated by Pablo Hoffman (Director) 1 year ago 1
For large projects with a lot of spiders, would be handy to be able to type the name of the spider in the search box to go directly to its spider page.
Answer

This is supported already.

0
Completed
Oleg Tarasenko (Support Engineer) 3 years ago in Scrapy Cloud • updated by Pablo Hoffman (Director) 1 year ago 1
I want to suggest a tweak to items filter. Especially to the part about pages section. For example: http://dash.scrapinghub.com/p/78/job/14/2/#pages

Here you can see that fields list box contains only one item! 

So why not to pre-select it for me? It will make dash usage a great pleasure for me!
Answer

There are more fields now, so pre-selecting doesn't make sense anymore.

+2
Completed
fernando Almeida 3 years ago in Scrapy Cloud • updated by Pablo Hoffman (Director) 1 year ago 1
Currently you can only schedule jobs to run either every day of the week or once per week, it would be nice if more options could be added like on each day of the the month (1-30) or every 15 days .
Answer

This is supported now.

0
Answered
Hirantha 3 years ago in Scrapy Cloud • updated by Pablo Hoffman (Director) 1 year ago 1
Hi,

Is it possible to setup an email notification when there is new data while scraping the page. Example - when there is new news item listed on a particular news page, it should notify by an email only for the new news update.
Answer

You need to implement this functionality yourself in your Scrapy spider, for example using the MailSender facility:

http://doc.scrapy.org/en/latest/topics/email.html

+1
Answered
Matt Lebrun 4 years ago in Portia • updated by Shane Evans (Director) 4 years ago 4
To note, all fields are marked as vary so I don't get why this is even happening.

Here's a sample item that AS says to be duplicate, which upon checking clearly isn't:

LineResult
32Scraped from <200 http://www.courts.com.sg/Products/PID-IP058275(Courts)/Computers/IT-Accessories/Hard-Disks/WESTERN-DIGITAL-MY-PASSPORT-ESSENTIAL-2TB-BLK-WDBY8L0020BBKPESN> Less

{'_cached_page_id': '60b7e00f7dbe65861cb6505a0f29296817e215f5',
'_template': '52b1899e4d6c710f54a65589',
'_type': u'product_page',
u'brand': [u'WESTERN DIGITAL'],
u'category': [u'MY PASSPORT ESSENTIAL 2TB BLK'],
u'image': ['http://d2j8wlv4w10az1.cloudfront.net/assets/images/products/ip058275.jpg'],
u'price': [u'209'],
u'title': [u'WDBY8L0020BBK-PESN'],
'url': 'http://www.courts.com.sg/Products/PID-IP058275(Courts)/Computers/IT-Accessories/Hard-Disks/WESTERN-DIGITAL-MY-PASSPORT-ESSENTIAL-2TB-BLK-WDBY8L0020BBKPESN'}
50Dropped: Duplicate product scraped at <http://www.courts.com.sg/Products/PID-IP039998(Courts)/Computers/IT-Accessories/Hard-Disks/WESTERN-DIGITAL-MY-BOOK-ESSENTIAL-3TB-35INUSB30-WDBACW0030HBKSESN>, first one was scraped at <http://www.courts.com.sg/Products/PID-IP058275(Courts)/Computers/IT-Accessories/Hard-Disks/WESTERN-DIGITAL-MY-PASSPORT-ESSENTIAL-2TB-BLK-WDBY8L0020BBKPESN> Less

{'_cached_page_id': '2d21d63869a48f310fec54c48bdc15be1ed942e0',
'_template': '52b1899e4d6c710f54a65589',
'_type': u'product_page',
u'brand': [u'WESTERN DIGITAL'],
u'category': [u'MY BOOK ESSENTIAL 3TB 3.5INUSB3.0'],
u'image': ['http://d2j8wlv4w10az1.cloudfront.net/assets/images/products/ip039998.jpg'],
u'price': [u'229'],
u'title': [u'WDBACW0030HBK-SESN'],
'url': 'http://www.courts.com.sg/Products/PID-IP039998(Courts)/Computers/IT-Accessories/Hard-Disks/WESTERN-DIGITAL-MY-BOOK-ESSENTIAL-3TB-35INUSB30-WDBACW0030HBKSESN'}
+3
Completed
Shane Evans (Director) 4 years ago in Portia • updated by Pablo Vaz (Support Engineer) 7 months ago 3
It should be possible to specify:
* a larger set of start urls, with some clear maximum number supported
* a url containing other start urls
* a simple pattern to move through integer numbers, e.g. page[1..200].html
Answer

Hi Shane!


We are happy to announce in our community that new release of Portia will allow you to set a bulk of start urls using a list (from Dropbox for example).

We hope to get this new feature among others ready, very soon!

Best Regards!

0
Answered
David 4 years ago in Portia • updated by Paul Tremberth (Engineer) 2 years ago 5
When viewing scraped Items/pages, I click on Items drop-down to Ge as CSV, though it opens a blank page. Please advise on proper way to view what has been scraped. Thanks!
Answer
Need more information in order to help
0
Answered
Matt Lebrun 4 years ago in Portia • updated by Martin Olveyra (Engineer) 4 years ago 0
The site I'm trying to scrape is quite a confusing mess. They use subdomain links and links to another domain.

I've read that I must activate the offsite middleware as well as set the allowed_urls config of the spider. How do I do this in Autocrawler?

Answer
Hi Matt,

the offsite middleware is setted by default. Its behaviour is to block every link outside the domains in start urls, and the extra domains explicitly given in a spider attribute allowed_domains.

In case of AS, you cannot set allowed_domains explicitly. The implicit allowed domains are those contained in start urls and in the urls of the templates. So an easy way to include the needed allowed domains is to add in start urls, one with the needed domain. The effect will be exactly the same.

And in order to avoid to write lots of start urls with lots of subdomains, you can just add a url with the higher hierarchy domain.