Our Forums have moved to a new location - please visit Scrapinghub Help Center to post new topics.
You can still browse older topics on this page.
Welcome to the Scrapinghub community forum! We discuss all things related to Scrapy Cloud, Portia and Crawlera. You can participate by reading posts, asking questions, providing feedback, helping others, and voting the best questions & answers. Scrapinghub employees regularly pop in to answer questions, share tips, and post announcements.
I have deleted the failed entry. Sorry for delay, I was trying to get a time for investigate how that could have happened, and even I was not able to reproduce for other cases.
Let us know if happen again
The end goal is to generate a csv of all the owner names to import into excel for post processing. I can get the spider to crawl every page but I can't seem to get the items to extract properly. Am I just missing something?
Any help appreciated, cheers Rob
1. Find a particular field.
2. Tell whether a field is missing.
Display the scraped fields sorted would make these task easy.
Fields are sorted alphabetically now.
Select a job's checkbox, and click "Restart"
This is supported already.
Here you can see that fields list box contains only one item!
So why not to pre-select it for me? It will make dash usage a great pleasure for me!
There are more fields now, so pre-selecting doesn't make sense anymore.
This is supported now.
Is it possible to setup an email notification when there is new data while scraping the page. Example - when there is new news item listed on a particular news page, it should notify by an email only for the new news update.
Here's a sample item that AS says to be duplicate, which upon checking clearly isn't:
u'brand': [u'WESTERN DIGITAL'],
u'category': [u'MY PASSPORT ESSENTIAL 2TB BLK'],
Duplicate product scraped at
first one was scraped at
u'brand': [u'WESTERN DIGITAL'],
u'category': [u'MY BOOK ESSENTIAL 3TB 3.5INUSB3.0'],
* a larger set of start urls, with some clear maximum number supported
* a url containing other start urls
* a simple pattern to move through integer numbers, e.g. page[1..200].html
We are happy to announce in our community that new release of Portia will allow you to set a bulk of start urls using a list (from Dropbox for example).
We hope to get this new feature among others ready, very soon!
I've read that I must activate the offsite middleware as well as set the allowed_urls config of the spider. How do I do this in Autocrawler?
the offsite middleware is setted by default. Its behaviour is to block every link outside the domains in start urls, and the extra domains explicitly given in a spider attribute allowed_domains.
In case of AS, you cannot set allowed_domains explicitly. The implicit allowed domains are those contained in start urls and in the urls of the templates. So an easy way to include the needed allowed domains is to add in start urls, one with the needed domain. The effect will be exactly the same.
And in order to avoid to write lots of start urls with lots of subdomains, you can just add a url with the higher hierarchy domain.
Customer support service by UserEcho