Welcome to the Scrapinghub community forum! We discuss all things related to Scrapy Cloud, Portia and Crawlera. You can participate by reading posts, asking questions, providing feedback, helping others, and voting the best questions & answers. Scrapinghub employees regularly pop in to answer questions, share tips, and post announcements.
Remember to check the Help Center!
Please remember to check the Scrapinghub Help Center before asking here, your question may be already answered there.
I've read that I must activate the offsite middleware as well as set the allowed_urls config of the spider. How do I do this in Autocrawler?
the offsite middleware is setted by default. Its behaviour is to block every link outside the domains in start urls, and the extra domains explicitly given in a spider attribute allowed_domains.
In case of AS, you cannot set allowed_domains explicitly. The implicit allowed domains are those contained in start urls and in the urls of the templates. So an easy way to include the needed allowed domains is to add in start urls, one with the needed domain. The effect will be exactly the same.
And in order to avoid to write lots of start urls with lots of subdomains, you can just add a url with the higher hierarchy domain.
For these cases it would be useful to have a set of locations, available for spider start, but something more lightweight than Crawlera.
Sorry, but there are no plans to support any regional scraping other than using Crawlera.
Sometimes it's useful to have data stored in XML format. Could you please add a support for this feature to the dash.
This is supported now:
For the following job:
the scrapy returs 404 error page as an image. It seems craiglist is blocking scrapy. The bot is simple crawler fetching all the pages, but it is not working.
yes, our bots are being blocked by craiglist. You should use Crawlera addon and setup for using your preferred proxy service. Check the link in the addon description.
Sometimes websites require maintaining a session (e.g. when requiring login) and this usually does not work if the proxy touches cookies and rotates IPs.
It would be nice to have a convenient way to configure users to keep sessions going through a single proxy until that gets banned.
When I try to create my first project, I get an error message that says "you cannot create more than 1 project in beta mode."
I haven't created a project yet, so I don't understand why it's telling me that I am creating too many projects. Pleases help.
I would like to use a sitemap.xml as a starting point and go from there. I was wondering if that was possible. Thanks!
We are currently working on this feature! You will be able to use various formats as start urls in Portia.
I think it is a good idea to store project ids in local storage with total number of hits and then based on that data only show top N projects on dash homepage. This is will make dash homepage load faster.
is it possible to start the crawling for one single item of a product?
I'm crawling the products of like 300 shops. ( > 50'000 Products) There is no need for me to update the whole shop all the time. I just want to recrawl the found items to check the price or if it is still available. This would be much faster. I could do so by sending just the url of this item.
Thanks so much!
actually there is a way (or two), not sure if fits entirely on what you need, but in case not, we can make some improvements.
You can schedule a spider with a given start url using the schedule.json API
and passing the parameter start_urls with the value of the url you want to scrape.
The start urls can be set also before the scheduling action (in case you want to use our periodic scheduler, so there is no possibility to set at scheduling time). Check the AS spider properties API:
Another previous step you need is to edit the spider properties from the UI, so it does not follow links.
A new user interface for annotating web pages where users would:
- Annote / edit links to follow as well as what to extract
- Not require items to be defined first & have control of settings from within the annotation tool
- Allow easy browsing of target website, while fixing annotations
- Provide better feedback in real time of the performance of the crawl
Hosted Portia launched more than a year ago and we recently launched the next verison (Portia 2.0) with a completely new UI rewritten from scratch.
Allow billing using a credit card via a payment service provider. This is more convenient for many smaller users than our current approach (invoice + paypal).
Automated billing has been in place since February 2016. More information here: http://support.scrapinghub.com/topics/1712-enabling-credit-card-payments/
Customer support service by UserEcho