Our Forums have moved to a new location - please visit Scrapinghub Help Center to post new topics.
You can still browse older topics on this page.
Welcome to the Scrapinghub community forum! We discuss all things related to Scrapy Cloud, Portia and Crawlera. You can participate by reading posts, asking questions, providing feedback, helping others, and voting the best questions & answers. Scrapinghub employees regularly pop in to answer questions, share tips, and post announcements.
For these cases it would be useful to have a set of locations, available for spider start, but something more lightweight than Crawlera.
Sorry, but there are no plans to support any regional scraping other than using Crawlera.
Sometimes it's useful to have data stored in XML format. Could you please add a support for this feature to the dash.
This is supported now:
For the following job:
the scrapy returs 404 error page as an image. It seems craiglist is blocking scrapy. The bot is simple crawler fetching all the pages, but it is not working.
yes, our bots are being blocked by craiglist. You should use Crawlera addon and setup for using your preferred proxy service. Check the link in the addon description.
Sometimes websites require maintaining a session (e.g. when requiring login) and this usually does not work if the proxy touches cookies and rotates IPs.
It would be nice to have a convenient way to configure users to keep sessions going through a single proxy until that gets banned.
When I try to create my first project, I get an error message that says "you cannot create more than 1 project in beta mode."
I haven't created a project yet, so I don't understand why it's telling me that I am creating too many projects. Pleases help.
I would like to use a sitemap.xml as a starting point and go from there. I was wondering if that was possible. Thanks!
We are currently working on this feature! You will be able to use various formats as start urls in Portia.
I think it is a good idea to store project ids in local storage with total number of hits and then based on that data only show top N projects on dash homepage. This is will make dash homepage load faster.
is it possible to start the crawling for one single item of a product?
I'm crawling the products of like 300 shops. ( > 50'000 Products) There is no need for me to update the whole shop all the time. I just want to recrawl the found items to check the price or if it is still available. This would be much faster. I could do so by sending just the url of this item.
Thanks so much!
actually there is a way (or two), not sure if fits entirely on what you need, but in case not, we can make some improvements.
You can schedule a spider with a given start url using the schedule.json API
and passing the parameter start_urls with the value of the url you want to scrape.
The start urls can be set also before the scheduling action (in case you want to use our periodic scheduler, so there is no possibility to set at scheduling time). Check the AS spider properties API:
Another previous step you need is to edit the spider properties from the UI, so it does not follow links.
A new user interface for annotating web pages where users would:
- Annote / edit links to follow as well as what to extract
- Not require items to be defined first & have control of settings from within the annotation tool
- Allow easy browsing of target website, while fixing annotations
- Provide better feedback in real time of the performance of the crawl
Hosted Portia launched more than a year ago and we recently launched the next verison (Portia 2.0) with a completely new UI rewritten from scratch.
Allow billing using a credit card via a payment service provider. This is more convenient for many smaller users than our current approach (invoice + paypal).
Automated billing has been in place since February 2016. More information here: http://support.scrapinghub.com/topics/1712-enabling-credit-card-payments/
Customer support service by UserEcho