Our Forums have moved to a new location - please visit Scrapinghub Help Center to post new topics.

You can still browse older topics on this page.


Welcome to the Scrapinghub community forum! We discuss all things related to Scrapy Cloud, Portia and Crawlera. You can participate by reading posts, asking questions, providing feedback, helping others, and voting the best questions & answers. Scrapinghub employees regularly pop in to answer questions, share tips, and post announcements.

+2
Under review
Paul Tremberth (Engineer) 4 years ago in Scrapy Cloud • updated 1 year ago 4
Especially when running periodic jobs, it would be interesting to find items based on specific field values, not just for 1 job like the current filter does, but to return items from multiple jobs, perhaps restricted to a specific spider
+1
Declined
Oleg Tarasenko (Support Engineer) 4 years ago in Scrapy Cloud • updated by Pablo Hoffman (Director) 1 year ago 2
Sometimes it's useful to run a spider from a given location, for example in order to scrape localized website (we had cases when websites from given location did not render their parts in other location).
For these cases it would be useful to have a set of locations, available for spider start, but something more lightweight than Crawlera.
Answer

Sorry, but there are no plans to support any regional scraping other than using Crawlera.

0
Answered
Matt Lebrun 4 years ago in Portia • updated by Oleg Tarasenko (Support Engineer) 3 years ago 3
Or something similar.
Answer
Currently we do not plan to support screenshots as custom fields. Mainly because this operation will be resources heavy, and we did not get a lot of requests to implement it. Anyway, I can suggest you to raise this as an idea in this support forum. So we can consider it, in case the idea will be popular. 
0
Answered
Matt Lebrun 4 years ago in Portia • updated by Oleg Tarasenko (Support Engineer) 3 years ago 3
Possibly getting shipping data and the like.
+2
Completed
Oleg Tarasenko (Support Engineer) 4 years ago in Scrapy Cloud • updated by Trent Mackness 7 months ago 2

Sometimes it's useful to have data stored in XML format. Could you please add a support for this feature to the dash.

Answer

This is supported now:

Answer

Yes, you need to use the x-splash-format=json and you'll get the same output as render.json, which can include both the HTML & PNG.


See Splash README for more info.

0
Answered
drsumm 4 years ago in Portia • updated by Martin Olveyra (Engineer) 4 years ago 1

For the following job:

http://dash.scrapinghub.com/p/690/spider/craiglistemail/


the scrapy returs 404 error page as  an image. It seems craiglist is blocking scrapy. The bot is simple crawler fetching all the pages, but it is not working.

Answer

Hi drsumm,


yes, our bots are being blocked by craiglist. You should use Crawlera addon and setup for using your preferred proxy service. Check the link in the addon description.

+5
Completed
Shane Evans (Director) 4 years ago in Crawlera • updated by Bobby39 10 months ago 6

Sometimes websites require maintaining a session (e.g. when requiring login) and this usually does not work if the proxy touches cookies and rotates IPs.


It would be nice to have a convenient way to configure users to keep sessions going through a single proxy until that gets banned.

Answer
Beta version of the feature is already implemented.
0
Fixed
Keenan Shaw 4 years ago in Scrapy Cloud • updated by Martin Olveyra (Engineer) 4 years ago 2

When I try to create my first project, I get an error message that says "you cannot create more than 1 project in beta mode."


I haven't created a project yet, so I don't understand why it's telling me that I am creating too many projects. Pleases help.



Answer

Hi Keenan. According to our db, you have this project already created:

http://dash.scrapinghub.com/p/1227/jobs/

If you have some problem accesing it, please enter our chat room so we can do better support to guide/understand the problem, by clicking here:

http://www.hipchat.com/gJog3cSUL

+9
Completed
Andrew Koller 4 years ago in Portia • updated by Pablo Vaz (Support Engineer) 7 months ago 8

I would like to use a sitemap.xml as a starting point and go from there. I was wondering if that was possible. Thanks!

Answer

We are currently working on this feature! You will be able to use various formats as start urls in Portia.

0
Completed
Umair Ashraf 4 years ago in Scrapy Cloud • updated 3 years ago 2
On dash homepage I think it shows all available projects. It takes a while to load all these projects which is okay first time user visits dash.

I think it is a good idea to store project ids in local storage with total number of hits and then based on that data only show top N projects on dash homepage. This is will make dash homepage load faster.
0
Answered
Benkert Johannes 4 years ago in Portia • updated by Martin Olveyra (Engineer) 4 years ago 11

Hi there,


is it possible to start the crawling for one single item of a product?


My situation:

I'm crawling the products of like 300 shops. ( > 50'000 Products) There is no need for me to update the whole shop all the time. I just want to recrawl the found items to check the price or if it is still available. This would be much faster. I could do so by sending just the url of this item.


Thanks so much!


Answer

Hi Benkert,


actually there is a way (or  two), not sure if fits entirely on what you need, but in case not, we can make some improvements.


You can schedule a spider with a given start url using the schedule.json API

http://doc.scrapinghub.com/api.html#schedule-json


and passing the parameter start_urls with the value of the url you want to scrape.


The start urls can be set also before the scheduling action (in case you want to use our periodic scheduler, so there is no possibility to set at scheduling time). Check the AS spider properties API:


http://doc.scrapinghub.com/api.html#as-spider-properties-json


Another previous step you need is to edit the spider properties from the UI, so it does not follow links.

+3
Completed
Shane Evans (Director) 4 years ago in Portia • updated by Pablo Hoffman (Director) 1 year ago 5

A new user interface for annotating web pages where users would:

  • Annote / edit links to follow as well as what to extract
  • Not require items to be defined first & have control of settings from within the annotation tool
  • Allow easy browsing of target website, while fixing annotations
  • Provide better feedback in real time of the performance of the crawl
We would like to release this tool as open source, and it should work with the slybot project.

Answer

Hosted Portia launched more than a year ago and we recently launched the next verison (Portia 2.0) with a completely new UI rewritten from scratch.

+4
Completed
Shane Evans (Director) 4 years ago in Scrapy Cloud • updated by Pablo Hoffman (Director) 1 year ago 2

Allow billing using a credit card via a payment service provider. This is more convenient for many smaller users than our current approach (invoice + paypal).

Answer

Automated billing has been in place since February 2016. More information here: http://support.scrapinghub.com/topics/1712-enabling-credit-card-payments/

+26
Completed
Shane Evans (Director) 4 years ago in Portia • updated by Pablo Hoffman (Director) 2 years ago 4
We could have an addon to execute JavaScript (say, using splash). This would be suitable for smaller websites that require JavaScript execution in order for Portia to work.