Welcome to the Scrapinghub community forum! We discuss all things related to Scrapy Cloud, Portia and Crawlera. You can participate by reading posts, asking questions, providing feedback, helping others, and voting the best questions & answers. Scrapinghub employees regularly pop in to answer questions, share tips, and post announcements.

Remember to check the Help Center!

Please remember to check the Scrapinghub Help Center before asking here, your question may be already answered there.

0
Answered
Matt Lebrun 3 years ago in Portia • updated by Martin Olveyra (Engineer) 3 years ago 0
The site I'm trying to scrape is quite a confusing mess. They use subdomain links and links to another domain.

I've read that I must activate the offsite middleware as well as set the allowed_urls config of the spider. How do I do this in Autocrawler?

Answer
Hi Matt,

the offsite middleware is setted by default. Its behaviour is to block every link outside the domains in start urls, and the extra domains explicitly given in a spider attribute allowed_domains.

In case of AS, you cannot set allowed_domains explicitly. The implicit allowed domains are those contained in start urls and in the urls of the templates. So an easy way to include the needed allowed domains is to add in start urls, one with the needed domain. The effect will be exactly the same.

And in order to avoid to write lots of start urls with lots of subdomains, you can just add a url with the higher hierarchy domain.
+2
Under review
Paul Tremberth (Engineer) 3 years ago in Scrapy Cloud • updated 1 year ago 4
Especially when running periodic jobs, it would be interesting to find items based on specific field values, not just for 1 job like the current filter does, but to return items from multiple jobs, perhaps restricted to a specific spider
+1
Declined
Oleg Tarasenko (Support Engineer) 3 years ago in Scrapy Cloud • updated by Pablo Hoffman (Director) 10 months ago 2
Sometimes it's useful to run a spider from a given location, for example in order to scrape localized website (we had cases when websites from given location did not render their parts in other location).
For these cases it would be useful to have a set of locations, available for spider start, but something more lightweight than Crawlera.
Answer
Pablo Hoffman (Director) 10 months ago

Sorry, but there are no plans to support any regional scraping other than using Crawlera.

0
Answered
Matt Lebrun 3 years ago in Portia • updated by Oleg Tarasenko (Support Engineer) 3 years ago 3
Or something similar.
Answer
Currently we do not plan to support screenshots as custom fields. Mainly because this operation will be resources heavy, and we did not get a lot of requests to implement it. Anyway, I can suggest you to raise this as an idea in this support forum. So we can consider it, in case the idea will be popular. 
0
Answered
Matt Lebrun 3 years ago in Portia • updated by Oleg Tarasenko (Support Engineer) 3 years ago 3
Possibly getting shipping data and the like.
+2
Completed
Oleg Tarasenko (Support Engineer) 3 years ago in Scrapy Cloud • updated by Trent Mackness 5 months ago 2

Sometimes it's useful to have data stored in XML format. Could you please add a support for this feature to the dash.

Answer
Pablo Hoffman (Director) 10 months ago

This is supported now:

Answer

Yes, you need to use the x-splash-format=json and you'll get the same output as render.json, which can include both the HTML & PNG.


See Splash README for more info.

0
Answered
drsumm 3 years ago in Portia • updated by Martin Olveyra (Engineer) 3 years ago 1

For the following job:

http://dash.scrapinghub.com/p/690/spider/craiglistemail/


the scrapy returs 404 error page as  an image. It seems craiglist is blocking scrapy. The bot is simple crawler fetching all the pages, but it is not working.

Answer

Hi drsumm,


yes, our bots are being blocked by craiglist. You should use Crawlera addon and setup for using your preferred proxy service. Check the link in the addon description.

+5
Completed
Shane Evans (Director) 3 years ago in Crawlera • updated by Bobby39 8 months ago 6

Sometimes websites require maintaining a session (e.g. when requiring login) and this usually does not work if the proxy touches cookies and rotates IPs.


It would be nice to have a convenient way to configure users to keep sessions going through a single proxy until that gets banned.

Answer
Pablo Hoffman (Director) 11 months ago
Beta version of the feature is already implemented.
0
Fixed
Keenan Shaw 3 years ago in Scrapy Cloud • updated by Martin Olveyra (Engineer) 3 years ago 2

When I try to create my first project, I get an error message that says "you cannot create more than 1 project in beta mode."


I haven't created a project yet, so I don't understand why it's telling me that I am creating too many projects. Pleases help.



Answer

Hi Keenan. According to our db, you have this project already created:

http://dash.scrapinghub.com/p/1227/jobs/

If you have some problem accesing it, please enter our chat room so we can do better support to guide/understand the problem, by clicking here:

http://www.hipchat.com/gJog3cSUL

+9
Completed
Andrew Koller 3 years ago in Portia • updated by Pablo Vaz (Support Engineer) 5 months ago 8

I would like to use a sitemap.xml as a starting point and go from there. I was wondering if that was possible. Thanks!

Answer

We are currently working on this feature! You will be able to use various formats as start urls in Portia.

0
Completed
Umair Ashraf 3 years ago in Scrapy Cloud • updated 3 years ago 2
On dash homepage I think it shows all available projects. It takes a while to load all these projects which is okay first time user visits dash.

I think it is a good idea to store project ids in local storage with total number of hits and then based on that data only show top N projects on dash homepage. This is will make dash homepage load faster.
0
Answered
Benkert Johannes 4 years ago in Portia • updated by Martin Olveyra (Engineer) 3 years ago 11

Hi there,


is it possible to start the crawling for one single item of a product?


My situation:

I'm crawling the products of like 300 shops. ( > 50'000 Products) There is no need for me to update the whole shop all the time. I just want to recrawl the found items to check the price or if it is still available. This would be much faster. I could do so by sending just the url of this item.


Thanks so much!


Answer

Hi Benkert,


actually there is a way (or  two), not sure if fits entirely on what you need, but in case not, we can make some improvements.


You can schedule a spider with a given start url using the schedule.json API

http://doc.scrapinghub.com/api.html#schedule-json


and passing the parameter start_urls with the value of the url you want to scrape.


The start urls can be set also before the scheduling action (in case you want to use our periodic scheduler, so there is no possibility to set at scheduling time). Check the AS spider properties API:


http://doc.scrapinghub.com/api.html#as-spider-properties-json


Another previous step you need is to edit the spider properties from the UI, so it does not follow links.

+3
Completed
Shane Evans (Director) 4 years ago in Portia • updated by Pablo Hoffman (Director) 10 months ago 5

A new user interface for annotating web pages where users would:

  • Annote / edit links to follow as well as what to extract
  • Not require items to be defined first & have control of settings from within the annotation tool
  • Allow easy browsing of target website, while fixing annotations
  • Provide better feedback in real time of the performance of the crawl
We would like to release this tool as open source, and it should work with the slybot project.

Answer
Pablo Hoffman (Director) 10 months ago

Hosted Portia launched more than a year ago and we recently launched the next verison (Portia 2.0) with a completely new UI rewritten from scratch.

+4
Completed
Shane Evans (Director) 4 years ago in Scrapy Cloud • updated by Pablo Hoffman (Director) 10 months ago 2

Allow billing using a credit card via a payment service provider. This is more convenient for many smaller users than our current approach (invoice + paypal).

Answer
Pablo Hoffman (Director) 10 months ago

Automated billing has been in place since February 2016. More information here: http://support.scrapinghub.com/topics/1712-enabling-credit-card-payments/