Our Forums have moved to a new location - please visit Scrapinghub Help Center to post new topics.

You can still browse older topics on this page.


Welcome to the Scrapinghub community forum! We discuss all things related to Scrapy Cloud, Portia and Crawlera. You can participate by reading posts, asking questions, providing feedback, helping others, and voting the best questions & answers. Scrapinghub employees regularly pop in to answer questions, share tips, and post announcements.

0
Answered
Wolfgang Keupp 3 years ago in Portia • updated by Martin Olveyra (Engineer) 3 years ago 0
I want to test differential scraping and crawlera. Therefore, i designed a group of spiders and want to run these spiders.

Do i need to install scrapy, python and all the stuff for that?
Answer
Hi Wolfgang,

are you talking about running AS spiders in your local machine? AS spiders are based on slybot. Check instructions here http://slybot.readthedocs.org/en/latest/

The crawlera middleware is inside this library:

https://github.com/scrapinghub/scrapylib/

0
Fixed
Shane Evans (Director) 3 years ago in Scrapy Cloud • updated by Pablo Hoffman (Director) 1 year ago 1
There should be a limit for beta (non-paying customers) of one running job, but it should allow a lot of pending jobs. Unfortunately only one job can be scheduled via the dash API. When more are scheduled, the following occurs:
project.schedule('ScrapeVerge')
Traceback (most recent call last):
  File "", line 1, in 
  File "/usr/local/lib/python2.7/dist-packages/scrapinghub.py", line 196, in schedule
    result = self._post('schedule', 'json', params)
  File "/usr/local/lib/python2.7/dist-packages/scrapinghub.py", line 178, in _post
    return self._request_proxy._post(method, format, params, headers, raw, files)
  File "/usr/local/lib/python2.7/dist-packages/scrapinghub.py", line 105, in _post
    return self._request(url, params, headers, format, raw, files)
  File "/usr/local/lib/python2.7/dist-packages/scrapinghub.py", line 125, in _request
    return self._decode_response(response, format, raw)
  File "/usr/local/lib/python2.7/dist-packages/scrapinghub.py", line 137, in _decode_response
    raise APIError(data['message'])
scrapinghub.APIError: You cannot have more than 1 active jobs in beta mode.
Answer

This was fixed a long time ago.

0
Fixed
Sourour Alkahtib 3 years ago in Scrapy Cloud • updated by Pablo Hoffman (Director) 1 year ago 3
Hi,

I was testing your scrapping service and it looks great.
Unfortunately for my use case where I'm trying to scrap Arabic websites the encoding returned is incorrect, there for I'm unable to read the scaped content. 

here is the site URL I scapped :
http://www.ar8ar.com/news/news.php?action=listnews...

and this is a screenshot of the resulting scrapping:



Thanks
Answer

Thanks for reporting. Even though the link does not work anymore, we believe this is fixed now. Otherwise, please comment.

0
Answered
Oleg Tarasenko (Support Engineer) 3 years ago in Portia • updated by Andrés Pérez-Albela H. 3 years ago 1
We need directions on how to use Crawlera with autoscraping spiders. Can we have steps of enabling crawlera for autoscraping?
0
Answered
Oleg Tarasenko (Support Engineer) 3 years ago in Scrapy Cloud • updated by Pablo Hoffman (Director) 1 year ago 5

I want to suggest removing "Confirm navigation" dialog triggered on Compare With pages. It's a bit irritating.

Answer
This is a safety measure that browsers perform in order to avoid to close a page with data pending to be sent. But here does not seem to have sense. Probably there is some data in a hidden form. In that case solution would be dash to avoid to do that and use another mean for doing whatever is trying to do.
0
Fixed
Bjk-Tribun 3 years ago in Portia • updated by Andrés Pérez-Albela H. 3 years ago 6
Some items are scraped in pages but when I want to export them as csv i get a blank page. Although it generates json files which have no use for me.
This is the url that is generated. But no csv files.

https://storage.scrapinghub.com/items/1936/4/16?ap...
+1
Answered
Talha 3 years ago in Portia • updated by Martin Olveyra (Engineer) 3 years ago 0
the type of pages that I'm trying to scrape is like this;
http://www.domain.com/new-homes/generation/61778

I enter the start url as: http://www.autotrader.co.uk/
and the follow patterns :
/new-homes/generation/[0-9]+/
/new-homes/generation/^[0-9]+$/
but the spider doesn't return any pages?
Answer
Hi Talha,

In addition of the suggestion below about regular expression, you also need to ensure that the filters does not ban the pages that leads to the products. Check this
section of the documentation http://doc.scrapinghub.com/autoscraping.html#considerations-when-using-url-filters
+4
Completed
Paul Tremberth (Engineer) 3 years ago in Scrapy Cloud • updated by Pablo Hoffman (Director) 1 year ago 1
Would be great to have spider-level configurable alerts for periodic jobs.

For example, ask for an email to be sent if some condition is met (or not met), possibly compared to previous jobs:
- job crawled 0 items
- job crawled 50% less than previous N last job(s)
- job is taking 200% longer than usual
- some stats counter(s) is too high
- etc.

Something à-la Pingdom maybe
Email is one option, Sentry or Graphite events would be another.
Answer

This is supported through the powerful (although still very under-documented) Monitoring addon.

0
Answered
Deryk Wenaus 3 years ago in Portia • updated by Pablo Hoffman (Director) 1 year ago 10

I want to get all the external URLs on a site.

Answer
Hi Deryk,

I have uploaded the package to your project. In order to use it, you would need to enable the middleware with:

LINKSEXTRACTOR_ENABLED = 1
LINKSEXTRACTOR_EXCLUDE_PATTERNS = <some patterns to exclude>
0
Completed
Deryk Wenaus 3 years ago in Portia • updated by Pablo Hoffman (Director) 1 year ago 1
with sites that are not coded semantically, it's often difficult to select the piece you want. Would it be possible to have sibling CSS selectors such as element + element2 ?
Answer

Portia supports CSS selectors now.

+1
Under review
Paul Tremberth (Engineer) 3 years ago in Scrapy Cloud • updated by anonymous 3 years ago 1
In periodic jobs view, it would be handy to see current UTC time,
and also maybe the jobs that are about to start if any.
+2
Under review
Rolando Espinoza (Engineer) 3 years ago in Scrapy Cloud • updated by Pablo Vaz (Support Engineer) 6 months ago 2

It would be nice to have an additional field in the custom settings (project-wide, per-spider, etc) to add a comment about why that specific setting was added.


For example: "Auto throttle disabled because the spider does broad crawls and scrapes few pages within a single domain."

0
Answered
Paul 3 years ago in Scrapy Cloud • updated by Martin Olveyra (Engineer) 3 years ago 0
In the Spiders log, I click on the below link, it the page that opens says 'Unauthorised'. Why is this?

HubStorage: writing items to http://storage.scrapinghub.com/items/1891/1/1
Answer
Hi Paul,
It is just debug information, that is not intended to be accessed as a link. It is rendered as a link in the log (because it is an url), but it is not a link. It is just the api url that the job uses to store items which, in order to be called, needs other parameters (for instance, the api user key)


0
Completed
Ron Johnson 3 years ago in Portia • updated by Tomas Rinke (Support Engineer) 8 months ago 1
The ability to click on a button in the autospider screen and generate a spider with the same settings would be useful if you need a spider that will crawl the same or similar pages and yet requires different tempate settings or the addition of excluded/included URL patterns.
Answer

Hi,

  • On Portia V1.0 this is implemented already from project menu:

.

  • On Portia V2.0 is planned to be implemented in two phases: first via API on the next Portia release and secondly via the User Interface.


+2
Completed
Ron Johnson 3 years ago in Portia • updated by Pablo Hoffman (Director) 1 year ago 1
Once a spider has been named. There is no option to go in and rename it. This could be useful during the development of spiders as their functions and roles change through the development process.
Answer

Portia supports renaming spiders.


(The original question was for Autoscraping, Portia predecessor product)