Welcome to the Scrapinghub community forum! We discuss all things related to Scrapy Cloud, Portia and Crawlera. You can participate by reading posts, asking questions, providing feedback, helping others, and voting the best questions & answers. Scrapinghub employees regularly pop in to answer questions, share tips, and post announcements.

Remember to check the Help Center!

Please remember to check the Scrapinghub Help Center before asking here, your question may be already answered there.

0
Aysc 2 weeks ago in Splash 0

I run a splash instance at scrapinghub.


I am receiving 400 error code when I try to call select method:

local element = splash:select('#checkbox2')


I have checked the splash documentation, but could not find the source of this error. can you help me?


WARNING: Bad request to Splash: {u'info': {u'line_number': 86, u'message': u'Lua error: [string "..."]:86: attempt to call method \'select\' (a nil value)', u'type': u'LUA_ERROR', u'source': u'[string "..."]', u'error': u"attempt to call method 'select' (a nil value)"}, u'type': u'ScriptError', u'description': u'Error happened while executing Lua script', u'error': 400}

0
Waiting for Customer
Dege 2 weeks ago in Crawlera • updated by Nestor Toledo Koplin (Support Engineer) 4 hours ago 7

Using the examples you provided on the documentation:

https://doc.scrapinghub.com/crawlera.html#using-crawlera-with-casperjs-phantomjs-and-spookyjs


wget https://doc.scrapinghub.com/_downloads/crawlera-ca.crt

wget https://github.com/ariya/phantomjs/raw/master/examples/rasterize.js

phantomjs --ssl-protocol=any --proxy="proxy.crawlera.com:8010" --proxy-auth="<API KEY>:''" --ssl-client-certificate-file=crawlera-ca.crt rasterize.js https://twitter.com twitter.jpg


I'll get the error message: Unable to load the address!


activating phantomjs debugging (--debug=yes) you can isolate the error:

SSL Error: "The issuer certificate of a locally looked up certificate could not be found"


an error that can be bypassed by using the parameter: --ignore-ssl-errors=true

wich unfortunately will cause another issue:

Resource request error: QNetworkReply::NetworkError(ProxyAuthenticationRequiredError) ( "Proxy requires authentication" )


The final command to replicate the issue is:

phantomjs --debug=yes --ignore-ssl-errors=true --ssl-protocol=any --proxy="proxy.crawlera.com:8010" --proxy-auth="<API KEY>:''" --ssl-client-certificate-file=crawlera-ca.crt rasterize.js https://twitter.com twitter.jpg


My PhantomJS version is 2.1.1 (the latest one)


0
Answered
simon.nizov 2 weeks ago in Scrapy Cloud • updated by Nestor Toledo Koplin (Support Engineer) 2 weeks ago 5

Hi!

I'm trying to deploy my spider to Scrapy Cloud using shub but I keep running into this following error:


$ shub deploy

Packing version 2df64a0-master

Deploying to Scrapy Cloud project "164526"

Deploy log last 30 lines:

---> Using cache

---> 55d64858a2f3

Step 11 : RUN mkdir /app/python && chown nobody:nogroup /app/python

---> Using cache

---> 2ae4ff90489a

Step 12 : RUN sudo -u nobody -E PYTHONUSERBASE=$PYTHONUSERBASE pip install --user --no-cache-dir -r /app/requirements.txt

---> Using cache

---> 51f233d54a01

Step 13 : COPY *.egg /app/

---> e2aa1fc31f89

Removing intermediate container 5f0a6cb53597

Step 14 : RUN if [ -d "/app/addons_eggs" ]; then rm -f /app/*.dash-addon.egg; fi

---> Running in 3a2b2bbc1a73

---> af8905101e32

Removing intermediate container 3a2b2bbc1a73

Step 15 : ENV PATH /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin

---> Running in ccffea3009a4

---> b4882513b76e

Removing intermediate container ccffea3009a4

Successfully built b4882513b76e

>>> Checking python dependencies

scrapinghub 1.9.0 has requirement six>=1.10.0, but you have six 1.7.3.

monkeylearn 0.3.5 has requirement requests>=2.8.1, but you have requests 2.3.0.

monkeylearn 0.3.5 has requirement six>=1.10.0, but you have six 1.7.3.

hubstorage 0.23.6 has requirement six>=1.10.0, but you have six 1.7.3.

Warning: Pip checks failed, please fix the conflicts.

Process terminated with exit code 1, signal None, status=0x0100

{"message": "Dependencies check exit code: 193", "details": "Pip checks failed, please fix the conflicts", "error": "requirements_error"}

{"message": "Requirements error", "status": "error"}

Deploy log location: /var/folders/w0/5w7rddxn28l2ywk5m6jwp7380000gn/T/shub_deploy_xi_w3xx8.log

Error: Deploy failed: b'{"message": "Requirements error", "status": "error"}'



It looks like a simple problem of an outdated package (six). However the installed package actually IS up to date:


$ pip show six

Name: six

Version: 1.10.0

Summary: Python 2 and 3 compatibility utilities

Home-page: http://pypi.python.org/pypi/six/

Author: Benjamin Peterson

Author-email: benjamin@python.org

License: MIT

Location: /Users/mac/.pyenv/versions/3.6.0/lib/python3.6/site-packages

Requires:


my requirements.txt file only contains the following dependency:

newspaper==0.0.9.8

Adding six==1.10.0 to it throws:

newspaper 0.0.9.8 has requirement six==1.7.3, but you have six 1.10.0.

monkeylearn 0.3.5 has requirement requests>=2.8.1, but you have requests 2.3.0.



I'm running python 3.6 through pyenv on a Mac.

Any ideas?


Thanks,

Simon!

Answer

If you are using Python 3, you'll need to use the "newspaper3k" branch. "newspaper" on pypi is the Python 2 branch.

0
Answered
Aysc 2 weeks ago in Splash • updated 2 weeks ago 2

I am trying to figure out how to configure crawlera with splash, I will not use splash for every request, but I want to use crawlera for every request. I have created a test spider that scrapes http://httpbin.org/ip', I am getting a response with a new IP with every spider run for both splash and regular requests, so it seems to be working okay. But I have not used method explained in crawlera doc that use a lua script with Splash /execute endpoint.

https://doc.scrapinghub.com/crawlera.html#using-crawlera-with-splash


My question: Am I doing something wrong? Should I use the splash /execute endpoint for proxy settings? Should I change anything in my settings file?


Here is my settings file:


SPLASH_URL = 'my scrapinghub splash instance url'

DOWNLOADER_MIDDLEWARES = {
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}

SPIDER_MIDDLEWARES = {
'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}

DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'

HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'

DOWNLOADER_MIDDLEWARES = {
'scrapy_crawlera.CrawleraMiddleware': 610
}

CRAWLERA_ENABLED = True
CRAWLERA_APIKEY = 'my crawlera api key'


And here is my test spider:


import scrapy
from scrapy_splash import SplashRequest

from base64 import b64encode
encoded_key = b64encode('my scrapinghub splash api key:')

class IpSpider(scrapy.Spider):
name = 'ip'

def start_requests(self):
url = 'http://httpbin.org/ip'
yield SplashRequest(url, self.parse_splash,
args={'wait': 0.5 },
splash_headers = {'Authorization': 'Basic ' + encoded_key}
)
yield scrapy.Request(url=url, callback=self.parse)

def parse_splash(self, response):
print 'SPLASH'
print response.text

def parse(self, response):
print 'REQUEST crawlera'
print response.text


Answer

Hey Acac,

Using Splash with Crawlera can be difficult, the best way is shown through this article:
http://help.scrapinghub.com/splash/using-crawlera-with-splash

If you still experiencing issues, feel free to reach us and request a free quote:

https://scrapinghub.com/quote

Our developers can help you to retrieve your data with a huge saving of time and resources.

Kind regards!

Pablo

0
Answered
trustyao 2 weeks ago in Scrapy Cloud • updated by Pablo Vaz (Support Engineer) 2 weeks ago 1

There's an error in my jobs.

------------------

Unhandled Error

Less
    Traceback (most recent call last):
      File "/usr/local/lib/python3.5/site-packages/scrapy/commands/crawl.py", line 58, in run
        self.crawler_process.start()
      File "/usr/local/lib/python3.5/site-packages/scrapy/crawler.py", line 280, in start
        reactor.run(installSignalHandlers=False)  # blocking call
      File "/usr/local/lib/python3.5/site-packages/twisted/internet/base.py", line 1194, in run
        self.mainLoop()
      File "/usr/local/lib/python3.5/site-packages/twisted/internet/base.py", line 1203, in mainLoop
        self.runUntilCurrent()
    --- <exception caught here> ---
      File "/usr/local/lib/python3.5/site-packages/twisted/internet/base.py", line 825, in runUntilCurrent
        call.func(*call.args, **call.kw)
      File "/usr/local/lib/python3.5/site-packages/scrapy/utils/reactor.py", line 41, in __call__
        return self._func(*self._a, **self._kw)
      File "/usr/local/lib/python3.5/site-packages/scrapy/core/engine.py", line 121, in _next_request
        if not self._next_request_from_scheduler(spider):
      File "/usr/local/lib/python3.5/site-packages/scrapy/core/engine.py", line 148, in _next_request_from_scheduler
        request = slot.scheduler.next_request()
      File "/usr/local/lib/python3.5/site-packages/scrapy/core/scheduler.py", line 71, in next_request
        request = self._dqpop()
      File "/usr/local/lib/python3.5/site-packages/scrapy/core/scheduler.py", line 101, in _dqpop
        d = self.dqs.pop()
      File "/usr/local/lib/python3.5/site-packages/queuelib/pqueue.py", line 43, in pop
        m = q.pop()
      File "/usr/local/lib/python3.5/site-packages/scrapy/squeues.py", line 19, in pop
        s = super(SerializableQueue, self).pop()
      File "/usr/local/lib/python3.5/site-packages/queuelib/queue.py", line 160, in pop
        self.f.seek(-self.SIZE_SIZE, os.SEEK_END)
       builtins.OSError: [Errno 122] Disk quota exceeded

How can I solve it? What cause this error? Need I add unit to this job?




Answer

Hi Trustyao, if your job exceeds 1GB of capacity, you should add more scrapy units.

Remember that the first unit purchased, upgrade your existing one, to have two scrapy units, you have to purchase two.

Kind regards,

Pablo

0
Answered
trustyao 3 weeks ago in Scrapy Cloud • updated 1 week ago 3

I have read this topic. https://support.scrapinghub.com/topics/708-api-for-periodic-jobs/

But when I run this in command lines, it returns some errors. Now how to use it? Thanks

------------


E:\QA>curl -X POST -u ffffffffffffffffffffffffffffffff: "http://dash.scrapinghub.com/api/periodic_jobs?project=111111" -d '{"hour": "0", "minutes_shift": "0", "month": "*", "spiders": [{"priority": "2", "args": {}, "name": "myspider"}], "day": "*"}'

{"status": "badrequest", "message": "No JSON object could be decoded"}curl: (6)
Could not resolve host: 0,
curl: (6) Could not resolve host: minutes_shift
curl: (6) Could not resolve host: 0,
curl: (6) Could not resolve host: month
curl: (6) Could not resolve host: *,
curl: (6) Could not resolve host: spiders
curl: (3) [globbing] bad range specification in column 2
curl: (6) Could not resolve host: 2,
curl: (6) Could not resolve host: args
curl: (3) [globbing] empty string within braces in column 2
curl: (6) Could not resolve host: name
curl: (3) [globbing] unmatched close brace/bracket in column 12
curl: (6) Could not resolve host: day

curl: (3) [globbing] unmatched close brace/bracket in column 2
Answer

Hi Trustyao,

To set periodic jobs [officially supported] please check this article:
http://help.scrapinghub.com/scrapy-cloud/periodic-jobs

Kind regards,

Pablo

0
Answered
Dante 3 weeks ago in Portia • updated by Pablo Vaz (Support Engineer) 2 weeks ago 1

I am trying to parse a website with portia. I can extract all fields but i cannot extract email eventhough it is viewable on the browser. When creating fields with portia it returns "email protected" for this field. what can i do? is there any command/regular expression i can put on the particular field?

PS: I have no idea about programming. copy paste only if that helps :D


it has cloudflare protection i guess. i ve seen the scripts in chrome console

Answer

Yes, for some sites or fields, Portia can't retrieve data beyond their security policies specially for anti-bot protection. Please check for more information here: http://help.scrapinghub.com/portia/troubleshooting-portia
Kind regards,

Pablo


0
Answered
cangokalp 3 weeks ago in Crawlera • updated by Nestor Toledo Koplin (Support Engineer) 2 days ago 1

Hi,


for some urls my log shows ['partial'] at the end


and when I googled it I found this on stackoverflow;


You're seeing ['partial'] in your logs because the server at vallenproveedora.com.mx doesn't set the Content-Length header in its response; run curl -I to see for yourself. For more detail on the cause of the partial flag


Is this something on crawlera end, how can I fix this?


Best,


Can

Answer

Hello,


419 were actually 429 error codes (now correctly displays as 429), which mean that the concurrent connection limit based on the Crawlera plan was exceeded. For other possible error codes, please see: https://doc.scrapinghub.com/crawlera.html#errors

0
Answered
michelnegrao 3 weeks ago in Portia • updated by Pablo Vaz (Support Engineer) 2 weeks ago 1

Hi. I'm following this guide: http://help.scrapinghub.com/portia/using-portia-20-the-complete-beginners-guide, and got stuck on the start page, new sample step. It's not showing. Options on the left are PROJECT, SPYDER and DATA FORMAT.





Answer

Hi Michel,

New sample button is shown after you create new spider. It should appear up in the Portia Browser.

Kind regards,

Pablo

0
Answered
Konstantin 3 weeks ago in Scrapy Cloud • updated by Pablo Vaz (Support Engineer) 17 hours ago 1

We have error on deploying our spiders to scrapy cloud.


Error log:


Login succeeded

Pulling a base image
Pulling repository docker.io/scrapinghub/scrapinghub-stack-scrapy
{"message": "Could not reach any registry endpoint", "details": {"message": "Could not reach any registry endpoint"}, "error": "build_error"}

It stopped working today but yesterday everything was fine. We use stack 'hworker', but with stack 'scrappy-1.1' we have the same error.


Please help.

Answer

Hey Konstantin,


I saw successful deploys on your projects. Please use the same config for future deploys.


Kind regards,


Pablo Vaz

Support Team

0
Answered
EdoPut 3 weeks ago in Scrapy Cloud • updated by Pablo Vaz (Support Engineer) 17 hours ago 3

It seems that the scrapinghub platform is working with some packages I would like to use (botocore) and we have different version dependencies.


This is the output of my build log on the deploy section and even if I am not using the awscli package it is still listed as used and it is in conflict with my project dependencies.


This is something I can try and track myself with some pip magic but it would be really nice to know before deploying that I have conflict with the platform requirements.


I created a stub project to expose this behaviour so that you can reproduce it.


Obviously the best solution would be to isolate platform dependencies from project dependencies

Answer

Hey EdoPut!


Our engineer Rolando Espinoza suggest to check our latest requirements files here:
https://github.com/scrapinghub/scrapinghub-stack-scrapy/blob/branch-1.1-py3/requirements.txt


Hope to be useful!


Best,

Pablo Vaz

0
Answered
ijoin 4 weeks ago in Portia • updated by Pablo Vaz (Support Engineer) 1 week ago 2

Hi,


I'm looking for help

My spider can crawl and scrape items before, but now it doesn't work any more

Any idea how this can happen?

The url is https://goo.gl/vnPJNL


I remade the spider, but it only can scrape the description, the others like name, price, brand is not scraped

*it works before*

0
Answered
info 4 weeks ago in Portia • updated by Pablo Vaz (Support Engineer) 3 weeks ago 1

hi there,


when i use portia on eg. http://www.boxofficemojo.com/movies/?id=theaccountant.htm to extract anything from the page there's nothing in extracted data and Extracting data... keeps just spinning.


I don't have issues on other sites - seem's to be site-specific. I tried also different selectors without any luck...


hope you can shed some light on my problem :)

cheers

Answer

Hey Info!

Some sites are really hard to crawl using Portia due to site complexity and scripts (some related to security policies).

Check this: https://blog.scrapinghub.com/2016/08/25/how-to-crawl-the-web-politely-with-scrapy/

Try to enable JS for those pages that need to be crawled and use CSS extractors if needed.

If interested in to set a more powerful crawler using Scrapy, our engineers can help you. Don't hesitate to ask for a free request through: https://scrapinghub.com/quote


Kind regards,

Pablo

0
Answered
LuckyDucky 4 weeks ago in Crawlera • updated by Pablo Vaz (Support Engineer) 4 weeks ago 1

I've added more servers but I'm not seeing an increase in per minute throughput. Does Crawlera throttle?

Answer

Hi LuckyDucky,

Please check your configuration settings, you can set throttling also. Take a few minutes to explore this articles:

http://help.scrapinghub.com/crawlera/crawlera-best-practices

and
http://help.scrapinghub.com/scrapy-cloud/addons/auto-throttle-addon

For more information and further settings always is useful to check:

https://doc.scrapinghub.com/crawlera.html#crawlera-api

Kind regards and happy weekend,

Pablo

0
Answered
Francois 4 weeks ago in Crawlera • updated by Pablo Vaz (Support Engineer) 4 weeks ago 1

Hi,


I writed an CasperJs script, it worked with others proxy (IP list or/and scrapoxy).

This script navigated on website because I need javascript interaction.


I installed crawlera, it works perfectly but Casper didn't succeed to click on my first link :


` CasperError: Cannot dispatch mousedown event on nonexistent selector: #hlb-view-cart-announce`


Can you explain me why and How can I fixed that ?

Answer

Hi Francois,


To know more about how to set up Crawlera using CasperJS, please visit:
https://doc.scrapinghub.com/crawlera.html#using-crawlera-with-casperjs-phantomjs-and-spookyjs


If interested please consider to request a free quote: https://scrapinghub.com/quote, for professional assistance. Our developers can help you to set up and deploy your projects solving all script issues for you.

Kind regards,

Pablo