Welcome to the Scrapinghub feedback & support site! We discuss all things related to Scrapy Cloud, Portia and Crawlera. You can participate by reading posts, asking questions, providing feedback, helping others, and voting the best questions & answers. Scrapinghub employees regularly pop in to answer questions, share tips, and post announcements.
0
tofunao1 yesterday at 10:04 p.m. in Scrapy Cloud 0

Now I want to download a website which uses ajax and js. So I use selenium and PhantomJS in scrapy. It runs successful in my local pc. But when I upload it to scrapinghub, it stops with some errors.

How to solve this error or how can I download the js website? Thanks.



0
Answered
simon.nizov yesterday at 8:50 a.m. in Scrapy Cloud • updated yesterday at 11:12 a.m. 2

Hi,

Is it possible to limit a job's runtime? My spider's runtime can change drastically depending on its arguments and at some point I'd rather for the job to just stop and continue to the next one.


Thanks!

Simon.

0
Fixed
Roney Hossain 2 days ago in Scrapy Cloud • updated by Nestor Toledo Koplin (Support Engineer) yesterday at 5:46 a.m. 4

I was testing new porti beta. when i run spider it always failed and error message is " [root] Script initialization failed : IOError: [Errno 2] No such file or directory: 'project-slybot.zip'

Answer

The issue has been fixed, you can now run job from the dashboard.

0
Answered
Pedro Sousa 2 days ago in Scrapy Cloud • updated by Nestor Toledo Koplin (Support Engineer) 2 days ago 1

I have a Rails project feeding on a Scrapy Cloud sheduled spider. I need the jib to run every single day. Is there a way for me to know if the data im requesting in the API is recent ? That is, is there a way to fetch the last updated date on the data or, as an alternative, receive from Scrapy Cloud a warning if something goes wrong with the job ?

Answer

Hello Pedro,


You can use this API call to fetch the latest job data: http://help.scrapinghub.com/scrapy-cloud/fetching-latest-spider-data. For more information on the API please see: https://doc.scrapinghub.com/scrapy-cloud.html#scrapycloud

0
Answered
Rubhan 5 days ago in Scrapy Cloud • updated by Nestor Toledo Koplin (Support Engineer) 2 days ago 3

I am trying to deploy scrapy app on scrapy cloud. But after deploy, Spider is not available to run. After deploy I am getting this status.
{"project": 169397, "version": "f79741f-master", "spiders": 0, "status": "ok"}
On running scrapy list, it is showing spider.

Answer

Thanks for the feedback!

Basic usage of deploying does not need a requirements.txt unless you need to deploy libraries which are not available in Scrapy Cloud. You can find that information in Deploying Dependencies section of shub documentation: https://shub.readthedocs.io/en/latest/deploying.html#deploying-dependencies

0
Answered
seschulz 1 week ago in Scrapy Cloud • updated by Pablo Vaz (Support Engineer) 1 week ago 1

Hello,

I am new in the scraping universe and would like to ask, which scraping method / product would suit my needs best: I would like to find out from forums and social platforms, if certain brands are being discussed at all there, and if so, with which adjectives they are being connected / mentioned. Am I on the right track, trying to solve this over scraping? Thanks a lot for your help.


Seb

Answer

Hi Seschulz,

Your project seems very interesting, and I think there's no one simple answer to your inquiry.

Currently we have two options to offer: Use Portia and deploy your Scrapy Projects.


Portia is our open source visual scraper, it is made for scraping simple sites, learn the beautiful "art of crawling" and performs simple and middle size projects with almost no programming knowledge.

You can start by reading: http://help.scrapinghub.com/portia/using-portia-20-the-complete-beginners-guide

Scrapy is a powerful crawler developed and maintained by our founders. It requires more programming knowledge but the results can be outstanding. The best part, is that you can deploy your projects for free in our platform and automate your project in a very nice and easy way. I suggest you to start with the tutorial:

https://doc.scrapy.org/en/latest/intro/tutorial.html#scrapy-tutorial

If you need further assistance, you can always hire our experts or ask for our datasets service:

https://scrapinghub.com/quote

They can save you a huge amount of time and resources.


Kind regards,

Pablo

0
Not a bug
robi9011235 2 weeks ago in Scrapy Cloud • updated by Pablo Vaz (Support Engineer) 1 week ago 1

HI there. I always get this error when trying to run spider.

twisted] Unhandled error in Deferred

Traceback (most recent call last):

File "/usr/local/lib/python2.7/site-packages/scrapy/commands/crawl.py", line 57, in run

	    self.crawler_process.crawl(spname, **opts.spargs)
	  File "/usr/local/lib/python2.7/site-packages/scrapy/crawler.py", line 163, in crawl
	    return self._crawl(crawler, *args, **kwargs)
	  File "/usr/local/lib/python2.7/site-packages/scrapy/crawler.py", line 167, in _crawl
	    d = crawler.crawl(*args, **kwargs)
	  File "/usr/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 1274, in unwindGenerator
	    return _inlineCallbacks(None, gen, Deferred())
	--- <exception caught here> ---
	  File "/usr/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 1128, in _inlineCallbacks
	    result = g.send(result)
	  File "/usr/local/lib/python2.7/site-packages/scrapy/crawler.py", line 90, in crawl
	    six.reraise(*exc_info)
	  File "/usr/local/lib/python2.7/site-packages/scrapy/crawler.py", line 71, in crawl
	    self.spider = self._create_spider(*args, **kwargs)
	  File "/usr/local/lib/python2.7/site-packages/scrapy/crawler.py", line 94, in _create_spider
	    return self.spidercls.from_crawler(self, *args, **kwargs)
	  File "/usr/local/lib/python2.7/site-packages/scrapy/spiders/__init__.py", line 50, in from_crawler
	    spider = cls(*args, **kwargs)
	  File "/src/slybot/slybot/slybot/spidermanager.py", line 54, in __init__
	    **kwargs)
	  File "/src/slybot/slybot/slybot/spider.py", line 58, in __init__
	    settings, spec, item_schemas, all_extractors)
	  File "/src/slybot/slybot/slybot/spider.py", line 226, in _configure_plugins
	    self.logger)
	  File "/src/slybot/slybot/slybot/plugins/scrapely_annotations/annotations.py", line 89, in setup_bot
	    self.extractors.append(SlybotIBLExtractor(list(group)))
	  File "/src/slybot/slybot/slybot/plugins/scrapely_annotations/extraction/extractors.py", line 61, in __init__
	    for p, v in zip(parsed_templates, template_versions)
	  File "/src/slybot/slybot/slybot/plugins/scrapely_annotations/extraction/extractors.py", line 70, in build_extraction_tree
	    basic_extractors = ContainerExtractor.apply(template, basic_extractors)
	  File "/src/slybot/slybot/slybot/plugins/scrapely_annotations/extraction/container_extractors.py", line 65, in apply
	    extraction_tree = cls._build_extraction_tree(containers)
	  File "/src/slybot/slybot/slybot/plugins/scrapely_annotations/extraction/container_extractors.py", line 144, in _build_extraction_tree
	    parent = containers[parent_id]
	exceptions.KeyError: u'5e88-4205-901a#parent'
Answer

Hi Robi, please review the fields you trying to scrape.

I could replicate a similar spider for that site you are trying to scrape using Portia and after 2 minutes more than 300 items were scraped successfully.

Kind regards,

Pablo

0
Answered
trustyao 2 weeks ago in Scrapy Cloud • updated by Pablo Vaz (Support Engineer) 1 week ago 2

I was in China now and China adopts the GMT+8 Time Zone.

Everytime I login in the scrapinghub, it use the GMT time. But it makes me puzzled.

So I want to change the timezone from GMT to GMT+8. How can I do it?

Thanks

Answer

Hi Trustyao,

Our platform uses all time sets in UTC time and at the moment can't be changed. But it's a good suggestion.

Kind regards,

Pablo

0
Answered
simon.nizov 2 weeks ago in Scrapy Cloud • updated by Nestor Toledo Koplin (Support Engineer) 2 weeks ago 5

Hi!

I'm trying to deploy my spider to Scrapy Cloud using shub but I keep running into this following error:


$ shub deploy

Packing version 2df64a0-master

Deploying to Scrapy Cloud project "164526"

Deploy log last 30 lines:

---> Using cache

---> 55d64858a2f3

Step 11 : RUN mkdir /app/python && chown nobody:nogroup /app/python

---> Using cache

---> 2ae4ff90489a

Step 12 : RUN sudo -u nobody -E PYTHONUSERBASE=$PYTHONUSERBASE pip install --user --no-cache-dir -r /app/requirements.txt

---> Using cache

---> 51f233d54a01

Step 13 : COPY *.egg /app/

---> e2aa1fc31f89

Removing intermediate container 5f0a6cb53597

Step 14 : RUN if [ -d "/app/addons_eggs" ]; then rm -f /app/*.dash-addon.egg; fi

---> Running in 3a2b2bbc1a73

---> af8905101e32

Removing intermediate container 3a2b2bbc1a73

Step 15 : ENV PATH /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin

---> Running in ccffea3009a4

---> b4882513b76e

Removing intermediate container ccffea3009a4

Successfully built b4882513b76e

>>> Checking python dependencies

scrapinghub 1.9.0 has requirement six>=1.10.0, but you have six 1.7.3.

monkeylearn 0.3.5 has requirement requests>=2.8.1, but you have requests 2.3.0.

monkeylearn 0.3.5 has requirement six>=1.10.0, but you have six 1.7.3.

hubstorage 0.23.6 has requirement six>=1.10.0, but you have six 1.7.3.

Warning: Pip checks failed, please fix the conflicts.

Process terminated with exit code 1, signal None, status=0x0100

{"message": "Dependencies check exit code: 193", "details": "Pip checks failed, please fix the conflicts", "error": "requirements_error"}

{"message": "Requirements error", "status": "error"}

Deploy log location: /var/folders/w0/5w7rddxn28l2ywk5m6jwp7380000gn/T/shub_deploy_xi_w3xx8.log

Error: Deploy failed: b'{"message": "Requirements error", "status": "error"}'



It looks like a simple problem of an outdated package (six). However the installed package actually IS up to date:


$ pip show six

Name: six

Version: 1.10.0

Summary: Python 2 and 3 compatibility utilities

Home-page: http://pypi.python.org/pypi/six/

Author: Benjamin Peterson

Author-email: benjamin@python.org

License: MIT

Location: /Users/mac/.pyenv/versions/3.6.0/lib/python3.6/site-packages

Requires:


my requirements.txt file only contains the following dependency:

newspaper==0.0.9.8

Adding six==1.10.0 to it throws:

newspaper 0.0.9.8 has requirement six==1.7.3, but you have six 1.10.0.

monkeylearn 0.3.5 has requirement requests>=2.8.1, but you have requests 2.3.0.



I'm running python 3.6 through pyenv on a Mac.

Any ideas?


Thanks,

Simon!

Answer

If you are using Python 3, you'll need to use the "newspaper3k" branch. "newspaper" on pypi is the Python 2 branch.

0
Answered
trustyao 2 weeks ago in Scrapy Cloud • updated by Pablo Vaz (Support Engineer) 2 weeks ago 1

There's an error in my jobs.

------------------

Unhandled Error

Less
    Traceback (most recent call last):
      File "/usr/local/lib/python3.5/site-packages/scrapy/commands/crawl.py", line 58, in run
        self.crawler_process.start()
      File "/usr/local/lib/python3.5/site-packages/scrapy/crawler.py", line 280, in start
        reactor.run(installSignalHandlers=False)  # blocking call
      File "/usr/local/lib/python3.5/site-packages/twisted/internet/base.py", line 1194, in run
        self.mainLoop()
      File "/usr/local/lib/python3.5/site-packages/twisted/internet/base.py", line 1203, in mainLoop
        self.runUntilCurrent()
    --- <exception caught here> ---
      File "/usr/local/lib/python3.5/site-packages/twisted/internet/base.py", line 825, in runUntilCurrent
        call.func(*call.args, **call.kw)
      File "/usr/local/lib/python3.5/site-packages/scrapy/utils/reactor.py", line 41, in __call__
        return self._func(*self._a, **self._kw)
      File "/usr/local/lib/python3.5/site-packages/scrapy/core/engine.py", line 121, in _next_request
        if not self._next_request_from_scheduler(spider):
      File "/usr/local/lib/python3.5/site-packages/scrapy/core/engine.py", line 148, in _next_request_from_scheduler
        request = slot.scheduler.next_request()
      File "/usr/local/lib/python3.5/site-packages/scrapy/core/scheduler.py", line 71, in next_request
        request = self._dqpop()
      File "/usr/local/lib/python3.5/site-packages/scrapy/core/scheduler.py", line 101, in _dqpop
        d = self.dqs.pop()
      File "/usr/local/lib/python3.5/site-packages/queuelib/pqueue.py", line 43, in pop
        m = q.pop()
      File "/usr/local/lib/python3.5/site-packages/scrapy/squeues.py", line 19, in pop
        s = super(SerializableQueue, self).pop()
      File "/usr/local/lib/python3.5/site-packages/queuelib/queue.py", line 160, in pop
        self.f.seek(-self.SIZE_SIZE, os.SEEK_END)
       builtins.OSError: [Errno 122] Disk quota exceeded

How can I solve it? What cause this error? Need I add unit to this job?




Answer

Hi Trustyao, if your job exceeds 1GB of capacity, you should add more scrapy units.

Remember that the first unit purchased, upgrade your existing one, to have two scrapy units, you have to purchase two.

Kind regards,

Pablo

0
Answered
trustyao 3 weeks ago in Scrapy Cloud • updated 1 week ago 3

I have read this topic. https://support.scrapinghub.com/topics/708-api-for-periodic-jobs/

But when I run this in command lines, it returns some errors. Now how to use it? Thanks

------------


E:\QA>curl -X POST -u ffffffffffffffffffffffffffffffff: "http://dash.scrapinghub.com/api/periodic_jobs?project=111111" -d '{"hour": "0", "minutes_shift": "0", "month": "*", "spiders": [{"priority": "2", "args": {}, "name": "myspider"}], "day": "*"}'

{"status": "badrequest", "message": "No JSON object could be decoded"}curl: (6)
Could not resolve host: 0,
curl: (6) Could not resolve host: minutes_shift
curl: (6) Could not resolve host: 0,
curl: (6) Could not resolve host: month
curl: (6) Could not resolve host: *,
curl: (6) Could not resolve host: spiders
curl: (3) [globbing] bad range specification in column 2
curl: (6) Could not resolve host: 2,
curl: (6) Could not resolve host: args
curl: (3) [globbing] empty string within braces in column 2
curl: (6) Could not resolve host: name
curl: (3) [globbing] unmatched close brace/bracket in column 12
curl: (6) Could not resolve host: day

curl: (3) [globbing] unmatched close brace/bracket in column 2
Answer

Hi Trustyao,

To set periodic jobs [officially supported] please check this article:
http://help.scrapinghub.com/scrapy-cloud/periodic-jobs

Kind regards,

Pablo

0
Answered
Konstantin 3 weeks ago in Scrapy Cloud • updated by Pablo Vaz (Support Engineer) 17 hours ago 1

We have error on deploying our spiders to scrapy cloud.


Error log:


Login succeeded

Pulling a base image
Pulling repository docker.io/scrapinghub/scrapinghub-stack-scrapy
{"message": "Could not reach any registry endpoint", "details": {"message": "Could not reach any registry endpoint"}, "error": "build_error"}

It stopped working today but yesterday everything was fine. We use stack 'hworker', but with stack 'scrappy-1.1' we have the same error.


Please help.

Answer

Hey Konstantin,


I saw successful deploys on your projects. Please use the same config for future deploys.


Kind regards,


Pablo Vaz

Support Team

0
Answered
EdoPut 3 weeks ago in Scrapy Cloud • updated by Pablo Vaz (Support Engineer) 17 hours ago 3

It seems that the scrapinghub platform is working with some packages I would like to use (botocore) and we have different version dependencies.


This is the output of my build log on the deploy section and even if I am not using the awscli package it is still listed as used and it is in conflict with my project dependencies.


This is something I can try and track myself with some pip magic but it would be really nice to know before deploying that I have conflict with the platform requirements.


I created a stub project to expose this behaviour so that you can reproduce it.


Obviously the best solution would be to isolate platform dependencies from project dependencies

Answer

Hey EdoPut!


Our engineer Rolando Espinoza suggest to check our latest requirements files here:
https://github.com/scrapinghub/scrapinghub-stack-scrapy/blob/branch-1.1-py3/requirements.txt


Hope to be useful!


Best,

Pablo Vaz

0
Answered
trustyao 4 weeks ago in Scrapy Cloud • updated by Pablo Vaz (Support Engineer) 3 weeks ago 2

When I first test my code locally, I got this error.

But after searching on google, I found I need to upgrade the twisted module. Then the spider run without any mistakes.

Now I deploy the project to scrapinghub, it runs error. How can I solve it? I have added 'twisted' to the requirements.

Thanks

-------------------


[scrapy.core.downloader.handlers] Loading "scrapy.core.downloader.handlers.ftp.FTPDownloadHandler" for scheme "ftp"

Less

Traceback (most recent call last):
  File "/usr/local/lib/python3.5/site-packages/scrapy/core/downloader/handlers/__init__.py", line 44, in _get_handler
    dhcls = load_object(path)
  File "/usr/local/lib/python3.5/site-packages/scrapy/utils/misc.py", line 44, in load_object
    mod = import_module(module)
  File "/usr/local/lib/python3.5/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 986, in _gcd_import
  File "<frozen importlib._bootstrap>", line 969, in _find_and_load
  File "<frozen importlib._bootstrap>", line 958, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 673, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 673, in exec_module
  File "<frozen importlib._bootstrap>", line 222, in _call_with_frames_removed
  File "/usr/local/lib/python3.5/site-packages/scrapy/core/downloader/handlers/ftp.py", line 36, in <module>
    from twisted.protocols.ftp import FTPClient, CommandFailed
ImportError: No module named 'twisted.protocols.ftp'


Answer
trustyao 3 weeks ago

I have solve this problem. Stacks is not the answer.

I added 'Twisted>=17.1.0' to the requirements.txt

Thanks

0
Answered
olivie2r 4 weeks ago in Scrapy Cloud • updated by Pablo Vaz (Support Engineer) 6 days ago 1

Hello,

If I try to deploy with shub my spiders with the following 2 lines in setup.py, then no spider is deployed. If I just comment these 2 lines and deploy again, it works fine... Is that a bug?

Thanks

Olivier


LOG_FILE = 'spider.log'
LOG_STDOUT = True
Answer

Hi Olivier, have you tried to deploy and add the settings after? Go to Spider settings->Raw Settings and try to add your settings there:

Kind regards,

Pablo