Our Forums have moved to a new location - please visit Scrapinghub Help Center to post new topics.

You can still browse older topics on this page.


Welcome to the Scrapinghub community forum! We discuss all things related to Scrapy Cloud, Portia and Crawlera. You can participate by reading posts, asking questions, providing feedback, helping others, and voting the best questions & answers. Scrapinghub employees regularly pop in to answer questions, share tips, and post announcements.

0
Answered
Sergey Sinkovskiy 4 years ago in Crawlera • updated 2 years ago 0

How should I use Crawlera middleware?

Answer
Sergey Sinkovskiy 2 years ago


To use Crawlera middleware you configure it as follows:

DOWNLOADER_MIDDLEWARES = {
   ....
   'scrapy_crawlera.CrawleraMiddleware': 600
}
To specify credentials when using Crawlera middleware please use

CRAWLERA_ENABLED = True
CRAWLERA_USER = 'username'
CRAWLERA_PASS = 'secret'
or in spider attributes:
class MySpider:
    crawlera_enabled = True
    crawlera_user = 'username'
    crawlera_pass = 'secret'
0
Answered
Sergey Sinkovskiy 4 years ago in Crawlera • updated 4 years ago 0

Scrapy doesn't support specifying credentials in URL, what should I do?

Answer
Sergey Sinkovskiy 4 years ago

To use fetch API of service, you need to make sure Crawlera middleware is disabled and use http_user/http_pass attributes to specify credentials.

Unfortunately, scrapy doesn't support credentials in url.


http://doc.scrapy.org/en/latest/topics/downloader-middleware.html#scrapy.contrib.downloadermiddleware.httpauth.HttpAuthMiddleware



0
Answered
Chris Forno 4 years ago in Scrapy Cloud • updated by Martin Olveyra (Engineer) 4 years ago 5

Once my custom (deployed to Scrapinghub via scrapyd) spider hits about 60,000 items it gets killed with a memusage_exceeded error. I suspect this is happening during item deduplication, but I don't know how to test that.

0
Answered
Metawing 4 years ago in Portia • updated by Oleg Tarasenko (Support Engineer) 3 years ago 3

there are no items are being extracted from the pages from autoscraping

Answer
This topic was abandoned for 4 months already
0
Fixed
Sabyasachi Goswami 4 years ago in Portia • updated by Oleg Tarasenko (Support Engineer) 3 years ago 3

Running auto scraping spider but has defined proper items still items not recived

0
Fixed
drsumm 4 years ago in Portia • updated by Oleg Tarasenko (Support Engineer) 3 years ago 6

Need to login to playstore:

https://accounts.google.com/ServiceLogin?service=googleplay&passive=1209600&continue=https://play.google.com/store&followup=https://play.google.com/store


I configured a spider with above login URL, but giving the following error:

1:2013-09-13 11:45:44INFOScrapy 0.19.0-28-g0f00b16 started
2:2013-09-13 11:45:45INFOusing set_wakeup_fd
3:2013-09-13 11:45:45INFOScrapy 0.19.0-28-g0f00b16 started (bot: scrapybot)
4:2013-09-13 11:45:45INFOSyncing .scrapy directory from s3://hubcloud.scrapinghub.com/690/dot-scrapy/playstorelog/
5:2013-09-13 11:45:46INFOHubStorage: writing items to http://storage.scrapinghub.com/items/690/23/3
6:2013-09-13 11:45:46INFOHubStorage: writing pages to http://storage.scrapinghub.com/collections/690/cs/Pages
7:2013-09-13 11:45:47INFOSpider opened
8:2013-09-13 11:45:47INFOCrawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
9:2013-09-13 11:45:47INFOTelnetConsole starting on 6023
10:2013-09-13 11:45:47ERRORSpider error processing <GET https://accounts.google.com/ServiceLogin?service=googleplay&passive=1209600&continue=https://play.google.com/store&followup=https://play.google.com/storeLess
	Traceback (most recent call last):
	  File "/usr/lib/python2.7/dist-packages/twisted/internet/base.py", line 1178, in mainLoop
	    self.runUntilCurrent()
	  File "/usr/lib/python2.7/dist-packages/twisted/internet/base.py", line 800, in runUntilCurrent
	    call.func(*call.args, **call.kw)
	  File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 362, in callback
	    self._startRunCallbacks(result)
	  File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 458, in _startRunCallbacks
	    self._runCallbacks()
	--- <exception caught here> ---
	  File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 545, in _runCallbacks
	    current.result = callback(current.result, *args, **kw)
	  File "/usr/lib/pymodules/python2.7/slybot/spider.py", line 111, in parse_login_page
	    args, url, method = fill_login_form(response.url, response.body, username, password)
	  File "/usr/lib/python2.7/dist-packages/loginform.py", line 53, in fill_login_form
	    form.fields[userfield] = username
	  File "/usr/lib/python2.7/dist-packages/lxml/html/__init__.py", line 885, in __setitem__
	    self.inputs[item].value = value
	  File "/usr/lib/python2.7/dist-packages/lxml/html/__init__.py", line 946, in __getitem__
	    "No input element with the name %r" % name)
	exceptions.KeyError: 'No input element with the name None'
	
11:2013-09-13 11:45:47INFOClosing spider (finished)
12:2013-09-13 11:45:48INFODumping Scrapy stats: Less
	{'downloader/request_bytes': 362,
	 'downloader/request_count': 1,
	 'downloader/request_method_count/GET': 1,
	 'downloader/response_bytes': 72947,
	 'downloader/response_count': 1,
	 'downloader/response_status_count/200': 1,
	 'finish_reason': 'finished',
	 'finish_time': datetime.datetime(2013, 9, 13, 6, 15, 47, 632051),
	 'memusage/max': 49434624,
	 'memusage/startup': 49434624,
	 'response_received_count': 1,
	 'scheduler/dequeued': 1,
	 'scheduler/dequeued/disk': 1,
	 'scheduler/enqueued': 1,
	 'scheduler/enqueued/disk': 1,
	 'spider_exceptions/KeyError': 1,
	 'start_time': datetime.datetime(2013, 9, 13, 6, 15, 47, 83072)}
13:2013-09-13 11:45:48INFOSpider closed (finished)
14:2013-09-13 11:45:48INFOSyncing .scrapy directory to s3://hubcloud.scrapinghub.com/690/dot-scrapy/playstorelog/
15:2013-09-13 11:45:49INFOMain loop terminated.





So how to handle google authentication?


0
Answered
Sergey Sinkovskiy 4 years ago in Crawlera • updated 4 years ago 0


Answer
Sergey Sinkovskiy 4 years ago

No. Crawlera doesn't implement any caching.


0
Answered
Edwin Shao 4 years ago in Scrapy Cloud • updated by Pablo Hoffman (Director) 4 years ago 0
Hello,

I can't seem to manage to see the DEBUG loglevel in your web log, which would help me debug some problems getting my spiders working on your production environment.

For example, the following log message (that I see on my development machine) would help me:

2013-08-23 10:34:08+0800 [scrapy] DEBUG: Overridden settings: {'NEWSPIDER_MODULE': 'miner.spiders', 'FEED_URI': 'stdout:', 'SPIDER_MODULES': ['miner.spiders'], 'BOT_NAME': 'kites-miner-bot/0.1.0', 'ITEM_PIPELINES': ['miner.pipelines.AddressPipeline', 'miner.pipelines.GeoPipeline', 'miner.pipelines.MergesPipeline', 'miner.pipelines.HoursPipeline', 'miner.pipelines.CategoriesPipeline', 'miner.pipelines.WidgetPipeline', 'miner.pipelines.BasePipeline', 'miner.pipelines.ItemToBSONPipeline', 'scrapy_mongodb.MongoDBPipeline', 'miner.pipelines.CouchDBPipeline', 'miner.pipelines.BSONToItemPipeline'], 'USER_AGENT': 'kites-miner-bot/0.1.0 (+http://kites.hk)', 'FEED_FORMAT': 'json'}


I've already tried setting LOG_LEVEL to 'DEBUG' in settings.py. Is there anything else I should do?


Answer

You need to set LOG_LEVEL= DEBUG in Settings -> Scrapy settings.


The default log level on Scrapy Cloud is INFO.

0
Answered
Edwin Shao 4 years ago in Scrapy Cloud • updated by Martin Olveyra (Engineer) 4 years ago 2

I am using scrapy_mongodb to store scraped items into my Mongo database. Everything works fine on my development machine, but when I deploy to scrapinghub, I am having a hard time configuring the MONGODB_DATABASE setting that it depends on.


Regardless of whether I put in a project or spider override, it keeps using the MONGODB_DATABASE that is in settings.py. Why is this?

Answer

Hi, Edwin,


The current behaviour is: spider settings has the biggest priority, then project settings, then settings in settings.py file, so should work in that way. If does not work in that way, i cannot say why without knowing what your code does. Are you sure your code is reading the setting MONGODB_DATABASE and not a hard coded value?

0
Answered
Pablo Hoffman (Director) 4 years ago in Scrapy Cloud • updated 2 years ago 0


Answer

Even when no changes to code are made, jobs can run slower depending on how busy is the server they are assigned to run in the cloud.


This variability can be improved by purchasing dedicated servers. Check the Pricing page, and contact sales@scrapinghub.com to request them.

0
Answered
Pablo Hoffman (Director) 4 years ago in Scrapy Cloud • updated 4 years ago 0

So that, given a range, we always obtain the same set of data.

Answer

Job items are always in order (which is the order they are extracted & stored in). This is the same in the API, even when you filter or request ranges.

+1
Answered
Pablo Hoffman (Director) 4 years ago in Scrapy Cloud • updated 4 years ago 0


Answer

The Scrapy process gets the stop signal within a hundred milliseconds. It does a graceful stop, which means it finishes pending http requests, flushes items and logs, etc. which takes some time. Probably most of the time people killing jobs don't care, and we could provide a quick kill mechanism.

0
Answered
Edwin Shao 4 years ago in Scrapy Cloud • updated by Martin Olveyra (Engineer) 4 years ago 0
I don't see an option to do so in the web UI.

When I try to deploy, I get the following error:

deshao-mbp (1)~/miner> bin/scrapy deploy
Packing version 1376539087
Deploying to project "889" in http://dash.scrapinghub.com/api/scrapyd/addversion.json
Deploy failed (400):
{"status": "badrequest", "message": "Duplicated spider name: there is an autospider named 'burgerking'"}


Thus, I'd like to delete the autospider named 'burgerking'.

Answer

You have to go to Autoscraping properties of the spider ('Autoscraping' red button) and then you have a button 'Delete'.

0
Not a bug
Dominici 4 years ago in Portia • updated by Oleg Tarasenko (Support Engineer) 3 years ago 2

Hello, 

I have tried to export my "completed job".

But when I click on "CSV", a new window is open and nothing happen.

Is that a bug ?

0
Answered
Max Kraszewski 4 years ago in Portia • updated by Martin Olveyra (Engineer) 4 years ago 3

I can´t figure what is wrong, but when I attempt to download scrapped items in csv format, it redirects me to a blank page with the following message:

Need to indicate fields for the CSV file
in the request parameter fields
I think I'm doing something wrong, buy could you help me? Thanks in advance

Answer

That is because that job does not have items (you can see the items counter in 0). If you do the same on other jobs, you will get a csv.