Welcome to the Scrapinghub community forum! We discuss all things related to Scrapy Cloud, Portia and Crawlera. You can participate by reading posts, asking questions, providing feedback, helping others, and voting the best questions & answers. Scrapinghub employees regularly pop in to answer questions, share tips, and post announcements.

Remember to check the Help Center!

Please remember to check the Scrapinghub Help Center before asking here, your question may be already answered there.

0
Answered
Narcissus Emi 3 years ago in Crawlera • updated by Martin Olveyra (Engineer) 3 years ago 1

Hi,


I'm a new comer to Crawlera, now I'm creating a spider to crawl a site but need to use session key and tokens to validate the form, ip change will cause the server not recognize this request.


Is there any option or way to achieve this?

Answer

Sorry, this is not supported yet for public access.

0
Thanks
Castedo Ellerman 3 years ago in Scrapy Cloud • updated by Andrés Pérez-Albela H. 2 years ago 1

I just "scrapy deploy"'d the navscraper git project to Scrap Cloud, ran some spiders and it really worked!


0
Answered
Castedo Ellerman 3 years ago in Scrapy Cloud • updated by Pablo Hoffman (Director) 3 years ago 3

Under "Settings" > "Scrapy Deply" then "Git" I entered

https://github.com/scrapinghub/navscraper

then "Save"

then "Deploy from Git"

I get a dark screen with "Processing..."

and then a red popup of "Deploy Failed".


When I follow the instructions under "Deploy configuration" and run "scrapy deploy" from my local project clone it works fine.

0
Answered
drsumm 3 years ago in Portia • updated by Martin Olveyra (Engineer) 3 years ago 3

How is it possible to download images in autoscraping , also if we have our own datastore how can it be connected with autoscraping?

Answer

You can use the Images addon


Please check this doc on how to use


http://doc.scrapinghub.com/addons.html#images

+1
Completed
drsumm 3 years ago in Portia • updated by Pablo Hoffman (Director) 9 months ago 3

Often I find that items are not defined by me , and only when I see the template I can decide on the item fields to be extracted. So there should be a feature to create new item fields while in the template mode. Its not efficient to go back and define items each time.

Answer

Portia already supports this.

0
Answered
drsumm 3 years ago in Portia • updated by Martin Olveyra (Engineer) 3 years ago 2

I had annotated sticky field annotations and I require them to also be extracted, but its not being extracted

Answer

Hi drsumm,


data extracted in sticky annotations are intended to provide a kind of annotation that must be thrown away.

If you want to have the extracted data in your item, you must use a normal item field (and mark the annotation as required). Does not have sense to have a sticky annotation which extracts the data because it is the same as a normal field.


0
Answered
drsumm 3 years ago in Portia • updated by Oleg Tarasenko (Support Engineer) 3 years ago 1

I need to display only those pages where there were no items extracted to understand those pages, Is that possible now?

0
Answered
drsumm 3 years ago in Portia • updated by Martin Olveyra (Engineer) 3 years ago 3

I am scraping the playstore and its scraping at  the rate of 6 items\min which is unrealistically slow , here is the job id:

http://dash.scrapinghub.com/p/690/job/23/8/#details



settings:


CONCURRENT_ITEMS = 100

CONCURRENT_REQUESTS_PER_IP = 10

DELTAFETCH_ENABLED = 1

DOTSCRAPY_ENABLED = 1


Answer

Hi, drsumm, do you remember autothrottle? Check this feedback:

http://support.scrapinghub.com/topic/168025-slow-scraping/

About CONCURRENT_ITEMS, i don't think that is what you need. Check this:

http://doc.scrapy.org/en/latest/topics/settings.html#concurrent-items

This settings gives the max limit of concurrent items. It will not accelerate the crawling speed.

Also, the number of duplicates indicates that the same items are being scraped once and again, and that reduces drastically the relation items/pages.You should add url/parameter filters in order to avoid to scrape unneeded pages. Check issues 6 and 8 in this section of the help:

http://doc.scrapinghub.com/autoscraping.html#good-practices-for-best-results-with-less-effort

If you look at the dropped log lines, for example:

Dropped: Duplicate product scraped at <https://play.google.com/store/apps/details?id=com.app.vodio&reviewId=Z3A6QU9xcFRPR0FvM1p0aGRCSmtpN2ZMTExEWjR2ZUhQZzhoRUE1X2pRb0Q4UXhvWUFBLTZkb0pXYk1zN3Z0SXpkLWszVDZiLXZCNU5ya0t2ZE1CdHRpamc>, first one was scraped at <https://play.google.com/store/apps/details?id=com.app.vodio

You will see that the same page is being visited with two different urls. You must use the QueryCleaner addon in order to remove the parameter reviewId from the urls.

0
Answered
Sergey Sinkovskiy 4 years ago in Crawlera • updated 2 years ago 0

How should I use Crawlera middleware?

Answer
Sergey Sinkovskiy 2 years ago


To use Crawlera middleware you configure it as follows:

DOWNLOADER_MIDDLEWARES = {
   ....
   'scrapy_crawlera.CrawleraMiddleware': 600
}
To specify credentials when using Crawlera middleware please use

CRAWLERA_ENABLED = True
CRAWLERA_USER = 'username'
CRAWLERA_PASS = 'secret'
or in spider attributes:
class MySpider:
    crawlera_enabled = True
    crawlera_user = 'username'
    crawlera_pass = 'secret'
0
Answered
Sergey Sinkovskiy 4 years ago in Crawlera • updated 4 years ago 0

Scrapy doesn't support specifying credentials in URL, what should I do?

Answer
Sergey Sinkovskiy 4 years ago

To use fetch API of service, you need to make sure Crawlera middleware is disabled and use http_user/http_pass attributes to specify credentials.

Unfortunately, scrapy doesn't support credentials in url.


http://doc.scrapy.org/en/latest/topics/downloader-middleware.html#scrapy.contrib.downloadermiddleware.httpauth.HttpAuthMiddleware



0
Answered
Chris Forno 4 years ago in Scrapy Cloud • updated by Martin Olveyra (Engineer) 3 years ago 5

Once my custom (deployed to Scrapinghub via scrapyd) spider hits about 60,000 items it gets killed with a memusage_exceeded error. I suspect this is happening during item deduplication, but I don't know how to test that.

0
Answered
Metawing 4 years ago in Portia • updated by Oleg Tarasenko (Support Engineer) 3 years ago 3

there are no items are being extracted from the pages from autoscraping

Answer
This topic was abandoned for 4 months already
0
Fixed
Sabyasachi Goswami 4 years ago in Portia • updated by Oleg Tarasenko (Support Engineer) 3 years ago 3

Running auto scraping spider but has defined proper items still items not recived

0
Fixed
drsumm 4 years ago in Portia • updated by Oleg Tarasenko (Support Engineer) 3 years ago 6

Need to login to playstore:

https://accounts.google.com/ServiceLogin?service=googleplay&passive=1209600&continue=https://play.google.com/store&followup=https://play.google.com/store


I configured a spider with above login URL, but giving the following error:

1:2013-09-13 11:45:44INFOScrapy 0.19.0-28-g0f00b16 started
2:2013-09-13 11:45:45INFOusing set_wakeup_fd
3:2013-09-13 11:45:45INFOScrapy 0.19.0-28-g0f00b16 started (bot: scrapybot)
4:2013-09-13 11:45:45INFOSyncing .scrapy directory from s3://hubcloud.scrapinghub.com/690/dot-scrapy/playstorelog/
5:2013-09-13 11:45:46INFOHubStorage: writing items to http://storage.scrapinghub.com/items/690/23/3
6:2013-09-13 11:45:46INFOHubStorage: writing pages to http://storage.scrapinghub.com/collections/690/cs/Pages
7:2013-09-13 11:45:47INFOSpider opened
8:2013-09-13 11:45:47INFOCrawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
9:2013-09-13 11:45:47INFOTelnetConsole starting on 6023
10:2013-09-13 11:45:47ERRORSpider error processing <GET https://accounts.google.com/ServiceLogin?service=googleplay&passive=1209600&continue=https://play.google.com/store&followup=https://play.google.com/storeLess
	Traceback (most recent call last):
	  File "/usr/lib/python2.7/dist-packages/twisted/internet/base.py", line 1178, in mainLoop
	    self.runUntilCurrent()
	  File "/usr/lib/python2.7/dist-packages/twisted/internet/base.py", line 800, in runUntilCurrent
	    call.func(*call.args, **call.kw)
	  File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 362, in callback
	    self._startRunCallbacks(result)
	  File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 458, in _startRunCallbacks
	    self._runCallbacks()
	--- <exception caught here> ---
	  File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 545, in _runCallbacks
	    current.result = callback(current.result, *args, **kw)
	  File "/usr/lib/pymodules/python2.7/slybot/spider.py", line 111, in parse_login_page
	    args, url, method = fill_login_form(response.url, response.body, username, password)
	  File "/usr/lib/python2.7/dist-packages/loginform.py", line 53, in fill_login_form
	    form.fields[userfield] = username
	  File "/usr/lib/python2.7/dist-packages/lxml/html/__init__.py", line 885, in __setitem__
	    self.inputs[item].value = value
	  File "/usr/lib/python2.7/dist-packages/lxml/html/__init__.py", line 946, in __getitem__
	    "No input element with the name %r" % name)
	exceptions.KeyError: 'No input element with the name None'
	
11:2013-09-13 11:45:47INFOClosing spider (finished)
12:2013-09-13 11:45:48INFODumping Scrapy stats: Less
	{'downloader/request_bytes': 362,
	 'downloader/request_count': 1,
	 'downloader/request_method_count/GET': 1,
	 'downloader/response_bytes': 72947,
	 'downloader/response_count': 1,
	 'downloader/response_status_count/200': 1,
	 'finish_reason': 'finished',
	 'finish_time': datetime.datetime(2013, 9, 13, 6, 15, 47, 632051),
	 'memusage/max': 49434624,
	 'memusage/startup': 49434624,
	 'response_received_count': 1,
	 'scheduler/dequeued': 1,
	 'scheduler/dequeued/disk': 1,
	 'scheduler/enqueued': 1,
	 'scheduler/enqueued/disk': 1,
	 'spider_exceptions/KeyError': 1,
	 'start_time': datetime.datetime(2013, 9, 13, 6, 15, 47, 83072)}
13:2013-09-13 11:45:48INFOSpider closed (finished)
14:2013-09-13 11:45:48INFOSyncing .scrapy directory to s3://hubcloud.scrapinghub.com/690/dot-scrapy/playstorelog/
15:2013-09-13 11:45:49INFOMain loop terminated.





So how to handle google authentication?