Welcome to the Scrapinghub feedback & support site! We discuss all things related to Scrapy Cloud, Portia and Crawlera. You can participate by reading posts, asking questions, providing feedback, helping others, and voting the best questions & answers. Scrapinghub employees regularly pop in to answer questions, share tips, and post announcements.
0
jbothma yesterday at 8:36 a.m. in Scrapy Cloud 0

My spider gets killed after 2 hours of syncing dotscrapy.


DotScrapy Persistence and the HTTP Cache worked fine for a few days: I set a 4 day lifetime, it populated he cache and did a few good scrapes, the the cache expired and it had a couple of slow scrapes repopulating the cache, then since 2017-02-17 21:00:08 UTC my jobs get SIGTERM after only 2 hours


This is where my log ends each time after around 1 hour 55 mins.


6:
2017-02-17 21:00:16
INFO
[scrapy_dotpersistence] Syncing .scrapy directory from s3://scrapinghub-app-dash-addons/org-66666/79193/dot-scrapy/mfma/

7:
2017-02-17 22:51:11
INFO
[scrapy.crawler] Received SIGTERM, shutting down gracefully. Send again to force

I'm trying to figure out how to clear the dotscrapy storage and start afresh but I'd also like to know whether I'm doing something wrong so I don't get into this situation again.


Why would my job just get SIGTERM? Is there something killing slow AWS activity?


I don't think my dotscrapy is too large because I enabled gzip for the http cache which dealt with the out-of-space errors I initially got with the cache.

0
Answered
ghostmou 4 days ago in Scrapy Cloud • updated by Nestor Toledo Koplin (Support Engineer) 4 days ago 1

I am having issues downloading a paginated CSV. I have seen in the documentation that the parameters to control pagination are:

  • start, to indicate the start of the page, indicated in format <project>/<spider>/<job>/<item_id>.
  • count, to indicate the page's size.

When I try it in JSON format, it works like a charm. But with CSV, I'm having issues:

  • Parameter count works properly, returning the desired page size.
  • Parameter start seems to fail in CSV. I have tried both with the recommended format (<project>/<spider>/<job>/<item_id>) and with a numeral format (for example, to start the page on item 2500, pass the number 2499.

It seems to ignore the start parameter...

Example URLs used:

Any suggestions? :(

Thank you!

Answer

Hello,


You can use the Items API for this:

~ curl -u APIKEY: "https://storage.scrapinghub.com/items/<project_id>/<spider_id>/<job_id>?format=csv&include_headesr=1&fields=field1,field2,field3&start=<project_id>/<spider_id>/<job_id>/<item_id>&count=x"

Or


~ "https://storage.scrapinghub.com/items/<project_id>/<spider_id>/<job_id>?apikey=<apikey>&format=csv&include_headesr=1&fields=field1,field2,field3&start=/<project_id>/<spider_id>/<job_id>/<item_id>&count=x"
0
Answered
triggerdev 6 days ago in Scrapy Cloud • updated by Nestor Toledo Koplin (Support Engineer) 5 days ago 1

Hi,


I am running a spider job every 3 minutes and I would like how can I get the last job from shub (Windows version)?


Thanks.

Answer

Hello,


You can use the JobQ API (https://doc.scrapinghub.com/api/jobq.html#jobq-project-id-list) with parameters state=finished and count=1 (if you only want the last one) to get.

Or you can use the Jobs API (https://doc.scrapinghub.com/api/jobs.html#jobs-list-json-jl).

0
Answered
mrcai 6 days ago in Scrapy Cloud • updated by Nestor Toledo Koplin (Support Engineer) 6 days ago 1

Hi,


I'm receiving the following error.


[scrapy.extensions.feedexport] Unknown feed format: 'jsonlines'


This works in my development environment, I'm not sure if I've missed enabling a plugin?


Many thanks,

Answer

Hello,

If you are adding FEED_FORMAT via the UI settings, try removing the ' '.

0
Answered
I. Hathout 1 week ago in Scrapy Cloud • updated by Pablo Vaz (Support Engineer) 1 week ago 1

I can not run my scrapy spider i deployed it like the video on youtube, but it just ran for 16 seconds with 0 items and no_reson outcome !!

Answer

Hey Hathout, if you are deploying a Scrapy project, I suggest to follow this tutorial:

https://doc.scrapy.org/en/latest/intro/tutorial.html

It's quite complete, and works.


If using Portia, our visual scraper, take a moment to explore this tutorial:

http://help.scrapinghub.com/portia/using-portia-20-the-complete-beginners-guide


Good look with your projects, and be patient! Don't hesitate to ask here further questions.

Kind regards,

Pablo

0
Answered
edward.feng 1 week ago in Scrapy Cloud • updated by Nestor Toledo Koplin (Support Engineer) 1 week ago 1

I am trying to save PDF file in Base64 format binary in “items” storage of ScrapingHub.It is able to download the PDF document lesser than 1MB, but failed on 2 large PDF’s. I couldn’t find documentations in this regard… Does anyone had encountered this issue as well?


Traceback (most recent call last):

File "/usr/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 150, in maybeDeferred

result = f(*args, **kw)

File "/usr/local/lib/python2.7/site-packages/pydispatch/robustapply.py", line 55, in robustApply

return receiver(*arguments, **named)

File "/usr/local/lib/python2.7/site-packages/sh_scrapy/extension.py", line 46, in item_scraped

self._write_item(item)

File "/usr/local/lib/python2.7/site-packages/scrapinghub/hubstorage/resourcetype.py", line 208, in write

return self.writer.write(item)

File "/usr/local/lib/python2.7/site-packages/scrapinghub/hubstorage/batchuploader.py", line 229, in write

.format(self.maxitemsize, truncated_data))

ValueTooLarge: Value exceeds max encoded size of 1048576 bytes: '{"_type": "CrossroadsItem", "isFound": "true", "requestID": "565780", "html": "JVBERi0xLjQKJeLjz9MKMSAwIG9iago8PC9TdWJ0eXBlL1hNTC9MZW5ndGggMzIyMC9UeXBlL01ldGFkYXRhPj5zdHJlYW0KPD94cGFja2V0IGJlZ2luPSLvu78iIGlkPSJXNU0wTXBDZWhpSHpyZVN6TlRjemtjOWQiPz4KPHg6eG1wbWV0YSB4bWxuczp4PSJhZG9iZTpuczptZXRhLyIgeDp4bXB0az0iWE1QIENvcmUgNS41LjAiPgogICA8cmRmOlJERiB4bWxuczpyZGY9Imh0dHA6Ly93d3cudzMub3JnLzE5OTkvMDIvMjItcmRmLXN5bnRheC1ucyMiPgogICAgICA8cmRmOkRlc2NyaXB0aW9uIHJkZjphYm91dD0iIiB4bWxuczpwZGY9Imh0dHA6Ly9ucy5hZG9iZS5jb20vcGRmLzEuMy8iIHhtbG5zOnBkZmFpZD0iaHR0cDovL3d3dy5haWltLm9yZy9wZGZhL25zL2lkLyIgeG1sbnM6ZGM9Imh0dHA6Ly9wdXJsLm9yZy9kYy9lbGVtZW50cy8xLjEvIiB4bWxuczp4bXBNTT0iaHR0cDovL25zLmFkb2JlLmNvbS94YXAvMS4wL21tLyIgeG1sbnM6eG1wPSJodHRwOi8vbnMuYWRvYmUuY29tL3hhcC8xLjAvIj4KICAgICAgICAgPHBkZjpQcm9kdWNlcj5QRlUgUERGIExpYnJhcnkgMS4yLjA7IG1vZGlmaWVkIHVzaW5nIGlUZXh0U2hhcnAgNS4wLjAgKGMpIDFUM1hUIEJWQkE8L3BkZjpQcm9kdWNlcj4KICAgICAgICAgPHBkZmFpZDpwYXJ0PjE8L3BkZmFpZDpwYXJ0PgogICAgICAgICA8cGRmYWlkOmNvbmZvcm1hbmNlPkI8L3BkZmFpZDpjb25mb3JtYW5jZ...'

Answer

Hello Edward,


There's a limit of 1MB per item/log/collection item and it cannot be increased. Possible solutions are to split the item so it doesn't exceed the limit and merge them during post-processing, try to produce smaller items or you could try using a file storage like S3.

0
Answered
nobody 1 week ago in Scrapy Cloud • updated 6 days ago 4

I got an error "ValueTooLarge: Value exceeds max encoded size of 1048576 bytes:" when running spider. I suppose there are limit of 1MiB (=1,048,576B) in html file.

Is it possible to relax size limitation?


Detailed Error:

Traceback (most recent call last):

  File "/usr/local/lib/python2.7/site-packages/scrapy/core/spidermw.py", line 42, in process_spider_input
    result = method(response=response, spider=spider)
  File "/usr/local/lib/python2.7/site-packages/scrapy_pagestorage.py", line 59, in process_spider_input
    self.save_response(response, spider)
  File "/usr/local/lib/python2.7/site-packages/scrapy_pagestorage.py", line 90, in save_response
    self._writer.write(payload)
  File "/usr/local/lib/python2.7/site-packages/scrapinghub/hubstorage/batchuploader.py", line 229, in write
    .format(self.maxitemsize, truncated_data))
ValueTooLarge: Value exceeds max encoded size of 1048576 bytes: '{"body": "<!DOC...(snip)
Answer

Hello,


Yes there's a limit of 1MB per item/log/collection item and it is not possible to increase. You could try to split the item so it doesn't exceed the limit and then join then during post-processing, producing smaller items or you could try using a file storage like S3.

0
Answered
mcchin 2 weeks ago in Scrapy Cloud • updated by Pablo Vaz (Support Engineer) 5 days ago 6

Below is the test script I am using to test "DeltaFetch"


# -*- coding: utf-8 -*-
import scrapy
import json
import re
import md5
from scrapy.selector import Selector

#Python 3
#from urllib.parse import unquote

#Python 2
from urlparse import unquote
from scrapy.utils.request import request_fingerprint

import time

class TestSOSpider(scrapy.Spider):
    name = "testso"
    base_url = 'http://pastebin.com'
    #allowed_domains = ["pastebin.com"]
    start_urls = ['http://pastebin.com/']

    #def start_requests(self):
        #meta={"deltafetch_key": "222"},
    #    yield scrapy.Request('http://pastebin.com', callback=self.parse)

    def parse(self, response):
        self.logger.info('Main parse request - %s', response.url)
        #meta={"deltafetch_key": "111"},
        r = scrapy.Request('http://pastebin.com/kb33dEnd', callback=self.parseListing)
        self.logger.info("requesting %r (fingerprint: %r)" % (r, request_fingerprint(r)))
        yield r

    def parseListing(self, response):
        self.logger.info('Get listing - %s', response.url)
        self.logger.info("parse_page(%r); request %r (fingerprint: %r)" % (
            response, response.request, request_fingerprint(response.request)))

        txt = response.css(".paste_box_line1 > h1::text").extract_first()
        bin = response.css("#paste_code::text").extract_first()

        bin = u' '.join((bin, ''))
        bin = bin.encode('ascii')
        bin = unquote(bin)

        yield {
            'txt' : txt,
            'bin' : bin
        }

I have enabled both "DeltaFetch" and "DotPersistent" as seen below


When executing the spider, I can see that it is enabled from log as seen in image below.

(not sure why it has a character 'u' in front but others don't? not sure if this is the problem?)



Based on this article https://blog.scrapinghub.com/2016/07/20/scrapy-tips-from-the-pros-july-2016/, I am expecting to see 'detalfetch/stored' in my Dumping Scrapy stats

2016-07-19 10:17:53 [scrapy] INFO: Dumping Scrapy stats:
{
    'deltafetch/stored': 1000,
    ...
    'downloader/request_count': 1051,
    ...
    'item_scraped_count': 1000,
}

But I didn't as seen below from my log:



My spider just keeps returning items and results despite the URL is visited and scraped before.

I am wondering what have I done wrong? or if there is a bug?

Answer

With the help from online support I got it to work


Basically my confusion is

[DeltaFetch addons = pip install scrapy-deltafetch]
and that if I want to use deltafetch on scrapinghub I don't need to install the plugin

But in actual fact both are needed

So I need enable DeltaFetch addons on scrapinghub.com project settings, and also need to update project dependencies as described in http://help.scrapinghub.com/scrapy-cloud/dependencies-in-scrapy-cloud-project
that I need to [pip install scrapy-deltafetch]


0
Answered
SeanBannister 2 weeks ago in Scrapy Cloud • updated by Pablo Vaz (Support Engineer) 2 weeks ago 1

I'm using Portia on Scrapy Cloud is it possible to have the scraped data piped to an external database? I've noticed it appears to be possible if I was hosting it myself but I'd prefer to pay and use your service.

Answer

Hi Sean, you have two options that could match your inquiry, first you can use our paid Scrapy Cloud Units for $9/Month, each paid container unit allows you to store data for 120 days (free units has 7 days), and has 1GB of RAM.

The other option, is using AWS/S3 pipeline, please take a moment to read this article: http://help.scrapinghub.com/scrapy-cloud/how-to-export-my-items-to-a-awss3-account-ui-mode

I hope you find this information useful.

Kind regards,

Pablo

0
Waiting for Customer
Reti 2 weeks ago in Scrapy Cloud • updated 2 weeks ago 2

Hello,


I'm having difficulty scraping a pricing website. Through the job I see the currency changing between Euro, GBP, and even the Russian Ruble! This setting can be changed by a cookie but it defaults to the location of the user browsing the website.


Using Scrappy Cloud (and Portia) is it possible to select this currency preference before the job runs. I tried providing custom paramters:


request_with_cookies = Request(url="http://www.websites.com",cookies={'currency':'USD'})


This doesn't seem to work as I can see the default currency still overiding this.


Is there a better way to do this, or is my syntax above incorrect?

0
Answered
Vincent Molinié 2 weeks ago in Scrapy Cloud • updated by Pablo Vaz (Support Engineer) 2 weeks ago 3

curl -u API_KEY: https://storage.scrapinghub.com/activity/projects/

return need at least one project that a user has permission to view. Why is this happening ? It's the first time I use this API

Answer

Try with:


curl -u <APIkey>: https://storage.scrapinghub.com/activity/<project ID>/?count=2


And make sure to replace correctly your project ID and your APIkey which is available in:
https://app.scrapinghub.com/account/apikey


It works fine.

Best,

Pablo

0
Answered
WileECoyoteGenius 3 weeks ago in Scrapy Cloud • updated by korka 8 hours ago 2


Attempting to deploy a spider. Have images configured (works great locally sending to S3). Have add-on installed and configured in scraping hub.


No scrapinghub.yml file is created upon shub deploy and no errors are occurring.


Tried creating one anyways along with requirements.txt and redeploying. That didn't work.




File "/usr/local/lib/python2.7/site-packages/scrapy/commands/crawl.py", line 57, in run

	    self.crawler_process.crawl(spname, **opts.spargs)
	  File "/usr/local/lib/python2.7/site-packages/scrapy/crawler.py", line 163, in crawl
	    return self._crawl(crawler, *args, **kwargs)
	  File "/usr/local/lib/python2.7/site-packages/scrapy/crawler.py", line 167, in _crawl
	    d = crawler.crawl(*args, **kwargs)
	  File "/usr/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 1274, in unwindGenerator
	    return _inlineCallbacks(None, gen, Deferred())
	--- <exception caught here> ---
	  File "/usr/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 1128, in _inlineCallbacks
	    result = g.send(result)
	  File "/usr/local/lib/python2.7/site-packages/scrapy/crawler.py", line 90, in crawl
	    six.reraise(*exc_info)
	  File "/usr/local/lib/python2.7/site-packages/scrapy/crawler.py", line 72, in crawl
	    self.engine = self._create_engine()
	  File "/usr/local/lib/python2.7/site-packages/scrapy/crawler.py", line 97, in _create_engine
	    return ExecutionEngine(self, lambda _: self.stop())
	  File "/usr/local/lib/python2.7/site-packages/scrapy/core/engine.py", line 69, in __init__
	    self.scraper = Scraper(crawler)
	  File "/usr/local/lib/python2.7/site-packages/scrapy/core/scraper.py", line 71, in __init__
	    self.itemproc = itemproc_cls.from_crawler(crawler)
	  File "/usr/local/lib/python2.7/site-packages/scrapy/middleware.py", line 58, in from_crawler
	    return cls.from_settings(crawler.settings, crawler)
	  File "/usr/local/lib/python2.7/site-packages/scrapy/middleware.py", line 34, in from_settings
	    mwcls = load_object(clspath)
	  File "/usr/local/lib/python2.7/site-packages/scrapy/utils/misc.py", line 44, in load_object
	    mod = import_module(module)
	  File "/usr/local/lib/python2.7/importlib/__init__.py", line 37, in import_module
	    __import__(name)
	  File "/usr/local/lib/python2.7/site-packages/scrapy/pipelines/images.py", line 15, in <module>
	    from PIL import Image
	exceptions.ImportError: No module named PIL
	
Answer

Hi!

I've found a similar issue posted here: https://support.scrapinghub.com/topics/744-image-downloads/

Let me know if it helps.

Best regards,

Pablo

0
Answered
etohimofish 3 weeks ago in Scrapy Cloud • updated by Pablo Vaz (Support Engineer) 3 weeks ago 1

Getting this message in log when trying to run spider: "hubstorage.HubstorageDeprecationWarning: python-hubstorage is deprecated, please use python-scrapinghub >= 1.9.0 instead"


Not sure how to fix this, help would be appreciated.

Answer

Hi!

This message isn't an error, it's information log for a library that will soon be deprecated. If you do use it, simply use the python-scrapinghub library instead of python-hubstorage. If you don't use either on your spiders, you can safely ignore that line.
You can work with python-scrapinghub: https://github.com/scrapinghub/python-scrapinghub
Kind regards!

Pablo

0
Answered
EdoPut 1 month ago in Scrapy Cloud • updated by Pablo Vaz (Support Engineer) 1 month ago 1

Is there any limit on writing files to disk? After uploading all my items to an ephemeral collection I would like to move them to a csv file and upload them to an external service (say S3).

Answer

Hey EdoPut,

The limit on the storage depends on the account and the number of containers, but you can check this article related to export the items to S3:

http://help.scrapinghub.com/scrapy-cloud/how-to-export-my-items-to-a-awss3-account-ui-mode

Kind regards,

Pablo

0
Answered
ghostmou 1 month ago in Scrapy Cloud • updated 1 month ago 2

Hi :)


I have been working with Scrapinghub with the goal to automate how we build some reports for our customers. I am extremely happy with the results!! :))


Currently I am trying to store the number of inlinks and outlinks of each URL on a collection of websites (inlinks, as the number of links pointing to each item; outlinks, as the number of links detected on each item).


The outlinks are very easy to store, as you only have to store the number of links captured by the link extractor, but I don't have any idea about how to store the inlinks of each item.


On my own machine, I have created a dict using as key each URL and, as value, I am incrementing the number of links pointing to each URL. Then, using a pipeline, I add the information to the items. Works perfectly, it is very easy to accomplish :).


However... How can I do this on Scrapinghub? Is there a way to add information through a pipeline or using close_spider, method and still been able to request the items through the Item API? Or should I consider using the pipeline to send the results to another server (S3, FTP, or similar)?


Thank you for your help!!

Answer

Hey Ghostmou!

First, thanks for your nice and constructive feedback and great to know you are pleased with the results using Scrapinghub platform.


Regarding your inquiry I think in two options. First, you could try to use Magic fields Addon, it allows you to create a new custom item for each request:

http://help.scrapinghub.com/scrapy-cloud/addons/magic-fields-addon


And Secondly, as you mention to use pipeline to send results through S3 you can consider to use export your items as related in this article: http://help.scrapinghub.com/scrapy-cloud/how-to-export-my-items-to-a-awss3-account-ui-mode


Hope you find this information useful,

Kind regards!

Pablo