Welcome to the Scrapinghub community forum! We discuss all things related to Scrapy Cloud, Portia and Crawlera. You can participate by reading posts, asking questions, providing feedback, helping others, and voting the best questions & answers. Scrapinghub employees regularly pop in to answer questions, share tips, and post announcements.

Remember to check the Help Center!

Please remember to check the Scrapinghub Help Center before asking here, your question may be already answered there.

trustyao 4 weeks ago in Scrapy Cloud • updated by Pablo Vaz (Support Engineer) 3 weeks ago 2

When I first test my code locally, I got this error.

But after searching on google, I found I need to upgrade the twisted module. Then the spider run without any mistakes.

Now I deploy the project to scrapinghub, it runs error. How can I solve it? I have added 'twisted' to the requirements.



[scrapy.core.downloader.handlers] Loading "scrapy.core.downloader.handlers.ftp.FTPDownloadHandler" for scheme "ftp"


Traceback (most recent call last):
  File "/usr/local/lib/python3.5/site-packages/scrapy/core/downloader/handlers/__init__.py", line 44, in _get_handler
    dhcls = load_object(path)
  File "/usr/local/lib/python3.5/site-packages/scrapy/utils/misc.py", line 44, in load_object
    mod = import_module(module)
  File "/usr/local/lib/python3.5/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 986, in _gcd_import
  File "<frozen importlib._bootstrap>", line 969, in _find_and_load
  File "<frozen importlib._bootstrap>", line 958, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 673, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 673, in exec_module
  File "<frozen importlib._bootstrap>", line 222, in _call_with_frames_removed
  File "/usr/local/lib/python3.5/site-packages/scrapy/core/downloader/handlers/ftp.py", line 36, in <module>
    from twisted.protocols.ftp import FTPClient, CommandFailed
ImportError: No module named 'twisted.protocols.ftp'

trustyao 3 weeks ago

I have solve this problem. Stacks is not the answer.

I added 'Twisted>=17.1.0' to the requirements.txt


SeanBannister 4 weeks ago in Portia • updated by Pablo Vaz (Support Engineer) 1 week ago 2

On running my spider which was created with Portia I'm receiving the following error http://pastebin.com/0M80xGLv I'm unsure how to debug this.


Hey Sean!

Have you tried to remove and try with different selectors to find which one is giving you errors?

This could be helpful to know if there's a bug on a selector trying to parse data with an improper format, or similar.

Kind regards,


SeanBannister 4 weeks ago in Portia • updated by Pablo Vaz (Support Engineer) 1 week ago 2

I've deleted a few spiders from Portia but they still show up on the ScrapingHub deashboard and still give the option to run them. Is this a bug? How do I delete them completely.


Hi Sean!

I've forwarded your inquiry to our Portia team. I could reproduce the bug and our team is aware now.

Thank you very much for your contribution, it help us to provide a better product for our customers.

Kind regards,


olivie2r 4 weeks ago in Scrapy Cloud • updated by Pablo Vaz (Support Engineer) 6 days ago 1


If I try to deploy with shub my spiders with the following 2 lines in setup.py, then no spider is deployed. If I just comment these 2 lines and deploy again, it works fine... Is that a bug?



LOG_FILE = 'spider.log'

Hi Olivier, have you tried to deploy and add the settings after? Go to Spider settings->Raw Settings and try to add your settings there:

Kind regards,


darndt 4 weeks ago in Scrapy Cloud • updated by Pablo Vaz (Support Engineer) 4 weeks ago 3

When my spider runs on Scrapy Cloud the items get automatically stored as part of the job and I use the API to access them which works beautifully.

However, when I run my spider locally I use a custom write-to-file pipeline to store the items on my disk.

Is there a way to turn this pipeline on/off depending if the spider runs locally or on Scrapy Cloud? Perhaps there is a environment variable that I can read?

Thanks a lot for your help!



Thanks for your nice feedback Darndt,

When using Scrapy Cloud you can also use the Magic field addon, and set some kind of field and value and you may turn recognizable for your script, then set up instructions for that. Please take a few moments to read:


Did you already check my article about pipelines:

Even if not strictly related to your question, perhaps you can set some instructions using this feature.

Please don't hesitate to share with us if you find a useful solution, your inquiry seems very interesting!

Kind regards

Xerxes 1 month ago in Portia • updated by Pablo Vaz (Support Engineer) 1 month ago 2

Hello, I got 502, 500 and other errors last 2 days...


[root] Script initialization failed Less
Traceback (most recent call last):
File "/usr/local/lib/python2.7/site-packages/sh_scrapy/crawl.py", line 145, in _run_usercode
_run(args, settings)
File "/usr/local/lib/python2.7/site-packages/sh_scrapy/crawl.py", line 103, in _run
_run_scrapy(args, settings)
File "/usr/local/lib/python2.7/site-packages/sh_scrapy/crawl.py", line 111, in _run_scrapy
File "/usr/local/lib/python2.7/site-packages/scrapy/cmdline.py", line 141, in execute
cmd.crawler_process = CrawlerProcess(settings)
File "/usr/local/lib/python2.7/site-packages/scrapy/crawler.py", line 238, in __init__
super(CrawlerProcess, self).__init__(settings)
File "/usr/local/lib/python2.7/site-packages/scrapy/crawler.py", line 129, in __init__
self.spider_loader = _get_spider_loader(settings)
File "/usr/local/lib/python2.7/site-packages/scrapy/crawler.py", line 325, in _get_spider_loader
return loader_cls.from_settings(settings.frozencopy())
File "/src/slybot/slybot/slybot/spidermanager.py", line 91, in from_settings
return cls(datadir, zipfile, spider_cls, settings=settings)
File "/src/slybot/slybot/slybot/spidermanager.py", line 84, in __init__
File "/src/slybot/slybot/slybot/spidermanager.py", line 29, in __init__
self._specs = open_project_from_dir(datadir)
File "/src/slybot/slybot/slybot/utils.py", line 54, in open_project_from_dir
spec.setdefault("templates", []).extend(templates)
File "/src/slybot/slybot/slybot/utils.py", line 87, in load_external_templates
yield _build_sample(sample, legacy=version < '0.13.0')
File "/src/slybot/slybot/slybot/utils.py", line 101, in _build_sample
Annotations().save_extraction_data(data, sample, legacy=legacy)
File "/src/slybot/slybot/slybot/plugins/scrapely_annotations/builder.py", line 58, in save_extraction_data
annotation_data, html, bool(options.get('legacy')))
File "/src/slybot/slybot/slybot/plugins/scrapely_annotations/builder.py", line 370, in apply_annotations
selector_annotations, numbered_html)
File "/src/slybot/slybot/slybot/plugins/scrapely_annotations/builder.py", line 312, in apply_selector_annotations
repeated_parent = add_repeated_field(annotation, elems, page)
File "/src/slybot/slybot/slybot/plugins/scrapely_annotations/builder.py", line 344, in add_repeated_field
parent = _get_parent(elems, page)
File "/src/slybot/slybot/slybot/plugins/scrapely_annotations/migration.py", line 343, in _get_parent
parent = annotations[0]
TypeError: 'NoneType' object has no attribute '__getitem__'


That's not the only problem. There is definitely something wrong in parser. Tt works in browser (portia), but then doesn't parse needed fields. It happens on MOST of sites (like 15 from 20). It was better some time ago, with previous version of portia.

ALSO why did you remove possibility to parse meta fields? Most information is usually there...

In the last 3 weeks I tried to scrape 20 sites. only on 5 it was ok. It took around 100 hours of work for me. The system is so bugged. It's like black box. Error don't give me any useful information and I can't control anything.

For example, it works on client side but then nothing scraped. Or scraped but not all. Or from some pages scraped, on others did not (but html is the same).

You can't test filter on arbitrary page now, like it was before. It was very useful.

Day ago I got this problem, probably when I used selector *[itemprop="reviewCount"] (trying to parse meta tags):

Traceback (most recent call last):

  File "/usr/local/lib/python2.7/site-packages/scrapy/utils/defer.py", line 102, in iter_errback
    yield next(it)
  File "/usr/local/lib/python2.7/site-packages/sh_scrapy/extension.py", line 78, in process_spider_output
    for x in result:
  File "/usr/local/lib/python2.7/site-packages/scrapy/spidermiddlewares/offsite.py", line 29, in process_spider_output
    for x in result:
  File "/usr/local/lib/python2.7/site-packages/scrapy/spidermiddlewares/referer.py", line 22, in <genexpr>
    return (_set_referer(r) for r in result or ())
  File "/usr/local/lib/python2.7/site-packages/scrapy/spidermiddlewares/urllength.py", line 37, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "/usr/local/lib/python2.7/site-packages/scrapy/spidermiddlewares/depth.py", line 58, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "/usr/local/lib/python2.7/site-packages/scrapy_pagestorage.py", line 98, in process_spider_output
    for r in result:
  File "/src/slybot/slybot/slybot/spider.py", line 228, in _handle
    for item_or_request in itertools.chain(*generators):
  File "/src/slybot/slybot/slybot/plugins/scrapely_annotations/annotations.py", line 122, in handle_html
    items, link_regions = self.extract_items(htmlpage, response)
  File "/src/slybot/slybot/slybot/plugins/scrapely_annotations/annotations.py", line 137, in extract_items
  File "/src/slybot/slybot/slybot/plugins/scrapely_annotations/annotations.py", line 145, in _do_extract_items_from
    extracted, template = extractor.extract(htmlpage, pref_template_id)
  File "/src/slybot/slybot/slybot/plugins/scrapely_annotations/extraction/extractors.py", line 103, in extract
    extracted = extraction_tree.extract(extraction_page)
  File "/src/slybot/slybot/slybot/plugins/scrapely_annotations/extraction/extractors.py", line 22, in extract
  File "/src/slybot/slybot/slybot/plugins/scrapely_annotations/extraction/container_extractors.py", line 285, in extract
    region, page, ignored_regions, surrounding, **kwargs)
  File "/src/slybot/slybot/slybot/plugins/scrapely_annotations/extraction/container_extractors.py", line 300, in _extract_items_from_region
    ignored_regions, **kwargs
  File "/src/slybot/slybot/slybot/plugins/scrapely_annotations/extraction/region_extractors.py", line 66, in extract
    end_index, **kwargs)
  File "/src/slybot/slybot/slybot/plugins/scrapely_annotations/extraction/region_extractors.py", line 103, in _doextract
    end_region, self.best_match, **kwargs)
  File "/src/scrapely/scrapely/extraction/similarity.py", line 179, in similar_region
    suffix, prefix_index + 1, range_end)
  File "/src/scrapely/scrapely/extraction/similarity.py", line 85, in longest_unique_subsequence
    matches = naive_match_length(to_search, subsequence, range_start, range_end)
  File "scrapely/extraction/_similarity.pyx", line 155, in scrapely.extraction._similarity.naive_match_length (scrapely/extraction/_similarity.c:3845)
    cpdef naive_match_length(sequence, pattern, int start=0, int end=-1):
  File "scrapely/extraction/_similarity.pyx", line 158, in scrapely.extraction._similarity.naive_match_length (scrapely/extraction/_similarity.c:3648)
    return np_naive_match_length(sequence, pattern, start, end)
  File "scrapely/extraction/_similarity.pyx", line 104, in scrapely.extraction._similarity.np_naive_match_length (scrapely/extraction/_similarity.c:2952)
    while sequence[k] == pattern[j]:
IndexError: Out of bounds on buffer access (axis 0)

Hi Xerxes,

A new version of Portia was released today fixing all major bugs.

About bans. We can guarantee Portia will handle parsing and extraction for all sites. Some sites just simply avoid to be crawled and makes you very hard to extract data, and some others have severe policies to stop crawling. That's why sometimes you need more powerful tools and our experts can assist you.

If your project becomes more complex and ambitious to be handled with Portia, don't hesitate to consider our Professional services and request a free quote: https://scrapinghub.com/quote

Best regards,


Laurent Ades 1 month ago in Portia • updated 4 weeks ago 9


new to all this... very exciting!

To give it a try i am scraping the open-mesh website to get all the products (less than 30 in total)

I have created a spider .... defined a sample page which pattern is followed by all product pages as expected...


It works pretty well, except for the price which sometimes is scraped, sometimes not... and can't get my hand on a specific reason coming for a difference in one page vs another one. I have tried to define fields with css or xpath but it does not change...

I have read other posts that kinda sound like my issue - bu t not exactly - whereby extraction does not always come up as expected....

Bug to corrected in the coming version (as i have read it ) or me doing something stupid ?



Hi Laurent, thanks for sharing your results. Unfortunately we can't guarantee a successful interaction between Portia extractors and the site you are crawling, it depends not just on the settings you made, but also in the site structure as well.

If your project is urgent and you need the data with excellent accuracy, don't hesitate to ask for our professional services and request a free quote: https://scrapinghub.com/quote
Our developers can help you to set and deploy powerful crawlers using the most advanced technology and easily achieve best results.

Kind regards,


jbothma 1 month ago in Scrapy Cloud 0

My spider gets killed after 2 hours of syncing dotscrapy.

DotScrapy Persistence and the HTTP Cache worked fine for a few days: I set a 4 day lifetime, it populated he cache and did a few good scrapes, the the cache expired and it had a couple of slow scrapes repopulating the cache, then since 2017-02-17 21:00:08 UTC my jobs get SIGTERM after only 2 hours

This is where my log ends each time after around 1 hour 55 mins.

2017-02-17 21:00:16
[scrapy_dotpersistence] Syncing .scrapy directory from s3://scrapinghub-app-dash-addons/org-66666/79193/dot-scrapy/mfma/

2017-02-17 22:51:11
[scrapy.crawler] Received SIGTERM, shutting down gracefully. Send again to force

I'm trying to figure out how to clear the dotscrapy storage and start afresh but I'd also like to know whether I'm doing something wrong so I don't get into this situation again.

Why would my job just get SIGTERM? Is there something killing slow AWS activity?

I don't think my dotscrapy is too large because I enabled gzip for the http cache which dealt with the out-of-space errors I initially got with the cache.

ghostmou 1 month ago in Scrapy Cloud • updated by Nestor Toledo Koplin (Support Engineer) 1 month ago 1

I am having issues downloading a paginated CSV. I have seen in the documentation that the parameters to control pagination are:

  • start, to indicate the start of the page, indicated in format <project>/<spider>/<job>/<item_id>.
  • count, to indicate the page's size.

When I try it in JSON format, it works like a charm. But with CSV, I'm having issues:

  • Parameter count works properly, returning the desired page size.
  • Parameter start seems to fail in CSV. I have tried both with the recommended format (<project>/<spider>/<job>/<item_id>) and with a numeral format (for example, to start the page on item 2500, pass the number 2499.

It seems to ignore the start parameter...

Example URLs used:

Any suggestions? :(

Thank you!



You can use the Items API for this:

~ curl -u APIKEY: "https://storage.scrapinghub.com/items/<project_id>/<spider_id>/<job_id>?format=csv&include_headesr=1&fields=field1,field2,field3&start=<project_id>/<spider_id>/<job_id>/<item_id>&count=x"


~ "https://storage.scrapinghub.com/items/<project_id>/<spider_id>/<job_id>?apikey=<apikey>&format=csv&include_headesr=1&fields=field1,field2,field3&start=/<project_id>/<spider_id>/<job_id>/<item_id>&count=x"
Adam 1 month ago in Portia • updated by Laurent Ades 1 month ago 2

Hi guys,

My spider doesn't capture all of the fields that I've specified, even though it seems to work in the "Extracted Items" preview. I've tried different things and still no luck.

Some facts:

  • Data is available on the page load (it's not loaded with AJAX).
  • It's happening for all of the scraped pages.
  • I have 100% match on 3 out of 7 fields and none on the remaining 4.
  • I have tried setting up new sample page from scratch, using new schema but I still have the same issue.

There's nothing unusual in the log:

0:2017-02-16 08:02:39INFO

Log opened.

1:2017-02-16 08:02:39INFO

[scrapy.log] Scrapy 1.2.2 started

2:2017-02-16 08:02:40INFO

[stderr] /usr/local/lib/python2.7/site-packages/scrapy_pagestorage.py:7: HubstorageDeprecationWarning: python-hubstorage is deprecated, please use python-scrapinghub >= 1.9.0 instead (https://pypi.python.org/pypi/scrapinghub).

3:2017-02-16 08:02:40INFO

[stderr] from hubstorage import ValueTooLarge

4:2017-02-16 08:02:40INFO

[stderr] /usr/local/lib/python2.7/site-packages/scrapy/crawler.py:129: ScrapyDeprecationWarning: SPIDER_MANAGER_CLASS option is deprecated. Please use SPIDER_LOADER_CLASS.

5:2017-02-16 08:02:40INFO

[stderr] self.spider_loader = _get_spider_loader(settings)

6:2017-02-16 08:02:40INFO

[root] Slybot 0.13.0b30 Spider

7:2017-02-16 08:02:40INFO

[stderr] /src/slybot/slybot/slybot/plugins/scrapely_annotations/builder.py:334: ScrapyDeprecationWarning: Attribute `_root` is deprecated, use `root` instead

8:2017-02-16 08:02:40INFO

[stderr] elems = [elem._root for elem in page.css(selector)]

9:2017-02-16 08:02:40INFO

[scrapy.utils.log] Scrapy 1.2.2 started (bot: scrapybot)

10:2017-02-16 08:02:40INFO

[scrapy.utils.log] Overridden settings: {'LOG_LEVEL': 'INFO', 'AUTOTHROTTLE_ENABLED': True, 'STATS_CLASS': 'sh_scrapy.stats.HubStorageStatsCollector', 'MEMUSAGE_LIMIT_MB': 950, 'TELNETCONSOLE_HOST': '', 'LOG_ENABLED': False, 'MEMUSAGE_ENABLED': True}

11:2017-02-16 08:02:40WARNING

[py.warnings] /src/slybot/slybot/slybot/closespider.py:10: ScrapyDeprecationWarning: Importing from scrapy.xlib.pydispatch is deprecated and will no longer be supported in future Scrapy versions. If you just want to connect signals use the from_crawler class method, otherwise import pydispatch directly if needed. See: https://github.com/scrapy/scrapy/issues/1762

12:2017-02-16 08:02:40INFO

[scrapy.log] HubStorage: writing items to https://storage.scrapinghub.com/items/156095/5/17

13:2017-02-16 08:02:40INFO

[scrapy.middleware] Enabled extensions:

14:2017-02-16 08:02:40INFO

[scrapy.middleware] Enabled downloader middlewares:

15:2017-02-16 08:02:40WARNING

[py.warnings] /usr/local/lib/python2.7/site-packages/scrapy_pagestorage.py:50: ScrapyDeprecationWarning: log.msg has been deprecated, create a python logger and log through it instead

16:2017-02-16 08:02:40INFO

[scrapy.log] HubStorage: writing pages to https://storage.scrapinghub.com/collections/156095/cs/Pages

17:2017-02-16 08:02:41INFO

[scrapy.middleware] Enabled spider middlewares:

18:2017-02-16 08:02:41INFO

[scrapy.middleware] Enabled item pipelines:

19:2017-02-16 08:02:41INFO

[scrapy.core.engine] Spider opened

20:2017-02-16 08:02:41INFO

[scrapy.extensions.logstats] Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)

21:2017-02-16 08:02:41INFO

TelnetConsole starting on 6023

22:2017-02-16 08:02:51WARNING

[py.warnings] /src/slybot/slybot/slybot/plugins/scrapely_annotations/processors.py:226: ScrapyDeprecationWarning: Attribute `_root` is deprecated, use `root` instead

23:2017-02-16 08:02:51WARNING

[py.warnings] /src/slybot/slybot/slybot/plugins/scrapely_annotations/processors.py:213: ScrapyDeprecationWarning: Attribute `_root` is deprecated, use `root` instead

24:2017-02-16 08:03:42INFO

[scrapy.extensions.logstats] Crawled 149 pages (at 149 pages/min), scraped 91 items (at 91 items/min)

25:2017-02-16 08:04:33INFO

[scrapy.crawler] Received SIGTERM, shutting down gracefully. Send again to force

26:2017-02-16 08:04:33INFO

[scrapy.core.engine] Closing spider (shutdown)

27:2017-02-16 08:04:41INFO

[scrapy.extensions.logstats] Crawled 188 pages (at 39 pages/min), scraped 126 items (at 35 items/min)

28:2017-02-16 08:05:11INFO

[scrapy.statscollectors] Dumping Scrapy stats:

29:2017-02-16 08:05:12INFO

[scrapy.core.engine] Spider closed (shutdown)

30:2017-02-16 08:05:12INFO

(TCP Port 6023 Closed)

31:2017-02-16 08:05:12INFO

Main loop terminated.

Any ideas why this happened?


I've tried setting up a brand new spider from scratch, same problem occurred.

On the original spider, I've added some random fields that are always on the page (such as login link or telephone number), that doesn't seem to get picked up.

I've tried to rename one of the 3 fields that work to see if my changes are actually deployed successfully, this worked, I could see the renamed field in the scraped data, still missing the other 4 fields though.




Hi Adam, our Portia team is about to release a new version of Portia with fixes to most bugs, like this, reported by our users.

We will update in our community for this new release.

Kind regards,


triggerdev 1 month ago in Scrapy Cloud • updated by Nestor Toledo Koplin (Support Engineer) 1 month ago 1


I am running a spider job every 3 minutes and I would like how can I get the last job from shub (Windows version)?




You can use the JobQ API (https://doc.scrapinghub.com/api/jobq.html#jobq-project-id-list) with parameters state=finished and count=1 (if you only want the last one) to get.

Or you can use the Jobs API (https://doc.scrapinghub.com/api/jobs.html#jobs-list-json-jl).

mrcai 1 month ago in Scrapy Cloud • updated by Nestor Toledo Koplin (Support Engineer) 1 month ago 1


I'm receiving the following error.

[scrapy.extensions.feedexport] Unknown feed format: 'jsonlines'

This works in my development environment, I'm not sure if I've missed enabling a plugin?

Many thanks,



If you are adding FEED_FORMAT via the UI settings, try removing the ' '.

Tristan Bailey 1 month ago in Portia • updated 1 month ago 2

I see in Portia 2.0 there is the option for FeedUrl as a starting page list type - text one link per line.

Is it possible to pass this feedurl in the API to start a new spider like "start_urls"?
(looks like may be not?)

Second part there is another post that mentions that you can do it with as RSS or XML sitemap.

I can not find the docs for this. It looks like it might work, but is there a spec for these formats, as they can vary.

Third part, is there any limit to the number of urls in these bulk methods for seeding?




Hi Tristan,

For the first question, the feed refers to a URL, so if you can update the data provided in that URL and schedule the spider in Scrapy Cloud, you could solve this. Perhaps there's another solution more efficient and our community members would like to share.

I think the second question is related to first one. But feel free to elaborate a bit more what you want to achieve so we can find a possible solution.

For the last question, according our Portia developers there's no limit for URLs, but keep in mind that pushing Portia beyond its limits has, as you may experience, uncomfortable consequences due to memory usage and capacity of our free storage.

Feel free to explore using Portia and share with us what you find. Your contributions are very helpful.

Kind regards,


19dc60 1 month ago in Portia • updated by Pablo Vaz (Support Engineer) 1 month ago 1

I am getting the following when attempting to open my spider in Portia. Please advise why.

"Failed to load resource: the server responded with a status of 403 (Forbidden)"


Hi 19dc60!

Possibly a network issue, now is working fine. Feel free to ask if you need further assistance.

Kind regards,


maniac103 1 month ago in Datasets • updated by Pablo Hoffman (Director) 1 month ago 2

I have a couple of spiders for which I want to automatically publish their results into a public dataset in the dataset catalog, overwriting data of the previous spider run. I seem to be unable to do that because datasets seem to be tied to jobs/runs, not to the spider in general. Am I missing something there? End goal is being able to fetch the data from an app, so

I need a static URL for the last run's data. Unfortunately the method described in [1] doesn't work for me, as it requires me to put my API key (which allows read/write access to the project) into the URL, which is not an option in this (open source) app.

Thanks for your help.

[1] http://help.scrapinghub.com/scrapy-cloud/fetching-latest-spider-data


Hi Maniac,

This is a feature we discussed about, and even though we plan to incorporate it at some point we can't provide an ETA as of yet.

I will forward the bug report to the product team.