Welcome to the Scrapinghub feedback & support site! We discuss all things related to Scrapy Cloud, Portia and Crawlera. You can participate by reading posts, asking questions, providing feedback, helping others, and voting the best questions & answers. Scrapinghub employees regularly pop in to answer questions, share tips, and post announcements.
0
Answered
maniac103 2 months ago in Datasets • updated by Pablo Hoffman (Director) 2 months ago 2

I have a couple of spiders for which I want to automatically publish their results into a public dataset in the dataset catalog, overwriting data of the previous spider run. I seem to be unable to do that because datasets seem to be tied to jobs/runs, not to the spider in general. Am I missing something there? End goal is being able to fetch the data from an app, so

I need a static URL for the last run's data. Unfortunately the method described in [1] doesn't work for me, as it requires me to put my API key (which allows read/write access to the project) into the URL, which is not an option in this (open source) app.


Thanks for your help.


[1] http://help.scrapinghub.com/scrapy-cloud/fetching-latest-spider-data

Answer

Hi Maniac,


This is a feature we discussed about, and even though we plan to incorporate it at some point we can't provide an ETA as of yet.


I will forward the bug report to the product team.

0
Fixed
nyov 10 months ago in Datasets • updated 6 months ago 7

Downloads are broken when fetching compiled datasets.

JSON lines are concatenated by newlines nothing, instead of comma and no list is built:

{"set":1}{"set":2}{"set":3}

should be:

[{"set":1},{"set":2},{"set":3}]

Result:

$ json_xs -t none <items-360.json
garbage after JSON object, at character offset 238 (before "{"_key":"46278/1/12/...") at /usr/bin/json_xs line 177, <STDIN> line 1.
Answer

This issue has been fixed, thanks for reporting!