Collections API#

Collections are key-value stores for an arbitrary large number of records. They are especially useful to store information produced and/or used by multiple scraping jobs.

Note

The frontier API is best suited to store queues of URLs to be processed by scraping jobs.

Quickstart#

A collection is identified by a project id, a type, and a name. A record can be any JSON dictionary. They are identified by a _key field.

In the following, we use project id 78 , the regular storage type s for the collection named my_collection.

Note

Avoid using multiple collections with the same name and different types like /s/my_collection and /cs/my_collection. During operations on an entire collection, like renaming or deleting, Hubstorage will treat homonyms as a single entity and rename or delete both.

Create/Update a record:#

$ curl -u $APIKEY: -X POST -d '{"_key": "foo", "value": "bar"}' \
    https://storage.scrapinghub.com/collections/78/s/my_collection

Access a record:#

$ curl -u $APIKEY: -X GET \
    https://storage.scrapinghub.com/collections/78/s/my_collection/foo

Delete a record:#

$ curl -u $APIKEY: -X DELETE \
    https://storage.scrapinghub.com/collections/78/s/my_collection/foo

List records:#

$ curl -u $APIKEY: -X GET \
    https://storage.scrapinghub.com/collections/78/s/my_collection

Create/Update multiple records:#

We use the jsonline format by default (json objects separated by a newline):

$ curl -u $APIKEY: -X POST -d $'{"_key": "foo", "value": "bar"}\n{"_key": "goo", "value": "baz"}' \
    https://storage.scrapinghub.com/collections/78/s/my_collection

Details#

The following collection types are available:

Type

Full name

Hubstorage method

Description

s

store

new_store

Basic set store

cs

cached store

new_cached_store

Items expire after a month

vs

versioned store

new_versioned_store

Up to 3 copies of each item will be retained

vcs

versioned cache store

new_versioned_cached_store

Multiple copies are retained, and each one expires after a month

Note

Avoid using multiple collections with the same name and different types like /s/my_collection and /cs/my_collection. During operations on an entire collection, like renaming or deleting, Hubstorage will treat homonyms as a single entity and rename or delete both.

Records are JSON objects, with the following constraints:

  • Their serialized size can’t be larger than 1 MB;

  • Javascript’s inf values are not supported;

  • Floating-point numbers can’t be larger than 2^64 - 1.

API#

collections/:project_id/list#

List all collections.

$ curl -u APIKEY: https://storage.scrapinghub.com/collections/78/list
{"type":"s","name":"my_collection"}
{"type":"s","name":"my_collection_2"}
{"type":"cs","name":"my_other_collection"}

collections/:project_id/:type/:collection#

Read, write or remove items in a collection.

Parameter

Description

Required

key

Read items with a specified key. Multiple values are supported.

No

prefix

Read items with a specified key prefix.

No

prefixcount

Maximum number of values to return per prefix.

No

startts

UNIX timestamp at which to begin results, in milliseconds.

No

endts

UNIX timestamp at which to end results, in milliseconds.

No

Method

Description

Supported parameters

GET

Read items from the specified collection.

key, prefix, prefixcount, startts, endts

POST

Write items to the specified collection.

DELETE

Delete items from the specified collection.

key, prefix, prefixcount, startts, endts

Note

Pagination and meta parameters are supported, see Pagination and Meta parameters.

GET examples:

$ curl -u APIKEY: "https://storage.scrapinghub.com/collections/78/s/my_collection?key=foo1&key=foo2"
{"value":"bar1"}
{"value":"bar2"}
$ curl -u APIKEY: https://storage.scrapinghub.com/collections/78/s/my_collection?prefix=f
{"value":"bar"}
$ curl -u APIKEY: "https://storage.scrapinghub.com/collections/78/s/my_collection?startts=1402699941000&endts=1403039369570"
{"value":"bar"}

Prefix filters, unlike other filters, use indexes and should be used when possible. You can use the prefixcount parameter to limit the number of values returned for each prefix.

A common pattern is to download changes within a certain time period. You can use the startts and endts parameters to select records within a certain time window.

The current timestamp can be retrieved like so:

$ curl https://storage.scrapinghub.com/system/ts
1403039369570

Note

Timestamp filters may perform poorly when selecting a small number of records from a large collection.

collections/:project_id/:type/:collection/count#

Count the number of items in a collection.

$ curl -u APIKEY: https://storage.scrapinghub.com/collections/78/s/my_collection/count
{"count":972,"scanned":972}%

If the collection is large, the result may contain a nextstart field that is used for pagination, see Pagination.

collections/:project_id/:type/:collection/:item#

Read Write or Delete an individual item.

Method

Description

GET

Read the item with the given key

POST

Write the item with the given key

DELETE

Delete the item with the given key

$ curl -u $APIKEY: https://storage.scrapinghub.com/collections/78/s/my_collection/foo
{"value":"bar"}

collections/:project_id/:type/:collection/:item/value#

Read an individual item value.

$ curl -u APIKEY: https://storage.scrapinghub.com/collections/78/s/my_collection/foo/value
bar

collections/:project_id/:type/:collection/deleted#

POST with a list of item keys to delete them.

Note

This endpoint is designed to delete a large number of non-consecutive items. To delete consecutive items use DELETE-based endpoints, which are faster.

$ curl -u $APIKEY: -X POST -d '"foo"' -d '"bar"' \
    https://storage.scrapinghub.com/collections/78/s/my_collection/deleted

collections/:project_id/delete?name=:collection#

Delete an entire collection immediately.

$ curl -u APIKEY: -X POST https://storage.scrapinghub.com/collections/78/delete?name=my_collection

collections/:project_id/rename?name=:collection&new_name=:new_name#

Rename a collection and move all its items immediately.

$ curl -u APIKEY: -X POST https://storage.scrapinghub.com/collections/rename?name=my_collection&new_name=my_collection_renamed