NOT TO BE CONFUSED WITH THE DOTSCRAPY PERSISTENCE ADDON


The purpose of this guide is to keep the content of the .scrapy directory in a persistent store, which is loaded when the spider starts and saved when the spider finishes. It allows spiders to share data between different runs, keeping a state or any kind of data that needs to be persisted.


The .scrapy directory is well known in Scrapy and a few extensions use it to keep a state between runs. The canonical way to work with the .scrapy directory is by calling the scrapy.utils.project.data_path function, as illustrated in the following example:

from scrapy.utils.project import data_path

mydata_path = data_path()

# ... use mydata_path to store or read data which will be persisted among runs ...


Enabling DotScrapy Persistence


Enable the extension by adding the following settings to your settings.py:

EXTENSIONS = {
    ...
    'scrapy_dotpersistence.DotScrapyPersistence': 0
}

and

DOTSCRAPY_ENABLED = True


Configuring DotScrapy Persistence


Configure the extension through the following settings:

ADDONS_AWS_ACCESS_KEY_ID = 'ABC'
ADDONS_AWS_SECRET_ACCESS_KEY = 'DEF'
ADDONS_AWS_USERNAME = 'username' // This is the folder path (optional)
ADDONS_S3_BUCKET = 'my_bucket'


This way DotScrapy Persistence will sync your .scrapy folder to an S3 bucket with this format:

s3://my_bucket/username/org-<orgid>/<projectid>/dot-scrapy/<spidername>/