This article presents some approaches on how to use private dependencies in your Scrapy Cloud project.
Let's assume your private dependency is located in some git repository.
A straightforward way would be to provide the credentials embedded into the git repo url:
Another option, if you use Github, would be to issue a Github personal access token and provide it instead like:
The token also provides access to all Git repository and should be treated as a password.
There's another option related with requirements.txt although it requires some development: you can launch your own private PyPi server (for example devpi). It can be used like:
--extra-index-url <Repo-URL> my-pkg=0.0.1
However if you want to keep privacy, you have to enable authorisation for the server, and in the similar way provide some credentials to install your private dependencies in Scrapy Cloud.
Using a custom Docker image
Using a custom Docker image allows customizing a lot of things, including private dependencies.
First approach proposes using SSH keys. Assuming you have:
- requirements.txt file contains an entry for the private repository.
- a pair of SSH keys:
* you've added
id_rsa.pubas a deployment key for the private repository
* you've copied
id_rsato your project directory
- a configured project using a custom Docker image (check this blog post for more details)
Add the following lines before
pip install requirements statement.
RUN mkdir -p /root/.ssh COPY ./id_rsa /root/.ssh/id_rsa RUN ssh-keyscan -t rsa github.com > /root/.ssh/known_hosts
This case assumes that the private repository is on github.com. If it's on other domain, you must replace the last line according to the repository domain.
Then you should continue with the guide and deploy your project.
Using a vendor folder is an alternative that doesn't require generating and storing ssh keys in the image (nor in the repository). The idea is simple, just:
- clone private dependencies locally under a known subdirectory like
- Copy them in
- finally reference the folders in
Project structure looks like this:
. ├── Dockerfile ├── requirements.txt ├── scrapinghub.yml ├── scrapy.cfg ├── setup.py ├── scrapyprojectname │ ├── __init__.py │ ├── __pycache__ │ ├── items.py │ ├── middleware.py │ ├── settings.py │ └── spiders │ ├── __init__.py │ ├── supercoolspider.py └── vendor └── myprivatelibrary ├── setup.py ├── i_am_not_public │ ├── __init__.py │ ├── remote.py │ └── utils.py ...
Then requirements.txt looks like:
Scrapy==1.1.0 tldextract==1.0.0 -e vendor/myprivatelibrary # or simply vendor/myprivatelibrary
Be sure that vendor path is copied on
Dockerfile and not ignored by
There's another modification of the vending approach: combining it with Git Submodules. The advantages is it matches local development with an image build environment.