You will need the Scrapinghub command line client to deploy projects into Scrapy Cloud, so install it if you have not done so yet. If you already have it installed, make sure you have the latest version:
$ pip install shub --upgrade
The next step is to deploy your Scrapy project to Scrapy Cloud. You will need your API key and the numeric ID of your Scrapy Cloud project. You can find both of these on your project’s Code & Deploys page. First, run:
$ shub login
to save your API key to a local file (
~/.scrapinghub.yml). You can delete it from there anytime via
shub logout. Next, run:
$ shub deploy
to be guided through a wizard that will set up the project configuration file (
scrapinghub.yml) for you. After you complete the wizard, your project will be uploaded to Scrapy Cloud. You can re-trigger deployment (without having to go through the wizard again) anytime via another call to
Now you can schedule your spider to run on Scrapy Cloud:
$ shub schedule quotes-toscrape Spider quotes-toscrape scheduled, job ID: 99830/1/1 Watch the log on the command line: shub log -f 1/1 or print items as they are being scraped: shub items -f 1/1 or watch it running in Scrapinghub's web interface: https://app.scrapinghub.com/p/99830/job/1/1
And watch it run (replace
1/1 with the job ID
shub gave you on the previous command, you can leave out the project ID):
shub log -f 1/1
Alternatively, you can go to your project page and schedule the spider there:
Then select your spider:
You will be redirected to the project dashboard and you can visually check if your spider is running correctly, the job created, items, requests, etc.
Once finished, the job created will be automatically moved to completed jobs.
To understand some terms, click on the job link (in this case 2/3) and you will be redirected to the job description. Check on the address bar in your browser, suppose you have the next url:
The information gathered from this address is:
- Project_id: 166395
- Spider_id: 2
- Job_id: 3
If you run this spider again, the only change will be the job_id, which will be 4 in this case.
This is important to avoid confusions between terms like projects, spiders and jobs.
Enjoy creating, deploying and scraping with us!