Here at Scrapinghub we are big fans of Heroku. When people ask what Scrapy Cloud is about we sometimes tell people that "it's like Heroku, but for web crawlers". Having Heroku as a role model is something that is always pushing us towards getting better every single day.
We are sure that most of you folks are also Heroku fans and we heard that many people deploy Scrapy spiders in Heroku. So, here is a comparison of Heroku and Scrapy Cloud to help you decide which one fits better your needs to deploy and run Scrapy spiders.
Let's start by looking at the deployment process at both.
Assuming you already has a Scrapy Cloud account and a Scrapy Cloud project, all you need to deploy a project to Scrapy Cloud is , the Scrapinghub command line client. You go to your local project's folder and run:
$ shub deploy
Providing the Scrapy Cloud project ID when asked and follow the steps. That's it, after doing this you can go to the project dashboard at Scrapy Cloud and manage your crawler there.
Scrapyd via Heroku
Here we have some options:
Deploy only the Scrapy project, running the spiders via cmdline (heroku run)
Deploy the Scrapy project and build a web UI to control spiders execution
Deploy the Scrapy project and Scrapyd, a service to run Scrapy spiders
We decided to follow the last approach. It's the one who gets closer to Scrapy Cloud, because Scrapyd provides an HTTP API to manage spider's execution and also a very simple web UI that you can use to view logs, jobs information and the extracted data. This way we have an effective interface to our spiders and we don't don't have to reinvent the wheel.
The Scrapy community is awesome. Thanks to their restless work, we can combine Scrapyd (an open source project) with Heroku via scrapy-heroku. Be default, Heroku doesn't support Scrapyd because the latter depends on sqlite3, which can't be used in the former. The scrapy-heroku package overcomes this issue by including PostgreSQL support to Scrapyd.
Now, let's go through the steps required to deploy a Scrapy crawler + Scrapyd on Heroku.
Assuming that you already have created a Heroku account and a Heroku app, the next step is to set up a Postgres database, which is as simple as enabling an addon in the Resources tab.
Heroku provides three different ways to deploy your crawlers:
Heroku Git: a git repository with Heroku as the upstream repo
Github: automatic deploys the project for every new changeset pushed to Github.
Dropbox: grabs the code from a dropbox folder and allow you to deploy it via Web UI
We opted for the first one and the steps are really straightforward: setup a local git repo, install the Heroku Toolbelt and use it to add Heroku as the upstream repository. Then, every changeset you push to Heroku will trigger a new build.
Preparing the Project
We will use the crawler from this repository as the example for this walkthrough. There are some things that you have to change in your project in order to make it work with Scrapyd on Heroku.
1. Add dependencies to your project's requirements.txt. Heroku will automatically deploy those dependencies when you trigger a build:
scrapy scrapyd scrapy-heroku
2. Configure Scrapyd in your project's scrapy.cfg:
[scrapyd] application = scrapy_heroku.app.application [deploy] url = http://<YOUR_HEROKU_APP_NAME>.herokuapp.com:80/ project = <YOUR_SCRAPY_PROJECT_NAME> username = <A_USER_NAME> password = <A_PASSWORD>
3. Create a file called Procfile in your project's root containing the command that we want to execute when our app is started:
Deploying the project
To deploy the project, first make sure that heroku toolbelt is installed in your machine. Then, you will use it to include a new remote in your current project's Git repo:
$ heroku login $ heroku git:remote -a your-app-name
After you did that, you have to commit and push your changes to Heroku, triggering the build process:
$ git add . $ git commit -m "settings to run with scrapyd on heroku" $ git push heroku master
Now, if you go to http://your-app-name.herokuapp.com, you should see Scrapyd welcome page:
Running a Spider
In order to run a spider with Scrapyd, you have to make a an API call:
$ curl http://your-app-name.heroku.com/schedule.json -d project=your-scrapy-project -d spider=somespider
Checking the Running Jobs
Scrapyd provides a very simple web UI where you can see some things like Jobs, Items and Logs. The UI is more a report than a control panel, because you can't control your spiders or configurations using it. All you can do is see what's happening by manually refreshing its pages:
The Scrapyd API is very limited. It doesn't even allow you to retrieve the scraped items. Perhaps the most useful endpoints are schedule.json and cancel.json, because you'll use them to control your spiders execution.
You can view the downloaded items by clicking in the job's "Items" label.
Scrapy Cloud vs Heroku
Here is a side by side comparison of the features:
API call, UI, command line tool or client library
Scrapyd HTTP API call