Scrapy Cloud spiders#

A Scrapy Cloud spider is a Scrapy spider that is part of a Scrapy project that has been deployed into a Scrapy Cloud project. You can start jobs to execute the code of a spider.

Our web scraping tutorial covers creating, deploying, and running spiders. For more information, see the Scrapy documentation.

It is also possible to create spiders without code.

Spider templates and virtual spiders#

Scrapy Cloud supports defining spider templates, that you can use from the Scrapy Cloud UI to create virtual spiders that run the code of the corresponding spider template with predefined parameters.

Tip

Zyte’s AI-powered spiders are good examples of spider templates that you can customize to create new templates.

Spider templates#

To create a spider template:

  1. Add scrapy-spider-metadata as a dependency to your Scrapy Cloud project.

  2. On the spiders that you wish to use as templates, define metadata including a title and description of your choice, and setting template to True:

    from scrapy import Spider
    
    class MySpider(Spider):
        ...
        metadata = {
            "title": "My Template",
            "description": "Description of my template.",
            "template": True,
        }
    

When you redeploy your code, you can start creating virtual spiders from your spider templates.

Note

Spider templates are also regular spiders, and can be executed directly as well.

Spider parameters#

The point of spider templates is to be able to create virtual spiders from them that each works differently based on predefined parameters.

To expose parameters to the Scrapy Cloud UI so that they can be defined when creating a virtual spider, add a parameter specification to your template spiders using scrapy-spider-metadata:

from pydantic import BaseModel
from scrapy import Spider
from scrapy_spider_metadata import Args

class MyParams(BaseModel):
    foo: str

class MySpider(Args[MyParams], Spider):
    ...

The Scrapy Cloud UI supports the following parameter types:

  • bool

  • int, float (with gt, lt, ge, and le numeric constraint support)

    When defining a parameter that specifies a maximum number of requests, set {"widget": "request-limit"} in Field.json_schema_extra to get special handling in the Scrapy Cloud UI:

    from pydantic import BaseModel, Field
    
    class MyParams(BaseModel):
        max_requests: int = Field(
            json_schema_extra={
                "widget": "request-limit",
            },
        )
    
  • str (with string constraint support)

    Define placeholder in Field.json_schema_extra to set a placeholder value to show in the Scrapy Cloud UI for the parameter:

    from pydantic import BaseModel, Field
    
    class MyParams(BaseModel):
        url: str = Field(
            json_schema_extra={
                "placeholder": "https://books.toscrape.com",
            },
        )
    
  • str + Enum

    Define enumMeta in Field.json_schema_extra to give your enumeration choices an optional title and description:

    from enum import Enum
    
    from pydantic import BaseModel, Field
    
    class Foo(str, Enum):
        bar: str = "bar"
        baz: str = "baz"
    
    class MyParams(BaseModel):
        foo: Foo = Field(
            json_schema_extra={
                "enumMeta": {
                    Foo.bar: {
                        "title": "Bar",
                        "description": "Bar description.",
                    },
                    Foo.baz: {
                        "title": "Baz",
                        "description": "Baz description.",
                    },
                },
            },
        )
    

Virtual spiders#

To create a virtual spider from a spider template, go to your Scrapy Cloud project page and, on the left-hand sidebar, under Spiders, select Create spider.

On the Create Spider page, you can select a template, define the parameters of your new virtual spider, and save your spider.

You can then use your virtual spider from Scrapy Cloud as if it were a regular spider.

Virtual spiders exist only in Scrapy Cloud, not in your code. However, changes to the code of their spider template will affect them.