Migrating from Automatic Extraction to Zyte API#
Learn how to migrate from Automatic Extraction to Zyte API, which supports automatic extraction.
Key differences#
The following table summarizes the feature differences between both products:
Feature |
Automatic Extraction |
Zyte API |
---|---|---|
Type support |
Product, product list, article, article list, comment, forum post, review, real estate, vehicle, job posting |
Product, product list, product navigation, article, article list, article navigation, job posting |
Data schemas |
||
Extraction from raw HTML |
No |
|
Crawling |
UI only |
|
Browser HTML |
Premium only |
|
Screenshots |
No |
|
Actions |
No |
|
Network capture |
No |
|
JavaScript toggle |
No |
|
Geolocation |
No |
|
Cookies |
No |
|
Sessions |
No |
|
Response headers |
No |
|
Browserless input |
No |
|
Batch queries |
Yes |
No |
Custom input |
Yes ( |
No |
Pricing |
More granular and flexible. For example, getting both automatic extraction and browser HTML on the same request no longer requires an Enterprise account. |
Updating your subscription#
Zyte API requires a separate subscription, follow the getting started guide to get one.
Also remember to cancel your existing Automatic Extraction subscription once you have completed your migration to Zyte API automatic extraction.
Updating requests#
If you are using an HTTP client, check examples of making Zyte API calls from different languages, and update your API requests as follows:
- Update your endpoint, from:
https://autoextract.scrapinghub.com/v1/extract
to:https://api.zyte.com/v1/extract
Update your API key to your Zyte API key.
- Update your request body from an array of queries:
[{"…": "…"}]
to a single query object:{"…": "…"}
Zyte API does not support query batching. If you were sending multiple queries per request, you must split them into separate requests with 1 query each.
Replace
"pageType": "TYPE"
with"TYPE": true
.For example, replace
"pageType": "product"
with"product": true
.Replace
meta
withechoData
, which can be any JSON structure, not only a string. See Metadata.Replace
fullHtml
withbrowserHtml
. See Browser HTML.Remove
articleBodyRaw
, Zyte API can only returnarticleBodyHtml
.Remove
customHtml
, Zyte API does not support providing a custom HTML document as input.
Example (curl)
Automatic Extraction:
curl \
--user YOUR_API_KEY: \
--header 'Content-Type: application/json' \
--data '[{"url": "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html", "pageType": "product"}]' \
--compressed \
https://autoextract.scrapinghub.com/v1/extract
Zyte API:
curl \
--user YOUR_API_KEY: \
--header 'Content-Type: application/json' \
--data '{"url": "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html", "product": true}' \
--compressed \
https://api.zyte.com/v1/extract
To replace the command-line interface of zyte-autoextract:
Install python-zyte-api:
pip install zyte-api
Replace the
python -m autoextract
command withzyte-api
.Update your API key to your Zyte API key.
If you were setting your API key with the
ZYTE_AUTOEXTRACT_KEY
environment variable, useZYTE_API_KEY
instead now.If you were using a list of URLs as input, switch to a JSON Lines file as input.
Tip
You do not need to pass
--intype jl
on the command line,zyte-api
automatically detects your input format.Instead of passing
--page-type TYPE
on the command line, use"TYPE": true
in each query of your input JSON Lines file.For example, replace
--page-type product
on the command line with"product": true
on every query.If you were using a JSON Lines file as input:
Replace
"pageType": "TYPE"
with"TYPE": true
.For example, replace
"pageType": "product"
with"product": true
.Replace
meta
withechoData
, which can be any JSON structure, not only a string. See Metadata.Replace
fullHtml
withbrowserHtml
. See Browser HTML.Remove
articleBodyRaw
, Zyte API can only returnarticleBodyHtml
.Remove
customHtml
, Zyte API does not support providing a custom HTML document as input.
If you are using
--api-endpoint ENDPOINT
, find out what your Zyte API endpoint is and use--api-url ENDPOINT
instead, or remove the command-line parameter altogether to use the default endpoint.Remove
--batch-size NUMBER
, Zyte API does not support query batching.Remove
--max-query-error-retries
.zyte-api
performs some retries automatically, but it does not allow customizing its retry policy from the command line, other than disabling error retries altogether with--dont-retry-errors
.Remove
--disable-cert-validation
.If you get SSL errors, install our CA certificate.
Example
Automatic Extraction:
{"url": "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html", "pageType": "product"}
python -m autoextract --intype jl input.jsonl
Zyte API:
{"url": "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html", "product": true}
zyte-api input.jsonl
To replace the Python asyncio interface of zyte-autoextract:
Install python-zyte-api:
pip install zyte-api
- In your import statements, change
autoextract
tozyte_api
Instead of calling the
request_raw
andrequest_parallel_as_completed
functions, create an instance ofAsyncClient
and call its same-name methods.Update your query to be a single
dict
, instead of a list ofdict
orRequest
objects.Zyte API does not support query batching. If you were sending multiple queries per request, you must split them into separate requests with 1 query each.
Update your query
dict
orRequest
object to be adict
with the following field changes:Remove
pageType
, use its previous value as a field name instead, and set it toTrue
.For example, replace
Request(pageType="product")
with{"product": True}
.Replace
meta
withechoData
, which can be any JSON structure, not only a string. See Metadata.Remove
articleBodyRaw
, Zyte API can only returnarticleBodyHtml
.Replace
fullHtml
withbrowserHtml
. See Browser HTML.Remove
extra
.extra.customHtml
has no replacement, as Zyte API does not support providing a custom HTML document as input.
Pass the
api_key
parameter toAsyncClient
, with your Zyte API key as value.If you were setting your API key with the
ZYTE_AUTOEXTRACT_KEY
environment variable, useZYTE_API_KEY
instead now.If you are using the
endpoint
parameter, find out what your Zyte API URL and endpoint are, and use insteadapi_url
inAsyncClient
(default:https://api.zyte.com/v1/
) andendpoint
in client methods (default:extract
), or omit the parameters altogether to use their default values.If you are creating an aiohttp session with
create_session
, drop thedisable_cert_validation
parameter.If you get SSL errors, install our CA certificate.
Remove the
agg_stats
parameter, or pass it toAsyncClient
instead.Remove the
max_query_error_retries
parameter. To customize the retry policy, use theretrying
parameter ofAsyncClient
instead.If you are using
request_raw
:Remove the
handle_retries
parameter. To customize the retry policy, use theretrying
parameter ofAsyncClient
instead.Remove the
headers
parameter, python-zyte-api does not support customizing the HTTP headers sent to Zyte API.Note
Not to be confused with Zyte API parameters to set request headers: customHttpRequestHeaders (HTTP) and requestHeaders (browser).
Pass the
retrying
parameter toAsyncClient
instead.
If you are using
request_parallel_as_completed
:Pass the
n_conn
parameter toAsyncClient
instead.Remove the
batch_size
parameter, Zyte API does not support query batching.
Example
Automatic Extraction:
import asyncio
from autoextract.aio.client import request_raw
async def main():
api_response = await request_raw(
[
{
"url": (
"https://books.toscrape.com/catalogue"
"/a-light-in-the-attic_1000/index.html"
),
"pageType": "product",
},
],
)
print(api_response)
asyncio.run(main())
Zyte API:
import asyncio
from zyte_api.aio.client import AsyncClient
async def main():
client = AsyncClient()
api_response = await client.request_raw(
{
"url": (
"https://books.toscrape.com/catalogue"
"/a-light-in-the-attic_1000/index.html"
),
"product": True,
},
)
print(api_response)
asyncio.run(main())
To replace the scrapy-autoextract middleware:
Install and configure scrapy-zyte-api following this page of the web scraping tutorial, including your Zyte API key.
Instead of setting a page type with the
AUTOEXTRACT_PAGE_TYPE
setting, thepage_type
spider attribute, or theautoextract.pageType
request metadata key, set"zyte_api_automap": {"TYPE": true}
on the request metadata, whereTYPE
is the target type, e.g.product
.For example, replace
Request(meta={"autoextract": {"pageType": "product"}})
withRequest(meta={"zyte_api_automap": {"product": True}})
.If you are using the
AUTOEXTRACT_URL
setting, find out what your Zyte API endpoint is and useZYTE_API_URL
instead, or let the default endpoint be used.scrapy-zyte-api does not provide a counterpart to the
AUTOEXTRACT_SLOT_POLICY
setting, a per-domain policy is always used. Moreover, Zyte API and non-Zyte-API requests are always treated as targeting different domains.If you are using the
autoextract.extra
request metadata key, map its values to values in thezyte_api_automap
request metadata key as follows:Replace
meta
withechoData
, which can be any JSON structure, not only a string. See Metadata.Replace
fullHtml
withbrowserHtml
. See Browser HTML.Remove
articleBodyRaw
, Zyte API can only returnarticleBodyHtml
.Remove
customHtml
, Zyte API does not support providing a custom HTML document as input.
Remove the
autoextract.headers
parameter, scrapy-zyte-api does not support customizing the HTTP headers sent to Zyte API.Note
Not to be confused with Zyte API parameters to set request headers: customHttpRequestHeaders (HTTP) and requestHeaders (browser).
Example
Automatic Extraction:
from scrapy import Request, Spider
class BooksToScrapeComSpider(Spider):
name = "books_toscrape_com"
def start_requests(self):
yield Request(
(
"https://books.toscrape.com/catalogue"
"/a-light-in-the-attic_1000/index.html"
),
meta={
"autoextract": {
"enabled": True,
"pageType": "product",
},
},
)
def parse(self, response):
print(response.meta["autoextract"])
Zyte API:
from scrapy import Request, Spider
class BooksToScrapeComSpider(Spider):
name = "books_toscrape_com"
def start_requests(self):
yield Request(
(
"https://books.toscrape.com/catalogue"
"/a-light-in-the-attic_1000/index.html"
),
meta={
"zyte_api_automap": {
"product": True,
},
},
)
def parse(self, response):
print(response.raw_api_response)
To replace the scrapy-autoextract page object providers:
Install and configure scrapy-zyte-api following this page of the web scraping tutorial, including your Zyte API key.
Upgrade your versions of web-poet and scrapy-poet, chances are you are using old versions that still work with scrapy-autoextract.
Remove
scrapy_autoextract.AutoExtractProvider
from yourSCRAPY_POET_PROVIDERS
setting.Replace
autoextract_poet.pages.AutoExtract<type>Page
withzyte_common_items.<type>
, e.g.autoextract_poet.pages.AutoExtractProductPage
withzyte_common_items.Product
.Note
zyte_common_items.Product
is not a page object class but an item, i.e. the result of callingto_item()
on a page object.Replace
autoextract_poet.pages.AutoExtractWebPage
withweb_poet.AnyResponse
, which wrapsweb_poet.HttpResponse
orweb_poet.BrowserResponse
. Which input is actually used depends on your custom page object dependencies, if any, or your extraction source, if defined (see Dependency annotations in scrapy-poet integration).If you are using the
AUTOEXTRACT_URL
setting, find out what your Zyte API endpoint is and useZYTE_API_URL
instead, or let the default endpoint be used.scrapy-zyte-api does not provide a counterpart to the
AUTOEXTRACT_MAX_QUERY_ERROR_RETRIES
setting, see Retries to achieve something similar.scrapy-zyte-api does not provide a counterpart to the
AUTOEXTRACT_CONCURRENT_REQUESTS_PER_DOMAIN
setting, use theCONCURRENT_REQUESTS_PER_DOMAIN
setting instead.scrapy-zyte-api does not provide a counterpart to the
AUTOEXTRACT_CACHE_FILENAME
andAUTOEXTRACT_CACHE_GZIP
settings.
Example
Automatic Extraction:
from autoextract_poet.pages import AutoExtractProductPage
from scrapy import Spider
from scrapy_poet import DummyResponse
class BooksToScrapeComSpider(Spider):
name = "books_toscrape_com"
start_urls = [
"https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html",
]
def parse(self, response: DummyResponse, product_page: AutoExtractProductPage):
print(product_page.to_item())
Zyte API:
from scrapy import Spider
from scrapy_poet import DummyResponse
from zyte_common_items import Product
class BooksToScrapeComSpider(Spider):
name = "books_toscrape_com"
start_urls = [
"https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html",
]
def parse(self, response: DummyResponse, product: Product):
print(product)
Updating response expectations#
If you are using an HTTP client, update your API response expectations as follows:
You get a JSON object (
{"…": "…"}
), not an array ([{"…": "…"}]
).You also get a key matching your page type, e.g.
product
, but its content follows a different schema in Zyte API.There are no
query
,webPage
, oralgorithmVersion
keys in the response.You can use metadata to replace
query
.A
url
key exists, but it is not the request URL that you get inquery.userQuery.url
, but the response URL, which could be different from the request URL, e.g. due to redirections.Error response handling is similar, rate limiting is more generous. See Zyte API error handling.
If you are using the command-line interface of zyte-autoextract, update your API response expectations as follows:
You also get a key matching your page type, e.g.
product
, but its content follows a different schema in Zyte API. See the entry of your page type in the response section of the Zyte API reference documentation for schema details.There are no
query
,webPage
, oralgorithmVersion
keys in the response.You can use metadata to replace
query
.A
url
key exists, but it is not the request URL that you get inquery.userQuery.url
, but the response URL, which could be different from the request URL, e.g. due to redirections.Rate limiting is more generous. See Zyte API error handling.
If you are using the Python asyncio interface of zyte-autoextract, update your API response expectations as follows:
You get a
dict
({"…": "…"}
), not a list ofdict
([{"…": "…"}]
).You also get a key matching your page type, e.g.
product
, but its content follows a different schema in Zyte API. See the entry of your page type in the response section of the Zyte API reference documentation for schema details.There are no
query
,webPage
, oralgorithmVersion
keys in the response.You can use metadata to replace
query
.A
url
key exists, but it is not the request URL that you get inquery.userQuery.url
, but the response URL, which could be different from the request URL, e.g. due to redirections.Rate limiting is more generous. See Zyte API error handling.
If you are using the scrapy-autoextract middleware, update your API response expectations as follows:
You also get a key matching your page type, e.g.
product
, but its content follows a different schema in Zyte API. See the entry of your page type in the response section of the Zyte API reference documentation for schema details.There are no
original_url
ortiming
meta keys in the response.You can read the original URL from
response.request.url
.There is no built-in alternative for the timing data, if you want that you need to implement it on your own, for example with a custom Scrapy downloader middleware.
Rate limiting is more generous. See Zyte API error handling.
scrapy-zyte-api is smarter about retries, at the cost of handling retries off Scrapy. See Retries.
If you are using scrapy-autoextract page object providers, update your API response expectations as follows:
zyte_common_items.Product
is not a page object but an item, i.e. the result of callingto_item()
on a page object.Its API is also slightly different from that of
autoextract_poet.items.Product
, which is whatautoextract_poet.AutoExtractProductPage.to_item()
returns.Rate limiting is more generous. See Zyte API error handling.
Example
Automatic Extraction:
[
{
"query": {
"id": "1686644367537-712b96d0aa96c12a",
"domain": "toscrape.com",
"userAgent": "curl/8.1.0",
"userQuery": {
"url": "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html",
"pageType": "product"
}
},
"webPage": {
"inLanguages": [
{
"code": "en"
}
]
},
"product": {
"name": "A Light in the Attic",
"description": "It's hard to imagine a world without A Light in the Attic. This now-classic collection of poetry and drawings from Shel Silverstein celebrates its 20th anniversary with this special edition. Silverstein's humorous and creative verse can amuse the dowdiest of readers. Lemon-faced adults and fidgety kids sit still and read these rhythmic words and laugh and smile and love th It's hard to imagine a world without A Light in the Attic. This now-classic collection of poetry and drawings from Shel Silverstein celebrates its 20th anniversary with this special edition. Silverstein's humorous and creative verse can amuse the dowdiest of readers. Lemon-faced adults and fidgety kids sit still and read these rhythmic words and laugh and smile and love that Silverstein. Need proof of his genius? RockabyeRockabye baby, in the treetopDon't you know a treetopIs no safe place to rock?And who put you up there,And your cradle, too?Baby, I think someone down here'sGot it in for you. Shel, you never sounded so good. ...more",
"mainImage": "https://books.toscrape.com/media/cache/fe/72/fe72f0532301ec28892ae79a629a293c.jpg",
"images": [
"https://books.toscrape.com/media/cache/fe/72/fe72f0532301ec28892ae79a629a293c.jpg"
],
"url": "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html",
"additionalProperty": [
{
"name": "upc",
"value": "a897fe39b1053632"
},
{
"name": "product type",
"value": "Books"
},
{
"name": "price (excl. tax)",
"value": "£51.77"
},
{
"name": "price (incl. tax)",
"value": "£51.77"
},
{
"name": "tax",
"value": "£0.00"
},
{
"name": "availability",
"value": "In stock (22 available)"
},
{
"name": "number of reviews",
"value": "0"
}
],
"offers": [
{
"price": "51.77",
"currency": "£",
"availability": "InStock"
}
],
"sku": "1000",
"breadcrumbs": [
{
"name": "Home",
"link": "https://books.toscrape.com/index.html"
},
{
"name": "Books",
"link": "https://books.toscrape.com/catalogue/category/books_1/index.html"
},
{
"name": "Poetry",
"link": "https://books.toscrape.com/catalogue/category/books/poetry_23/index.html"
},
{
"name": "A Light in the Attic"
}
],
"probability": 0.9982717,
"aggregateRating": {
"reviewCount": 0
},
"descriptionHtml": "<article>\n\n<p>It's hard to imagine a world without A Light in the Attic. This now-classic collection of poetry and drawings from Shel Silverstein celebrates its 20th anniversary with this special edition. Silverstein's humorous and creative verse can amuse the dowdiest of readers. Lemon-faced adults and fidgety kids sit still and read these rhythmic words and laugh and smile and love th It's hard to imagine a world without A Light in the Attic. This now-classic collection of poetry and drawings from Shel Silverstein celebrates its 20th anniversary with this special edition. Silverstein's humorous and creative verse can amuse the dowdiest of readers. Lemon-faced adults and fidgety kids sit still and read these rhythmic words and laugh and smile and love that Silverstein. Need proof of his genius? RockabyeRockabye baby, in the treetopDon't you know a treetopIs no safe place to rock?And who put you up there,And your cradle, too?Baby, I think someone down here'sGot it in for you. Shel, you never sounded so good. ...more</p>\n\n</article>",
"color": "Books"
},
"algorithmVersion": "21.12.7"
}
]
Zyte API:
{
"url": "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html",
"statusCode": 200,
"product": {
"name": "A Light in the Attic",
"price": "51.77",
"currency": "GBP",
"currencyRaw": "£",
"availability": "InStock",
"sku": "a897fe39b1053632",
"brand": {
"name": "Books to Scrape"
},
"breadcrumbs": [
{
"name": "Home",
"url": "https://books.toscrape.com/index.html"
},
{
"name": "Books",
"url": "https://books.toscrape.com/catalogue/category/books_1/index.html"
},
{
"name": "Poetry",
"url": "https://books.toscrape.com/catalogue/category/books/poetry_23/index.html"
},
{
"name": "A Light in the Attic"
}
],
"mainImage": {
"url": "https://books.toscrape.com/media/cache/fe/72/fe72f0532301ec28892ae79a629a293c.jpg"
},
"images": [
{
"url": "https://books.toscrape.com/media/cache/fe/72/fe72f0532301ec28892ae79a629a293c.jpg"
}
],
"description": "It's hard to imagine a world without A Light in the Attic. This now-classic collection of poetry and drawings from Shel Silverstein celebrates its 20th anniversary with this special edition. Silverstein's humorous and creative verse can amuse the dowdiest of readers. Lemon-faced adults and fidgety kids sit still and read these rhythmic words and laugh and smile and love th It's hard to imagine a world without A Light in the Attic. This now-classic collection of poetry and drawings from Shel Silverstein celebrates its 20th anniversary with this special edition. Silverstein's humorous and creative verse can amuse the dowdiest of readers. Lemon-faced adults and fidgety kids sit still and read these rhythmic words and laugh and smile and love that Silverstein. Need proof of his genius? RockabyeRockabye baby, in the treetopDon't you know a treetopIs no safe place to rock?And who put you up there,And your cradle, too?Baby, I think someone down here'sGot it in for you. Shel, you never sounded so good. ...more",
"descriptionHtml": "<article>\n\n<p>It's hard to imagine a world without A Light in the Attic. This now-classic collection of poetry and drawings from Shel Silverstein celebrates its 20th anniversary with this special edition. Silverstein's humorous and creative verse can amuse the dowdiest of readers. Lemon-faced adults and fidgety kids sit still and read these rhythmic words and laugh and smile and love th It's hard to imagine a world without A Light in the Attic. This now-classic collection of poetry and drawings from Shel Silverstein celebrates its 20th anniversary with this special edition. Silverstein's humorous and creative verse can amuse the dowdiest of readers. Lemon-faced adults and fidgety kids sit still and read these rhythmic words and laugh and smile and love that Silverstein. Need proof of his genius? RockabyeRockabye baby, in the treetopDon't you know a treetopIs no safe place to rock?And who put you up there,And your cradle, too?Baby, I think someone down here'sGot it in for you. Shel, you never sounded so good. ...more</p>\n\n</article>",
"aggregateRating": {
"reviewCount": 0
},
"additionalProperties": [
{
"name": "upc",
"value": "a897fe39b1053632"
},
{
"name": "product type",
"value": "Books"
},
{
"name": "price (excl. tax)",
"value": "£51.77"
},
{
"name": "price (incl. tax)",
"value": "£51.77"
},
{
"name": "tax",
"value": "£0.00"
},
{
"name": "availability",
"value": "In stock (22 available)"
},
{
"name": "number of reviews",
"value": "0"
}
],
"url": "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html",
"metadata": {
"probability": 0.9947898387908936,
"dateDownloaded": "2023-06-13T08:19:46Z"
}
}
}
Schema changes#
Data schemas in Zyte API are different from those used in Automatic Extraction.
Zyte API data schemas are based on Zyte Data schemas, implementing a subset of their fields. For detailed reference of the Zyte API data schemas, find the corresponding data type response section of the Zyte API reference.
Select a data type below to see how its schema has changed:
New fields: currency, features, metadata.dateDownloaded.
Fields price, regularPrice, and availability, previously nested under
offers
, have been unnested.currency
has been unnested and renamed to currencyRaw.offers
has been removed.{ "price": "9999.99", "regularPrice": "11999.99", "currency": "USD", "currencyRaw": "$", "availability": "InStock" }
brand has become an object, with the brand name on the nested name field instead.
{ "brand": { "name": "Ka-pow" } }
In breadcrumbs, the
link
nested field is now called url instead.{ "breadcrumbs": [ { "url": "http://example.com/level1", "name": "Level 1" }, { "url": "http://example.com/level1/level2", "name": "Level 2" } ] }
mainImage and the images list items are no longer strings, but objects with a url field instead.
{ "mainImage": { "url": "https://img.example.com/products/22.jpeg" }, "images": [ { "url": "https://img.example.com/products/22.jpeg" } ] }
additionalProperty
is now additionalProperties.probability is now nested under the new metadata field.
{ "metadata": { "probability": 0.9999 } }
hasVariants
is now variants, and its nested items are also affected by all root schema changes listed above.
paginationNext
has been moved to the new productNavigation data type as nextPage, with its nestedtext
field renamed to name.paginationPrevious
has been removed.New fields: products[].currency, metadata, categoryName.
For items in products:
Fields price and regularPrice, previously nested under
offers
, have been unnested.currency
has been unnested and renamed to currencyRaw.offers[].availability
andoffers
have been removed.{ "price": "9999.99", "regularPrice": "11999.99", "currency": "USD", "currencyRaw": "$" }
mainImage is no longer a string, but an object with a url field instead.
{ "mainImage": { "url": "https://img.example.com/products/22.jpeg" } }
probability is now nested under the new metadata field.
{ "metadata": { "probability": 0.9999 } }
The following fields have been removed:
sku
,brand
,images
,description
,descriptionHtml
,aggregateRating
.
In breadcrumbs, the
link
nested field is now called url instead.{ "breadcrumbs": [ { "url": "http://example.com/level1", "name": "Level 1" }, { "url": "http://example.com/level1/level2", "name": "Level 2" } ] }
New fields: metadata.dateDownloaded.
author
andauthorsList
have been replaced by authors, a list of objects with name and nameRaw fields. Specifically,authors.name
replacedauthorsList
andauthors.nameRaw
replacedauthor
.{ "authors": [ { "name": "Alice", "nameRaw": "Alice and Bob" }, { "name": "Bob", "nameRaw": "Alice and Bob" } ] }
In breadcrumbs, the
link
nested field is now called url instead.{ "breadcrumbs": [ { "url": "http://example.com/level1", "name": "Level 1" }, { "url": "http://example.com/level1/level2", "name": "Level 2" } ] }
mainImage and the images list items are no longer strings, but objects with a url field instead.
{ "mainImage": { "url": "https://img.example.com/products/22.jpeg" }, "images": [ { "url": "https://img.example.com/products/22.jpeg" } ] }
audioUrls
andvideoUrls
have been replaced by audios and videos respectively, which are arrays of objects with url fields, rather than arrays of strings.{ "audios": [ { "url": "https://audio.example.com/products/22.mp3" } ], "videos": [ { "url": "https://video.example.com/products/22.mp4" } ] }
probability is now nested under the new metadata field.
{ "metadata": { "probability": 0.9999 } }
The
articleBodyRaw
field has been removed.
paginationNext
has been moved to the new articleNavigation data type as nextPage, with its nestedtext
field renamed to name.paginationPrevious
has been removed.New fields: metadata.
For items in articles:
author
andauthorsList
have been replaced by authors, a list of objects with name and nameRaw fields. Specifically,authors.name
replacedauthorsList
andauthors.nameRaw
replacedauthor
.{ "authors": [ { "name": "Alice", "nameRaw": "Alice and Bob" }, { "name": "Bob", "nameRaw": "Alice and Bob" } ] }
mainImage and the images list items are no longer strings, but objects with a url field instead.
{ "mainImage": { "url": "https://img.example.com/products/22.jpeg" }, "images": [ { "url": "https://img.example.com/products/22.jpeg" } ] }
probability is now nested under the new metadata field.
{ "metadata": { "probability": 0.9999 } }
New fields: baseSalary.currency, datePublishedRaw, metadata.dateDownloaded.
title
is now jobTitle.datePosted
is now datePublished.hiringOrganization.raw
is now hiringOrganization.name.Under baseSalary:
value
is now valueMax, and it is a number string instead of a float number.currency
is now currencyRaw.
{ "baseSalary": { "raw": "$53,251 a year", "valueMax": "53251.00", "currency": "USD", "currencyRaw": "$" } }
probability is now nested under the new metadata field.
{ "metadata": { "probability": 0.9999 } }
Keeping the old schema#
Migrating to the new schema is recommended, to enjoy richer, better-typed data.
However, if you use Python, you can speed up your initial migration by automatically downgrading new items to the old schema, so that you can postpone updating your code and processes that still rely on the old schema.
First install or upgrade zyte-common-items:
pip install --upgrade zyte-common-items
Then use it as follows:
If you have migrated from the Python asyncio interface of
zyte-autoextract to that of python-zyte-api, you can convert your
extracted data with zyte_common_items.ae.downgrade
.
For example, for a product:
from zyte_common_items import Product, ae
...
zyte_api_product = Product.from_dict(response["product"])
ae_product = ae.downgrade(zyte_api_product)
The resulting object is compatible with itemadapter, e.g. you can turn it into a dictionary as follows:
from itemadapter import ItemAdapter
ae_product_dict = ItemAdapter(ae_product).asdict()
If you have migrated from the scrapy-autoextract middleware to scrapy-zyte-api (without scrapy-poet), you have 2 options to convert your extracted data.
If you keep the extracted data unchanged, e.g. your callback looks something like this:
from scrapy import Spider
...
class MySpider(Spider):
...
def parse(self, response):
yield response.raw_api_response["product"]
You can change your callback to something like this:
from scrapy import Spider
from zyte_common_items import Product
...
class MySpider(Spider):
...
def parse(self, response):
yield Product.from_dict(response.raw_api_response["product"])
And enable the AEPipeline
item pipeline to convert your data:
ITEM_PIPELINES = {
"zyte_common_items.pipelines.AEPipeline": 500,
}
Tip
If you have item pipelines that rely on the old schema, you
might need to use a value lower than theirs, instead of 500, to run
AEPipeline
before them.
Also, mind that AEPipeline
returns an attrs object instead
of a dict
. Use itemadapter to interact with items.
However, if you do make changes to the extracted data, check if you still need those changes. If you do, you have 2 options:
Update your custom extraction code to use the new schema, and let the item pipeline change the item schema later. For example:
from scrapy import Spider from zyte_common_items import Product ... class MySpider(Spider): ... def parse(self, response): product = Product.from_dict(response.raw_api_response["product"]) product.price = response.css(".hidden-price").get() yield product
Change the item schema at the beginning of your callback, before your custom code. For example:
from scrapy import Spider from zyte_common_items import Product, ae ... class MySpider(Spider): ... def parse(self, response): zyte_api_product = Product.from_dict(response.raw_api_response["product"]) ae_product = ae.downgrade(zyte_api_product) ae_product.offers = [ ae.AEOffer( price=response.css(".hidden-price").get(), ) ] yield ae_product
The object that
ae.downgrade
returns is compatible with itemadapter, e.g. you can turn it into a dictionary as follows:from itemadapter import ItemAdapter ae_product_dict = ItemAdapter(ae_product).asdict()
If you have migrated from scrapy-autoextract page object providers to scrapy-zyte-api page object providers, you have 2 options to convert your extracted data.
If you keep the extracted data unchanged, e.g. you have no custom page object class and your callback looks something like this:
from scrapy import Spider
from scrapy_poet import DummyResponse
from zyte_common_items import Product
...
class MySpider(Spider):
...
def parse(self, response: DummyResponse, product: Product):
yield product
Enable the AEPipeline
item pipeline to convert your data:
ITEM_PIPELINES = {
"zyte_common_items.pipelines.AEPipeline": 500,
}
Tip
If you have item pipelines that rely on the old schema, you
might need to use a value lower than theirs, instead of 500, to run
AEPipeline
before them.
Also, mind that AEPipeline
returns an attrs object instead
of a dict
. Use itemadapter to interact with items.
However, if you do make changes to the extracted data, check if you still need those changes. If you do, you have 2 options:
Upgrade your custom extraction code, and let the item pipeline change the item schema later.
For every website or set of websites that require custom extraction code, write a page object class with a proper rule that subclasses an
Auto
-prefixed page object class (e.g.zyte_common_items.AutoProductPage
), and use fields to implement your custom extraction code. For example:import attrs from web_poet import AnyResponse, field, handle_urls @handle_urls("books.toscrape.com") @attrs.define class BooksToScrapeComProductPage(AutoProductPage): response: AnyResponse @field def price(self): return self.response.css(".hidden-price")
Tip
The order of decorators matters. Also, see Overriding parsing for another example.
Make sure you migrate into those page object classes all your custom extraction code, both from custom old-style page object classes where extraction logic used to be in their
to_item
method instead of fields, and from your callback, where there should be no extraction logic anymore.Also make sure you point the SCRAPY_POET_DISCOVER setting to a module containing your new page object classes, directly or indirectly. We recommend to have a
pages
module under your project, e.g.myproject.pages
, and keep all page objects there, so that you can use:SCRAPY_POET_DISCOVER = ["myproject.pages"]
Move all your custom extraction code to your callback, and change the item schema at the beginning of your callback, before your custom code. For example:
from scrapy import Spider from scrapy_poet import DummyResponse from web_poet import AnyResponse from zyte_common_items import Product, ae ... class MySpider(Spider): ... def parse(self, response: DummyResponse, zyte_api_product: Product, _response: AnyResponse): ae_product = ae.downgrade(zyte_api_product) ae_product.offers = [ ae.AEOffer( price=_response.css(".hidden-price").get(), ) ] yield ae_product
The object that
ae.downgrade
returns is compatible with itemadapter, e.g. you can turn it into a dictionary as follows:from itemadapter import ItemAdapter ae_product_dict = ItemAdapter(ae_product).asdict()
Handling custom extraction code#
On average, Zyte API automatic extraction performs better than Zyte Automatic Extraction. If your code includes custom extraction code to fix extraction issues from Zyte Automatic Extraction, see if you can remove some of that code now.
If the output of Zyte API is still imperfect, please open a support ticket indicating which field is not being properly extracted on which website.