Article Extraction
Article extraction supports pages which contain a single article,
such as a news article, blog post, or another kind of an article.
Many fields are extracted, such as
headline, article body, author and publication date.
This supports use-cases such as news and media monitoring,
analytics, brand monitoring, mentions, sentiment analysis
and many others.
Related page types are
Article List Extraction which supports pages with multiple articles, and
Comment Extraction which supports comment extraction from single article pages.
Request example
If you requested an article extraction, and the extraction succeeds,
then the article
field will be available in the query result:
from autoextract.sync import request_raw
query = [{
'url': 'http://example.com/article?id=24',
'pageType': 'article'
}]
results = request_raw(query, api_key='[api key]')
print(results[0]['article'])
Available fields
The following fields are available for article
:
headline
: stringArticle headline or title.
datePublished
: stringPublication date. ISO-formatted with ‘T’ separator, may contain a timezone.
If the actual publication date is not found, dateModified
value is taken.
datePublishedRaw
: stringSame date as datePublished
, but before parsing/normalization, i.e. as
it appears on the website.
dateModified
: stringThe date when the article was most recently modified.
ISO-formatted with ‘T’ separator, may contain a timezone.
dateModifiedRaw
: stringSame date as dateModified
but before parsing/normalization, i.e. as
it appears on the website.
author
: stringAuthor (or authors) of the article.
authorsList
: list of stringsAll authors of the article split into separate strings. For example,
if author
is "Alice and Bob"
, authorList
would be
["Alice", "Bob"]
,
if author
is "Alice Johnes"
(a single author), authorList
would be ["Alice Johnes"]
.
inLanguage
: stringLanguage of the article, as an ISO 639-1 language code. Example: "en"
.
Sometimes article language is not the same as the web page overall
language; to get the detected web page languages,
see General Web Page Information.
breadcrumbs
: list of dictionaries with name
and link
optional string fieldsA list of breadcrumbs (a specific navigation element) with optional
name and URL. Example:
[
{"name": "Foo", "link": "http://example.com/foo"},
{"name": "Bar", "link": "http://example.com/foo/bar"},
{"name": "Baz"},
]
mainImage
: stringA URL or data URL value of the main image of the article.
All URLs are absolute.
images
: list of stringsA list of URL or data URL values of all images of the article
(may include the main image). All URLs are absolute.
description
: stringA short summary of the article. It can be either human-provided
(if available), or auto-generated.
articleBody
: stringText of the article, including sub-headings, with newline separators.
articleBodyHtml
: stringSimplified and standardized HTML of the article, including sub-headings,
image captions and embedded content (videos, tweets, etc).
See Format of articleBodyHtml field section for
a detailed description.
articleBodyRaw
: stringHTML of the article body as seen in the source page.
This field is sometimes large, and often is not needed,
as articleBodyHtml
is preferrable. articleBodyRaw
field can be
turned off when making an API request: it will not be returned if you pass
"articleBodyRaw": false
as a query parameter
(see Requests).
videoUrls
: list of stringsA list of URLs of all videos inside the article body.
audioUrls
: list of stringsA list of URLs of all audios inside the article body.
probability
: floatProbability that this is a single article page.
This number is close to 1.0 when a requested page looks like
an individual news article page, blog post, etc. Otherwise this number
is low, closer to 0.0 - for example, expect it to be low on pages with
lists of news articles, on e-commerce pages, etc.
canonicalUrl
: stringCanonical URL of the article, if available.
url
: stringURL of a page where this article was extracted.
All fields are optional, except for url
and probability
.
Fields without a valid value (null or empty array) are excluded from
the extraction results.
Response example
Below is an example response with all article fields present:
[
{
"article": {
"headline": "Article headline",
"datePublished": "2019-06-19T00:00:00",
"datePublishedRaw": "June 19, 2019",
"dateModified": "2019-06-21T00:00:00",
"dateModifiedRaw": "June 21, 2019",
"author": "Article author",
"authorsList": [
"Article author"
],
"inLanguage": "en",
"breadcrumbs": [
{
"name": "Level 1",
"link": "http://example.com"
}
],
"mainImage": "http://example.com/image.png",
"images": [
"http://example.com/image.png"
],
"description": "Article summary",
"articleBody": "Article body ...",
"articleBodyHtml": "<article><p>Article body ... </p> ... </article>",
"articleBodyRaw": "<div id=\"an-article\">Article body ...",
"videoUrls": [
"https://example.com/video.mp4"
],
"audioUrls": [
"https://example.com/audio.mp3"
],
"probability": 0.95,
"canonicalUrl": "https://example.com/article/article-about-something",
"url": "https://example.com/article?id=24"
},
"webPage": {
"inLanguages": [
{"code": "en"},
{"code": "es"}
]
},
"query": {
"id": "1564747029122-9e02a1868d70b7a3",
"domain": "example.com",
"userQuery": {
"pageType": "article",
"url": "http://example.com/article?id=24"
}
},
"algorithmVersion": "20.8.1"
}
]
Format of articleBodyHtml field
The articleBodyHtml
field in article extractions contains a normalized and simplified HTML version of the article body.
It is easy to create your own CSS styles over this HTML so that the final look-and-feel is integrated with the rest of your app.
The normalized HTML also allows for automated HTML processing which is consistent across websites. For example:
To get all images with their captions you can run //figure
xpath and then ./img
and ./figcaption
h
tags are normalized, making the article hierarchy easy to determine
Tables and lists can be extracted cleanly
Links are absolute
Only semantic HTML tags are returned - no generic divs/spans are included
The supported tags and attributes are normalized as follows:
Content Type |
Normalization |
Supported Elements/Attributes |
Sectioning |
All content is enclosed in a root article tag. Headings are
normalized so that they always start with h2 . |
article (root only), h2 , h3 , h4 , h5 , h6 , aside
|
Text |
Paragraphs are enclosed with p tag. Tables, lists, definition lists
and block quotes are supported. |
p , table , tbody , thead , tfoot , th , tr , td ,
ul , ol , li , dl , dt , dd , blockquote
|
Inline text |
b tag is translated to strong . i tag is translated to em .
|
a , br , strong , em , s , sup , sub , del ,
ins , u , cite
|
Pre-formatted text |
None |
pre , code
|
Multimedia elements |
Multimedia elements are enclosed within figure generally. Captions
for these elements are included within the figcaption tag when
available. If multimedia elements appear in the text as inline elements
within paragraphs they are kept as is (without enclosing them in a
figure element). |
figure , figcaption , img , video , audio , iframe ,
embed , object , source
|
Supported attributes |
Tag attributes not in the suported list to the right are
filtered out of the output. |
data-* , alt , cite , colspan , datetime , dir , href ,
label , rowspan , src , srcset , sizes , start , title ,
type , value , vspace
|
Social media content
Content from social media platforms (Twitter, etc) will be rendered properly if the correct JavaScript files from the platform are included.
The currently supported platforms and the JavaScript file to use to include them are as follows:
Platform |
Script file |
Twitter |
<script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>
|
Instagram |
<script async src="//www.instagram.com/embed.js"></script>
|
Facebook |
<div id="fb-root"></div>
<script async defer src="https://connect.facebook.net/en_US/sdk.js#xfbml=1&version=v3.2"></script>
|
Example articleBodyHtml response
<article>
<p>The range of use cases for web data extraction is rapidly increasing and with it the necessary investment. Plus the number of websites continues to grow rapidly and is expected to exceed 2 billion by 2020.</p>
<p>Presented by <a href="https://www.zyte.com/">Zyte</a> (formerly Scrapinghub), the first Web Data Extraction Summit will be held in Dublin, Ireland on 17th September 2019. This is the first-ever event dedicated to web data and extraction and will be graced by over 100 CEOs, Founders, Data Scientists and Engineers.</p>
<figure><iframe src="https://play.vidyard.com/7hJbbWtiNgipRiYHhTCDf6?v=4.2.13&viral_sharing=0&embed_button=0&hide_playlist=1&color=FFFFFF&playlist_color=FFFFFF&play_button_color=2A2A2A&gdpr_enabled=1&type=inline&new_player_ui=1&vydata%5Butk%5D=d057931dfb8520abe024ef4b2f68d0ad&vydata%5Bportal_id%5D=4367560&vydata%5Bcontent_type%5D=blog-post&vydata%5Bcanonical_url%5D=https%3A%2F%2Fblog.scrapinghub.com%2Fthe-first-web-data-extraction-summit&vydata%5Bpage_id%5D=12510333185&vydata%5Bcontent_page_id%5D=12510333185&vydata%5Blegacy_page_id%5D=12510333185&vydata%5Bcontent_folder_id%5D=null&vydata%5Bcontent_group_id%5D=5623735666&vydata%5Bab_test_id%5D=null&vydata%5Blanguage_code%5D=null&disable_popouts=1" title="Video"></iframe></figure>
<p>With a promising line-up of talks and discussions accompanied by interesting conversations and networking sessions with fellow data enthusiasts, followed by food and drinks at the magnificent Guinness Storehouse, there are no reasons to miss this event. What’s more, we are also giving out free swag! You will get your own Extract Summit T-shirts on the day!</p>
<figure><img src="https://blog.scrapinghub.com/hubfs/Extract-Summit-Emails-images-tee-aug2019-v1.gif" alt="Extract-Summit-Emails-images-tee-aug2019-v1"></figure>
</article>