Our Forums have moved to a new location - please visit Scrapinghub Help Center to post new topics.

You can still browse older topics on this page.


Welcome to the Scrapinghub community forum! We discuss all things related to Scrapy Cloud, Portia and Crawlera. You can participate by reading posts, asking questions, providing feedback, helping others, and voting the best questions & answers. Scrapinghub employees regularly pop in to answer questions, share tips, and post announcements.

0
Answered
Kalo 2 months ago in Scrapy Cloud • updated by Pablo Vaz (Support Engineer) 2 months ago 1

Is there a way to load just a small element from a page's DOM instead of loading the whole page? I'd like to scrape just a few divs out of the whole html page. The idea is to increase the speed and minimize transfer bandwidth.

Answer

Hi Kalo, this can be done with a headless browser like Splash.

For example you can turn off the images or execute custom JavaScript.

To know more about it, please check the documentation in: https://splash.readthedocs.io/en/stable/

Kind regards,

Pablo Vaz

0
Not a bug
Dnnn2011 2 months ago in Crawlera • updated by Pablo Vaz (Support Engineer) 2 months ago 1

I tried to use Crawlera using the default PHP script (see the script that is given by Crawlera) and adjusted the script (API key, path to certificate file) but it doesn't work at all. It gives an error: connect() timed out!


The script:


<?php


$ch = curl_init();

$url = 'http://www.leokerklaan.nl';
$proxy = 'proxy.crawlera.com:8010';
$proxy_auth = 'hidden_key:';

curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_PROXY, $proxy);
curl_setopt($ch, CURLOPT_PROXYUSERPWD, $proxy_auth);
curl_setopt($ch, CURLOPT_HEADER, 1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 30);
curl_setopt($ch, CURLOPT_TIMEOUT, 30);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, 1 );
curl_setopt($ch, CURLOPT_CAINFO, 'hidden_path/crawlera-ca.crt');

$scraped_page = curl_exec($ch);
echo curl_error($ch);
curl_close($ch);
var_dump( $scraped_page );

?>

Please HELP!


Answer

Hi!

The code works fine for me.

I tried with the URL:

http://httpbin.org/ip

and the one provided by you and worked fine for both.


Please be sure that:

  • The path to the ca-cert is correct (you can try to install in Desktop or home directory)
  • The Proxy auth is: $proxy_auth = '1231examplekfsj6789:'; and the ":" is at the end.

I'm using OSX to try this script using:  > php my_script.php


Best,


Pablo

0
Completed
hello 2 months ago in Scrapy Cloud • updated by Pablo Vaz (Support Engineer) 2 months ago 1

I am using the following code to try and bring back all fields within an job using the items API;


$sch_id="172/73/3" //job ID

$ch = curl_init();

curl_setopt_array($ch, array(
CURLOPT_RETURNTRANSFER => 1,
CURLOPT_URL => "https://storage.scrapinghub.com/items/". $sch_id ."?format=json&fields=_type,design,price,material",
CURLOPT_CUSTOMREQUEST =>"GET",
CURLOPT_HTTPHEADER, array(
'Accept: application/x-jsonlines',
),
CURLOPT_USERPWD => "e658eb1xxxxxxxxxxxx4b42de6fd" . ":" . "",
));

$result = curl_exec($ch);
print_r(json_decode($result));

curl_close ($ch);

There are 4 fields I am trying to get as json but the request only brings back "_type" and "price". I have tried various things with different headers and the request URL but no luck.


Any advice would be appreciated.


Cheers,

Adam

Answer

Hi,


We suggest to use the script provided in our docs:


<!--?php
$ch = curl_init();
$url = 'https://twitter.com/';
$proxy = 'proxy.crawlera.com:8010';
$proxy_auth = '<API KEY-->:';
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_PROXY, $proxy);
curl_setopt($ch, CURLOPT_PROXYUSERPWD, $proxy_auth);
curl_setopt($ch, CURLOPT_HEADER, 1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 30);
curl_setopt($ch, CURLOPT_TIMEOUT, 30);
curl_setopt($ch, CURLOPT_CAINFO, '/path/to/crawlera-ca.crt');
$scraped_page = curl_exec($ch);
curl_close($ch);
echo $scraped_page;
?> 



But if no possible, try to add the  reference to the certificate for fetching HTTPS. In the code provided by us this is explicated in:


curl_setopt($ch, CURLOPT_CAINFO, '/path/to/crawlera-ca.crt');
0
Answered
vl2017 2 months ago in Portia • updated by Pablo Vaz (Support Engineer) 2 months ago 2

Could you add support for UTF-8. Not English letters are not shown in the sample page editor, and regexp-conditions are not working with them.

Answer

Hey!

Your inquiry has been escalated to our Portia team.

UTF-8 is supported for non latin characters, but perhaps needs to be improved when interacts with regex.

This feature it´s planned for next releases.


Thanks for your valuable feedback and for helping us to improve our services.


Kind regards,

Pablo

0
Answered
vl2017 2 months ago in Portia • updated by Pablo Vaz (Support Engineer) 2 months ago 1

What is the difference between annotations and fields? In the "Sample page → Items" each field has configuration icons that open a tab with separate groups "Annotation" and "Field". There are separate "required" options, what do they mean and whether they overlap each other? The "Annotation" group sets the path to the element, but it is already hidden in the "Item", why "required"?


How do I configure the scrapper to ignore any pages that containing a specified attribute or word?
Answer

Hi!


Annotation Count are not the same as Extracted Items count.

If the webpage contains a list of items and the user uses the repeated annotations icon, the annotations will propagate and reflect the number of items present in the page.


However, it may happen that the algorithm responsible for data extraction is unable to use the annotations provided by the user to properly extract data, thus extracting a number of items different from the count next to the annotations.


For example, on the image above, we have one annotation with count equal to 10, hinting that we are extracting 10 items from the page. However, the Extracted Items count shows that 0 items were extracted. This means that our annotations haven't worked with Portia's algorithm, so we may have to try updating our annotations to get the data from alternative elements.


To know more see Portia documentation:

Annotations


Excellent question!

Kind regards,


Pablo


0
Answered
Alex L 2 months ago in Scrapy Cloud • updated 2 months ago 2

Currently scripts can only be deployed only by using the shub deploy command. When we push scripts to git, the app doesn't seem to pull the scripts from our repo.


Will pulling scripts from the git hook be supported in the future or do you guys intend to stay on using shub deploy for now?

Answer

Hi Alex, we are currently supporting deploy from GitHub repository.

Please take a moment to check this tutorial:

Deploying a Project from a Github Repository


Kind regards,


Pablo

+1
Answered
jkluv000 2 months ago in Crawlera • updated by Pablo Vaz (Support Engineer) 2 months ago 1

I am using the default sample script provided on the site https://doc.scrapinghub.com/crawlera.html#php


When I use the default, my dashboard isnt show I'm even making it to crawlera. There are no errors, there is nothing displayed. Any idea how to troubleshoot?


DOMAIN HOST: Godaddy

Cert is in the same directory as PHP script


<?php


$ch = curl_init();


$url = 'https://www.google.com/';
$proxy = 'proxy.crawlera.com:8010';
$proxy_auth = '239ec2d8dd334cfeb7b7361b00830f40:';


curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_PROXY, $proxy);
curl_setopt($ch, CURLOPT_PROXYUSERPWD, $proxy_auth);
curl_setopt($ch, CURLOPT_HEADER, 1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 30);
curl_setopt($ch, CURLOPT_TIMEOUT, 30);
curl_setopt($ch, CURLOPT_CAINFO, 'crawlera-ca.crt');


$scraped_page = curl_exec($ch);
curl_close($ch);
echo $scraped_page;


?>
Answer

Hi!

make sure to add the path before crawlera-ca.crt,

For example:


'/Users/my_user/Desktop/my_Folder/crawlera-ca.crt'


The script works fine.


Best,

Pablo

0
Fixed
sappollo 2 months ago in Portia • updated 2 months ago 2

Hi all,


Since yesterday my Portia crawls are failing with certain error:


I don't know whether this is Scrapinghub/Portia error or related to the external page to be scraped (which worked previously successfully before since months)

Answer

Dear Sapollo,


Sometimes, backend updates or new Portia releases could affect old extractors and that's why we always suggest to give some maintenance to the spiders, refresh and redeploy when necessary.


If possible, try to recreate your spider and launch again. This should work.


Kind regards,


Pablo

0
Answered
shamily23v 2 months ago in Scrapy Cloud • updated by Pablo Vaz (Support Engineer) 2 months ago 1

I would like to write/ update data in mongodb with the items crawled from scraping hub.

Answer

Hi Shamily,


If you have a MongoDB server that you would like your spiders to write to, and you would like to open access to that server only from Scrapy Cloud IPs.

Unfortunately, this is not possible. We cannot provide you a reliable range of IP addresses for our Scrapy Cloud crawling servers, because they're not static, they change frequently. So, even if we were to provide you the list we have now, it will soon change and your spider's connection with Mongo will break.

Here are a couple alternatives to consider:

  • Write a script that pulls the data from Scrapinghub (using the API) and writes it to your Mongo server. This script can run in your mongo server or any server (and you only need to whitelist that IP)
  • Use authentication in Mongo

To know more about our API: https://doc.scrapinghub.com/api/items.html#items-api

Kind regards,


Pablo

0
Answered
MSH 2 months ago in Portia • updated by Pablo Vaz (Support Engineer) 2 months ago 1

Hi.


I created a portia spider for a website which created by asp.net and uses (javascript:__doPostBack) for the pagination links.


is it possible to use this kind of links (javascript:__doPostBack) in portia?

for example:

<a href="javascript:__doPostBack('p$lt$ctl06$pageplaceholder$p$lt$ctl00$ApplyPageJobSearch$GridView1','Page$2')">2</a>


Thanks

Answer

Hi MSH!


Why you don't try this approach for pagination links:

How to handle pagination in Portia


Also could be helpful this article:

Portia List of URLs


If these approach doesn't work, perhaps you should try with Scrapy. Portia, sometimes can't handle complex projects involving JS.


Best regards!


Pablo

0
Answered
csmik.cs 2 months ago in Crawlera • updated by Pablo Vaz (Support Engineer) 2 months ago 2

Hi guys,


I am trying to crawl a website using Crawlera that requires the presence of "Connection: keep-alive" in the header.

Is there any way to make Crawlera compatible with keep-alive connections? I tried using sessions but it didn't seem to help.


Thanks!

Answer
csmik.cs 2 months ago

My bad, it actually seems like working, but I'm getting some "Cache-Control: max-age=259200" header entries sometimes rather than "Connection: keep-alive". Probably normal behavior.


Cheers.

0
Not a bug
mattegli 2 months ago in Scrapy Cloud • updated by Pablo Vaz (Support Engineer) 2 months ago 1

I am trying to deploy a (local) portia project to scrapinghub. After adding "slybot" to requirements.txt I can deploy successfully, but when running the spider the following error occures:


Traceback (most recent call last):

  File "/usr/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 1299, in _inlineCallbacks
    result = g.send(result)
  File "/usr/local/lib/python2.7/site-packages/scrapy/crawler.py", line 90, in crawl
    six.reraise(*exc_info)
  File "/usr/local/lib/python2.7/site-packages/scrapy/crawler.py", line 71, in crawl
    self.spider = self._create_spider(*args, **kwargs)
  File "/usr/local/lib/python2.7/site-packages/scrapy/crawler.py", line 94, in _create_spider
    return self.spidercls.from_crawler(self, *args, **kwargs)
  File "/usr/local/lib/python2.7/site-packages/scrapy/spiders/__init__.py", line 50, in from_crawler
    spider = cls(*args, **kwargs)
  File "/app/python/lib/python2.7/site-packages/slybot/spidermanager.py", line 51, in __init__
    **kwargs)
  File "/app/python/lib/python2.7/site-packages/slybot/spider.py", line 44, in __init__
    ((t['scrapes'], t) for t in spec['templates']
KeyError: 'templates'
Answer

Hi!

Seems you have deployed successfully spiders in Switzerland project.

Let us know if you need further assistance.

Best,

Pablo

0
Answered
GinVlad 2 months ago in Crawlera • updated by Pablo Vaz (Support Engineer) 2 months ago 1

Hello, i am running crawler job, but it can not receive response data after request >40.

I run my code in localhost, and it work OK.

Answer

Hey Gin, checking your stats in dashboard seems your scrape spider is working fine.


Let us know if you need further help.


Best,


Pablo

0
Not a bug
BobC 2 months ago in Scrapy Cloud • updated by Pablo Vaz (Support Engineer) 1 month ago 2

The following URL renders fine in all browsers EXCEPT the Scrapinghub browser:

https://tenforward.social/@Redshirt27

I'd like to find out why, but no clues are given. Help?

Answer

Dear Bob,


Unfortunately this site still being hard to open with Portia browser.

Perhaps you should consider to use other tools as Scrapy to try fetching data. Some sites are simply too complex to scrape and require more advanced tools.


If you don't know how to use it, I suggest this great tutorial:

Learn Scrapy


Best regards,

Pablo

0
Answered
kzrster 2 months ago in Scrapy Cloud • updated by Pablo Vaz (Support Engineer) 3 weeks ago 2

Hi !
I needed to scrape site which have many JS code. So I use scrapy+selenium. Aslo it should run at Scrapy Cloud.
I've write spider which uses scrapy+selenuim+phantomjs and run it on my local machine. All is ok.
Then I deployed project to Scrapy cloud using shub-image. Deployment is ok. But results of
webdriver.page_source is different. It's ok on local, not ok(HTML with inscription - 403, request 200 http) at cloud.
Then I decided to use crawlera acc. I've added it with:

service_args = [

            '--proxy="proxy.crawlera.com:8010"',
'--proxy-type=https',
'--proxy-auth="apikey"',
]


for Windows(local)
self.driver = webdriver.PhantomJS(executable_path=r'D:\programms\phantomjs-2.1.1-windows\bin\phantomjs.exe',service_args=service_args)


for docker

self.driver = webdriver.PhantomJS(executable_path=r'/usr/bin/phantomjs', service_args=service_args, desired_capabilities=dcap)

Again at local all is ok. Cloud not ok.
I've checked cralwera info. It's ok. Requests sends from both(local and cloud).

I dont get what's wrong.
I think It might be differences between phantomjs versions(Windows, Linux).

Any ideas?










Answer

Hi Kzrester,


If the issue is related to SSL fetching (https), this may be due our current version of Erlang that returns some errors for some languages and browsers for that.


Our team is working in an update of the Erlang version and should be deployed in terms of weeks.


Let us know if you find more information about the error you get.


Best,


Pablo