Welcome to the Scrapinghub feedback & support site! We discuss all things related to Scrapy Cloud, Portia and Crawlera. You can participate by reading posts, asking questions, providing feedback, helping others, and voting the best questions & answers. Scrapinghub employees regularly pop in to answer questions, share tips, and post announcements.
0
Answered
devin 4 weeks ago in Crawlera • updated by Pablo Vaz (Support Engineer) 3 weeks ago 1

Does anyone have any experience using nightmareJS with crawlera? I am having trouble specifying the proxy server using just the electron switches.

Answer

Hey Deving,


Our team is actively working on provide a better interaction for Crawlera with different languages and browsers. At this moment the most close approach we can provide is in our KB for some spooky cousins of NightmareJS:

Crawlera with CasperJS, PhantomJS...


Please check in our forum for similar inquiries about NighmareJS, and also consider our Stack Overflow - Crawlera channel to ask there. Many of our best developers contribute there actively.


Best regards!


Pablo

0
Answered
signedup88 1 month ago in Crawlera • updated by Pablo Vaz (Support Engineer) 3 weeks ago 1

Hey guys does anybody have experience connecting to crawlera using  webdriver / selenium.


I am running a project on a WIN PC

Answer

Hey Signedup88,


Since it’s not so trivial to set up proxy authentication in Selenium, a popular option is to employ Polipo as a proxy. Update Polipo configuration file /etc/polipo/config to include Crawlera credentials (if the file is not present, copy and rename config.sample found in Polipo source folder):

parentProxy = "proxy.crawlera.com:8010"
parentAuthCredentials = "<API key>:"

For password safety reasons this content is displayed as (hidden) in the Polipo web interface manager. The next step is to specify Polipo proxy details in the Selenium automation script, e.g. for Python and Firefox:

from selenium import webdriver
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.common.proxy import *
polipo_proxy = "localhost:8123"
proxy = Proxy({
    'proxyType': ProxyType.MANUAL,
    'httpProxy': polipo_proxy,
    'ftpProxy' : polipo_proxy,
    'sslProxy' : polipo_proxy,
    'noProxy'  : ''
})
driver = webdriver.Firefox(proxy=proxy)
driver.get("http://scrapinghub.com")
assert "Scrapinghub" in driver.title
elem = driver.find_element_by_class_name("portia")
actions = ActionChains(driver)
actions.click(on_element=elem)
actions.perform()
print "Clicked on Portia!"
driver.close()

Best regards,


Pablo

0
Answered
Chris Fankhauser 1 month ago in Crawlera • updated by Pablo Vaz (Support Engineer) 3 weeks ago 1

I've recently noticed in the Crawlera documentation that the Fetch API is now deprecated and going away "soon". Considering that we use it almost exclusively, I'm a little concerned about timing re: moving away from it.


Is it possible to get a more definitive timeline of the retirement of the Fetch API?  At some organizations "soon" can mean "next week" and at others it can mean "2019".  Thanks!

Answer

Hi Chris, I've escalated your question but no timeline for retirement yet. But feel free to ask anytime you need.


As you know, even planned, those kind of things sometimes get delayed due other projects by our developers.


Best regards,


Pablo

0
Answered
1044204605 1 month ago in Crawlera • updated by Pablo Vaz (Support Engineer) 1 month ago 1

i bought the crawlera service just now, and successfully run my scrapy poject using crawlera proxy, but the usage of my crawlera service is always 0% and there's nothing in the histogram on the website, why?

Answer

Hi,


The stats plots are slightly delayed. You will start to see the requests made after 20min or half hour.


If the stats still showing 0% usage let us know through our support platform to investigate in our internal stats to see if you are indeed making requests through Crawlera or bypassing it.


Best regards,


Pablo Vaz

Support Engineer

0
Answered
Sergej 1 month ago in Crawlera • updated 4 days ago 5

Hello,


I am experiencing issues by using Crawlera on https sites with Scrapy and PhantomJs. My Config is:


        service_args = [
            '--proxy=proxy.crawlera.com:8010',
            '--proxy-type=http',
            '--proxy-auth=XXXXXXXXXXXXXXXXX:',
            '--webdriver-logfile=phantom.log',
            '--webdriver-loglevel=DEBUG',
            '--ssl-protocol=any',
            '--ssl-client-certificate-file=crawlera-ca.crt',
            '--ignore-ssl-errors=true',
            ]


Tough I always get this error and the result is empty:

"errorCode":99,"errorString":"Cannot provide a certificate with no key, "


I am stuck for hours on this problem. Any help much appreciated. 


Thank you!

Sergej

Answer

Hey Sergej, even there's no exact ETA we expect in terms of some weeks.

There has been an increasing demand on Casper, Nightmare and PhantomJS support last months and our team has taken this as a priority.

We commonly use our blog to post there new releases and info about cool features:
Scrapinghub Blog


Best regards,

Pablo

0
Not a bug
joaquin 1 month ago in Crawlera • updated by Pablo Vaz (Support Engineer) 3 weeks ago 14

I am running some crawlers on a site, all of them using crawlera, and I am getting several 429 error statuses (which mean too many requests), so crawlera doesn't seem to be adjusting its throttling to accommodate for these errors.


Does your throttling algorithm consider 429 status codes?


I am using scrapy plus the crawlera middleware btw.

Answer

Also, plan max concurrency is overall account usage, which may be used for crawling many different sites at same time.

0
Not a bug
Dnnn2011 2 months ago in Crawlera • updated by Pablo Vaz (Support Engineer) 2 months ago 1

I tried to use Crawlera using the default PHP script (see the script that is given by Crawlera) and adjusted the script (API key, path to certificate file) but it doesn't work at all. It gives an error: connect() timed out!


The script:


<?php


$ch = curl_init();

$url = 'http://www.leokerklaan.nl';
$proxy = 'proxy.crawlera.com:8010';
$proxy_auth = 'hidden_key:';

curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_PROXY, $proxy);
curl_setopt($ch, CURLOPT_PROXYUSERPWD, $proxy_auth);
curl_setopt($ch, CURLOPT_HEADER, 1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 30);
curl_setopt($ch, CURLOPT_TIMEOUT, 30);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, 1 );
curl_setopt($ch, CURLOPT_CAINFO, 'hidden_path/crawlera-ca.crt');

$scraped_page = curl_exec($ch);
echo curl_error($ch);
curl_close($ch);
var_dump( $scraped_page );

?>

Please HELP!


Answer

Hi!

The code works fine for me.

I tried with the URL:

http://httpbin.org/ip

and the one provided by you and worked fine for both.


Please be sure that:

  • The path to the ca-cert is correct (you can try to install in Desktop or home directory)
  • The Proxy auth is: $proxy_auth = '1231examplekfsj6789:'; and the ":" is at the end.

I'm using OSX to try this script using:  > php my_script.php


Best,


Pablo

+1
Answered
jkluv000 2 months ago in Crawlera • updated by Pablo Vaz (Support Engineer) 2 months ago 1

I am using the default sample script provided on the site https://doc.scrapinghub.com/crawlera.html#php


When I use the default, my dashboard isnt show I'm even making it to crawlera. There are no errors, there is nothing displayed. Any idea how to troubleshoot?


DOMAIN HOST: Godaddy

Cert is in the same directory as PHP script


<?php


$ch = curl_init();


$url = 'https://www.google.com/';
$proxy = 'proxy.crawlera.com:8010';
$proxy_auth = '239ec2d8dd334cfeb7b7361b00830f40:';


curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_PROXY, $proxy);
curl_setopt($ch, CURLOPT_PROXYUSERPWD, $proxy_auth);
curl_setopt($ch, CURLOPT_HEADER, 1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 30);
curl_setopt($ch, CURLOPT_TIMEOUT, 30);
curl_setopt($ch, CURLOPT_CAINFO, 'crawlera-ca.crt');


$scraped_page = curl_exec($ch);
curl_close($ch);
echo $scraped_page;


?>
Answer

Hi!

make sure to add the path before crawlera-ca.crt,

For example:


'/Users/my_user/Desktop/my_Folder/crawlera-ca.crt'


The script works fine.


Best,

Pablo

0
Answered
csmik.cs 2 months ago in Crawlera • updated by Pablo Vaz (Support Engineer) 2 months ago 2

Hi guys,


I am trying to crawl a website using Crawlera that requires the presence of "Connection: keep-alive" in the header.

Is there any way to make Crawlera compatible with keep-alive connections? I tried using sessions but it didn't seem to help.


Thanks!

Answer
csmik.cs 2 months ago

My bad, it actually seems like working, but I'm getting some "Cache-Control: max-age=259200" header entries sometimes rather than "Connection: keep-alive". Probably normal behavior.


Cheers.

0
Answered
GinVlad 2 months ago in Crawlera • updated by Pablo Vaz (Support Engineer) 2 months ago 1

Hello, i am running crawler job, but it can not receive response data after request >40.

I run my code in localhost, and it work OK.

Answer

Hey Gin, checking your stats in dashboard seems your scrape spider is working fine.


Let us know if you need further help.


Best,


Pablo

0
Answered
g4s.evry 2 months ago in Crawlera • updated by Thriveni Patil (Support Engineer) 2 months ago 1

Hi,


I was able to access below Url. Today I am unable to access this Url.


http://help.scrapinghub.com/crawlera/


It says 404 Not found.

Answer

Hello,


We have moved to new Interface, you can find the Crawlera KB articles in https://helpdesk.scrapinghub.com/solution/folders/22000131039 .


Regards,

Thriveni Patil

0
Answered
Regan 3 months ago in Crawlera • updated by Nestor Toledo Koplin (Support Engineer) 2 months ago 1

I get the following error when trying to access an ssl website through the proxy in C#: The remote server returned an error: (407) Proxy Authentication Required.


I have installed the certificate and tried the two following code methods below:


1.

var key = _scrapingApiKey;
var myProxy = new WebProxy("http://proxy.crawlera.com:8010");

HttpWebRequest request = (HttpWebRequest)WebRequest.Create(url);

vvar encodedApiKey = Base64Encode(key);
request.Headers.Add("Proxy-Authorization", "Basic " + encodedApiKey);

request.Proxy = myProxy;
request.PreAuthenticate = true;

WebResponse response = request.GetResponse();


2.

var myProxy = new WebProxy("http://proxy.crawlera.com:8010");

myProxy.Credentials = new NetworkCredential(_scrapingApiKey, "");

HttpWebRequest request = (HttpWebRequest)WebRequest.Create(url);

request.Proxy = myProxy;

request.PreAuthenticate = true;

WebResponse response = request.GetResponse();


What is the correct way to make the proxy work when accessing SSL websites?

Answer

Hello,


The top code should work, but make sure to include the ":" after the APIKey.

0
Answered
Mani Zia 3 months ago in Crawlera • updated by Pablo Vaz (Support Engineer) 3 months ago 1

Hi All,

please help me out, i am trying to crawl the site with the given link : https://en-ae.wadi.com/home_entertainment-televisions/?ref=navigation
but scrapy (python) is unable to crawl it. i also used classes


import scrapy

from selenium import webdriverfrom scrapy.http

import TextResponse


but still it is returning only null in one line, nothing else. looking forward to your kind response.

Thanks

Regards zia.

Answer

Hi Mani!


We can't provide Scrapy assistance. But let me suggest you other channels to ask:

1. StackOverflow - Scrapy

There's a vast community of Scrapy developers and users contributing actively there.

2. Github - Scrapy

Same


If your projects require urgent attention, please share with us your needs through:

https://scrapinghub.com/quote

To receive our sales assistance if considering to hire our professional services.


Best luck with your project!


Pablo

0
Answered
Nayak 3 months ago in Crawlera • updated by Pablo Vaz (Support Engineer) 3 weeks ago 1

Hi,


I want to make a webrequest for Google site using Crawlera Proxy along with Session API after integration, I observed that few times we are unable to get session id.

Below code shows webrequest for google

C# Code
private static void GetResponse()
{
ServicePointManager.SecurityProtocol = SecurityProtocolType.Ssl3 | SecurityProtocolType.Tls | SecurityProtocolType.Tls11 | SecurityProtocolType.Tls12;
ServicePointManager.ServerCertificateValidationCallback += (sender, certificate, chain, sslPolicyErrors) => true;
//Proxy API Call
var myProxy = new WebProxy("http://proxy.crawlera.com:8010");
myProxy.Credentials = new NetworkCredential("< C10 plan API key >", "");
// Session API Call

HttpWebRequest sessionRequest = (HttpWebRequest)WebRequest.Create("http://proxy.crawlera.com:8010/sessions");
sessionRequest.Credentials = new NetworkCredential("< C10Plan API key >", "");
sessionRequest.Method = "POST";
HttpWebResponse sessionResponse = (HttpWebResponse)sessionRequest.GetResponse();
StreamReader srSession = new StreamReader(sessionResponse.GetResponseStream());
string sessionId = srSession.ReadToEnd();

// Google Request
string searchResults = "http://google.com/search?q=Ganesh Nayak K&num=100";
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(searchResults);
request.Proxy = myProxy;
request.PreAuthenticate = true;
request.Headers.Add("x-crawlera-use-https", "1");
request.Headers.Add("X-Crawlera-Session", sessionId);
request.ServerCertificateValidationCallback += (sender, certificate, chain, sslPolicyErrors) => true;

HttpWebResponse response = (HttpWebResponse)request.GetResponse(); // To get the response from server it is taking lot of time

Stream resStream = response.GetResponseStream();
StreamReader sr = new StreamReader(resStream);
sr.ReadToEnd();
}

If we try to implement same code without session we are getting response in a quicker way and able to process the request more faster.

Without session I have processed more number of requests but few times I got below exception

--> Unable to connect to the remote server à A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond 64.58.114.15:8010

Regards,
Ganesh Nayak K
Answer

Hi Ganesh,


You may use sessions just for particular projects involving a long interaction with the site or with more than one page using the same IP. After you don't need the same IP interaction you should disable Sessions or create a new one, in order to take advantage of the proxy rotation.


If you use the same IP to make request, the risks to being banned will be higher and it's almost like to make requests without Crawlera and from a single IP as in Scrapy Cloud.


Best regards,


Pablo

0
Answered
FHERTZ 3 months ago in Crawlera • updated by Rashmi Vasudevan (Support Engineer) 3 months ago 1

Hello,


We need to crawl NL website that have restricted to Netherland ips,

When we test we have the message : "No available proxies"


It mean that crawlera don't have Netherland ip address ?


Thanks


Answer

Hello,


We do not have Netherland proxies to add at the moment.

Apologies for the inconvenience caused.


Rashmi