Welcome to the Scrapinghub feedback & support site! We discuss all things related to Scrapy Cloud, Portia and Crawlera. You can participate by reading posts, asking questions, providing feedback, helping others, and voting the best questions & answers. Scrapinghub employees regularly pop in to answer questions, share tips, and post announcements.
0
jkluv000 4 days ago in Crawlera 0

I am using the default sample script provided on the site https://doc.scrapinghub.com/crawlera.html#php


When I use the default, my dashboard isnt show I'm even making it to crawlera. There are no errors, there is nothing displayed. Any idea how to troubleshoot?


DOMAIN HOST: Godaddy

Cert is in the same directory as PHP script


<?php


$ch = curl_init();


$url = 'https://www.google.com/';
$proxy = 'proxy.crawlera.com:8010';
$proxy_auth = '239ec2d8dd334cfeb7b7361b00830f40:';


curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_PROXY, $proxy);
curl_setopt($ch, CURLOPT_PROXYUSERPWD, $proxy_auth);
curl_setopt($ch, CURLOPT_HEADER, 1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 30);
curl_setopt($ch, CURLOPT_TIMEOUT, 30);
curl_setopt($ch, CURLOPT_CAINFO, 'crawlera-ca.crt');


$scraped_page = curl_exec($ch);
curl_close($ch);
echo $scraped_page;


?>
+1
uday.kumar.kakani 5 days ago in Crawlera • updated by Nguyễn Hoàng 7 hours ago 1

Hi,


We have a Crawlera account with C10 Plan. We have renewed billing period for this account. Today I am facing performance issue for processing each request it is taking more than 2 minutes also getting Time Out Exception. Before April 20th this was working fine without any issues. Today I am facing performance issue. Before April 20th I was able to process 100 request within 45 mins. Today I am able to process only 35 requests within 45 mins. Please let me know is there any changes that is required after renewed billing period. Most of Crawlera request are failing.


Regards,

Ganesh Nayak K

0
Answered
csmik.cs 6 days ago in Crawlera • updated by Pablo Vaz (Support Engineer) 5 days ago 2

Hi guys,


I am trying to crawl a website using Crawlera that requires the presence of "Connection: keep-alive" in the header.

Is there any way to make Crawlera compatible with keep-alive connections? I tried using sessions but it didn't seem to help.


Thanks!

Answer
csmik.cs 5 days ago

My bad, it actually seems like working, but I'm getting some "Cache-Control: max-age=259200" header entries sometimes rather than "Connection: keep-alive". Probably normal behavior.


Cheers.

0
GinVlad 1 week ago in Crawlera 0

Hello, i am running crawler job, but it can not receive response data after request >40.

I run my code in localhost, and it work OK.

0
Under review
Aleksandr Kurbatov 1 week ago in Crawlera • updated 1 week ago 2

Now any request to site returns 503 error "Website crawl ban" or "Timeout from upstream server".

My plan is C50.


Thanks.

0
Answered
g4s.evry 2 weeks ago in Crawlera • updated by Thriveni Patil (Support Engineer) 2 weeks ago 1

Hi,


I was able to access below Url. Today I am unable to access this Url.


http://help.scrapinghub.com/crawlera/


It says 404 Not found.

Answer

Hello,


We have moved to new Interface, you can find the Crawlera KB articles in https://helpdesk.scrapinghub.com/solution/folders/22000131039 .


Regards,

Thriveni Patil

0
Answered
Regan 3 weeks ago in Crawlera • updated by Nestor Toledo Koplin (Support Engineer) 5 days ago 1

I get the following error when trying to access an ssl website through the proxy in C#: The remote server returned an error: (407) Proxy Authentication Required.


I have installed the certificate and tried the two following code methods below:


1.

var key = _scrapingApiKey;
var myProxy = new WebProxy("http://proxy.crawlera.com:8010");

HttpWebRequest request = (HttpWebRequest)WebRequest.Create(url);

vvar encodedApiKey = Base64Encode(key);
request.Headers.Add("Proxy-Authorization", "Basic " + encodedApiKey);

request.Proxy = myProxy;
request.PreAuthenticate = true;

WebResponse response = request.GetResponse();


2.

var myProxy = new WebProxy("http://proxy.crawlera.com:8010");

myProxy.Credentials = new NetworkCredential(_scrapingApiKey, "");

HttpWebRequest request = (HttpWebRequest)WebRequest.Create(url);

request.Proxy = myProxy;

request.PreAuthenticate = true;

WebResponse response = request.GetResponse();


What is the correct way to make the proxy work when accessing SSL websites?

Answer

Hello,


The top code should work, but make sure to include the ":" after the APIKey.

0
Answered
Mani Zia 4 weeks ago in Crawlera • updated by Pablo Vaz (Support Engineer) 4 weeks ago 1

Hi All,

please help me out, i am trying to crawl the site with the given link : https://en-ae.wadi.com/home_entertainment-televisions/?ref=navigation
but scrapy (python) is unable to crawl it. i also used classes


import scrapy

from selenium import webdriverfrom scrapy.http

import TextResponse


but still it is returning only null in one line, nothing else. looking forward to your kind response.

Thanks

Regards zia.

Answer

Hi Mani!


We can't provide Scrapy assistance. But let me suggest you other channels to ask:

1. StackOverflow - Scrapy

There's a vast community of Scrapy developers and users contributing actively there.

2. Github - Scrapy

Same


If your projects require urgent attention, please share with us your needs through:

https://scrapinghub.com/quote

To receive our sales assistance if considering to hire our professional services.


Best luck with your project!


Pablo

0
Nayak 4 weeks ago in Crawlera 0

Hi,


I want to make a webrequest for Google site using Crawlera Proxy along with Session API after integration, I observed that few times we are unable to get session id.

Below code shows webrequest for google

C# Code
private static void GetResponse()
{
ServicePointManager.SecurityProtocol = SecurityProtocolType.Ssl3 | SecurityProtocolType.Tls | SecurityProtocolType.Tls11 | SecurityProtocolType.Tls12;
ServicePointManager.ServerCertificateValidationCallback += (sender, certificate, chain, sslPolicyErrors) => true;
//Proxy API Call
var myProxy = new WebProxy("http://proxy.crawlera.com:8010");
myProxy.Credentials = new NetworkCredential("< C10 plan API key >", "");
// Session API Call

HttpWebRequest sessionRequest = (HttpWebRequest)WebRequest.Create("http://proxy.crawlera.com:8010/sessions");
sessionRequest.Credentials = new NetworkCredential("< C10Plan API key >", "");
sessionRequest.Method = "POST";
HttpWebResponse sessionResponse = (HttpWebResponse)sessionRequest.GetResponse();
StreamReader srSession = new StreamReader(sessionResponse.GetResponseStream());
string sessionId = srSession.ReadToEnd();

// Google Request
string searchResults = "http://google.com/search?q=Ganesh Nayak K&num=100";
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(searchResults);
request.Proxy = myProxy;
request.PreAuthenticate = true;
request.Headers.Add("x-crawlera-use-https", "1");
request.Headers.Add("X-Crawlera-Session", sessionId);
request.ServerCertificateValidationCallback += (sender, certificate, chain, sslPolicyErrors) => true;

HttpWebResponse response = (HttpWebResponse)request.GetResponse(); // To get the response from server it is taking lot of time

Stream resStream = response.GetResponseStream();
StreamReader sr = new StreamReader(resStream);
sr.ReadToEnd();
}

If we try to implement same code without session we are getting response in a quicker way and able to process the request more faster.

Without session I have processed more number of requests but few times I got below exception

--> Unable to connect to the remote server à A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond 64.58.114.15:8010

Regards,
Ganesh Nayak K
0
Under review
Chris Fankhauser 1 month ago in Crawlera • updated by Pablo Vaz (Support Engineer) 4 weeks ago 1

Currently, I have several wide-open API keys which I use on our production servers.


It would be very beneficial to be able to use a restricted (to x requests per day, limited by IP address(es), etc.) development-only API key for development, which could be shared more widely than the production keys.


Is this possible to set up? If not, is it functionality which seems reasonable to implement?

Answer

Hi Chris, not sure to understand correctly, but the idea behind the different API keys for Crawlera is that you can set up different Crawlera accounts with different regions for example as shown in:

https://helpdesk.scrapinghub.com/solution/articles/22000188398-regional-ips-in-crawlera

You can also share your needs to evaluate for our engineers if it's possible to implement on an Enterprise account project. If interested, please share all details possible through: https://scrapinghub.com/quote.


Thanks Chris for always provide good questions and ideas helping us to provide a better service.


Best regards,


Pablo

0
Answered
FHERTZ 1 month ago in Crawlera • updated by Rashmi Vasudevan (Support Engineer) 1 month ago 1

Hello,


We need to crawl NL website that have restricted to Netherland ips,

When we test we have the message : "No available proxies"


It mean that crawlera don't have Netherland ip address ?


Thanks


Answer

Hello,


We do not have Netherland proxies to add at the moment.

Apologies for the inconvenience caused.


Rashmi

0
Waiting for Customer
DPr 1 month ago in Crawlera • updated by Nestor Toledo Koplin (Support Engineer) 4 weeks ago 1

Hello,


I am using Crawlera. I have tested it making requests to my own web. In my HTTP Log, I see the IP's from crawlera instead of mine. That's ok, however, Google Analytics says that these visits are from my city and not from the different countries of the IPs of Crawlera.


Is it normal? Anyone knows what could be the problem?


Thank you,


David

0
Answered
ganesh.nayak 1 month ago in Crawlera • updated by Nayak 4 weeks ago 2

Hi,


I have integrated my application with Crawlera using C#, when I make Https request I got exception (407) Proxy Authentication Required. I have installed Cirtificates also sending request through certificates.


If I try sending Http using 'x-crawlera-use-https' request header it is working fine, but as per document this is Deprecated. Please let me know how to make Https request without header.


I tried same code as mentioned in the document still it is throwing exceptions



<strong>using</strong> <span class="nn">System.IO</span><span class="p">;
using <span class="nn">System</span><span class="p">;</span></span>
<strong>using</strong> <span class="nn">System.Net</span><span class="p">;</span>

<strong>namespace</strong> <span class="nn">ProxyRequest</span>
<span class="p">{</span>
    <strong>class</strong> <strong>MainClass</strong>
    <span class="p">{</span>
        <strong>public</strong> <strong>static</strong> <strong>void</strong> <strong>Main</strong> <span class="p">(</span><strong>string</strong><span class="p">[]</span> <span class="n">args</span><span class="p">)</span>
        <span class="p">{</span>
            <strong>var</strong> <span class="n">myProxy</span> <span class="p">=</span> <strong>new</strong> <span class="n">WebProxy</span><span class="p">(</span><span class="s">"http://proxy.crawlera.com:8010"</span><span class="p">);</span>
            <span class="n">myProxy</span><span class="p">.</span><span class="n">Credentials</span> <span class="p">=</span> <strong>new</strong> <span class="n">NetworkCredential</span><span class="p">(</span><span class="s">"<API KEY>"</span><span class="p">,</span> <span class="s">""</span><span class="p">);</span>

            <span class="n">HttpWebRequest</span> <span class="n">request</span> <span class="p">=</span> <span class="p">(</span><span class="n">HttpWebRequest</span><span class="p">)</span><span class="n">WebRequest</span><span class="p">.</span><span class="n">Create</span><span class="p">(</span><span class="s">"https://twitter.com"</span><span class="p">);</span>
            <span class="n">request</span><span class="p">.</span><span class="n">Proxy</span> <span class="p">=</span> <span class="n">myProxy</span><span class="p">;</span>
            <span class="n">request</span><span class="p">.</span><span class="n">PreAuthenticate</span> <span class="p">=</span> <strong>true</strong><span class="p">;</span>

            <span class="n">WebResponse</span> <span class="n">response</span> <span class="p">=</span> <span class="n">request</span><span class="p">.</span><span class="n">GetResponse</span><span class="p">();</span>
            <span class="n">Console</span><span class="p">.</span><span class="n">WriteLine</span><span class="p">(</span><span class="s">"Response Status: "</span> 
                <span class="p">+</span> <span class="p">((</span><span class="n">HttpWebResponse</span><span class="p">)</span><span class="n">response</span><span class="p">).</span><span class="n">StatusDescription</span><span class="p">);</span>
            <span class="n">Console</span><span class="p">.</span><span class="n">WriteLine</span><span class="p">(</span><span class="s">"\nResponse Headers:\n"</span> 
                <span class="p">+</span> <span class="p">((</span><span class="n">HttpWebResponse</span><span class="p">)</span><span class="n">response</span><span class="p">).</span><span class="n">Headers</span><span class="p">);</span>

            <span class="n">Stream</span> <span class="n">dataStream</span> <span class="p">=</span> <span class="n">response</span><span class="p">.</span><span class="n">GetResponseStream</span><span class="p">();</span>
            <strong>var</strong> <span class="n">reader</span> <span class="p">=</span> <strong>new</strong> <span class="n">StreamReader</span><span class="p">(</span><span class="n">dataStream</span><span class="p">);</span>
            <strong>string</strong> <span class="n">responseFromServer</span> <span class="p">=</span> <span class="n">reader</span><span class="p">.</span><span class="n">ReadToEnd</span><span class="p">();</span>
            <span class="n">Console</span><span class="p">.</span><span class="n">WriteLine</span><span class="p">(</span><span class="s">"Response Body:\n"</span> <span class="p">+</span> <span class="n">responseFromServer</span><span class="p">);</span>
            <span class="n">reader</span><span class="p">.</span><span class="n">Close</span><span class="p">();</span>

            <span class="n">response</span><span class="p">.</span><span class="n">Close</span><span class="p">();</span>
        <span class="p">}</span>
    <span class="p">}</span>
<span class="p">}</span>


Regards,

Ganesh Nayak K

Answer

Hi Ganesh,


NetworkCredential will not work in this case. Just set your api key as a string.

In your request header do something like:

var encodedApiKey = Base64Encode(apiKey);

request.Headers.Add("Proxy-Authorization", "Basic " + encodedApiKey);

request.Proxy = proxy;

request.PreAuthenticate = true;

Also keep the : appended to your api key for this to work

0
Waiting for Customer
jcdeesign 1 month ago in Crawlera • updated by Nestor Toledo Koplin (Support Engineer) 1 month ago 3

Hi

When i use curl how example all works

if i try use Phantom js in selenium i recive 407

service_args = [
'--proxy=proxy.crawlera.com:8010',
'--proxy-auth=XXXXXXXXXXXXXXX:',
'--proxy-type=http',
'--load-images=false',
'--ssl-protocol=any',
'--webdriver-logfile=phantom.log',
'--ssl-client-certificate-file='+CRAWLERA_SERT,
'--ignore-ssl-errors=true',
'--webdriver-loglevel=DEBUG'
]


driver = webdriver.PhantomJS(executable_path=settings.PHANTOMJS, desired_capabilities=dcap,service_args=service_args)


i recive

'<html><head></head><body></body></html>'


in log


{"name":"X-Crawlera-Error","value":"bad_proxy_auth"}


key with curl work

0
jarek 1 month ago in Crawlera 0

Hello, I'm using the phantomjs from amir20: https://github.com/amir20/phantomjs-node

and from what I read I could enable a session if I pass true on my : currentPage.on('onResourceRequested', true, onResourceRequested);


and now I'm not getting the error: "networkRequest.setHeader is not a function" anymore. In my function:



function onResourceRequested(requestData, networkRequest) {


requestCounter += 1;
//yellow color for the requests:
if (printLogs) console.log('\x1b[33m%s\x1b[0m: ', '> ' + requestData.id + ' - ' + requestData.url);
if (!this.crawleraSessionId) {
networkRequest.setHeader('X-Crawlera-Session', 'create');

}



BUT the onResourceReceived now is not working because it's not returning the html data and before typing true in the onResourceRequested woked.


Any advice?