Welcome to the Scrapinghub feedback & support site! We discuss all things related to Scrapy Cloud, Portia and Crawlera. You can participate by reading posts, asking questions, providing feedback, helping others, and voting the best questions & answers. Scrapinghub employees regularly pop in to answer questions, share tips, and post announcements.
0
ganesh.nayak 14 hours ago in Crawlera 0

Hi,


I have integrated my application with Crawlera using C#, when I make Https request I got exception (407) Proxy Authentication Required. I have installed Cirtificates also sending request through certificates.


If I try sending Http using 'x-crawlera-use-https' request header it is working fine, but as per document this is Deprecated. Please let me know how to make Https request without header.


I tried same code as mentioned in the document still it is throwing exceptions



using System.IO;
using System;
using System.Net;

namespace ProxyRequest
{
    class MainClass
    {
        public static void Main (string[] args)
        {
            var myProxy = new WebProxy("http://proxy.crawlera.com:8010");
            myProxy.Credentials = new NetworkCredential("<API KEY>", "");

            HttpWebRequest request = (HttpWebRequest)WebRequest.Create("https://twitter.com");
            request.Proxy = myProxy;
            request.PreAuthenticate = true;

            WebResponse response = request.GetResponse();
            Console.WriteLine("Response Status: " 
                + ((HttpWebResponse)response).StatusDescription);
            Console.WriteLine("\nResponse Headers:\n" 
                + ((HttpWebResponse)response).Headers);

            Stream dataStream = response.GetResponseStream();
            var reader = new StreamReader(dataStream);
            string responseFromServer = reader.ReadToEnd();
            Console.WriteLine("Response Body:\n" + responseFromServer);
            reader.Close();

            response.Close();
        }
    }
}


Regards,

Ganesh Nayak K

0
jcdeesign 17 hours ago in Crawlera 0

Hi

When i use curl how example all works

if i try use Phantom js in selenium i recive 407

service_args = [
'--proxy=proxy.crawlera.com:8010',
'--proxy-auth=XXXXXXXXXXXXXXX:',
'--proxy-type=http',
'--load-images=false',
'--ssl-protocol=any',
'--webdriver-logfile=phantom.log',
'--ssl-client-certificate-file='+CRAWLERA_SERT,
'--ignore-ssl-errors=true',
'--webdriver-loglevel=DEBUG'
]


driver = webdriver.PhantomJS(executable_path=settings.PHANTOMJS, desired_capabilities=dcap,service_args=service_args)


i recive

'<html><head></head><body></body></html>'


in log


{"name":"X-Crawlera-Error","value":"bad_proxy_auth"}


key with curl work

0
jarek yesterday at 5:35 a.m. in Crawlera 0

Hello, I'm using the phantomjs from amir20: https://github.com/amir20/phantomjs-node

and from what I read I could enable a session if I pass true on my : currentPage.on('onResourceRequested', true, onResourceRequested);


and now I'm not getting the error: "networkRequest.setHeader is not a function" anymore. In my function:



function onResourceRequested(requestData, networkRequest) {


requestCounter += 1;
//yellow color for the requests:
if (printLogs) console.log('\x1b[33m%s\x1b[0m: ', '> ' + requestData.id + ' - ' + requestData.url);
if (!this.crawleraSessionId) {
networkRequest.setHeader('X-Crawlera-Session', 'create');

}



BUT the onResourceReceived now is not working because it's not returning the html data and before typing true in the onResourceRequested woked.


Any advice?



0
Not a bug
Braulio Ríos Ferreira 3 days ago in Crawlera • updated by Pablo Vaz (Support Engineer) yesterday at 6:56 p.m. 1

The spider test code is the following (I've removed irrelevant code, but this spider is tested and reproduces the same error):


# -*- coding: utf-8 -*-from scrapy import Request
from scrapy.spiders import Spider

class AlohaTestSpider(Spider):
    name = "aloha_test"

    def __init__(self, *args, **kwargs):
        super(AlohaTestSpider, self).__init__(*args, **kwargs)

    def start_requests(self):
        site = 'https://aroogas.alohaorderonline.com/OrderEntryService.asmx/GetSiteList/'
        yield Request(url=site,
                      method='POST',
                      callback=self.parse,
                      headers={"Content-Type": "application/json"})

    def parse(self, response):
        print(response.body)

When I run this spider:

$ scrapy crawl aloha_test


I keep getting the following error:

2017-03-20 12:33:11 [scrapy] DEBUG: Retrying <POST https://aroogas.alohaorderonline.com/OrderEntryService.asmx/GetSiteList/> (failed 1 times): 400 Bad Request


In the original spider, I have a retry decorator, and this errors repeats for 10 retries.


I only get this error with this specific request. In the real spider, which has more https requests before, It only fails when this request is reached (previous https requests return 200 OK).


Please note that this is a POST request that doesn't have any data. I don't know if this is relevant to you, but this is the only particularity that this request has in my spider.


If I deactivate "CrawleraMiddleware" and activate "CustomHttpProxyMiddleware" in DOWNLOADER_MIDDLEWARES (settings.py), I can make the request without error.


If I make this request using curl, I can't reproduce this error even when using crawlera, I mean that both of the following requests work fine:


$ curl --cacert ~/crawlera-ca.crt -H 'Content-Type: application/json' -H 'Content-Length: 0' -X POST -vx proxy.crawlera.com:8010 -U MY_API_KEY https://aroogas.alohaorderonline.com/OrderEntryService.asmx/GetSiteList


$ curl -H 'Content-Type: application/json' -H 'Content-Length: 0' -X POST https://aroogas.alohaorderonline.com/OrderEntryService.asmx/GetSiteList


I've tried everything in my imagination (Crawlera sessions, Crawlera cookies disabled, different types of http headers, but I can't figure out a way to get this request to work with Crawlera.


I guess it has to do with the Crawlera Middleware in Scrapy, but I don't know what sort of magic with the http headers might Crawlera be doing that is causing this request to fail.

Any suggestions about what could be causing this error?

Answer
Pablo Vaz (Support Engineer) yesterday at 6:56 p.m.

Hi Braulio,


As you corrected tested wit Curl, seems your Crawlera account is working fine.

Also, all projects using Scrapy-Crawlera integration are working fine in our platform.


About the integration with Scrapy, I can suggest you to review the information provided here:

http://help.scrapinghub.com/crawlera/using-crawlera-with-scrapy


And to know even more please see the official documentation:

http://scrapy-crawlera.readthedocs.io/en/latest/


If your project needs urgent attention, you can also consider to hire our experts. We can set up Scrapy-Crawlera projects that fits your needs saving you a lot of time and resources. If interested, let me invite you to fill our free quote request: https://scrapinghub.com/quote


Best regards,


Pablo Vaz

Support team

0
Answered
noa.drach-crawlera 3 days ago in Crawlera • updated by Pablo Vaz (Support Engineer) yesterday at 7:09 p.m. 1

I have the C10 subscription and when I try to use 10 parallel calls i get parallel connection limit reached error.


I dispatch the calls to crawlera in a simple loop


for (var index = 0; index <10; index++) {..}


when i change the loop to run 9 calls it works ok - so it's not clear to me how the limit is reached.


I contacted you on the support chat and got this response

"The bets way to ensure you make 10 concurrent requests and not go beyond that value is to set the concurrent_requests parameter to 10 in your crawlera settings."


this is my only crawlera related code:

var new_req = request.defaults({

'proxy': 'http://<API key>:@proxy.crawlera.com:8010'
});

so it's not clear to me what does it mean "crawlera settings"

Answer
Pablo Vaz (Support Engineer) yesterday at 7:09 p.m.

Hey Noa, I saw you request support throug Fresh Desk and Thriveni is assisting you.

We are here for any further inquiry you need. About the question you posted here, the best way to use Crawlera with Node.JS is provided in: https://doc.scrapinghub.com/crawlera.html#node-js

Crawlera settings are available for Scrapy projects, if interested to try:

http://scrapy-crawlera.readthedocs.io/en/latest/


Kind regards,

Pablo

0
Answered
noa.drach-crawlera 3 days ago in Crawlera • updated by Nestor Toledo Koplin (Support Engineer) 3 days ago 1

I want to be able to stop my code if i reached my monthly quota - is there an API call/Error code that indicate this state?

Answer

Hello,


There's currently not an API for this. However if you reach the monthly quota, the account would be suspended and you would receive a 403 User Account Suspended error and there will also be a response header X-Crawlera-Error: user_suspended.

0
Answered
Nayak 1 week ago in Crawlera • updated by Pablo Vaz (Support Engineer) 6 days ago 1

Hi,


We want to make webrequest for one of popular domain continuesly, we created background service and requesting domain using .Net HttpWebRequest.
We got the problem of IP Ban. I just saw the Crawlera.
I just saw the post How to use Crawlera in C# .Net
var myProxy = new WebProxy("http://proxy.crawlera.com:8010");
myProxy.Credentials = new NetworkCredential("<API KEY>", "");
HttpWebRequest request = (HttpWebRequest)WebRequest.Create("some domain url");
request.Proxy = myProxy;
request.PreAuthenticate = true;

If i make a webrequest for any website repeatedly, if IP ban occurs, will crawlera automatically handle this request with different IP? or do we need to send same request once again?


If it handles automatically then Do we need to send any header information apart from above listed code.

Regards,

Nayak


Answer

Hi Nayak, yes Crawlera will automatically retry. After 5 times (by default) if the site still banning IPs Crawlera will give you a ban status and will try with another request.

Even though Crawlera should protect you against bans, sometimes it runs out of capacity and will return a 503 response. Because of this, we recommend you retry 503 responses up to 5 times. Consider using the x-crawlera-next-request-in to retry more efficiently.

Kind regards,

Pablo

0
Started
Chris Fankhauser 2 weeks ago in Crawlera • updated 12 hours ago 3

I'm noticing some unusual sudden behavior from Crawlera lately...


First off, I log all outgoing requests/responses locally and build a dashboard using that log information so that I can see if something's gone horribly wrong on my end. I've noticed a couple of things which are disconcerting:


1) It looks like, as of March 3rd, `X-Crawlera-Version`, `X-Crawlera-Debug-Request-Time`, and `X-Crawlera-Debug-UA` are no longer being returned? This isn't a deal-breaker, but it makes me suspicious that something substantial has changed in regards to:


2) The number of requests seems to be dramatically underreported in the Crawlera dashboard as of, surprise, March 3rd. The number of "Failed" requests jumps up on the same day.


Here's my internal dashboard's graph of all requests/responses using Crawlera:


...and here's the graph in the Crawlera dashboard:



So... something seems to be clearly off here. Is anyone able to shine some light on the discrepancies I'm seeing? Thanks.

Answer
Pablo Vaz (Support Engineer) yesterday at 7:11 p.m.

Hi Chris,


Several fixes has been released regarding the issue you reported.

How is the performance now?


Kind regards,

Pablo Vaz

Support team

0
Waiting for Customer
Dege 2 weeks ago in Crawlera • updated by Nestor Toledo Koplin (Support Engineer) 4 hours ago 7

Using the examples you provided on the documentation:

https://doc.scrapinghub.com/crawlera.html#using-crawlera-with-casperjs-phantomjs-and-spookyjs


wget https://doc.scrapinghub.com/_downloads/crawlera-ca.crt

wget https://github.com/ariya/phantomjs/raw/master/examples/rasterize.js

phantomjs --ssl-protocol=any --proxy="proxy.crawlera.com:8010" --proxy-auth="<API KEY>:''" --ssl-client-certificate-file=crawlera-ca.crt rasterize.js https://twitter.com twitter.jpg


I'll get the error message: Unable to load the address!


activating phantomjs debugging (--debug=yes) you can isolate the error:

SSL Error: "The issuer certificate of a locally looked up certificate could not be found"


an error that can be bypassed by using the parameter: --ignore-ssl-errors=true

wich unfortunately will cause another issue:

Resource request error: QNetworkReply::NetworkError(ProxyAuthenticationRequiredError) ( "Proxy requires authentication" )


The final command to replicate the issue is:

phantomjs --debug=yes --ignore-ssl-errors=true --ssl-protocol=any --proxy="proxy.crawlera.com:8010" --proxy-auth="<API KEY>:''" --ssl-client-certificate-file=crawlera-ca.crt rasterize.js https://twitter.com twitter.jpg


My PhantomJS version is 2.1.1 (the latest one)


0
Answered
cangokalp 3 weeks ago in Crawlera • updated by Nestor Toledo Koplin (Support Engineer) 2 days ago 1

Hi,


for some urls my log shows ['partial'] at the end


and when I googled it I found this on stackoverflow;


You're seeing ['partial'] in your logs because the server at vallenproveedora.com.mx doesn't set the Content-Length header in its response; run curl -I to see for yourself. For more detail on the cause of the partial flag


Is this something on crawlera end, how can I fix this?


Best,


Can

Answer

Hello,


419 were actually 429 error codes (now correctly displays as 429), which mean that the concurrent connection limit based on the Crawlera plan was exceeded. For other possible error codes, please see: https://doc.scrapinghub.com/crawlera.html#errors

0
Answered
LuckyDucky 4 weeks ago in Crawlera • updated by Pablo Vaz (Support Engineer) 4 weeks ago 1

I've added more servers but I'm not seeing an increase in per minute throughput. Does Crawlera throttle?

Answer

Hi LuckyDucky,

Please check your configuration settings, you can set throttling also. Take a few minutes to explore this articles:

http://help.scrapinghub.com/crawlera/crawlera-best-practices

and
http://help.scrapinghub.com/scrapy-cloud/addons/auto-throttle-addon

For more information and further settings always is useful to check:

https://doc.scrapinghub.com/crawlera.html#crawlera-api

Kind regards and happy weekend,

Pablo

0
Answered
Francois 4 weeks ago in Crawlera • updated by Pablo Vaz (Support Engineer) 4 weeks ago 1

Hi,


I writed an CasperJs script, it worked with others proxy (IP list or/and scrapoxy).

This script navigated on website because I need javascript interaction.


I installed crawlera, it works perfectly but Casper didn't succeed to click on my first link :


` CasperError: Cannot dispatch mousedown event on nonexistent selector: #hlb-view-cart-announce`


Can you explain me why and How can I fixed that ?

Answer

Hi Francois,


To know more about how to set up Crawlera using CasperJS, please visit:
https://doc.scrapinghub.com/crawlera.html#using-crawlera-with-casperjs-phantomjs-and-spookyjs


If interested please consider to request a free quote: https://scrapinghub.com/quote, for professional assistance. Our developers can help you to set up and deploy your projects solving all script issues for you.

Kind regards,

Pablo

0
Answered
1669573348 1 month ago in Crawlera • updated by Nestor Toledo Koplin (Support Engineer) 1 month ago 1

i want change exsist Crawlera Account Regions, how to do?

Answer

Hello,


This article will guide on how to create an account for a particular region: http://help.scrapinghub.com/crawlera/regional-ips-in-crawlera

0
Completed
silverstart1987913 2 months ago in Crawlera • updated by Pablo Vaz (Support Engineer) 2 months ago 1

curl -vx proxy.crawlera.com:8010 -U <API_KEY>: http://newyork.craigslist.org/reply/nyc/sls/5986555995





* Trying 64.58.114.47...
* Connected to proxy.crawlera.com (64.58.114.47) port 8010 (#0)
* Proxy auth using Basic with user '<API_KEY>'
> Host: newyork.craigslist.org
> Proxy-Authorization: Basic ODA5N2VkZGQzYzIyNGUyMGE2NzYyZmI3NTRhYjhkMmE6
> User-Agent: curl/7.47.0
> Accept: */*
> Proxy-Connection: Keep-Alive
>
< HTTP/1.1 503 Service Unavailable
< Connection: close
< Content-Length: 17
< Content-Type: text/plain
< Date: Mon, 06 Feb 2017 07:54:01 GMT
* HTTP/1.1 proxy connection set close!
< Proxy-Connection: close
< Retry-After: 1
< X-Crawlera-Error: slavebanned
< X-Crawlera-Slave: 191.96.248.108:3128
< X-Crawlera-Version: 1.18.0-42b9dc
<
* Closing connection 0


Answer

Hi Silverstart,


Crawlera does not solve captchas. However, will detect redirects to captcha pages (for most sites) and will retry until it hits a clean page. Unsuccessful attempts won't count towards the monthly quota.
http://help.scrapinghub.com/crawlera/crawlera-and-captchas

Also, trying to crawl Craigslist is not easy and many clients are trying with our professional services. Our developers can help you creating and deploying powerful crawlers using the most advanced techniques and tools. It saves you a lot of time and resources.
If interested, don't hesitate to contact us through help@scrapinghub.com or request a free quote at: https://scrapinghub.com/quote.


Kind regards!

Pablo

0
Answered
rdelbel 2 months ago in Crawlera • updated by dlb 6 hours ago 2

I cant seem to get selenium working with polipo.

My crawlera its self works find when i test it with curl.

I have never used polipo and might be doing something obviously wrong with it.

When I get my ip from httpbin.org/ip from selenium i get my normal ip not the proxy ip.

These request are also not logged in the crawlera dashboard.

See details and debugging steps below.

I am on ubungu 16.04 LTS headless server on aws ec2

My security group is wide open for incoming and outgoing connections


I installed polipo with $ sudo apt-get install polipo

I have added the two lines setting parentProxy and parentAuthCredentials to the etc/polipo/config file

I restarted polipo with $ sudo service polipo restart

I checked /var/log.polipo/polipo.log which states: Established listening socket on port 8123.

I am using a headless server but i got the html response from $ curl http://localhost:8123/polipo/config and pasted the html into my local environment and opened it with the browser and it seems the parentProxy is correct and the parentAuthCredentials is hidden as it is supposed to be


I tested out the following python script but it prints out my normal ip.


from pyvirtualdisplay import Display
from scrapy import signals
from selenium import webdriver
from selenium.webdriver.common.proxy import *


display = Display(visible=0, size=(1920, 1080))
display.start()


polipo_proxy = "localhost:8123"


proxy = Proxy({
'proxyType': ProxyType.MANUAL,
'httpProxy': polipo_proxy,
'ftpProxy' : polipo_proxy,
'sslProxy' : polipo_proxy,
'noProxy' : ''
})


driver = webdriver.Firefox(proxy=proxy)
driver.get('http://httpbin.org/ip')
print driver.page_source


Answer

Hi! to know more about how to use Crawlera with selenium and polipo, please see our documentation here:
https://doc.scrapinghub.com/crawlera.html#using-crawlera-with-selenium-and-polipo

Best regards!