Welcome to the Scrapinghub feedback & support site! We discuss all things related to Scrapy Cloud, Portia and Crawlera. You can participate by reading posts, asking questions, providing feedback, helping others, and voting the best questions & answers. Scrapinghub employees regularly pop in to answer questions, share tips, and post announcements.
0
Answered
Gustav Palsson 15 hours ago in Crawlera • updated by Pablo Vaz (Support Engineer) 1 hour ago 2

It seems like all the proxies from outside sweden are blocked, which makes my scrapes hard to do.

Can you help?

Answer

Hi Gustav, perhaps was something related to the target domain, I recently checked and many "all around the world" (as you correctly pointed =)) are making successful requests.

Don't hesitate to reach us through Intercom or here again, if you have further questions.

Kind regards,

Pablo

0
Answered
jason 2 weeks ago in Crawlera • updated by Pablo Vaz (Support Engineer) 1 week ago 3

I have a client whom I referred to Crawlera, and I had him provide the API Key to me to run the routine. When I input it and test, it will work, but within a day or so I get the error: "Failed to match the login check" and if I have him log in and send me the API Key again, it has changed, and the new one works. How can I prevent it from changing?

Answer

Hey Jason, our Crawlera engineers informs that no changes can occur on the API key without authorization of your client.

Don't hesitate to ask if you need further assistance.

Best regards.

0
Answered
alfons 3 weeks ago in Crawlera • updated by Pablo Vaz (Support Engineer) 2 weeks ago 2

Hi,


I've signed up for the dedicated enterprise pool and can't figure out how to configure my scrapy script to use my dedicated proxy url instead of the default : proxy.crawlera.com:8010.

Would be great if someone could help me out with this.

Thank you.

0
Answered
erainfotech.mail 1 month ago in Crawlera • updated by Pablo Vaz (Support Engineer) 2 weeks ago 2

https://www.airbnb.com/api/v2/calendar_months?listing_id=13943652&month=5&year=2017&count=1&_format=host_calendar_detailed&key=d306zoyjsyarp7ifhu67rjxn52tv0t20
503HTTP/1.1 200 OK



HTTP/1.1 503 Service Unavailable

Connection: close

Content-Length: 17

Content-Type: text/plain

Date: Sat, 17 Dec 2016 08:03:25 GMT

Proxy-Connection: close

Retry-After: 1

X-Crawlera-Error: slavebanned

X-Crawlera-Slave: 202.75.58.168:60099

X-Crawlera-Version: 1.14.0-c1abfa



Website crawl ban

0
Answered
glebon 2 months ago in Crawlera • updated by Pablo Vaz (Support Engineer) 3 weeks ago 2
0
Answered
Chris Fankhauser 2 months ago in Crawlera • updated 4 weeks ago 6

I'm having some trouble using Crawlera with a CasperJS script. I'm specifying the `--proxy` and `--proxy-auth` parameters properly along with the cert and `ignore-ssl-errors=true` but continue to receive a "407 Proxy Authentication Required" response.


I'm using:

Casper v1.1.3

Phantom v2.1.1


My command-line (edited for sneakiness):

$ casperjs --proxy="proxy.crawlera.com:8010" --proxy-auth="<api-key>:''" --ignore-ssl-errors=true --ssl-client-certificate-file="../crawlera-ca.crt" --output-path="." --url="https://<target-url>" policies.js

Which results in this response (which I format in JSON):

{"result":"error","status":407,"statusText":"Proxy Authentication Required","url":"https://<target-url>","details":"Proxy requires authentication"}

I've tried [and succeeded] using the Fetch API, but unfortunately this is a step backward for me since the target URL is an Angular-based site which requires a significant amount of interaction JS manipulation before I can dump the page contents.


I've also attempted to specify the proxy settings within the script, like this:

var casper = require('casper').create();
phantom.setProxy('proxy.crawlera.com','8010','manual','<api-key>','');

...but no dice. I still get the 407 response.


I don't think there's an issue with the proxy, as the curl tests work fine, but an issue with integrating the proxy settings with Casper/Phantom. If anyone has a potential solution or known working workaround, I'd love to hear it... :)

Answer

I would recommend trying to first get it to work directly with phantomjs (without the casperjs wrapping functions) to discard it having anything to do with casperjs.


Related to this, you're using `--proxy-auth="<api-key>:''"` and the signature should be `--proxy-auth="username:password"`, not sure if the apikey is the username in this context without a password.

0
Answered
upcretailer 2 months ago in Crawlera • updated by Pablo Vaz (Support Engineer) 1 week ago 2

Requests to walmart product pages take extremely long (many on the order of minutes), requests to walmart product pages use https (if using http you will get redirected and upgraded to a https), I have tried directly calling https and http (redirect times out when upgrading request). HTTP requests that get redirected only response with 301 and a body stating "Redirecting..." without actually following the redirect. Calling HTTPS address directly takes 20 seconds to sometimes minutes to respond, it is even slow (although not to the same extent as walmrt) when using https for other sites. Sometimes it will timeout after 2 minutes and not response at all. After reading through some of the posts here I tried adding "X-Crawlera-Use-HTTPS": 1, but that doesnt seem to make much difference. However, using some other proxies and python requests library without scrapy gets me back to reasonable response times. Am I doing something wrong?



Settings file:


import os

import sys
import time
import re

import django


# ...django stuff..

BOT_NAME = 'walmart_crawler'

LOG_LEVEL = 'DEBUG'

LOG_FILE = os.path.join(
os.path.dirname(os.path.dirname(os.path.abspath(__file__))),
'logs', # log directory
"%s--%s.log" % (BOT_NAME, time.strftime('%Y-%m-%dT%H-%M-%S'))
)

LOG_FORMAT = '%(asctime)s [WalmartCrawler] %(levelname)s: %(message)s'

LAST_JOB_NUM = max(
[
int(re.match(r'\w+?_crawl_(?P<jobnum>\d+)', file).group('jobnum'))
for file in os.listdir(os.path.join(os.path.dirname(os.path.abspath(__file__)), 'crawls'))
if re.match(r'\w+?_crawl_(?P<jobnum>\d+)', file)
] or (0,)
)

RESUME_LAST_JOB = False

NEXT_JOB_NUM = (LAST_JOB_NUM + 1) if LAST_JOB_NUM else 1

JOBDIR = os.path.join(os.path.dirname(os.path.abspath(__file__)), 'crawls', '%s_crawl_%d' % (
BOT_NAME,
(LAST_JOB_NUM if RESUME_LAST_JOB else NEXT_JOB_NUM)
))

SPIDER_MODULES = ['walmart_crawler.spiders']
NEWSPIDER_MODULE = 'walmart_crawler.spiders'

# Configure maximum concurrent requests performed by Scrapy (default: 16)
CONCURRENT_REQUESTS = 100

# Configure a delay for requests for the same website (default: 0)
# See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 3

# Override the default request headers:
DEFAULT_REQUEST_HEADERS = {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.8",
# "X-Crawlera-Cookies": "disable",
"X-Crawlera-Debug": "request-time,ua",
"X-Crawlera-Use-HTTPS": 1,
}

REDIRECT_ENABLED = True
REDIRECT_MAX_TIMES = 40

CRAWLERA_ENABLED = True
CRAWLERA_APIKEY = os.environ['CRAWLERA_KEY']

# Enable or disable downloader middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware': 80,
'scrapy_crawlera.CrawleraMiddleware': 610,
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware': 400
}

RETRY_TIMES = 1
# Retry on most error codes since proxies fail for different reasons
RETRY_HTTP_CODES = [500, 503, 504, 400, 403, 404, 408]

# Enable or disable extensions
# See http://scrapy.readthedocs.org/en/latest/topics/extensions.html
# EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
# }

# Configure item pipelines
# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
'walmart_crawler.pipelines.WalmartCrawlerPipeline': 300,
}

Answer

Hi Upcretailer,

First, sorry for the late answer. In the past 3 weeks we experienced heavy banns from several sites. New improvements on machine learning have strengthen security on those sites and made hard to scrape.
Our Crawlera team, have been investigating and performed some actions to deal with this issue.
Please let us know if you continue experiencing the same problems reported.

Best regards!

0
Answered
Albert Dieleman 3 months ago in Crawlera • updated by Tomas Rinke (Support Engineer) 2 months ago 1

I'm using google app scripts, which is based on Javascript. I use a UrlFetchApp.fetch function where you pass in the URL and the headers. Any examples of syntax to setup the headers for calling crawlera?

Answer

hi, I don't see that google app scripts support proxy functionality https://developers.google.com/apps-script/reference/url-fetch/url-fetch-app , in that case you should use Fetch API


curl -u APIKEY: http://proxy.crawlera.com:8010/fetch?url=http://httpbin.org/ip


Regards,


Tomas

0
Answered
Vishal Tandon 3 months ago in Crawlera • updated by Tomas Rinke (Support Engineer) 1 week ago 1

Hi Folks, I was just trying the service before making a purchase. so I made a get a request to http://httpbin.org/ip . just to make sure that the ip, I get in response is not mine. but I am getting following error.


requests.exceptions.ConnectionError: HTTPConnectionPool(host='proxy.crawlera.com', port=8010): Max retries exceeded with url: http://httpbin.org/ip (Caused by : [Errno 54] Connection reset by peer)


Please suggest me how can i test the service successfully before making an purchase.


Thanks.


Answer

Hi, could you share the code? did you follow documentation?


https://doc.scrapinghub.com/crawlera.html#python


thanks

0
Answered
Aminah Nuraini 3 months ago in Crawlera • updated by Pablo Hoffman (Director) 3 months ago 5

Eventhough Crawlera has a lot of IPs, if all of the IPs already got banned, the IPs will need to be renewed. Does Crawlera do that?

Answer

Crawlera works differently, it protects the IPs from getting banned instead of rotating them, like regular proxy providers do.


In the standard plans (which use a shared pool of IPs) que quality of the IPs is constantly being monitored and improved via machine learning powered tasks. In the Enterprise plans, you can request manual assistance to improve the quality of your crawls, and this assistance sometimes involves rotating IPs.

0
Answered
lila.au 3 months ago in Crawlera • updated by Tomas Rinke (Support Engineer) 3 months ago 1

I need to know all of the IPs of proxy.crawlera.com, as our firewall just can set access rules by IPs only, not support for domain setting. Thanks.

Answer

Hi regarding IPs for proxy.crawlera.com sub-domain:

They are not static, so they are likely to change, not that often though.


An approach to white-list the IPs in your firewall, could be to flush DNS on any machine and get the new IP by resolving the sub-domain when needed.



0
Answered
Zom 4 months ago in Crawlera • updated by Tomas Rinke (Support Engineer) 4 months ago 1

Please can someone provide sample code written in Perl for Crawlera? I have system which only allows for Perl version.

Answer

Hi, challenge accepted! it was as easy as googling around and change the variables for crawlera proxies:


https://stackoverflow.com/questions/10155971/http-request-not-going-through-proxy/10156377#10156377


and here is the code http://pastebin.com/9RRyvnY3


which resulted in:


perl test.pl
{
  "origin": "xx.xx.xx.xx" # -> proxy ip :)
}

Thanks for posting

0
Answered
Tall Steve 4 months ago in Crawlera • updated by Pablo Vaz (Support Engineer) 2 weeks ago 4

I have a page that requires .NET's __VIEWSTATE variable and a few others posted to it. It is therefore a 2 stage process.

1. Grab the first page with a GET and extract the __VIEWSTATE variable contents, then

2. POST these to the next page to view the contents.

Part 1 works and I get the viewstate var contents OK, but then part 2 times out after 30 seconds. I am using the same Curl handle and not re-initialising curl after part 1 - this works fine when not using the crawlera proxy. It only fails when I use the proxy


Code - curl and proxy already initialised, just doing a second call with POST VARS:

curl_setopt($curl,CURLOPT_POST, 1); //0 for a get request 1 for POST
$postData=array(
'__VIEWSTATE'=>$viewstate,
'__VIEWSTATEGENERATOR'=>$viewstategenerator,
'__EVENTVALIDATION'=>$eventvalidation,
'ctl00$cphMainContent$tabDocuments'=>'Documents');
curl_setopt($curl, CURLOPT_POSTFIELDS, $postData);
curl_setopt($curl, CURLOPT_REFERER, "https://websiteURL.com");
$contents = curl_exec($curl); //Grab the page


Do POST vars work with Crawlera?

Thanks,

Steve

Answer

I see. It could be as you said. Have you tried using Crawlera sessions https://doc.scrapinghub.com/crawlera.html#sessions-and-request-limits and not changing the UA neither resetting Cookies?


0
Answered
kirimaks 4 months ago in Crawlera • updated by Pablo Vaz (Support Engineer) 2 weeks ago 2

Hello. I change my plan from c10 to c50 and crawlera start to work very strange. Before I spend two c10 plans, and now 50000+ pages crawled. When I change plan it start to work slowly and looks like 2 or 3 times more banned requests and response with capcahs. Right now maybe 1 of 5 requests work. I didn't change scraper's code (I use casperjs). I'm not sure which user agent will be used if I didn't set any user agent in crawlera header?

Answer

Hi kirimaks!

Have you experience more issues regarding the C50 plan performance?

Regards!

0
Answered
Eyeless 4 months ago in Crawlera • updated by Tomas Rinke (Support Engineer) 4 months ago 1

Hello,


As I understood, Crawlera selects new proxy for each request, but I would like to make, for instance, 10 requests from one IP, and then 10 from another one, but not 20 requests from 20 different IPs. I've found a 3 years old topic with information, that I can pass the 'X-Crawlera-Slave' response header to my new request's headers to use the same proxy, but Martin has also written, that this feature was not available. Maybe now situation has changed and I can use same proxy for several requests somehow?

Answer

Hi, reading your post It seems sessions fit your requirement: https://doc.scrapinghub.com/crawlera.html#sessions


Let us know if it helps,


Tomas