Welcome to the Scrapinghub feedback & support site! We discuss all things related to Scrapy Cloud, Portia and Crawlera. You can participate by reading posts, asking questions, providing feedback, helping others, and voting the best questions & answers. Scrapinghub employees regularly pop in to answer questions, share tips, and post announcements.
0
Answered
1669573348 2 weeks ago in Crawlera • updated by Nestor Toledo Koplin (Support Engineer) 2 weeks ago 1

i want change exsist Crawlera Account Regions, how to do?

Answer

Hello,


This article will guide on how to create an account for a particular region: http://help.scrapinghub.com/crawlera/regional-ips-in-crawlera

0
Completed
silverstart1987913 2 weeks ago in Crawlera • updated by Pablo Vaz (Support Engineer) 2 weeks ago 1

curl -vx proxy.crawlera.com:8010 -U <API_KEY>: http://newyork.craigslist.org/reply/nyc/sls/5986555995





* Trying 64.58.114.47...
* Connected to proxy.crawlera.com (64.58.114.47) port 8010 (#0)
* Proxy auth using Basic with user '<API_KEY>'
> Host: newyork.craigslist.org
> Proxy-Authorization: Basic ODA5N2VkZGQzYzIyNGUyMGE2NzYyZmI3NTRhYjhkMmE6
> User-Agent: curl/7.47.0
> Accept: */*
> Proxy-Connection: Keep-Alive
>
< HTTP/1.1 503 Service Unavailable
< Connection: close
< Content-Length: 17
< Content-Type: text/plain
< Date: Mon, 06 Feb 2017 07:54:01 GMT
* HTTP/1.1 proxy connection set close!
< Proxy-Connection: close
< Retry-After: 1
< X-Crawlera-Error: slavebanned
< X-Crawlera-Slave: 191.96.248.108:3128
< X-Crawlera-Version: 1.18.0-42b9dc
<
* Closing connection 0


Answer

Hi Silverstart,


Crawlera does not solve captchas. However, will detect redirects to captcha pages (for most sites) and will retry until it hits a clean page. Unsuccessful attempts won't count towards the monthly quota.
http://help.scrapinghub.com/crawlera/crawlera-and-captchas

Also, trying to crawl Craigslist is not easy and many clients are trying with our professional services. Our developers can help you creating and deploying powerful crawlers using the most advanced techniques and tools. It saves you a lot of time and resources.
If interested, don't hesitate to contact us through help@scrapinghub.com or request a free quote at: https://scrapinghub.com/quote.


Kind regards!

Pablo

0
Answered
rdelbel 3 weeks ago in Crawlera • updated by Pablo Vaz (Support Engineer) 2 weeks ago 1

I cant seem to get selenium working with polipo.

My crawlera its self works find when i test it with curl.

I have never used polipo and might be doing something obviously wrong with it.

When I get my ip from httpbin.org/ip from selenium i get my normal ip not the proxy ip.

These request are also not logged in the crawlera dashboard.

See details and debugging steps below.

I am on ubungu 16.04 LTS headless server on aws ec2

My security group is wide open for incoming and outgoing connections


I installed polipo with $ sudo apt-get install polipo

I have added the two lines setting parentProxy and parentAuthCredentials to the etc/polipo/config file

I restarted polipo with $ sudo service polipo restart

I checked /var/log.polipo/polipo.log which states: Established listening socket on port 8123.

I am using a headless server but i got the html response from $ curl http://localhost:8123/polipo/config and pasted the html into my local environment and opened it with the browser and it seems the parentProxy is correct and the parentAuthCredentials is hidden as it is supposed to be


I tested out the following python script but it prints out my normal ip.


from pyvirtualdisplay import Display
from scrapy import signals
from selenium import webdriver
from selenium.webdriver.common.proxy import *


display = Display(visible=0, size=(1920, 1080))
display.start()


polipo_proxy = "localhost:8123"


proxy = Proxy({
'proxyType': ProxyType.MANUAL,
'httpProxy': polipo_proxy,
'ftpProxy' : polipo_proxy,
'sslProxy' : polipo_proxy,
'noProxy' : ''
})


driver = webdriver.Firefox(proxy=proxy)
driver.get('http://httpbin.org/ip')
print driver.page_source


Answer

Hi! to know more about how to use Crawlera with selenium and polipo, please see our documentation here:
https://doc.scrapinghub.com/crawlera.html#using-crawlera-with-selenium-and-polipo

Best regards!

0
Answered
Vince 3 weeks ago in Crawlera • updated by Pablo Vaz (Support Engineer) 3 weeks ago 1

Hello there,

I have a problem on my crawlera dashbord. The recent requests list doesn't work. The loader is always displayed.


The cause is a 500 error with the api query : https://app.scrapinghub.com/api/v2/kibana/requests_filtered?username=


Best regards.

Answer

Hey Vince, we have experienced some temporary network issues that impacted on Crawlera stats. This should works fine now.
Kind regards!

Pablo

0
Answered
张松 4 weeks ago in Crawlera • updated by Pablo Vaz (Support Engineer) 4 weeks ago 1

My account prompts me, 'Usage limit reached',

so, I used the ips can also be reused

Answer

Hi 张松, not sure to understand correctly. From the stats, we can see you reached your limit since 2017-01-22 06:20:27 and no more requests using Crawlera were made.

You can upgrade your account to continue using your account or wait until your next billing period.

Regards!

0
Answered
nahidul nibir 1 month ago in Crawlera • updated by Pablo Vaz (Support Engineer) 4 weeks ago 1

i was trying to use crawlera for selenium via polipo on a fedora machine. i've done exactly same written in the doc. but while i run my script poilpo keeps saying "couldn't parse etag". is there anything i'm missing .

Answer

Hi Nahidul,

I think this user experienced similar issues, let me know if this helps:
https://support.scrapinghub.com/topics/678-configuring-selenium-with-crawlera/
Regards,

Pablo

0
Answered
Gustav Palsson 1 month ago in Crawlera • updated by Pablo Vaz (Support Engineer) 1 month ago 4

It seems like all the proxies from outside sweden are blocked, which makes my scrapes hard to do.

Can you help?

Answer

Hi Gustav, perhaps was something related to the target domain, I recently checked and many "all around the world" (as you correctly pointed =)) are making successful requests.

Don't hesitate to reach us through Intercom or here again, if you have further questions.

Kind regards,

Pablo

0
Answered
jason 2 months ago in Crawlera • updated by Pablo Vaz (Support Engineer) 1 month ago 3

I have a client whom I referred to Crawlera, and I had him provide the API Key to me to run the routine. When I input it and test, it will work, but within a day or so I get the error: "Failed to match the login check" and if I have him log in and send me the API Key again, it has changed, and the new one works. How can I prevent it from changing?

Answer

Hey Jason, our Crawlera engineers informs that no changes can occur on the API key without authorization of your client.

Don't hesitate to ask if you need further assistance.

Best regards.

0
Answered
alfons 2 months ago in Crawlera • updated by Pablo Vaz (Support Engineer) 2 months ago 2

Hi,


I've signed up for the dedicated enterprise pool and can't figure out how to configure my scrapy script to use my dedicated proxy url instead of the default : proxy.crawlera.com:8010.

Would be great if someone could help me out with this.

Thank you.

0
Answered
erainfotech.mail 2 months ago in Crawlera • updated by Pablo Vaz (Support Engineer) 2 months ago 2

https://www.airbnb.com/api/v2/calendar_months?listing_id=13943652&month=5&year=2017&count=1&_format=host_calendar_detailed&key=d306zoyjsyarp7ifhu67rjxn52tv0t20
503HTTP/1.1 200 OK



HTTP/1.1 503 Service Unavailable

Connection: close

Content-Length: 17

Content-Type: text/plain

Date: Sat, 17 Dec 2016 08:03:25 GMT

Proxy-Connection: close

Retry-After: 1

X-Crawlera-Error: slavebanned

X-Crawlera-Slave: 202.75.58.168:60099

X-Crawlera-Version: 1.14.0-c1abfa



Website crawl ban

0
Answered
glebon 3 months ago in Crawlera • updated by Pablo Vaz (Support Engineer) 2 months ago 2
0
Answered
Chris Fankhauser 3 months ago in Crawlera • updated by Pablo Vaz (Support Engineer) 2 weeks ago 8

I'm having some trouble using Crawlera with a CasperJS script. I'm specifying the `--proxy` and `--proxy-auth` parameters properly along with the cert and `ignore-ssl-errors=true` but continue to receive a "407 Proxy Authentication Required" response.


I'm using:

Casper v1.1.3

Phantom v2.1.1


My command-line (edited for sneakiness):

$ casperjs --proxy="proxy.crawlera.com:8010" --proxy-auth="<api-key>:''" --ignore-ssl-errors=true --ssl-client-certificate-file="../crawlera-ca.crt" --output-path="." --url="https://<target-url>" policies.js

Which results in this response (which I format in JSON):

{"result":"error","status":407,"statusText":"Proxy Authentication Required","url":"https://<target-url>","details":"Proxy requires authentication"}

I've tried [and succeeded] using the Fetch API, but unfortunately this is a step backward for me since the target URL is an Angular-based site which requires a significant amount of interaction JS manipulation before I can dump the page contents.


I've also attempted to specify the proxy settings within the script, like this:

var casper = require('casper').create();
phantom.setProxy('proxy.crawlera.com','8010','manual','<api-key>','');

...but no dice. I still get the 407 response.


I don't think there's an issue with the proxy, as the curl tests work fine, but an issue with integrating the proxy settings with Casper/Phantom. If anyone has a potential solution or known working workaround, I'd love to hear it... :)

Answer

I would recommend trying to first get it to work directly with phantomjs (without the casperjs wrapping functions) to discard it having anything to do with casperjs.


Related to this, you're using `--proxy-auth="<api-key>:''"` and the signature should be `--proxy-auth="username:password"`, not sure if the apikey is the username in this context without a password.

0
Answered
upcretailer 3 months ago in Crawlera • updated by Pablo Vaz (Support Engineer) 1 month ago 2

Requests to walmart product pages take extremely long (many on the order of minutes), requests to walmart product pages use https (if using http you will get redirected and upgraded to a https), I have tried directly calling https and http (redirect times out when upgrading request). HTTP requests that get redirected only response with 301 and a body stating "Redirecting..." without actually following the redirect. Calling HTTPS address directly takes 20 seconds to sometimes minutes to respond, it is even slow (although not to the same extent as walmrt) when using https for other sites. Sometimes it will timeout after 2 minutes and not response at all. After reading through some of the posts here I tried adding "X-Crawlera-Use-HTTPS": 1, but that doesnt seem to make much difference. However, using some other proxies and python requests library without scrapy gets me back to reasonable response times. Am I doing something wrong?



Settings file:


import os

import sys
import time
import re

import django


# ...django stuff..

BOT_NAME = 'walmart_crawler'

LOG_LEVEL = 'DEBUG'

LOG_FILE = os.path.join(
os.path.dirname(os.path.dirname(os.path.abspath(__file__))),
'logs', # log directory
"%s--%s.log" % (BOT_NAME, time.strftime('%Y-%m-%dT%H-%M-%S'))
)

LOG_FORMAT = '%(asctime)s [WalmartCrawler] %(levelname)s: %(message)s'

LAST_JOB_NUM = max(
[
int(re.match(r'\w+?_crawl_(?P<jobnum>\d+)', file).group('jobnum'))
for file in os.listdir(os.path.join(os.path.dirname(os.path.abspath(__file__)), 'crawls'))
if re.match(r'\w+?_crawl_(?P<jobnum>\d+)', file)
] or (0,)
)

RESUME_LAST_JOB = False

NEXT_JOB_NUM = (LAST_JOB_NUM + 1) if LAST_JOB_NUM else 1

JOBDIR = os.path.join(os.path.dirname(os.path.abspath(__file__)), 'crawls', '%s_crawl_%d' % (
BOT_NAME,
(LAST_JOB_NUM if RESUME_LAST_JOB else NEXT_JOB_NUM)
))

SPIDER_MODULES = ['walmart_crawler.spiders']
NEWSPIDER_MODULE = 'walmart_crawler.spiders'

# Configure maximum concurrent requests performed by Scrapy (default: 16)
CONCURRENT_REQUESTS = 100

# Configure a delay for requests for the same website (default: 0)
# See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 3

# Override the default request headers:
DEFAULT_REQUEST_HEADERS = {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.8",
# "X-Crawlera-Cookies": "disable",
"X-Crawlera-Debug": "request-time,ua",
"X-Crawlera-Use-HTTPS": 1,
}

REDIRECT_ENABLED = True
REDIRECT_MAX_TIMES = 40

CRAWLERA_ENABLED = True
CRAWLERA_APIKEY = os.environ['CRAWLERA_KEY']

# Enable or disable downloader middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware': 80,
'scrapy_crawlera.CrawleraMiddleware': 610,
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware': 400
}

RETRY_TIMES = 1
# Retry on most error codes since proxies fail for different reasons
RETRY_HTTP_CODES = [500, 503, 504, 400, 403, 404, 408]

# Enable or disable extensions
# See http://scrapy.readthedocs.org/en/latest/topics/extensions.html
# EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
# }

# Configure item pipelines
# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
'walmart_crawler.pipelines.WalmartCrawlerPipeline': 300,
}

Answer

Hi Upcretailer,

First, sorry for the late answer. In the past 3 weeks we experienced heavy banns from several sites. New improvements on machine learning have strengthen security on those sites and made hard to scrape.
Our Crawlera team, have been investigating and performed some actions to deal with this issue.
Please let us know if you continue experiencing the same problems reported.

Best regards!

0
Answered
Albert Dieleman 4 months ago in Crawlera • updated by Tomas Rinke (Support Engineer) 3 months ago 1

I'm using google app scripts, which is based on Javascript. I use a UrlFetchApp.fetch function where you pass in the URL and the headers. Any examples of syntax to setup the headers for calling crawlera?

Answer

hi, I don't see that google app scripts support proxy functionality https://developers.google.com/apps-script/reference/url-fetch/url-fetch-app , in that case you should use Fetch API


curl -u APIKEY: http://proxy.crawlera.com:8010/fetch?url=http://httpbin.org/ip


Regards,


Tomas

0
Answered
Vishal Tandon 4 months ago in Crawlera • updated by Tomas Rinke (Support Engineer) 1 month ago 1

Hi Folks, I was just trying the service before making a purchase. so I made a get a request to http://httpbin.org/ip . just to make sure that the ip, I get in response is not mine. but I am getting following error.


requests.exceptions.ConnectionError: HTTPConnectionPool(host='proxy.crawlera.com', port=8010): Max retries exceeded with url: http://httpbin.org/ip (Caused by : [Errno 54] Connection reset by peer)


Please suggest me how can i test the service successfully before making an purchase.


Thanks.


Answer

Hi, could you share the code? did you follow documentation?


https://doc.scrapinghub.com/crawlera.html#python


thanks