Welcome to the Scrapinghub feedback & support site! We discuss all things related to Scrapy Cloud, Portia and Crawlera. You can participate by reading posts, asking questions, providing feedback, helping others, and voting the best questions & answers. Scrapinghub employees regularly pop in to answer questions, share tips, and post announcements.
0
Answered
seble 2 days ago in Scrapy Cloud • updated by Pablo Hoffman (Director) 17 hours ago 3

Whenever I deploy my spider written in Python 3, scrapinghub thinks that it is written in Python 2 (I can see that the python2.7 interpreter is used by looking at the spider's logs). Due to differences in how Python 2 and Python 3 handle strings, this results in a UnicodeDecodeError. How can I tell scrapinghub to use Python 3 to execute my spider?

Answer

Solved it by specifying a stack in the scrapinghub.yml config file:

projects:
default:
id: XYZ
stack: scrapy:1.1-py3

0
Ahmed 2 days ago in Scrapy Cloud 0

I used Portia to setup a scrapy cloud project and tested it on a couple of links from the website and it works great. Now, my question is, I want to retrieve data from pages on this website on-demand, one page at a time, similar to how Pinterest's users add a page and Pinterest pulls in the title and image of that page. The users on my website will do the same, entering a url of the page they want info from and my API sends the link to an API on scrapinghub through a GET request, scrapinghub extracts the info from that page and sends it back.

Is this something that can be done? If yes, can you please direct me on how this can be done?

0
Waiting for Customer
Chris Fankhauser 1 week ago in Crawlera • updated by Pablo Vaz (Support Engineer) 8 hours ago 1

I'm having some trouble using Crawlera with a CasperJS script. I'm specifying the `--proxy` and `--proxy-auth` parameters properly along with the cert and `ignore-ssl-errors=true` but continue to receive a "407 Proxy Authentication Required" response.


I'm using:

Casper v1.1.3

Phantom v2.1.1


My command-line (edited for sneakiness):

$ casperjs --proxy="proxy.crawlera.com:8010" --proxy-auth="<api-key>:''" --ignore-ssl-errors=true --ssl-client-certificate-file="../crawlera-ca.crt" --output-path="." --url="https://<target-url>" policies.js

Which results in this response (which I format in JSON):

{"result":"error","status":407,"statusText":"Proxy Authentication Required","url":"https://<target-url>","details":"Proxy requires authentication"}

I've tried [and succeeded] using the Fetch API, but unfortunately this is a step backward for me since the target URL is an Angular-based site which requires a significant amount of interaction JS manipulation before I can dump the page contents.


I've also attempted to specify the proxy settings within the script, like this:

var casper = require('casper').create();
phantom.setProxy('proxy.crawlera.com','8010','manual','<api-key>','');

...but no dice. I still get the 407 response.


I don't think there's an issue with the proxy, as the curl tests work fine, but an issue with integrating the proxy settings with Casper/Phantom. If anyone has a potential solution or known working workaround, I'd love to hear it... :)

0
Mathieu Dhondt 1 week ago in Scrapy Cloud 0

I would like to get the items of each run and do some further manipulation of the data on my own server (including storing it).


My first idea was to schedule a call from my server to Scrapinghub and check for finished jobs regularly, but it would be more efficient of course when Scrapinghub would notify me instead.


So is there a possibility for Scrapinghub to call an endpoint on my server whenever a crawling job has finished.

+1
Answered
temp2 2 weeks ago in Portia • updated by Pablo Vaz (Support Engineer) 2 weeks ago 0

I see that frames are not allowed, but I don't see a button "Toggle page styling"

is there a web scraper that doesn't use any of the targeted sites code to scrape? one that visually see's whats on the page and visually parses the text?





Answer

Hi! You can enable "Toggle page styling button" after setting your New spider and after clicking "Edit sample" button.

There, you will find a menu with the "Toggle" button. Looks like this:



On the other hand, iframes are not supported on Portia at this moment.

I hope this has been useful. Regards.

0
Michael Rüegg 3 weeks ago in Scrapy Cloud 0

Hi,


I have a spider that makes usage of FormRequest, item loaders and Request.

Here's an example for a FormRequest:


yield FormRequest(url, callback, formdata)

Here for an item loader:


il = ItemLoader(item=MyResult())
il.add_value('date', response.meta['date'])
yield il.load_item()

And here for a request:

page_request = Request(url, callback=self.parse_run_page)
yield page_request

Deltafetch is enabled, creates a .db file, but with every spider run, Scrapy does all page requests again, so no delta processing is achieved.


Any ideas why Scrapy deltafetch does not work? Thanks.

0
Answered
ustcgaoy01 3 weeks ago in Scrapy Cloud • updated by Nestor Toledo Koplin (Support Engineer) 3 weeks ago 1

I have two unit and currently a spider is running in one of the unit. I have another spider within the same scrapy project and I want to run it on the other free unit. How to do it?

Answer

Hello,


If you click on run from the dashboard and select the other(s) spider(s) it should run using the free unit. If there are no free units, it will go to the pending queue until a unit is available.

0
Ashley Sandyford-Sykes 4 weeks ago in Splash 0

I have been attempting to crawl a number of gaming website pages such as http://casino.bet365.com/home/en scraping the links and associated details of the game links.

These are rendered dynamically and so I've set up with scrapy-splash and splash unit on scraping hub.

However the dynamic display of games and their data (xpath:-- //div[@class="PodName"]) is dependent on the javascript detecting a flash version.


vs.


Can existence of "flash" be spoofed or faked with a lua script allowing the javascript to render complete html?

I am thinking is there an equivalent approach the PhantomJS Faker of https://github.com/mjesun/phantomjs-flash

0
upcretailer 4 weeks ago in Crawlera • updated 4 weeks ago 1

Requests to walmart product pages take extremely long (many on the order of minutes), requests to walmart product pages use https (if using http you will get redirected and upgraded to a https), I have tried directly calling https and http (redirect times out when upgrading request). HTTP requests that get redirected only response with 301 and a body stating "Redirecting..." without actually following the redirect. Calling HTTPS address directly takes 20 seconds to sometimes minutes to respond, it is even slow (although not to the same extent as walmrt) when using https for other sites. Sometimes it will timeout after 2 minutes and not response at all. After reading through some of the posts here I tried adding "X-Crawlera-Use-HTTPS": 1, but that doesnt seem to make much difference. However, using some other proxies and python requests library without scrapy gets me back to reasonable response times. Am I doing something wrong?



Settings file:


import os

import sys
import time
import re

import django


# ...django stuff..

BOT_NAME = 'walmart_crawler'

LOG_LEVEL = 'DEBUG'

LOG_FILE = os.path.join(
os.path.dirname(os.path.dirname(os.path.abspath(__file__))),
'logs', # log directory
"%s--%s.log" % (BOT_NAME, time.strftime('%Y-%m-%dT%H-%M-%S'))
)

LOG_FORMAT = '%(asctime)s [WalmartCrawler] %(levelname)s: %(message)s'

LAST_JOB_NUM = max(
[
int(re.match(r'\w+?_crawl_(?P<jobnum>\d+)', file).group('jobnum'))
for file in os.listdir(os.path.join(os.path.dirname(os.path.abspath(__file__)), 'crawls'))
if re.match(r'\w+?_crawl_(?P<jobnum>\d+)', file)
] or (0,)
)

RESUME_LAST_JOB = False

NEXT_JOB_NUM = (LAST_JOB_NUM + 1) if LAST_JOB_NUM else 1

JOBDIR = os.path.join(os.path.dirname(os.path.abspath(__file__)), 'crawls', '%s_crawl_%d' % (
BOT_NAME,
(LAST_JOB_NUM if RESUME_LAST_JOB else NEXT_JOB_NUM)
))

SPIDER_MODULES = ['walmart_crawler.spiders']
NEWSPIDER_MODULE = 'walmart_crawler.spiders'

# Configure maximum concurrent requests performed by Scrapy (default: 16)
CONCURRENT_REQUESTS = 100

# Configure a delay for requests for the same website (default: 0)
# See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 3

# Override the default request headers:
DEFAULT_REQUEST_HEADERS = {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.8",
# "X-Crawlera-Cookies": "disable",
"X-Crawlera-Debug": "request-time,ua",
"X-Crawlera-Use-HTTPS": 1,
}

REDIRECT_ENABLED = True
REDIRECT_MAX_TIMES = 40

CRAWLERA_ENABLED = True
CRAWLERA_APIKEY = os.environ['CRAWLERA_KEY']

# Enable or disable downloader middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware': 80,
'scrapy_crawlera.CrawleraMiddleware': 610,
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware': 400
}

RETRY_TIMES = 1
# Retry on most error codes since proxies fail for different reasons
RETRY_HTTP_CODES = [500, 503, 504, 400, 403, 404, 408]

# Enable or disable extensions
# See http://scrapy.readthedocs.org/en/latest/topics/extensions.html
# EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
# }

# Configure item pipelines
# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
'walmart_crawler.pipelines.WalmartCrawlerPipeline': 300,
}

0
tedois 4 weeks ago in Scrapy Cloud 0

So, I have 5+ different spiders that run happily on a daily basis on Scrapinghub.


For now I'm doing manual exports but am working on making something more professional. The task I'm trying to achieve is described below, along with the steps I'm trying to take and where I'm getting stuck at.


Main task: send an email containing new items in a Excel-supported CSV, along with a description of how many new items were found and old ones that were removed.


Steps

1) for each spider, grab last job ids with status "finished"

2) extract all items from last run and the run before, separately

3) build an object or json (whichecer works better) with these data, and compare them

4) from that json, build an excel compatible csv (enclose data in quotes, and using ; as separator)

5) merge all that into an email and send it away


After a whole day I'm still stuck in step #2. It may have something to do with my Pyhon knowledge or the strategy that I'm following - what'd you say? :)


You know, this shouldn't be a not-so-common task and I thought it couldn't be that hard, but hey - here I am asking for support! When I finish this I promise I'll make a public github repo for this project so we can help everyone too.


Cheers!

0
Answered
bjoern.juergensen 1 month ago in Portia • updated by Nestor Toledo Koplin (Support Engineer) 3 weeks ago 1

hi, i am sorry i don't have a clue what to type into the "run spider" dialogue.

I mean i have created one but typing its name into the field has no effect.

can you assist, pls?

thanks

Answer

Hello,


You'll need to publish the spider from Portia's UI and then you'll be able to run the spider in Scrapy Cloud. Note: the icon looks like a green cloud.

0
Milan Topuzov 1 month ago in Scrapy Cloud • updated 1 month ago 1

I'm using scrapinghub python client


i'm running my spider like this:


from scrapinghub import Connection
conn = Connection('API_KEY') # real api key in here
project = conn[12345] # real project id in here retrieved from conn.project_ids()
project.schedule('spider_name', start_urls='http://youtube.com') # real spider name


I've also tried running them like this
project.schedule('spider_name', start_urls=['http://youtube.com'])
project.schedule('spider_name', start_urls='[http://youtube.com]')


The same error on every try:
ValueError: Missing scheme in request url: h


I'll need to schedule the scrapers for 1000 urls per job. How can i make this happen?


Thanks a lot for the help
+1
Answered
Mahmood 1 month ago in Portia • updated by Pablo Vaz (Support Engineer) 2 weeks ago 1

i have created a new spider in portia. is it possible to set request per second for this spider when it's running in job section?

Answer

Hi Mahmood,


You are absolutely right.



You can also check on the request/min rate, by picking the request data shown on the running job, that the rate should not overpass the settings values. In your case 0.25s or 15 requests per minute.


Thanks for participating in the forum!



0
Danielz 1 month ago in Scrapy Cloud • updated by tedois 4 weeks ago 1

website requires to check a box before can view pages, how to get scraper to check box so can access?

you need to enter a name + age to acess the real site with content.