Welcome to the Scrapinghub feedback & support site! We discuss all things related to Scrapy Cloud, Portia and Crawlera. You can participate by reading posts, asking questions, providing feedback, helping others, and voting the best questions & answers. Scrapinghub employees regularly pop in to answer questions, share tips, and post announcements.
0
Not a bug
MikeLambert 1 month ago in Splash • updated by Pablo Vaz (Support Engineer) 4 weeks ago 1

It seems Distil is blocking my use of Splash now. I'm not sure if the site I'm accessing just started using Distil, or if Distil just started detecting Splash.


Is there any information as to how Distil is detecting Splash? I've seen examples for Selenium needing to delete certain properties from document.window, but am unclear as to exactly how Splash is automating things, and what it might stick in the browser that make it detectable.


I did find https://www.quora.com/How-do-you-crawl-Crunchbase-and-bypass-the-bot-protection , where Pablo Hoffman (scrapinghub co-founder) suggests contacting scrapinghub to help with crawling. I'm not sure what the costs for a full consulting gig to do this are (any estimates?)


I'm already using scrapinghub/splash and pay for an instance, but if it's impossible to get through Distil, I'll just have to turn off my spaslh JS instance and remove the feature, so any pointers (public or private to mlambert@gmail.com) would be appreciated!

Answer

Hi Mike, Distil Network is a powerful anti-bot system. I can see you contacted our team through: https://scrapinghub.com/quote

Our team will evaluate what could be the most suitable option in your case.


Best regards!


Pablo

0
Answered
tsvet 2 months ago in Splash • updated by Pablo Vaz (Support Engineer) 2 months ago 1

Is it possible to choose the version of the splash for the splash instance? Mine is v2.1, but I would need to use a function that only appears to be avalilible in v2.3

Answer

Hi Tsvet,


Yes it is possible from our internal set up. Feel free to contact us through the Support Help desk to help you further.

Kind regards,

Pablo Vaz

+1
Answered
Aysc 4 months ago in Splash • updated by Pablo Vaz (Support Engineer) 3 weeks ago 2

I run a splash instance at scrapinghub.


I am receiving 400 error code when I try to call select method:

local element = splash:select('#checkbox2')


I have checked the splash documentation, but could not find the source of this error. can you help me?


WARNING: Bad request to Splash: {u'info': {u'line_number': 86, u'message': u'Lua error: [string "..."]:86: attempt to call method \'select\' (a nil value)', u'type': u'LUA_ERROR', u'source': u'[string "..."]', u'error': u"attempt to call method 'select' (a nil value)"}, u'type': u'ScriptError', u'description': u'Error happened while executing Lua script', u'error': 400}

Answer

I think this is because of the version of the splash instance

0
Answered
Aysc 4 months ago in Splash • updated 4 months ago 2

I am trying to figure out how to configure crawlera with splash, I will not use splash for every request, but I want to use crawlera for every request. I have created a test spider that scrapes http://httpbin.org/ip', I am getting a response with a new IP with every spider run for both splash and regular requests, so it seems to be working okay. But I have not used method explained in crawlera doc that use a lua script with Splash /execute endpoint.

https://doc.scrapinghub.com/crawlera.html#using-crawlera-with-splash


My question: Am I doing something wrong? Should I use the splash /execute endpoint for proxy settings? Should I change anything in my settings file?


Here is my settings file:


SPLASH_URL = 'my scrapinghub splash instance url'

DOWNLOADER_MIDDLEWARES = {
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}

SPIDER_MIDDLEWARES = {
'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}

DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'

HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'

DOWNLOADER_MIDDLEWARES = {
'scrapy_crawlera.CrawleraMiddleware': 610
}

CRAWLERA_ENABLED = True
CRAWLERA_APIKEY = 'my crawlera api key'


And here is my test spider:


import scrapy
from scrapy_splash import SplashRequest

from base64 import b64encode
encoded_key = b64encode('my scrapinghub splash api key:')

class IpSpider(scrapy.Spider):
name = 'ip'

def start_requests(self):
url = 'http://httpbin.org/ip'
yield SplashRequest(url, self.parse_splash,
args={'wait': 0.5 },
splash_headers = {'Authorization': 'Basic ' + encoded_key}
)
yield scrapy.Request(url=url, callback=self.parse)

def parse_splash(self, response):
print 'SPLASH'
print response.text

def parse(self, response):
print 'REQUEST crawlera'
print response.text


Answer

Hey Acac,

Using Splash with Crawlera can be difficult, the best way is shown through this article:
http://help.scrapinghub.com/splash/using-crawlera-with-splash

If you still experiencing issues, feel free to reach us and request a free quote:

https://scrapinghub.com/quote

Our developers can help you to retrieve your data with a huge saving of time and resources.

Kind regards!

Pablo

0
Answered
ashley 5 months ago in Splash • updated by Pablo Vaz (Support Engineer) 5 months ago 8

Firstly apologies for new and relatively in-experienced and I hope I provide the right info here in the right format!


I have a Spider that needs Javascript rendering and so I am using Scrapy-Splash. Locally with Splash in a docker it all works well...


Settings.py as below

....

BOT_NAME = 'cerebro'

LOG_LEVEL = 'DEBUG'
SPIDER_MODULES = ['cerebro.spiders']

NEWSPIDER_MODULE = 'cerebro.spiders'
#SPLASH_URL = 'https://238dq2tt-splash.scrapinghub.com'

SPLASH_URL = 'http://localhost:8050'

SPLASH_LOG_400 = True

# Crawl responsibly by identifying yourself (and your website) on the user-agent

USER_AGENT = 'cerebro (+http://www.yourdomain.com)'
# Obey robots.txt rules

ROBOTSTXT_OBEY = True

..... Plus the usual middleware stuff.


Then in my spider.py

....

class SITE_Bet365_gamesSpider(scrapy.Spider):

name = "SITE_Bet365_games"
def start_requests(self):
urls = [
'https://games.bet365.com/home/en',
'https://www.unibet.com/casino',
]
for url in urls:
yield SplashRequest(url, self.parse,
method='GET',
endpoint='render.html',
args={'wait': 2.5, 'http_method': 'GET'},
)

..... (yes I have gone for overkill on the GET stuff, I have tried a million different options/thoughts and I know method='GET' isnt a method of SplashRequest but have tried with and without etc.


So the real crux...

When I run locally

>> scrapy crawl SITE_Bet365_games


it all works and in the job output I see the following DEBUG output:

2017-01-31 22:09:08 [scrapy] DEBUG: Crawled (200) (referer: None)

2017-01-31 22:09:16 [scrapy] DEBUG: Crawled (200) https://www.unibet.com/casino via http://localhost:8050/render.html> (referer: None)

2017-01-31 22:09:16 [scrapy] DEBUG: Scraped from <200 https://www.unibet.com/casino>

So I then shub deploy, add the SPLASH_URL and SPLASH_USER (from my scraping hub Splash instance) settings into the dashboard and then run the job.


The Splash instance appears to be working as the out put shows requests being made to the correct URLS but they now appear to failing with a 401 error.

Request 02017-01-31 21:36:53 UTC
Duration
56 ms
Fingerprint
422260242598f30c35b519d28ab02747d7395178
HTTP Method
POST
Response Size
195 bytes
HTTP Status
401
Last Seen
2017-01-31 21:36:53 UTC
URL
Request 12017-01-31 21:36:53 UTC
Duration
65 ms
Fingerprint
e7f55a1a45b868c93612f136783261d73d05b112
HTTP Method
POST
Response Size
195 bytes
HTTP Status
401
Last Seen
2017-01-31 21:36:53 UTC
URL


And I believe issue is that these URLS are being POSTED not GET as it shows. I'm keen to try and do this the recommended/easiest way. i.e use SplashRequest and Scrapinghub etc and just assuming there is some configuration difference between running locally and scrapinghub - but honestly no idea.


It has occurred to me that the sites themselves are deciding they dont like the request headers from scrapinghub I perhaps I needed to invest in crawlera to mask some of this but having not got Splash working yet, I am reticent to add another layer of complexity.


-- So worth adding..

I tried actually working with POST but there is no httpuser etc for me to set and ummy values dont work.

Consulted robots.txt and they have nothing that could be blocking

I added and removed such as scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware clearly I dont want to POST so have left removed.

I'm sure you'll ask me about rest of settings.py so on edit I have also included the other stuff below:

...

SPIDER_MIDDLEWARES = {

'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,

}

DOWNLOADER_MIDDLEWARES = {

'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
#'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware' :811
}

DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'

HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'

....

Answer
ashley 5 months ago

OK issue solved...

By making the same normal requests through my chrome browser - to the scrapy instance and authorising on prompt I inspected all details of headers etc.. and Finally noticed a difference in the last character of the "encoded_key" supplied in Splash_headers..


Using the value from my browser and - low and below authentication then works from scraping hub....


So what was the problem;


when encoding the authorisation i was encoding "<APIKEY>" and not "<APIKEY>:"


Alas now onto next issue!

0
Answered
cpierre 6 months ago in Splash • updated by Pablo Vaz (Support Engineer) 6 months ago 3

Hello,

I am using splash with multiple proxy and it works well using the 'proxy' args splash meta key (see http://splash.readthedocs.io/en/latest/api.html#render-html) but it does not work for crawlera !

Everytime i am trying (either against a http or https website), it returns an empty body '<html><head></head><body></body></html>'.

Should this work out of the box or do i miss something ?


Best,

Answer

Hey Cpierre,

I'm also checking this: http://help.scrapinghub.com/splash/using-crawlera-with-splash

Seems little different here,


<code> -- Put your Crawlera username and password here. This is different from your   
 -- <span class="hljs-constant">Scrappinghub</span> account. <span class="hljs-constant">Find</span> your <span class="hljs-constant">Crawlera</span> username <strong>and</strong> password <strong>in</strong>
    -- <a href="https://app.scrapinghub.com/" rel="nofollow noopener noreferrer" target="_blank"><span class="hljs-symbol">https:</span>/<span class="hljs-regexp">/app.scrapinghub.com/</span></a>
    local user = <span class="hljs-string">""</span>
    local password = <span class="hljs-string">""</span>

    local host = <span class="hljs-string">'proxy.crawlera.com'</span>
    local port = <span class="hljs-number">8010</span>
    local session_header = <span class="hljs-string">"X-Crawlera-Session"</span>
    local session_id = <span class="hljs-string">"create"
</span></code>

0
Answered
Ashley Sandyford-Sykes 8 months ago in Splash • updated by Pablo Vaz (Support Engineer) 6 months ago 2

I have been attempting to crawl a number of gaming website pages such as http://casino.bet365.com/home/en scraping the links and associated details of the game links.

These are rendered dynamically and so I've set up with scrapy-splash and splash unit on scraping hub.

However the dynamic display of games and their data (xpath:-- //div[@class="PodName"]) is dependent on the javascript detecting a flash version.


vs.


Can existence of "flash" be spoofed or faked with a lua script allowing the javascript to render complete html?

I am thinking is there an equivalent approach the PhantomJS Faker of https://github.com/mjesun/phantomjs-flash

0
Answered
Arunkumar N 1 year ago in Splash • updated by Pablo Hoffman (Director) 1 year ago 2


function main(splash)  
  local url = url
    local div = '<span data-field="name" class="selectbox-option-name">P</span><span data-field="stock_message" class="selectbox-option-stock"></span>'
    local id = "CO515APF79MBS-10300"
    assert(splash:go(url))
    assert(splash:wait(3))
    splash:runjs('document.getElementById("add-to-cart").getElementsByClassName("selectbox-current")[0].innerHTML= div')
    assert(splash:wait(5))
    return {
      html = splash:html(),
      png = splash:png(),
      har = splash:har(),
    }
  end

from the above code i have declared the div variable and used like that

splash:runjs('document.getElementById("addcart")[0].innerHTML= div')

here div is not replaced


Answer

Hi,


You should change the following:


splash:runjs('document.getElementById("add-to-cart").getElementsByClassName("selectbox-current")[0].innerHTML= div')


to:


splash:runjs('document.getElementById("add-to-cart").getElementsByClassName("selectbox-current")[0].innerHTML = "' .. div .. '"')


See here: http://lua-users.org/wiki/StringInterpolation

0
Answered
Alejandro 1 year ago in Splash • updated by Pablo Vaz (Support Engineer) 6 months ago 1

Hi All. I'm planning to use scrapy with Splash, within a crawlera plan. My program, as specified here http://doc.scrapinghub.com/crawlera.html#using-crawlera-with-splash needs to use a Lua script, but I cannot figure out where to create it or place it. Any thoughts will be highly appreciated.

Answer

Hi Alejandro,

Once you created the Splash instance, you will see a window on which you can add all the Lua scripts.

Please see our Help Desk to know more:
http://help.scrapinghub.com/splash

Best regards.

0
Answered
Angela 1 year ago in Splash • updated 1 year ago 2

Hi,


Is it possible to get a working example of how to render a page using Splash and then execute xpath queries on the result in the Scrapy shell? This isn't covered in the docs and the only example I have been able to find online (https://www.quora.com/What-are-the-ways-to-crawl-a-website-that-uses-JavaScript-with-the-help-of-Python) no longer works


Many thanks

0
Answered
Lucas Miranda 2 years ago in Splash • updated by Pablo Hoffman (Director) 1 year ago 3
Hello.

I am trying to do a simple task with splash:
  • Enter on http://stackoverflow.com/
  • click on link Questions (id "nav-questions")
  • then return the page.
See below.
That doesn't work. Can someone help me?
Thanks

function main(splash)  
local url = splash.args.url  
assert(splash:go(url))  
splash:wait(10)  
splash:runjs('document.getElementById("nav-questions").click()') 
splash:wait(10)  
return {
  html = splash:html(),    
  png = splash:png(),    
  har = splash:har(),  
}
end






Answer
Hey,

Unfortunately this is a known issue; one workaround is to use
splash:runjs("window.location = document.getElementById('nav-questions').href")
instead of
splash:runjs('document.getElementById("nav-questions").click()')
See https://github.com/scrapinghub/splash/issues/200#issuecomment-112552839 for a more detailed explanation.
0
Answered
Pablo Hoffman (Director) 2 years ago in Splash • updated 2 years ago 1
Answer
Please send an email to sales@scrapinghub.com and request your Splash instance, we don't yet support automatic provisioning through our dashboard.
0
Answered
kim jin 2 years ago in Splash • updated by Pablo Hoffman (Director) 1 year ago 3
I have installed portia, I am trying to crawl js based site using splash with portia UI. Is there any option to integrate portia with splash middleware. thanks
Answer

Javascript sites are now supported by Portia (using Splash to process pages)

0
Answered
Noah Cinquini 2 years ago in Splash • updated by Pablo Hoffman (Director) 1 year ago 7
According to my understanding of this blog entry, to use splash in a spider I can add the following settings.py:

DOWNLOADER_MIDDLEWARES = {
'scrapyjs.SplashMiddleware': 725,
}


SPLASH_URL = 'http://localhost:8050/'


DUPEFILTER_CLASS = 'scrapyjs.SplashAwareDupeFilter'
HTTPCACHE_STORAGE = 'scrapyjs.SplashAwareFSCacheStorage'

And `process_requests` argument of `Rule` on `CrawlSpider` set to a method like that:

def process_request(self, request):
request.meta.update({
'splash': {
'endpoint': 'render.html',
'args': {'wait': 0.5}
}
})
return request

But it's not working. Is it possible?
Answer

Yes, you need to use the x-splash-format=json and you'll get the same output as render.json, which can include both the HTML & PNG.


See Splash README for more info.