Our Forums have moved to a new location - please visit Scrapinghub Help Center to post new topics.
You can still browse older topics on this page.
Welcome to the Scrapinghub community forum! We discuss all things related to Scrapy Cloud, Portia and Crawlera. You can participate by reading posts, asking questions, providing feedback, helping others, and voting the best questions & answers. Scrapinghub employees regularly pop in to answer questions, share tips, and post announcements.
Is there a way to load just a small element from a page's DOM instead of loading the whole page? I'd like to scrape just a few divs out of the whole html page. The idea is to increase the speed and minimize transfer bandwidth.
Hi Kalo, this can be done with a headless browser like Splash.
To know more about it, please check the documentation in: https://splash.readthedocs.io/en/stable/
I tried to use Crawlera using the default PHP script (see the script that is given by Crawlera) and adjusted the script (API key, path to certificate file) but it doesn't work at all. It gives an error: connect() timed out!
The code works fine for me.
I tried with the URL:
and the one provided by you and worked fine for both.
Please be sure that:
- The path to the ca-cert is correct (you can try to install in Desktop or home directory)
- The Proxy auth is: $proxy_auth = '1231examplekfsj6789:'; and the ":" is at the end.
I'm using OSX to try this script using: > php my_script.php
I am using the following code to try and bring back all fields within an job using the items API;
$sch_id="172/73/3" //job ID $ch = curl_init();
CURLOPT_RETURNTRANSFER => 1,
CURLOPT_URL => "https://storage.scrapinghub.com/items/". $sch_id ."?format=json&fields=_type,design,price,material",
CURLOPT_USERPWD => "e658eb1xxxxxxxxxxxx4b42de6fd" . ":" . "",
$result = curl_exec($ch);
There are 4 fields I am trying to get as json but the request only brings back "_type" and "price". I have tried various things with different headers and the request URL but no luck.
Any advice would be appreciated.
We suggest to use the script provided in our docs:
<!--?php $ch = curl_init(); $url = 'https://twitter.com/'; $proxy = 'proxy.crawlera.com:8010'; $proxy_auth = '<API KEY-->:'; curl_setopt($ch, CURLOPT_URL, $url); curl_setopt($ch, CURLOPT_PROXY, $proxy); curl_setopt($ch, CURLOPT_PROXYUSERPWD, $proxy_auth); curl_setopt($ch, CURLOPT_HEADER, 1); curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1); curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 30); curl_setopt($ch, CURLOPT_TIMEOUT, 30); curl_setopt($ch, CURLOPT_CAINFO, '/path/to/crawlera-ca.crt'); $scraped_page = curl_exec($ch); curl_close($ch); echo $scraped_page; ?>
But if no possible, try to add the reference to the certificate for fetching HTTPS. In the code provided by us this is explicated in:
curl_setopt($ch, CURLOPT_CAINFO, '/path/to/crawlera-ca.crt');
Could you add support for UTF-8. Not English letters are not shown in the sample page editor, and regexp-conditions are not working with them.
Your inquiry has been escalated to our Portia team.
UTF-8 is supported for non latin characters, but perhaps needs to be improved when interacts with regex.
This feature it´s planned for next releases.
Thanks for your valuable feedback and for helping us to improve our services.
What is the difference between annotations and fields? In the "Sample page → Items" each field has configuration icons that open a tab with separate groups "Annotation" and "Field". There are separate "required" options, what do they mean and whether they overlap each other? The "Annotation" group sets the path to the element, but it is already hidden in the "Item", why "required"?
Annotation Count are not the same as Extracted Items count.
If the webpage contains a list of items and the user uses the repeated annotations icon, the annotations will propagate and reflect the number of items present in the page.
However, it may happen that the algorithm responsible for data extraction is unable to use the annotations provided by the user to properly extract data, thus extracting a number of items different from the count next to the annotations.
For example, on the image above, we have one annotation with count equal to 10, hinting that we are extracting 10 items from the page. However, the Extracted Items count shows that 0 items were extracted. This means that our annotations haven't worked with Portia's algorithm, so we may have to try updating our annotations to get the data from alternative elements.
To know more see Portia documentation:
Currently scripts can only be deployed only by using the shub deploy command. When we push scripts to git, the app doesn't seem to pull the scripts from our repo.
Will pulling scripts from the git hook be supported in the future or do you guys intend to stay on using shub deploy for now?
I am using the default sample script provided on the site https://doc.scrapinghub.com/crawlera.html#php
When I use the default, my dashboard isnt show I'm even making it to crawlera. There are no errors, there is nothing displayed. Any idea how to troubleshoot?
DOMAIN HOST: Godaddy
Cert is in the same directory as PHP script
make sure to add the path before crawlera-ca.crt,
The script works fine.
Since yesterday my Portia crawls are failing with certain error:
I don't know whether this is Scrapinghub/Portia error or related to the external page to be scraped (which worked previously successfully before since months)
Sometimes, backend updates or new Portia releases could affect old extractors and that's why we always suggest to give some maintenance to the spiders, refresh and redeploy when necessary.
If possible, try to recreate your spider and launch again. This should work.
I would like to write/ update data in mongodb with the items crawled from scraping hub.
If you have a MongoDB server that you would like your spiders to write to, and you would like to open access to that server only from Scrapy Cloud IPs.
Unfortunately, this is not possible. We cannot provide you a reliable range of IP addresses for our Scrapy Cloud crawling servers, because they're not static, they change frequently. So, even if we were to provide you the list we have now, it will soon change and your spider's connection with Mongo will break.
Here are a couple alternatives to consider:
- Write a script that pulls the data from Scrapinghub (using the API) and writes it to your Mongo server. This script can run in your mongo server or any server (and you only need to whitelist that IP)
- Use authentication in Mongo
To know more about our API: https://doc.scrapinghub.com/api/items.html#items-api
I am trying to crawl a website using Crawlera that requires the presence of "Connection: keep-alive" in the header.
Is there any way to make Crawlera compatible with keep-alive connections? I tried using sessions but it didn't seem to help.
My bad, it actually seems like working, but I'm getting some "Cache-Control: max-age=259200" header entries sometimes rather than "Connection: keep-alive". Probably normal behavior.
I am trying to deploy a (local) portia project to scrapinghub. After adding "slybot" to requirements.txt I can deploy successfully, but when running the spider the following error occures:
Traceback (most recent call last):
File "/usr/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 1299, in _inlineCallbacks result = g.send(result) File "/usr/local/lib/python2.7/site-packages/scrapy/crawler.py", line 90, in crawl six.reraise(*exc_info) File "/usr/local/lib/python2.7/site-packages/scrapy/crawler.py", line 71, in crawl self.spider = self._create_spider(*args, **kwargs) File "/usr/local/lib/python2.7/site-packages/scrapy/crawler.py", line 94, in _create_spider return self.spidercls.from_crawler(self, *args, **kwargs) File "/usr/local/lib/python2.7/site-packages/scrapy/spiders/__init__.py", line 50, in from_crawler spider = cls(*args, **kwargs) File "/app/python/lib/python2.7/site-packages/slybot/spidermanager.py", line 51, in __init__ **kwargs) File "/app/python/lib/python2.7/site-packages/slybot/spider.py", line 44, in __init__ ((t['scrapes'], t) for t in spec['templates'] KeyError: 'templates'
Seems you have deployed successfully spiders in Switzerland project.
Let us know if you need further assistance.
Hello, i am running crawler job, but it can not receive response data after request >40.
I run my code in localhost, and it work OK.
Hey Gin, checking your stats in dashboard seems your scrape spider is working fine.
Let us know if you need further help.
The following URL renders fine in all browsers EXCEPT the Scrapinghub browser:
I'd like to find out why, but no clues are given. Help?
Unfortunately this site still being hard to open with Portia browser.
Perhaps you should consider to use other tools as Scrapy to try fetching data. Some sites are simply too complex to scrape and require more advanced tools.
If you don't know how to use it, I suggest this great tutorial:
I needed to scrape site which have many JS code. So I use scrapy+selenium. Aslo it should run at Scrapy Cloud.
I've write spider which uses scrapy+selenuim+phantomjs and run it on my local machine. All is ok.
Then I deployed project to Scrapy cloud using shub-image. Deployment is ok. But results of
webdriver.page_source is different. It's ok on local, not ok(HTML with inscription - 403, request 200 http) at cloud.
Then I decided to use crawlera acc. I've added it with:
service_args = [
'--proxy="proxy.crawlera.com:8010"', '--proxy-type=https', '--proxy-auth="apikey"', ]
self.driver = webdriver.PhantomJS(executable_path=r'D:\programms\phantomjs-2.1.1-windows\bin\phantomjs.exe',service_args=service_args)
self.driver = webdriver.PhantomJS(executable_path=r'/usr/bin/phantomjs', service_args=service_args, desired_capabilities=dcap)
Again at local all is ok. Cloud not ok.
I've checked cralwera info. It's ok. Requests sends from both(local and cloud).
I dont get what's wrong.
I think It might be differences between phantomjs versions(Windows, Linux).
If the issue is related to SSL fetching (https), this may be due our current version of Erlang that returns some errors for some languages and browsers for that.
Our team is working in an update of the Erlang version and should be deployed in terms of weeks.
Let us know if you find more information about the error you get.
Customer support service by UserEcho