Welcome to the Scrapinghub community forum! We discuss all things related to Scrapy Cloud, Portia and Crawlera. You can participate by reading posts, asking questions, providing feedback, helping others, and voting the best questions & answers. Scrapinghub employees regularly pop in to answer questions, share tips, and post announcements.
Remember to check the Help Center!
Please remember to check the Scrapinghub Help Center before asking here, your question may be already answered there.
I am using the following code to try and bring back all fields within an job using the items API;
$sch_id="172/73/3" //job ID $ch = curl_init();
CURLOPT_RETURNTRANSFER => 1,
CURLOPT_URL => "https://storage.scrapinghub.com/items/". $sch_id ."?format=json&fields=_type,design,price,material",
CURLOPT_USERPWD => "e658eb1xxxxxxxxxxxx4b42de6fd" . ":" . "",
$result = curl_exec($ch);
There are 4 fields I am trying to get as json but the request only brings back "_type" and "price". I have tried various things with different headers and the request URL but no luck.
Any advice would be appreciated.
What is the difference between annotations and fields? In the "Sample page → Items" each field has configuration icons that open a tab with separate groups "Annotation" and "Field". There are separate "required" options, what do they mean and whether they overlap each other? The "Annotation" group sets the path to the element, but it is already hidden in the "Item", why "required"?
Currently scripts can only be deployed only by using the shub deploy command. When we push scripts to git, the app doesn't seem to pull the scripts from our repo.
Will pulling scripts from the git hook be supported in the future or do you guys intend to stay on using shub deploy for now?
I am using the default sample script provided on the site https://doc.scrapinghub.com/crawlera.html#php
When I use the default, my dashboard isnt show I'm even making it to crawlera. There are no errors, there is nothing displayed. Any idea how to troubleshoot?
DOMAIN HOST: Godaddy
Cert is in the same directory as PHP script
Since yesterday my Portia crawls are failing with certain error:
I don't know whether this is Scrapinghub/Portia error or related to the external page to be scraped (which worked previously successfully before since months)
I would like to write/ update data in mongodb with the items crawled from scraping hub.
We have a Crawlera account with C10 Plan. We have renewed billing period for this account. Today I am facing performance issue for processing each request it is taking more than 2 minutes also getting Time Out Exception. Before April 20th this was working fine without any issues. Today I am facing performance issue. Before April 20th I was able to process 100 request within 45 mins. Today I am able to process only 35 requests within 45 mins. Please let me know is there any changes that is required after renewed billing period. Most of Crawlera request are failing.
Ganesh Nayak K
I am trying to crawl a website using Crawlera that requires the presence of "Connection: keep-alive" in the header.
Is there any way to make Crawlera compatible with keep-alive connections? I tried using sessions but it didn't seem to help.
My bad, it actually seems like working, but I'm getting some "Cache-Control: max-age=259200" header entries sometimes rather than "Connection: keep-alive". Probably normal behavior.
I am trying to deploy a (local) portia project to scrapinghub. After adding "slybot" to requirements.txt I can deploy successfully, but when running the spider the following error occures:
Traceback (most recent call last):
File "/usr/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 1299, in _inlineCallbacks result = g.send(result) File "/usr/local/lib/python2.7/site-packages/scrapy/crawler.py", line 90, in crawl six.reraise(*exc_info) File "/usr/local/lib/python2.7/site-packages/scrapy/crawler.py", line 71, in crawl self.spider = self._create_spider(*args, **kwargs) File "/usr/local/lib/python2.7/site-packages/scrapy/crawler.py", line 94, in _create_spider return self.spidercls.from_crawler(self, *args, **kwargs) File "/usr/local/lib/python2.7/site-packages/scrapy/spiders/__init__.py", line 50, in from_crawler spider = cls(*args, **kwargs) File "/app/python/lib/python2.7/site-packages/slybot/spidermanager.py", line 51, in __init__ **kwargs) File "/app/python/lib/python2.7/site-packages/slybot/spider.py", line 44, in __init__ ((t['scrapes'], t) for t in spec['templates'] KeyError: 'templates'
Customer support service by UserEcho