Welcome to the Scrapinghub feedback & support site! We discuss all things related to Scrapy Cloud, Portia and Crawlera. You can participate by reading posts, asking questions, providing feedback, helping others, and voting the best questions & answers. Scrapinghub employees regularly pop in to answer questions, share tips, and post announcements.
0
Gabriel Munits 10 hours ago in Scrapy Cloud 0

Hey everyone,

I am launching new service which bypasses reCaptcha with multi language support/=.

0
hello 17 hours ago in Scrapy Cloud 0

I am using the following code to try and bring back all fields within an job using the items API;


$sch_id="172/73/3" //job ID

$ch = curl_init();

curl_setopt_array($ch, array(
CURLOPT_RETURNTRANSFER => 1,
CURLOPT_URL => "https://storage.scrapinghub.com/items/". $sch_id ."?format=json&fields=_type,design,price,material",
CURLOPT_CUSTOMREQUEST =>"GET",
CURLOPT_HTTPHEADER, array(
'Accept: application/x-jsonlines',
),
CURLOPT_USERPWD => "e658eb1xxxxxxxxxxxx4b42de6fd" . ":" . "",
));

$result = curl_exec($ch);
print_r(json_decode($result));

curl_close ($ch);

There are 4 fields I am trying to get as json but the request only brings back "_type" and "price". I have tried various things with different headers and the request URL but no luck.


Any advice would be appreciated.


Cheers,

Adam

0
vl2017 3 days ago in Portia 0

Could you add support for UTF-8. Not English letters are not shown in the sample page editor, and regexp-conditions are not working with them.

0
vl2017 3 days ago in Portia • updated 3 days ago 0

What is the difference between annotations and fields? In the "Sample page → Items" each field has configuration icons that open a tab with separate groups "Annotation" and "Field". There are separate "required" options, what do they mean and whether they overlap each other? The "Annotation" group sets the path to the element, but it is already hidden in the "Item", why "required"?


How do I configure the scrapper to ignore any pages that containing a specified attribute or word?
0
Alex L 4 days ago in Scrapy Cloud 0

Currently scripts can only be deployed only by using the shub deploy command. When we push scripts to git, the app doesn't seem to pull the scripts from our repo.


Will pulling scripts from the git hook be supported in the future or do you guys intend to stay on using shub deploy for now?

0
jkluv000 4 days ago in Crawlera 0

I am using the default sample script provided on the site https://doc.scrapinghub.com/crawlera.html#php


When I use the default, my dashboard isnt show I'm even making it to crawlera. There are no errors, there is nothing displayed. Any idea how to troubleshoot?


DOMAIN HOST: Godaddy

Cert is in the same directory as PHP script


<?php


$ch = curl_init();


$url = 'https://www.google.com/';
$proxy = 'proxy.crawlera.com:8010';
$proxy_auth = '239ec2d8dd334cfeb7b7361b00830f40:';


curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_PROXY, $proxy);
curl_setopt($ch, CURLOPT_PROXYUSERPWD, $proxy_auth);
curl_setopt($ch, CURLOPT_HEADER, 1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 30);
curl_setopt($ch, CURLOPT_TIMEOUT, 30);
curl_setopt($ch, CURLOPT_CAINFO, 'crawlera-ca.crt');


$scraped_page = curl_exec($ch);
curl_close($ch);
echo $scraped_page;


?>
0
sappollo 4 days ago in Portia • updated 4 days ago 0

Hi all,


Since yesterday my Portia crawls are failing with certain error:


I don't know whether this is Scrapinghub/Portia error or related to the external page to be scraped (which worked previously successfully before since months)

0
shamily23v 5 days ago in Scrapy Cloud • updated 5 days ago 0

I would like to write/ update data in mongodb with the items crawled from scraping hub.

+1
uday.kumar.kakani 5 days ago in Crawlera • updated by Nguyễn Hoàng 7 hours ago 1

Hi,


We have a Crawlera account with C10 Plan. We have renewed billing period for this account. Today I am facing performance issue for processing each request it is taking more than 2 minutes also getting Time Out Exception. Before April 20th this was working fine without any issues. Today I am facing performance issue. Before April 20th I was able to process 100 request within 45 mins. Today I am able to process only 35 requests within 45 mins. Please let me know is there any changes that is required after renewed billing period. Most of Crawlera request are failing.


Regards,

Ganesh Nayak K

0
MSH 5 days ago in Portia 0

Hi.


I created a portia spider for a website which created by asp.net and uses (javascript:__doPostBack) for the pagination links.


is it possible to use this kind of links (javascript:__doPostBack) in portia?

for example:

<a href="javascript:__doPostBack('p$lt$ctl06$pageplaceholder$p$lt$ctl00$ApplyPageJobSearch$GridView1','Page$2')">2</a>


Thanks

0
Answered
csmik.cs 6 days ago in Crawlera • updated by Pablo Vaz (Support Engineer) 5 days ago 2

Hi guys,


I am trying to crawl a website using Crawlera that requires the presence of "Connection: keep-alive" in the header.

Is there any way to make Crawlera compatible with keep-alive connections? I tried using sessions but it didn't seem to help.


Thanks!

Answer
csmik.cs 5 days ago

My bad, it actually seems like working, but I'm getting some "Cache-Control: max-age=259200" header entries sometimes rather than "Connection: keep-alive". Probably normal behavior.


Cheers.

0
mattegli 1 week ago in Scrapy Cloud 0

I am trying to deploy a (local) portia project to scrapinghub. After adding "slybot" to requirements.txt I can deploy successfully, but when running the spider the following error occures:


Traceback (most recent call last):

  File "/usr/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 1299, in _inlineCallbacks
    result = g.send(result)
  File "/usr/local/lib/python2.7/site-packages/scrapy/crawler.py", line 90, in crawl
    six.reraise(*exc_info)
  File "/usr/local/lib/python2.7/site-packages/scrapy/crawler.py", line 71, in crawl
    self.spider = self._create_spider(*args, **kwargs)
  File "/usr/local/lib/python2.7/site-packages/scrapy/crawler.py", line 94, in _create_spider
    return self.spidercls.from_crawler(self, *args, **kwargs)
  File "/usr/local/lib/python2.7/site-packages/scrapy/spiders/__init__.py", line 50, in from_crawler
    spider = cls(*args, **kwargs)
  File "/app/python/lib/python2.7/site-packages/slybot/spidermanager.py", line 51, in __init__
    **kwargs)
  File "/app/python/lib/python2.7/site-packages/slybot/spider.py", line 44, in __init__
    ((t['scrapes'], t) for t in spec['templates']
KeyError: 'templates'
0
GinVlad 1 week ago in Crawlera 0

Hello, i am running crawler job, but it can not receive response data after request >40.

I run my code in localhost, and it work OK.

0
BobC 1 week ago in Scrapy Cloud 0

The following URL renders fine in all browsers EXCEPT the Scrapinghub browser:

https://tenforward.social/@Redshirt27

I'd like to find out why, but no clues are given. Help?

0
Under review
Aleksandr Kurbatov 1 week ago in Crawlera • updated 1 week ago 2

Now any request to site returns 503 error "Website crawl ban" or "Timeout from upstream server".

My plan is C50.


Thanks.