Welcome to the Scrapinghub feedback & support site! We discuss all things related to Scrapy Cloud, Portia and Crawlera. You can participate by reading posts, asking questions, providing feedback, helping others, and voting the best questions & answers. Scrapinghub employees regularly pop in to answer questions, share tips, and post announcements.
0
ganesh.nayak 14 hours ago in Crawlera 0

Hi,


I have integrated my application with Crawlera using C#, when I make Https request I got exception (407) Proxy Authentication Required. I have installed Cirtificates also sending request through certificates.


If I try sending Http using 'x-crawlera-use-https' request header it is working fine, but as per document this is Deprecated. Please let me know how to make Https request without header.


I tried same code as mentioned in the document still it is throwing exceptions



using System.IO;
using System;
using System.Net;

namespace ProxyRequest
{
    class MainClass
    {
        public static void Main (string[] args)
        {
            var myProxy = new WebProxy("http://proxy.crawlera.com:8010");
            myProxy.Credentials = new NetworkCredential("<API KEY>", "");

            HttpWebRequest request = (HttpWebRequest)WebRequest.Create("https://twitter.com");
            request.Proxy = myProxy;
            request.PreAuthenticate = true;

            WebResponse response = request.GetResponse();
            Console.WriteLine("Response Status: " 
                + ((HttpWebResponse)response).StatusDescription);
            Console.WriteLine("\nResponse Headers:\n" 
                + ((HttpWebResponse)response).Headers);

            Stream dataStream = response.GetResponseStream();
            var reader = new StreamReader(dataStream);
            string responseFromServer = reader.ReadToEnd();
            Console.WriteLine("Response Body:\n" + responseFromServer);
            reader.Close();

            response.Close();
        }
    }
}


Regards,

Ganesh Nayak K

0
jcdeesign 16 hours ago in Crawlera 0

Hi

When i use curl how example all works

if i try use Phantom js in selenium i recive 407

service_args = [
'--proxy=proxy.crawlera.com:8010',
'--proxy-auth=XXXXXXXXXXXXXXX:',
'--proxy-type=http',
'--load-images=false',
'--ssl-protocol=any',
'--webdriver-logfile=phantom.log',
'--ssl-client-certificate-file='+CRAWLERA_SERT,
'--ignore-ssl-errors=true',
'--webdriver-loglevel=DEBUG'
]


driver = webdriver.PhantomJS(executable_path=settings.PHANTOMJS, desired_capabilities=dcap,service_args=service_args)


i recive

'<html><head></head><body></body></html>'


in log


{"name":"X-Crawlera-Error","value":"bad_proxy_auth"}


key with curl work

0
Answered
mescalante1988 23 hours ago in Portia • updated by Thriveni Patil (Support Engineer) 17 hours ago 1

Hello, I am doing a Project and I think Portia is great!

I have a doubt because I am extracting data from a webpage, but I want to include the category on all items I am extracting.. but I only have from each item the image, price and description.

What I want to do is force to add manually a category..

For example now I am receiving:


[ { "image": ["urlImage" ], "description": [ "TV LED " ], "price": [ "565" ] },[ { "image": [urlImage1], "description": [ "TV1" ], "price": [ "867" ] },


I want to add manually a category called TV and obtain the next result:


[ { "image": ["urlImage" ], "description": [ "TV LED " ], "price": [ "565" ], "category": ["TV"] },[ { "image": [urlImage1], "description": [ "TV1" ], "price": [ "867" ], "category": ["TV"] },

Could anyone help me with this?

I only know how to work with Portia on webpage on graphic mode.

Thanks!

Answer

Good to know that you are liking Portia :)


To add a field for every Item you can make use of Magic Fields addon, Please refer http://help.scrapinghub.com/scrapy-cloud/addons/magic-fields-addon to know more about the Magic Fields.


Regards,

Thriveni

0
tofunao1 yesterday at 10:04 p.m. in Scrapy Cloud 0

Now I want to download a website which uses ajax and js. So I use selenium and PhantomJS in scrapy. It runs successful in my local pc. But when I upload it to scrapinghub, it stops with some errors.

How to solve this error or how can I download the js website? Thanks.



0
Answered
simon.nizov yesterday at 8:50 a.m. in Scrapy Cloud • updated yesterday at 11:12 a.m. 2

Hi,

Is it possible to limit a job's runtime? My spider's runtime can change drastically depending on its arguments and at some point I'd rather for the job to just stop and continue to the next one.


Thanks!

Simon.

0
jarek yesterday at 5:35 a.m. in Crawlera 0

Hello, I'm using the phantomjs from amir20: https://github.com/amir20/phantomjs-node

and from what I read I could enable a session if I pass true on my : currentPage.on('onResourceRequested', true, onResourceRequested);


and now I'm not getting the error: "networkRequest.setHeader is not a function" anymore. In my function:



function onResourceRequested(requestData, networkRequest) {


requestCounter += 1;
//yellow color for the requests:
if (printLogs) console.log('\x1b[33m%s\x1b[0m: ', '> ' + requestData.id + ' - ' + requestData.url);
if (!this.crawleraSessionId) {
networkRequest.setHeader('X-Crawlera-Session', 'create');

}



BUT the onResourceReceived now is not working because it's not returning the html data and before typing true in the onResourceRequested woked.


Any advice?



0
Fixed
Uptown Found yesterday at 1:21 a.m. in Portia • updated by Pablo Vaz (Support Engineer) yesterday at 6:11 p.m. 1

When I try to access my Portia project using Chrome, I get a blank page. Opening the Chrome Inspector shows there are several CSS and JS files that cannot be loaded (404 errors):


Answer
Pablo Vaz (Support Engineer) yesterday at 6:11 p.m.

Hi Uptown found,


We have been doing some maintenance work, it should be working now.

Please be sure to clean cache to avoid related issues.


Best regards,

Pablo

0
Fixed
Roney Hossain 2 days ago in Scrapy Cloud • updated by Nestor Toledo Koplin (Support Engineer) yesterday at 5:46 a.m. 4

I was testing new porti beta. when i run spider it always failed and error message is " [root] Script initialization failed : IOError: [Errno 2] No such file or directory: 'project-slybot.zip'

Answer

The issue has been fixed, you can now run job from the dashboard.

0
Answered
Pedro Sousa 2 days ago in Scrapy Cloud • updated by Nestor Toledo Koplin (Support Engineer) 2 days ago 1

I have a Rails project feeding on a Scrapy Cloud sheduled spider. I need the jib to run every single day. Is there a way for me to know if the data im requesting in the API is recent ? That is, is there a way to fetch the last updated date on the data or, as an alternative, receive from Scrapy Cloud a warning if something goes wrong with the job ?

Answer

Hello Pedro,


You can use this API call to fetch the latest job data: http://help.scrapinghub.com/scrapy-cloud/fetching-latest-spider-data. For more information on the API please see: https://doc.scrapinghub.com/scrapy-cloud.html#scrapycloud

0
Not a bug
IncentFit IncentFit 3 days ago in Portia • updated by Pablo Vaz (Support Engineer) yesterday at 6:15 p.m. 1

I'm trying to scrape this Yoga Works website to get a list of their locations. Notice that it shows 43 results on the left and 50 on the right. How does that make any sense? Then when I run the job in ScrapingHub it times out after 24 hours. It's just trying to scrape one page!


Am I doing something wrong here or is it just that buggy? The correct answer of results is 43...

Answer
Pablo Vaz (Support Engineer) yesterday at 6:15 p.m.

Hey IncentFit,


Thanks for your feedback, yeah it could be confusing, the sample count is the amount of elements annotated, the extracted items count is the amount of items the extraction algorithm was able to actually extract.


We forwarded your feedback to our Portia team. Thanks for helping us to provide a more stable Portia platform.


Kind regards,


Pablo Vaz

0
Not a bug
A Rj 3 days ago in Portia • updated by Pablo Vaz (Support Engineer) 17 hours ago 2

none of these pages load any more in portia spider editing tool. I've tried to recreate the spider but it doesn't help is there anything that I'm doing wrong? I've followed the tutorials step by step and am able to get the data from other similar pages, but portia fails on these specific sites - they just don't open in portia spider editing tool:


1) anthonysheatingoilri.com - scraper doesn't process html tags/no elements can be selected:
2) anthonysoil.com - scraper doesn't process html tags/no elements can be selected:

3) scraper doesn't process html tags/no elements can be selected:


4) big-oats.com - can't select the particular element on the page. Nothing gets parsed (0 results):
Answer

Hi Rj and Markus,


We have been doing some maintenance tasks lastly so with our last release Portia should be working stable.


About specific domains, sometimes Portia can't handle complex components of the site and fails to extract data. Keep in mind that this tool was designed for easy and mid size projects. If interested to develop more powerful extractors, you should consider using Scrapy: https://doc.scrapy.org/en/latest/intro/tutorial.html

And deploy your projects for free in our Scrapy Cloud.


You can also hire our experts to set a specific crawler for your needs. If interested, don't hesitate to request our free quote form: https://scrapinghub.com/quote


Kind regards,


Pablo Vaz

Support Team

0
Not a bug
Braulio Ríos Ferreira 3 days ago in Crawlera • updated by Pablo Vaz (Support Engineer) yesterday at 6:56 p.m. 1

The spider test code is the following (I've removed irrelevant code, but this spider is tested and reproduces the same error):


# -*- coding: utf-8 -*-from scrapy import Request
from scrapy.spiders import Spider

class AlohaTestSpider(Spider):
    name = "aloha_test"

    def __init__(self, *args, **kwargs):
        super(AlohaTestSpider, self).__init__(*args, **kwargs)

    def start_requests(self):
        site = 'https://aroogas.alohaorderonline.com/OrderEntryService.asmx/GetSiteList/'
        yield Request(url=site,
                      method='POST',
                      callback=self.parse,
                      headers={"Content-Type": "application/json"})

    def parse(self, response):
        print(response.body)

When I run this spider:

$ scrapy crawl aloha_test


I keep getting the following error:

2017-03-20 12:33:11 [scrapy] DEBUG: Retrying <POST https://aroogas.alohaorderonline.com/OrderEntryService.asmx/GetSiteList/> (failed 1 times): 400 Bad Request


In the original spider, I have a retry decorator, and this errors repeats for 10 retries.


I only get this error with this specific request. In the real spider, which has more https requests before, It only fails when this request is reached (previous https requests return 200 OK).


Please note that this is a POST request that doesn't have any data. I don't know if this is relevant to you, but this is the only particularity that this request has in my spider.


If I deactivate "CrawleraMiddleware" and activate "CustomHttpProxyMiddleware" in DOWNLOADER_MIDDLEWARES (settings.py), I can make the request without error.


If I make this request using curl, I can't reproduce this error even when using crawlera, I mean that both of the following requests work fine:


$ curl --cacert ~/crawlera-ca.crt -H 'Content-Type: application/json' -H 'Content-Length: 0' -X POST -vx proxy.crawlera.com:8010 -U MY_API_KEY https://aroogas.alohaorderonline.com/OrderEntryService.asmx/GetSiteList


$ curl -H 'Content-Type: application/json' -H 'Content-Length: 0' -X POST https://aroogas.alohaorderonline.com/OrderEntryService.asmx/GetSiteList


I've tried everything in my imagination (Crawlera sessions, Crawlera cookies disabled, different types of http headers, but I can't figure out a way to get this request to work with Crawlera.


I guess it has to do with the Crawlera Middleware in Scrapy, but I don't know what sort of magic with the http headers might Crawlera be doing that is causing this request to fail.

Any suggestions about what could be causing this error?

Answer
Pablo Vaz (Support Engineer) yesterday at 6:56 p.m.

Hi Braulio,


As you corrected tested wit Curl, seems your Crawlera account is working fine.

Also, all projects using Scrapy-Crawlera integration are working fine in our platform.


About the integration with Scrapy, I can suggest you to review the information provided here:

http://help.scrapinghub.com/crawlera/using-crawlera-with-scrapy


And to know even more please see the official documentation:

http://scrapy-crawlera.readthedocs.io/en/latest/


If your project needs urgent attention, you can also consider to hire our experts. We can set up Scrapy-Crawlera projects that fits your needs saving you a lot of time and resources. If interested, let me invite you to fill our free quote request: https://scrapinghub.com/quote


Best regards,


Pablo Vaz

Support team

0
Waiting for Customer
Markus 3 days ago in Portia • updated by Nestor Toledo Koplin (Support Engineer) yesterday at 11:03 a.m. 4

When I try to create a new project and then open it in Portia, I either get an error message saying that "Project 169900 not found" in Portia or a 502 Bad Gateway error message. I can see the project in the scrapinghub dashboard (https://app.scrapinghub.com/p/169900/jobs), but it's failing to open in Portia. The URL to the project in Portia is https://portia.scrapinghub.com/#/projects/169900.


Thanks for your help,

Markus

0
Answered
noa.drach-crawlera 3 days ago in Crawlera • updated by Pablo Vaz (Support Engineer) yesterday at 7:09 p.m. 1

I have the C10 subscription and when I try to use 10 parallel calls i get parallel connection limit reached error.


I dispatch the calls to crawlera in a simple loop


for (var index = 0; index <10; index++) {..}


when i change the loop to run 9 calls it works ok - so it's not clear to me how the limit is reached.


I contacted you on the support chat and got this response

"The bets way to ensure you make 10 concurrent requests and not go beyond that value is to set the concurrent_requests parameter to 10 in your crawlera settings."


this is my only crawlera related code:

var new_req = request.defaults({

'proxy': 'http://<API key>:@proxy.crawlera.com:8010'
});

so it's not clear to me what does it mean "crawlera settings"

Answer
Pablo Vaz (Support Engineer) yesterday at 7:09 p.m.

Hey Noa, I saw you request support throug Fresh Desk and Thriveni is assisting you.

We are here for any further inquiry you need. About the question you posted here, the best way to use Crawlera with Node.JS is provided in: https://doc.scrapinghub.com/crawlera.html#node-js

Crawlera settings are available for Scrapy projects, if interested to try:

http://scrapy-crawlera.readthedocs.io/en/latest/


Kind regards,

Pablo

0
Answered
noa.drach-crawlera 3 days ago in Crawlera • updated by Nestor Toledo Koplin (Support Engineer) 3 days ago 1

I want to be able to stop my code if i reached my monthly quota - is there an API call/Error code that indicate this state?

Answer

Hello,


There's currently not an API for this. However if you reach the monthly quota, the account would be suspended and you would receive a 403 User Account Suspended error and there will also be a response header X-Crawlera-Error: user_suspended.