Welcome to the Scrapinghub community forum! We discuss all things related to Scrapy Cloud, Portia and Crawlera. You can participate by reading posts, asking questions, providing feedback, helping others, and voting the best questions & answers. Scrapinghub employees regularly pop in to answer questions, share tips, and post announcements.

Remember to check the Help Center!

Please remember to check the Scrapinghub Help Center before asking here, your question may be already answered there.

0
Answered
Ollie 1 month ago in Scrapy Cloud • updated by Pablo Vaz (Support Engineer) 1 month ago 3

This q&a shows how to store items to S3 https://support.scrapinghub.com/topics/627-store-items-in-s3/


Is it possible to gzip them when storing?

Answer

Hi Ollie,


Scrapy has no built-in support for compressed items exports (yet).

But I suggest that you check https://github.com/scrapy/scrapy/issues/2174. There are 3 snippets of code that you can try in combination with a S3 FEED URI.

It would mean configuring your own feed exporter or feed storage.


Hope this helps,

Paul.

0
Waiting for Customer
Ollie 1 month ago in Scrapy Cloud • updated by Nestor Toledo Koplin (Support Engineer) 4 weeks ago 2

This error occurred during a scraper run. It reads:


Rejected message because it was too big: ITM { ... my item's data }

Is there a size limit for items? How can I work with items over the size limit?

0
Answered
Rodrigo 1 month ago in Scrapy Cloud • updated 1 month ago 2

Hi Guys,


I can run my scrapy locally without any issues, however, when i try to run job from scrapinghub i get the following error:

exceptions.ImportError: No module named pymodm


I import using:

import pymodm


Any help is much appreciated.


Cheers



Answer

You'll need to add it as a dependency via the requirements.txt file when deploying the spider to Scrapy Cloud. For instructions on how to add the requirements.txt file, please see: http://help.scrapinghub.com/scrapy-cloud/dependencies-in-scrapy-cloud-projects

0
Fixed
shweta.kumar 1 month ago in Portia • updated by Pablo Vaz (Support Engineer) 4 weeks ago 2

When I select " </> download as scrapy" option a new tab opens with the message "A server error occurred. Please contact the administrator.", although same is not the case with "Download as Portia" option.

Answer

Dear Shweta,


I hope you are satisfied with our response.

Don't hesitate to ask again if you need further assistance.


Best regards,

Pablo

0
Answered
mouch 1 month ago in Scrapy Cloud • updated by Pablo Vaz (Support Engineer) 1 month ago 2

Hi,


I'm trying to upload my first scrapy project on scrapinghub.

Locally it is running fine but I cannot deploy it. Each time, I am stuck on

"ImportError: No module named parse"


I actually use urlib library within my spider. It seems that the line

"from urllib.parse import urlparse, parse_qs" generates the issue.


How should I proceed then? Avoid using that library? Trying to upload an egg (saw some topics about that but I cannot see anything related to eggs on my scrapinghub profile)?


Thanks for you support on that one :)

Answer

Hey Mouch, glad you could find the solution.


And thanks for sharing into our community. Your contribution help other users with similar issues.


Kind regards,


Pablo Vaz

Support Team

0
Not a bug
gianghi1985 1 month ago in Portia • updated by Pablo Vaz (Support Engineer) 1 month ago 1

i use trial portia, i can't get items and very slow when i enable javascript

Answer

Hi Gianghi!

Please check our Portia articles in our help center for more tips on how to use portia:

http://help.scrapinghub.com/portia


About your question, some sites doesn't interact correctly with Portia. If you want to pursue more complex extractions, please consider using other tools like Scrapy. If interested, our experts can help you.


Take a minute to explore this option:


Kind regards,

Pablo

0
Answered
shweta.kumar 1 month ago in Portia • updated by Pablo Vaz (Support Engineer) 1 month ago 1

My ultimate goal is to scrape some information like tittle, etc from sample page and save it along with its url. I also want to do this using Portia Visual interface online, without having to install portia.

Answer

Dear Shweta,


You can do this with any problems. Have you checked this section in our help center?

http://help.scrapinghub.com/portia


Kind regards,

Pablo

0
Answered
ganesh.nayak 1 month ago in Crawlera • updated by Nayak 4 weeks ago 2

Hi,


I have integrated my application with Crawlera using C#, when I make Https request I got exception (407) Proxy Authentication Required. I have installed Cirtificates also sending request through certificates.


If I try sending Http using 'x-crawlera-use-https' request header it is working fine, but as per document this is Deprecated. Please let me know how to make Https request without header.


I tried same code as mentioned in the document still it is throwing exceptions



<strong>using</strong> <span class="nn">System.IO</span><span class="p">;
using <span class="nn">System</span><span class="p">;</span></span>
<strong>using</strong> <span class="nn">System.Net</span><span class="p">;</span>

<strong>namespace</strong> <span class="nn">ProxyRequest</span>
<span class="p">{</span>
    <strong>class</strong> <strong>MainClass</strong>
    <span class="p">{</span>
        <strong>public</strong> <strong>static</strong> <strong>void</strong> <strong>Main</strong> <span class="p">(</span><strong>string</strong><span class="p">[]</span> <span class="n">args</span><span class="p">)</span>
        <span class="p">{</span>
            <strong>var</strong> <span class="n">myProxy</span> <span class="p">=</span> <strong>new</strong> <span class="n">WebProxy</span><span class="p">(</span><span class="s">"http://proxy.crawlera.com:8010"</span><span class="p">);</span>
            <span class="n">myProxy</span><span class="p">.</span><span class="n">Credentials</span> <span class="p">=</span> <strong>new</strong> <span class="n">NetworkCredential</span><span class="p">(</span><span class="s">"<API KEY>"</span><span class="p">,</span> <span class="s">""</span><span class="p">);</span>

            <span class="n">HttpWebRequest</span> <span class="n">request</span> <span class="p">=</span> <span class="p">(</span><span class="n">HttpWebRequest</span><span class="p">)</span><span class="n">WebRequest</span><span class="p">.</span><span class="n">Create</span><span class="p">(</span><span class="s">"https://twitter.com"</span><span class="p">);</span>
            <span class="n">request</span><span class="p">.</span><span class="n">Proxy</span> <span class="p">=</span> <span class="n">myProxy</span><span class="p">;</span>
            <span class="n">request</span><span class="p">.</span><span class="n">PreAuthenticate</span> <span class="p">=</span> <strong>true</strong><span class="p">;</span>

            <span class="n">WebResponse</span> <span class="n">response</span> <span class="p">=</span> <span class="n">request</span><span class="p">.</span><span class="n">GetResponse</span><span class="p">();</span>
            <span class="n">Console</span><span class="p">.</span><span class="n">WriteLine</span><span class="p">(</span><span class="s">"Response Status: "</span> 
                <span class="p">+</span> <span class="p">((</span><span class="n">HttpWebResponse</span><span class="p">)</span><span class="n">response</span><span class="p">).</span><span class="n">StatusDescription</span><span class="p">);</span>
            <span class="n">Console</span><span class="p">.</span><span class="n">WriteLine</span><span class="p">(</span><span class="s">"\nResponse Headers:\n"</span> 
                <span class="p">+</span> <span class="p">((</span><span class="n">HttpWebResponse</span><span class="p">)</span><span class="n">response</span><span class="p">).</span><span class="n">Headers</span><span class="p">);</span>

            <span class="n">Stream</span> <span class="n">dataStream</span> <span class="p">=</span> <span class="n">response</span><span class="p">.</span><span class="n">GetResponseStream</span><span class="p">();</span>
            <strong>var</strong> <span class="n">reader</span> <span class="p">=</span> <strong>new</strong> <span class="n">StreamReader</span><span class="p">(</span><span class="n">dataStream</span><span class="p">);</span>
            <strong>string</strong> <span class="n">responseFromServer</span> <span class="p">=</span> <span class="n">reader</span><span class="p">.</span><span class="n">ReadToEnd</span><span class="p">();</span>
            <span class="n">Console</span><span class="p">.</span><span class="n">WriteLine</span><span class="p">(</span><span class="s">"Response Body:\n"</span> <span class="p">+</span> <span class="n">responseFromServer</span><span class="p">);</span>
            <span class="n">reader</span><span class="p">.</span><span class="n">Close</span><span class="p">();</span>

            <span class="n">response</span><span class="p">.</span><span class="n">Close</span><span class="p">();</span>
        <span class="p">}</span>
    <span class="p">}</span>
<span class="p">}</span>


Regards,

Ganesh Nayak K

Answer

Hi Ganesh,


NetworkCredential will not work in this case. Just set your api key as a string.

In your request header do something like:

var encodedApiKey = Base64Encode(apiKey);

request.Headers.Add("Proxy-Authorization", "Basic " + encodedApiKey);

request.Proxy = proxy;

request.PreAuthenticate = true;

Also keep the : appended to your api key for this to work

0
Waiting for Customer
jcdeesign 1 month ago in Crawlera • updated by Nestor Toledo Koplin (Support Engineer) 1 month ago 3

Hi

When i use curl how example all works

if i try use Phantom js in selenium i recive 407

service_args = [
'--proxy=proxy.crawlera.com:8010',
'--proxy-auth=XXXXXXXXXXXXXXX:',
'--proxy-type=http',
'--load-images=false',
'--ssl-protocol=any',
'--webdriver-logfile=phantom.log',
'--ssl-client-certificate-file='+CRAWLERA_SERT,
'--ignore-ssl-errors=true',
'--webdriver-loglevel=DEBUG'
]


driver = webdriver.PhantomJS(executable_path=settings.PHANTOMJS, desired_capabilities=dcap,service_args=service_args)


i recive

'<html><head></head><body></body></html>'


in log


{"name":"X-Crawlera-Error","value":"bad_proxy_auth"}


key with curl work

0
Answered
mescalante1988 1 month ago in Portia • updated by Thriveni Patil (Support Engineer) 1 month ago 1

Hello, I am doing a Project and I think Portia is great!

I have a doubt because I am extracting data from a webpage, but I want to include the category on all items I am extracting.. but I only have from each item the image, price and description.

What I want to do is force to add manually a category..

For example now I am receiving:


[ { "image": ["urlImage" ], "description": [ "TV LED " ], "price": [ "565" ] },[ { "image": [urlImage1], "description": [ "TV1" ], "price": [ "867" ] },


I want to add manually a category called TV and obtain the next result:


[ { "image": ["urlImage" ], "description": [ "TV LED " ], "price": [ "565" ], "category": ["TV"] },[ { "image": [urlImage1], "description": [ "TV1" ], "price": [ "867" ], "category": ["TV"] },

Could anyone help me with this?

I only know how to work with Portia on webpage on graphic mode.

Thanks!

Answer

Good to know that you are liking Portia :)


To add a field for every Item you can make use of Magic Fields addon, Please refer http://help.scrapinghub.com/scrapy-cloud/addons/magic-fields-addon to know more about the Magic Fields.


Regards,

Thriveni

0
Answered
tofunao1 1 month ago in Scrapy Cloud • updated by Pablo Vaz (Support Engineer) 1 month ago 1

Now I want to download a website which uses ajax and js. So I use selenium and PhantomJS in scrapy. It runs successful in my local pc. But when I upload it to scrapinghub, it stops with some errors.

How to solve this error or how can I download the js website? Thanks.



Answer

Hi Tofunao,


Not familiar with Selenium or Phantom, but you can enable Crawlera and use it with both:


Crawlera - Selenium:

https://doc.scrapinghub.com/crawlera.html#using-crawlera-with-selenium-and-polipo

Crawlera - PhantomJS

https://doc.scrapinghub.com/crawlera.html#using-crawlera-with-casperjs-phantomjs-and-spookyjs


To know more about Crawlera:

https://scrapinghub.com/crawlera/


Best regards,


Pablo Vaz

Support Team

0
Answered
simon.nizov 1 month ago in Scrapy Cloud • updated by Nestor Toledo Koplin (Support Engineer) 1 month ago 4

Hi,

Is it possible to limit a job's runtime? My spider's runtime can change drastically depending on its arguments and at some point I'd rather for the job to just stop and continue to the next one.


Thanks!

Simon.

0
jarek 1 month ago in Crawlera 0

Hello, I'm using the phantomjs from amir20: https://github.com/amir20/phantomjs-node

and from what I read I could enable a session if I pass true on my : currentPage.on('onResourceRequested', true, onResourceRequested);


and now I'm not getting the error: "networkRequest.setHeader is not a function" anymore. In my function:



function onResourceRequested(requestData, networkRequest) {


requestCounter += 1;
//yellow color for the requests:
if (printLogs) console.log('\x1b[33m%s\x1b[0m: ', '> ' + requestData.id + ' - ' + requestData.url);
if (!this.crawleraSessionId) {
networkRequest.setHeader('X-Crawlera-Session', 'create');

}



BUT the onResourceReceived now is not working because it's not returning the html data and before typing true in the onResourceRequested woked.


Any advice?



0
Fixed
Uptown Found 1 month ago in Portia • updated by Pablo Vaz (Support Engineer) 1 month ago 1

When I try to access my Portia project using Chrome, I get a blank page. Opening the Chrome Inspector shows there are several CSS and JS files that cannot be loaded (404 errors):


Answer

Hi Uptown found,


We have been doing some maintenance work, it should be working now.

Please be sure to clean cache to avoid related issues.


Best regards,

Pablo

0
Fixed
Roney Hossain 1 month ago in Scrapy Cloud • updated by Nestor Toledo Koplin (Support Engineer) 1 month ago 4

I was testing new porti beta. when i run spider it always failed and error message is " [root] Script initialization failed : IOError: [Errno 2] No such file or directory: 'project-slybot.zip'

Answer

The issue has been fixed, you can now run job from the dashboard.