Our Forums have moved to a new location - please visit Scrapinghub Help Center to post new topics.

You can still browse older topics on this page.


Welcome to the Scrapinghub community forum! We discuss all things related to Scrapy Cloud, Portia and Crawlera. You can participate by reading posts, asking questions, providing feedback, helping others, and voting the best questions & answers. Scrapinghub employees regularly pop in to answer questions, share tips, and post announcements.

0
Answered
Jean Maynier 3 years ago in Portia • updated by Martin Olveyra (Engineer) 3 years ago 2
When do you expect to replace current Autoscraping by Portia ?
Answer
In the meantime we will provide soon a utility to import projects made with portia into scrapinghub.
0
Answered
Yu Paul 3 years ago in Crawlera • updated by Sergey Sinkovskiy 3 years ago 2
Just got signed up for Crawlera, but i don't see how i can pay for the service?
0
Answered
Mirosław Sadowski 3 years ago in Crawlera • updated by Mark Farrar 2 years ago 2
Hi!
I registered in mashape and I'm trying for example to fetch a page: http://site.pl
by means of the following script in PHP:

$curl = curl_init();
curl_setopt($curl, CURLOPT_URL, 'http://MY_NICK:MY_PASS@api.crawlera.com/fetch?url=http://site.pl');
print_r(curl_exec($curl));
curl_close($curl);

In response gets only 1
What am I doing wrong?
How to fetch the page in PHP?
0
Answered
Beto Lopez 3 years ago • updated by Pablo Hoffman (Director) 1 year ago 1
I'm actually a user of scrapinghub, and I have 2 years experience scraping big sites.

I can reply tickets in english, spanish and german. I would love to work in the support team of scrapinghub.

Where can I send a message for applying? thank you.
0
Answered
96levels 3 years ago in Portia • updated by Martin Olveyra (Engineer) 3 years ago 3
I don't understand why scraping more than 50k pages from a page, I dont get banned, without using crawlera.

I ordered crawlera but I'm not sure if I really need it, cause I'm not getting banned.

Thoughts?
0
Answered
Guyi Shen 3 years ago in Crawlera • updated by Andrés Pérez-Albela H. 3 years ago 2
I'm trying to test out how to consume the said API from ScrapingHub as I need to crawl a HTTPS website. However, the usage docs seems to only refer to the non-mashape options.

Question is will the crawlera middleware work the same even if I'll be using the Crawlera Mashape API? If not how do I do this?
0
Answered
Caio Iglesias 3 years ago in Crawlera • updated by Martin Olveyra (Engineer) 3 years ago 1
Do you support GET/POST and other methods?
Can I send POST/GET parameters with my request?
0
Fixed
arunes 3 years ago in Portia • updated by Martin Olveyra (Engineer) 3 years ago 1
When I try to retrieve scrapped items by a job, I receive this error;

503 Service Unavailable

No server is available to handle this request.
Answer
Hi arunes,

we are experienced some problems in our infrastructure, and working on that.

Please check this link for checking our components status

http://status.scrapinghub.com/
0
Answered
awilliams 3 years ago in Scrapy Cloud • updated by Martin Olveyra (Engineer) 3 years ago 3
I have copied the text and added it to my scrappy.cfg file, and I have uploaded an egg of psycopg2, but when I run scrapyd-deploy I get the following error :

ImportError: No module named psycopg2
ERROR:root:Script initialization failed
Answer
Hi awilliams,

the psycopg2 egg contains nothing. Also I don't think that it is possible to build a valid egg with this library, as it is not pure python, it needs platform dependent C postgre libraries
0
Answered
boganr 3 years ago in Portia • updated by Andrés Pérez-Albela H. 3 years ago 22
I'm doing a trial run of the software.  I setup a spider that was running perfectly fine earlier.  I went to run it again and now the job completes in like 16 secs and returns only 1 page.  Is there something that I need to modifiy to get this to run like it was before?
0
Answered
Rodolpho Ramirez 3 years ago in Portia • updated by Martin Olveyra (Engineer) 3 years ago 0
Hi. I am trying to setup images add-on with autoscraping, and I am struggling with 3 things:
1. S3 is up, but I can't figure how to discover this info: s3://<bucket name>/<base path>/ in Amazons' cpanel. I've manually uploaded an image, and when I click on it I get this url: https://s3-sa-east-1.amazonaws.com/cookpedia/1.jpg...
Can I get this info from here?
2. I can't find in autoscraping settings where to setup AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY. If I click on Settings/Scrapy Settings and refer to images add-on clicking on the + only shows the other options like IMAGES_EXPIRES and so on...
3. I also seem to be unable to find this info aws access key and aws secret access key, is it this info on the link above (to the image)?

Thanks a lot :)
Answer
Hi Rodolpho.

If you have already uploaded an image, then you have used an s3 bucket created by you (you can create all the bucket names you want, although their names are global, so you cannot create a bucket with the same name of another bucket created by anyone else in the s3 cloud)

the base path can be anything you want. It is just for the purpose of classify your data inside a bucket. It is like a folder. But you don't need to create it, because they are not really such. They are just file name prefixes, and the images addon will include it in the name of the file it creates for each uploaded image.

The AWS keys must be created from the aws amazon control panel (if no one were already created by default). They are needed for accessing your storage from outside. Probably what you need is to read some aws s3 tutorial in order to understand better how it works. Check for example

http://www.hongkiat.com/blog/amazon-s3-the-beginne...

About question 2, the AWS_ settings are in the list of general settings, because they are not specific to the images addon. There are other components that may need s3 storage.

If you need more interactive help from us, you can use the support chat http://www.hipchat.com/gJog3cSUL
0
Answered
Virgil 3 years ago in Scrapy Cloud • updated by Martin Olveyra (Engineer) 3 years ago 3
Hi,
I made a crawler using Scrapy, problem is that the sitemap URLs (in sitemap index files) end with ".xml.gz", but are not actually gzipped (and when scrappy fetchez them & attempts to gunzip, it raises an error and refuses to continue).

I made a modified version of Scrapy that fixes this problem - but can I upload my own Scrapy libray to your cloud? (if yes, how? if not, do you have any suggested fix for my problem?)

Thanks/
Answer
Hi Virgil,

actually scrapy tries to gunzip based on the response header, not the file name. Aside that, the decompression is handled by the httpcompression middleware:

https://github.com/scrapy/scrapy/blob/master/scrapy/contrib/downloadermiddleware/httpcompression.py

which can be disabled with the setting COMPRESSION_ENABLED=0
+1
Declined
Paul Tremberth (Engineer) 3 years ago in Scrapy Cloud • updated by Pablo Hoffman (Director) 1 year ago 1

When on the go, one sometimes wants to check the status of long-running spider jobs and other periodic jobs.


Current Scrapinghub on smartphones or tablet makes links and buttons quite small and difficult to select.


It would be cool to have a stripped down version of the Scrapinghub, making better use of small(er) screens,
for example by moving tall menus and advanced options to top bar menus or other offcanvas techniques (http://foundation.zurb.com/docs/components/offcanv...)

Answer

There are no plans to have a mobile specific version of the Scrapinghub dashboard, although it's fluid/reactive layout should make it render acceptable in modern mobile browser.

0
Answered
Vladimir 3 years ago in Crawlera • updated by Andrés Pérez-Albela H. 3 years ago 2
Can I choose region of IPs that will be used in Crawlera?
0
Answered
drsumm 3 years ago in Portia • updated by Pablo Hoffman (Director) 1 year ago 2
I have a custom large (in lacs) list of start_urls which I cant paste it in the start form for autoscraper. The form hangs due to lacs of URLS. So how can I use a spiderlet do this ? is there another way?
Answer
Hi drsumm. You already asked this and I have already answered:

http://support.scrapinghub.com/topic/411741-large-number-of-start-urls/