Our Forums have moved to a new location - please visit Scrapinghub Help Center to post new topics.

You can still browse older topics on this page.


Welcome to the Scrapinghub community forum! We discuss all things related to Scrapy Cloud, Portia and Crawlera. You can participate by reading posts, asking questions, providing feedback, helping others, and voting the best questions & answers. Scrapinghub employees regularly pop in to answer questions, share tips, and post announcements.

0
Answered
Conor Lee 4 years ago in Portia • updated by Martin Olveyra (Engineer) 4 years ago 2

I'm using the new interface, dash.scrapinghub.com and I can't get the autoscraper to work using my templates. In the inspection view (box icon) in the template builder, everthing is listed under base and seems to work. When I run the spider, nothing scrapes.  

0
Answered
hsantos 4 years ago in Scrapy Cloud • updated by korka 4 months ago 3

Want to know that how can we download jpeg images in scrapinghub as we use PIL(Python Image Library) in custom scraping.

Answer

Hi,


what you must do instead is to upload images into an storage of your own. We don't store images. We usually use amazon s3 service for that. Are you talking about autoscraping? If not, you can find help here

http://doc.scrapy.org/en/0.16/topics/images.html

0
Answered
Benkert Johannes 4 years ago in Scrapy Cloud • updated by Paul Tremberth (Engineer) 3 years ago 11
Hi everyone,


I'm trying to setup a connection with PHP and CURL.
My function looks like this:


$host = 'http://dash.scrapinghub.com/api/schedule.json';
$curl = curl_init();
curl_setopt_array($curl, array(
   CURLOPT_RETURNTRANSFER => 1,
   CURLOPT_HEADER => 1,
   CURLOPT_URL => $host,
   CURLOPT_USERAGENT => 'Penis',
   CURLOPT_POST => 1,
   CURLOPT_POSTFIELDS => array(
       'project' => 'projectid',
       'spider' => 'spidername'
   ),
   CURLOPT_USERPWD => "apikey")
);
$resp = curl_exec($curl);
curl_close($curl);
echo $resp;


The connection is working. If I change the apikey I get:
{"status": "error", "message": "Authentication failed"}

If I change the project ID or let it empty:
{"status": "badrequest", "message": "invalid value for project: asdkkj"}



But if everything is correct, nothing will be returned. It's just empty and nothing is scheduled. The same for GET-Requests.


Can somebody help me finding my mistake?


Thank you very much!


Greetings from Germany
Johannes
Answer

With curl you have to use the -L option in order to follow redirections.


Also check this link for PHP curl


http://stackoverflow.com/questions/3519939/make-curl-follow-redirects


+11
Completed
Michael Bernstein 4 years ago in Portia • updated by Tomas Rinke (Support Engineer) 8 months ago 7

When looking at the pages that a spider has produced, it would be useful to only see those that have not had any items extracted, to more easily identify pages that need new templates defined.

Answer

Currently you could check on Requests Tab under items field if a requests extracted an item or not.

Request with items:


Request wit no items:


0
Answered
Michael Bernstein 4 years ago in Scrapy Cloud • updated by Nicolas Ramírez 4 years ago 5

Hi. I am trying to crawl a website that sometimes gives a '503 Service Unavailable' response. Most of these are resolved on subsequent requests, but a few manage to fail 3 times and the crawler gives up. I would like a setting to increase the number of retries to 5 for particular spiders.

Answer
Nicolas Ramírez 4 years ago
You can use RETRY_TIMES, but I recommend you use AutoThrottle because increase the retries may not solve the problem and it even can aggravate it.


0
Answered
Umair Ashraf 4 years ago in Scrapy Cloud • updated by Oleg Tarasenko (Support Engineer) 3 years ago 6
I am working on a project where I have 2 separate Scrapyd deploy configs and I want to send separate spiders to them divided in 2 folders (spiders and store_spiders).

From project settings.py, I can set SPIDER_MODULES to include both folders but then both folder will go to both Scrapyd configs. Is there a way I can set SPIDER_MODULES under Scrapyd deploy config? so only folder specified will go to the server with the deploy.
0
Answered
Jean Maynier 4 years ago in Portia • updated by Paul Tremberth (Engineer) 3 years ago 5
I would like to update the start urls for an autoscraping spider. I tried


curl http://dash.scrapinghub.com/api/schedule.json -d project=155 -d spider=myspider -u <your api key>: -d start_urls="$(cat start_urls.txt)"

but it schedule a job and only use the start urls for that execution. My AS is a periodic job, and I want my start url to persist. is it possible ?
Thanks 

Answer

Hi, Jean,

you can edit the start url of a spider at any moment, by editing the spider autoscraping properties in the panel. At top right in the panel, you have a red button to see autoscraping properties of a spider, and once there, you have a red button to edit them.

The url you used is not for that. It is just as you said, for scheduling a spider and set the start url for that job only.


0
Fixed
Jean Maynier 4 years ago in Scrapy Cloud • updated by Martin Olveyra (Engineer) 4 years ago 2

One of my periodic job stopped to work for the last 6 days without notice. It appears that a running job was still active for the last 6 days (usually the job take few minutes to complete).

0
Declined
Serge 4 years ago in Portia • updated by Martin Olveyra (Engineer) 4 years ago 7

Hello to ScrapingHub Team !

Making the first tests I've seen that Spider has no an option INCLUDE Pattern (for urls)

I suppose that basing on scrap logic a user must check website what is planned for spidering and scraping.

I also suppose that 95% that pages for scraping are made under one standard on a website.

Let's say usually all products will be under /product/*.html url pattern.


So it seems to be logic if we could use for spidering an option INCLUDE Pattern what means that Spider goes through all pages but collect and scrap only addresses where url contains

/product/

but ignore all others like

/contact/

/aboutus/

/news/

and so on...


It's easier to adjust in Spider settings and, may be, even better for Spider speed.

Would be glad to know your opinions about this feature,

Have a nice day !

Answer

Hi Serge,


you already have that. In "Links to follow", you have the option "Follow links that matches the following patterns"


Check also documentation


http://help.scrapinghub.com/autoscraping.html#url-filters
0
Answered
Pavel Liubinski 4 years ago in Portia • updated by Martin Olveyra (Engineer) 4 years ago 0

I have scraped some site and setup the template. On "Items" page I see many scraped pages. Many of them have only "url" and "body" fields. Also there are pages with scraped items (that I have setup in template). How can I export the data with structured items only?

There is image attached to the post explaining what I want.


Answer

check this thread


http://support.scrapinghub.com/topic/171296-scraped-items-dont-show-up-in-json-only-url-body-and-cookies/


(note that you can search by yourself similar question by other users, before asking a new one)

+3
Answered
Pavel Liubinski 4 years ago in Portia • updated by Paul Tremberth (Engineer) 3 years ago 38

I would like to parse everyday a site about forecoming concerts in my town. For example this page: http://www.samru.ru/?module=article&action=showAll&id=198&subrazdel_id=41

All concerts are simply listed in a table, there are no page for each concert. I would like each concert to be an item in autoscraping. How should I setup autoscraping for scraping lists of items?

Thank you


Answer
At moment the method is indirect. You annotate products as variants


http://help.scrapinghub.com/autoscraping.html#variants


and then use a post processor to split variants into separate products (we can deploy to your project a split variant post processor)


In future we will allow to directly annotate separate products

+2
Answered
Dimitry Izotov 4 years ago in Portia • updated by Pablo Hoffman (Director) 1 year ago 11

Hi, I have a website that sells 10,000 parts, they all have part numbers, i would like to scrape additional detail of every part. Can I feed some .csv file with all parts and return defined fileds (image, description, weight, price, etc.)? I cannot seem to find an option to scrape from list...

Answer

We are thinking of allowing a URL to contain the start urls to seed the crawl. I guess we could extend this idea to allow the URL to point to a CSV document, and have a pattern to make urls from it e.g. http://mysite.com/part-{0} would create start URLs from the first field in the csv.

0
Thanks
Juan Catalano 4 years ago in Crawlera • updated by Martin Olveyra (Engineer) 4 years ago 0

It really helps me to throttle my requests and avoid being banned from servers. Its incredibly easy to use and totally transparent! Bravo!

0
Answered
Rodolpho Ramirez 4 years ago in Portia • updated by Martin Olveyra (Engineer) 4 years ago 2

Hi guys.


First of all, Autoscraping is awesome. After learning the basic concepts it has everything one needs to scrap the hell of the web.


I am running a spider at a website, and the lame coder at the other side forgot to write some external links with "http://" in some pages (there are over 13000 pages/products in that website).


So what happens is the the spider interprets that as an internal link, like this:

http://www.beeingscrapedwebsite.com/www.externallink.com


The problem is that this renders (loads) the exact same webpage, with the same crap link, that the spider again interprets as an internal link, and tries to scrape the following page:


http://www.beeingscrapedwebsite.com/www.externallink.com/www.externallink.com


And that goes on in an infinite loop until the spider stops for "fews items scraped".


I would like to know if there is a way to stop the spider to do that.


Thanks in advance.


Rodolpho

Answer
Hi, Rodolpho.

You can add the pattern
www.externallink.com


into excluded patterns property of spider.

0
Answered
Umair Ashraf 4 years ago in Scrapy Cloud • updated by Martin Olveyra (Engineer) 4 years ago 4

Can we send signals in our spider?


For example, if there is a case in which I want to collect data from different pages but only one item will be returned when spider_idle occurs.