Welcome to the Scrapinghub community forum! We discuss all things related to Scrapy Cloud, Portia and Crawlera. You can participate by reading posts, asking questions, providing feedback, helping others, and voting the best questions & answers. Scrapinghub employees regularly pop in to answer questions, share tips, and post announcements.

Remember to check the Help Center!

Please remember to check the Scrapinghub Help Center before asking here, your question may be already answered there.

0
Answered
Sergey Sinkovskiy 4 years ago in Crawlera • updated 4 years ago 0


Answer
Sergey Sinkovskiy 4 years ago

No. Crawlera doesn't implement any caching.


0
Answered
Edwin Shao 4 years ago in Scrapy Cloud • updated by Pablo Hoffman (Director) 4 years ago 0
Hello,

I can't seem to manage to see the DEBUG loglevel in your web log, which would help me debug some problems getting my spiders working on your production environment.

For example, the following log message (that I see on my development machine) would help me:

2013-08-23 10:34:08+0800 [scrapy] DEBUG: Overridden settings: {'NEWSPIDER_MODULE': 'miner.spiders', 'FEED_URI': 'stdout:', 'SPIDER_MODULES': ['miner.spiders'], 'BOT_NAME': 'kites-miner-bot/0.1.0', 'ITEM_PIPELINES': ['miner.pipelines.AddressPipeline', 'miner.pipelines.GeoPipeline', 'miner.pipelines.MergesPipeline', 'miner.pipelines.HoursPipeline', 'miner.pipelines.CategoriesPipeline', 'miner.pipelines.WidgetPipeline', 'miner.pipelines.BasePipeline', 'miner.pipelines.ItemToBSONPipeline', 'scrapy_mongodb.MongoDBPipeline', 'miner.pipelines.CouchDBPipeline', 'miner.pipelines.BSONToItemPipeline'], 'USER_AGENT': 'kites-miner-bot/0.1.0 (+http://kites.hk)', 'FEED_FORMAT': 'json'}


I've already tried setting LOG_LEVEL to 'DEBUG' in settings.py. Is there anything else I should do?


Answer

You need to set LOG_LEVEL= DEBUG in Settings -> Scrapy settings.


The default log level on Scrapy Cloud is INFO.

0
Answered
Edwin Shao 4 years ago in Scrapy Cloud • updated by Martin Olveyra (Engineer) 4 years ago 2

I am using scrapy_mongodb to store scraped items into my Mongo database. Everything works fine on my development machine, but when I deploy to scrapinghub, I am having a hard time configuring the MONGODB_DATABASE setting that it depends on.


Regardless of whether I put in a project or spider override, it keeps using the MONGODB_DATABASE that is in settings.py. Why is this?

Answer

Hi, Edwin,


The current behaviour is: spider settings has the biggest priority, then project settings, then settings in settings.py file, so should work in that way. If does not work in that way, i cannot say why without knowing what your code does. Are you sure your code is reading the setting MONGODB_DATABASE and not a hard coded value?

0
Answered
Pablo Hoffman (Director) 4 years ago in Scrapy Cloud • updated 2 years ago 0


Answer

Even when no changes to code are made, jobs can run slower depending on how busy is the server they are assigned to run in the cloud.


This variability can be improved by purchasing dedicated servers. Check the Pricing page, and contact sales@scrapinghub.com to request them.

0
Answered
Pablo Hoffman (Director) 4 years ago in Scrapy Cloud • updated 4 years ago 0

So that, given a range, we always obtain the same set of data.

Answer

Job items are always in order (which is the order they are extracted & stored in). This is the same in the API, even when you filter or request ranges.

+1
Answered
Pablo Hoffman (Director) 4 years ago in Scrapy Cloud • updated 4 years ago 0


Answer

The Scrapy process gets the stop signal within a hundred milliseconds. It does a graceful stop, which means it finishes pending http requests, flushes items and logs, etc. which takes some time. Probably most of the time people killing jobs don't care, and we could provide a quick kill mechanism.

0
Answered
Edwin Shao 4 years ago in Scrapy Cloud • updated by Martin Olveyra (Engineer) 4 years ago 0
I don't see an option to do so in the web UI.

When I try to deploy, I get the following error:

deshao-mbp (1)~/miner> bin/scrapy deploy
Packing version 1376539087
Deploying to project "889" in http://dash.scrapinghub.com/api/scrapyd/addversion.json
Deploy failed (400):
{"status": "badrequest", "message": "Duplicated spider name: there is an autospider named 'burgerking'"}


Thus, I'd like to delete the autospider named 'burgerking'.

Answer

You have to go to Autoscraping properties of the spider ('Autoscraping' red button) and then you have a button 'Delete'.

0
Not a bug
Dominici 4 years ago in Portia • updated by Oleg Tarasenko (Support Engineer) 3 years ago 2

Hello, 

I have tried to export my "completed job".

But when I click on "CSV", a new window is open and nothing happen.

Is that a bug ?

0
Answered
Max Kraszewski 4 years ago in Portia • updated by Martin Olveyra (Engineer) 4 years ago 3

I can´t figure what is wrong, but when I attempt to download scrapped items in csv format, it redirects me to a blank page with the following message:

Need to indicate fields for the CSV file
in the request parameter fields
I think I'm doing something wrong, buy could you help me? Thanks in advance

Answer

That is because that job does not have items (you can see the items counter in 0). If you do the same on other jobs, you will get a csv.

0
Answered
Conor Lee 4 years ago in Portia • updated by Martin Olveyra (Engineer) 4 years ago 2

I'm using the new interface, dash.scrapinghub.com and I can't get the autoscraper to work using my templates. In the inspection view (box icon) in the template builder, everthing is listed under base and seems to work. When I run the spider, nothing scrapes.  

0
Answered
hsantos 4 years ago in Scrapy Cloud • updated by korka 1 month ago 3

Want to know that how can we download jpeg images in scrapinghub as we use PIL(Python Image Library) in custom scraping.

Answer

Hi,


what you must do instead is to upload images into an storage of your own. We don't store images. We usually use amazon s3 service for that. Are you talking about autoscraping? If not, you can find help here

http://doc.scrapy.org/en/0.16/topics/images.html

0
Answered
Benkert Johannes 4 years ago in Scrapy Cloud • updated by Paul Tremberth (Engineer) 2 years ago 11
Hi everyone,


I'm trying to setup a connection with PHP and CURL.
My function looks like this:


$host = 'http://dash.scrapinghub.com/api/schedule.json';
$curl = curl_init();
curl_setopt_array($curl, array(
   CURLOPT_RETURNTRANSFER => 1,
   CURLOPT_HEADER => 1,
   CURLOPT_URL => $host,
   CURLOPT_USERAGENT => 'Penis',
   CURLOPT_POST => 1,
   CURLOPT_POSTFIELDS => array(
       'project' => 'projectid',
       'spider' => 'spidername'
   ),
   CURLOPT_USERPWD => "apikey")
);
$resp = curl_exec($curl);
curl_close($curl);
echo $resp;


The connection is working. If I change the apikey I get:
{"status": "error", "message": "Authentication failed"}

If I change the project ID or let it empty:
{"status": "badrequest", "message": "invalid value for project: asdkkj"}



But if everything is correct, nothing will be returned. It's just empty and nothing is scheduled. The same for GET-Requests.


Can somebody help me finding my mistake?


Thank you very much!


Greetings from Germany
Johannes
Answer

With curl you have to use the -L option in order to follow redirections.


Also check this link for PHP curl


http://stackoverflow.com/questions/3519939/make-curl-follow-redirects


+11
Completed
Michael Bernstein 4 years ago in Portia • updated by Tomas Rinke (Support Engineer) 5 months ago 7

When looking at the pages that a spider has produced, it would be useful to only see those that have not had any items extracted, to more easily identify pages that need new templates defined.

Answer

Currently you could check on Requests Tab under items field if a requests extracted an item or not.

Request with items:


Request wit no items:


0
Answered
Michael Bernstein 4 years ago in Scrapy Cloud • updated by Nicolas Ramírez 4 years ago 5

Hi. I am trying to crawl a website that sometimes gives a '503 Service Unavailable' response. Most of these are resolved on subsequent requests, but a few manage to fail 3 times and the crawler gives up. I would like a setting to increase the number of retries to 5 for particular spiders.

Answer
Nicolas Ramírez 4 years ago
You can use RETRY_TIMES, but I recommend you use AutoThrottle because increase the retries may not solve the problem and it even can aggravate it.


0
Answered
Umair Ashraf 4 years ago in Scrapy Cloud • updated by Oleg Tarasenko (Support Engineer) 3 years ago 6
I am working on a project where I have 2 separate Scrapyd deploy configs and I want to send separate spiders to them divided in 2 folders (spiders and store_spiders).

From project settings.py, I can set SPIDER_MODULES to include both folders but then both folder will go to both Scrapyd configs. Is there a way I can set SPIDER_MODULES under Scrapyd deploy config? so only folder specified will go to the server with the deploy.