Welcome to the Scrapinghub feedback & support site! We discuss all things related to Scrapy Cloud, Portia and Crawlera. You can participate by reading posts, asking questions, providing feedback, helping others, and voting the best questions & answers. Scrapinghub employees regularly pop in to answer questions, share tips, and post announcements.
0
Answered
EdoPut 2 days ago in Scrapy Cloud • updated by Pablo Vaz (Support Engineer) 2 days ago 1

Is there any limit on writing files to disk? After uploading all my items to an ephemeral collection I would like to move them to a csv file and upload them to an external service (say S3).

Answer

Hey EdoPut,

The limit on the storage depends on the account and the number of containers, but you can check this article related to export the items to S3:

http://help.scrapinghub.com/scrapy-cloud/how-to-export-my-items-to-a-awss3-account-ui-mode

Kind regards,

Pablo

0
Answered
ghostmou 2 days ago in Scrapy Cloud • updated yesterday at 1:58 a.m. 2

Hi :)


I have been working with Scrapinghub with the goal to automate how we build some reports for our customers. I am extremely happy with the results!! :))


Currently I am trying to store the number of inlinks and outlinks of each URL on a collection of websites (inlinks, as the number of links pointing to each item; outlinks, as the number of links detected on each item).


The outlinks are very easy to store, as you only have to store the number of links captured by the link extractor, but I don't have any idea about how to store the inlinks of each item.


On my own machine, I have created a dict using as key each URL and, as value, I am incrementing the number of links pointing to each URL. Then, using a pipeline, I add the information to the items. Works perfectly, it is very easy to accomplish :).


However... How can I do this on Scrapinghub? Is there a way to add information through a pipeline or using close_spider, method and still been able to request the items through the Item API? Or should I consider using the pipeline to send the results to another server (S3, FTP, or similar)?


Thank you for your help!!

Answer

Hey Ghostmou!

First, thanks for your nice and constructive feedback and great to know you are pleased with the results using Scrapinghub platform.


Regarding your inquiry I think in two options. First, you could try to use Magic fields Addon, it allows you to create a new custom item for each request:

http://help.scrapinghub.com/scrapy-cloud/addons/magic-fields-addon


And Secondly, as you mention to use pipeline to send results through S3 you can consider to use export your items as related in this article: http://help.scrapinghub.com/scrapy-cloud/how-to-export-my-items-to-a-awss3-account-ui-mode


Hope you find this information useful,

Kind regards!

Pablo

0
Answered
trustyao 3 days ago in Scrapy Cloud • updated 3 days ago 3

hello, sir


Now I have written a scrapy spider. There're so many urls to be visited in this job.


I didn't use the items module, just put the records into mysql database directly in 'parse' function. I don't want to find the record duplicate in my database, and the record in the same page url maybe changes, so every time I need to select the old data from database by some unique key before I insert the data. If the current data is not the same as the history data, the old data would be deleted and new data would be inserted. But in this way, it take to much time for the spider running.

Then I found the DeltaFetch addons, it seems useful to me. But when I visited the document in https://github.com/scrapy-plugins/scrapy-deltafetch. I found a very simple introduce. And I just google for it, but had no idea. Can you help me to find a document mo detail?

And is the DeltaFetch help for me? Can the addons filter by the whole records instead of the request url?


Answer

Hi Trustyao!

I think you can find useful to set Deltafetch addon enabled for your purposes. We have written a blog post to share some tips about this addon:

https://blog.scrapinghub.com/2016/07/20/scrapy-tips-from-the-pros-july-2016/

Also remember to enable the DotScrapy Persistence add on for DeltaFetch to work as related here:

http://help.scrapinghub.com/scrapy-cloud/addons/delta-fetch-addon

Regarding your last questions, DeltaFetch only checks for duplicate URLs of requests that contain items. Requests to URLs that haven’t yielded items will still be revisited in subsequent crawls. Start URLs will also be revisited.

Hope this information can be useful.

Kind regards,

Pablo

0
Answered
jbothma 5 days ago in Scrapy Cloud • updated by Pablo Vaz (Support Engineer) 3 days ago 2

I can't tell what's using too much memory, and the stats page doesn't show that I went near the 1G limit of the free tier

Answer

I think I found the issue shortly after posting:


The site I'm scraping added a really really really large web page which the spider tried to download and parse. I didn't see it in the log - perhaps page requests are only debug logs and I didn't think of dropping log level. In the end I found the culprit by looking at the last few requests on the Requests tab of a few crawls that exited with this reason. This new big page was in the last few on each.


In my case I could solve it by completely ignoring that page.

0
Answered
Seok-chan Ahn 1 week ago in Scrapy Cloud • updated by Pablo Vaz (Support Engineer) 6 days ago 2

Hi, I am using Scrapy Cloud

I want to customize my Slack Incoming WebHooks to get error information during execution of jobs in scrapy cloud.

Could I get error notifications in real-time or after job completion?
I want some polling web hooks or something else


Answer

Hey Seok more updates for you from our best developers.

About Notifications, perhaps is not the suitable option due the limitations on adding comments by people and not by the system.

Using Jobs API only allows for probing what the status of a job, so if you want to know what's up with a job, you currently have to do regular polling, which is the opposite of what you want (a webhook for when the event happens).

So basically more than an Answer I probably give more headaches :) Sorry about that.

It has been said, that there're some projects to review things like you propose and some of our developers could be motivated to see our users asking to improve notifications in real time.

Another suggestion provided for some ancient masters here in SH, is to simply implement a python script to check job status as posted here: https://blog.scrapinghub.com/2016/09/28/how-to-run-python-scripts-in-scrapy-cloud/

I think that this approach will be very useful by now...

Hope this time give you a better solution.

Kind regards!

0
Answered
Алексей 1 week ago in Scrapy Cloud • updated by Pablo Vaz (Support Engineer) 1 week ago 3

I'm used these spider configuration:

CONCURRENT_REQUESTS = 1

DOWNLOAD_TIMEOUT = 300

AUTOTHROTTLE_MAX_DELAY = 60

AUTOTHROTTLE_ENABLED = false

CONCURRENT_REQUESTS_PER_DOMAIN = 1

AUTOTHROTTLE_START_DELAY = 20

And get result, for 1 minute scraping:


downloader/response_status_count/20016
downloader/response_status_count/30111
downloader/response_status_count/503663

How to slow down scraping? I need about 1 page for 2-5 second

Answer

Hey Алексей!

You can set for instance: DOWNLOAD_DELAY = 5 (Which is 5 secs of delay).

Have you considered to use Crawlera? It can improve our crawling methods in order to give you more concurrent requests.

Let me know if you need further help.

Regards,

Pablo

0
Answered
tofunao1 2 weeks ago in Scrapy Cloud • updated by Pablo Vaz (Support Engineer) 1 week ago 1

After I upload the scrapy code by shub, how can I see the detail code? where does the code show?

Answer

Hi Tofunao,

You could check the deploy information on the Code & Deploys section of your project in Scrapy Cloud, but at this moment you can't check the code of the spider on the dashboard.

Regards,

Pablo

0
Answered
ihoekstra 2 weeks ago in Scrapy Cloud • updated by Pablo Vaz (Support Engineer) 1 week ago 3

I've got this in my settings.py:


FEED_URI = 'search_results.csv'
FEED_FORMAT = 'csv'


This works OK locally but I wonder if it has any effect in the scrapinghub cloud. I know I can view the items and export them as CSV but the order of the fields is different from what I intended.


My spider is running now and I am wondering if there will actually be a file called search_results.csv file at the end. If so, where might I find it?

Answer

Our more experienced support engineers suggest to check:

https://doc.scrapy.org/en/latest/topics/feed-exports.html#feed-export-fields

use setting suggested and run the job.

Finally when you export CSV should be ordered.

Regards!

0
Answered
ihoekstra 2 weeks ago in Scrapy Cloud • updated by Pablo Vaz (Support Engineer) 1 week ago 5

Hi,


I am trying to deploy my first project to scrapinghub and I'm very confused. I tried to follow the instructions here. It says that if you type "shub deploy" you will "be guided through a wizard that will set up the project configuration file (scrapinghub.yml) for you."


But this never happens. I just get "ImportError: No module named bs4". I suppose that is because the dependency (on beautifulsoup4) needs to be set in the scrapinghub,yml file, which hasn't been created..


Should I try to write the file by hand? If so, what needs to be in it and in what folder should it be stored? If I shouldn't write the file by hand, is there something I can do to meet this mysterious wizard so he can do it for me??

Answer
ihoekstra 1 week ago

OK, I finally figured it out. I manually created a scrapinghub.yml file, in the same directory that scrapy.cfg was in. Then I created a file called requirements.txt, as explained here. This file should be in that same directory!


scrapinghub.yml looks like this:


projects:
default: [yourprojectid]


requirements_file: requirements.txt


And requirements.txt looks like this:


beautifulsoup4==4.5.1


On to the next error... but I will leave that for another thread.

0
Answered
Zahar 3 weeks ago in Scrapy Cloud • updated by Pablo Vaz (Support Engineer) 3 weeks ago 4

I added a new spider to my project with dependency on scrapinghub API.

shub version 2.5.0

scrapinghub (1.9.0)

When I try to deploy my project I'm getting the following error:

ImportError: No module named scrapinghub

0
Answered
tee 4 weeks ago in Scrapy Cloud • updated by Pablo Vaz (Support Engineer) 1 week ago 5

Hi there, only just joined scrapinghub today, I've deployed my spider on the server now however I'm finding it diffcult creating or configuring an API for it so I can make calls from an external application to it there by collecting an external request with the API key as well as the website to scrape, scrape the website and then send back the scraped data to the external client I'm using, how do I go about that ?

Answer

Hi Tee, sorry for the late answer. About the last 3 jobs retrieved is just an example on how you can use this API.
Please check if this article regarding sharing data between spiders could help to achieve your goals:

http://help.scrapinghub.com/scrapy-cloud/sharing-data-between-spiders

Regards.

0
Answered
luhaoz 1 month ago in Scrapy Cloud • updated by Pablo Vaz (Support Engineer) 1 month ago 4

I have a project

He crawls search results from Google Search and stores them in a database
But after a certain time to crawl response will return 503
Can use scrapy cloud api to let me be fluent in grasping data
I would like to ask whether scrapy cloud scrapy trial services?
I would like to test my project can be greatly improved by crawlera
0
Answered
hydroscraping 1 month ago in Scrapy Cloud • updated by Pablo Vaz (Support Engineer) 1 week ago 2

I have a list of 100 start urls. When I crawl the full list, it appears that half are ignored (ie not crawled). I've isolated some of the URLs and tried running the spider with just two URLs. My stats are below. This crawl yields 0 requests, 0 errors, 0 items. How can I figure out why they are being ignored? How can I get Scrapy Cloud to crawl these sites?


downloader/exception_count2
downloader/exception_type_count/scrapy.exceptions.IgnoreRequest2
downloader/request_bytes446
downloader/request_count2
downloader/request_method_count/GET2
downloader/response_bytes1534
downloader/response_count2
downloader/response_status_count/2002
finish_reasonfinished
finish_time1481369384706
log_count/INFO10
memusage/max57884672
memusage/startup57884672
response_received_count2
scheduler/dequeued2
scheduler/dequeued/disk2
scheduler/enqueued2
scheduler/enqueued/disk2
start_time1481369383995

Answer

Hey hydroscraping!

It doesn't seem a problem of Scrapy Cloud itself since the other url's are crawled, maybe that url's has strengthen their security in order to avoid crawlers.

Try to create separate projects, one of them with the problematic url's targeted and run this project first. Enable Autothrottle and keep it a little bit higher.

Hope to be helpful.

Regards!

0
Answered
seble 2 months ago in Scrapy Cloud • updated by Pablo Hoffman (Director) 2 months ago 3

Whenever I deploy my spider written in Python 3, scrapinghub thinks that it is written in Python 2 (I can see that the python2.7 interpreter is used by looking at the spider's logs). Due to differences in how Python 2 and Python 3 handle strings, this results in a UnicodeDecodeError. How can I tell scrapinghub to use Python 3 to execute my spider?

Answer

Solved it by specifying a stack in the scrapinghub.yml config file:

projects:
default:
id: XYZ
stack: scrapy:1.1-py3

0
Answered
Ahmed 2 months ago in Scrapy Cloud • updated by Pablo Vaz (Support Engineer) 1 month ago 5

I used Portia to setup a scrapy cloud project and tested it on a couple of links from the website and it works great. Now, my question is, I want to retrieve data from pages on this website on-demand, one page at a time, similar to how Pinterest's users add a page and Pinterest pulls in the title and image of that page. The users on my website will do the same, entering a url of the page they want info from and my API sends the link to an API on scrapinghub through a GET request, scrapinghub extracts the info from that page and sends it back.

Is this something that can be done? If yes, can you please direct me on how this can be done?

Answer

You are very welcome Ahmed!