Welcome to the Scrapinghub community forum! We discuss all things related to Scrapy Cloud, Portia and Crawlera. You can participate by reading posts, asking questions, providing feedback, helping others, and voting the best questions & answers. Scrapinghub employees regularly pop in to answer questions, share tips, and post announcements.
Remember to check the Help Center!
Please remember to check the Scrapinghub Help Center before asking here, your question may be already answered there.
It seems like all the proxies from outside sweden are blocked, which makes my scrapes hard to do.
Can you help?
Hi Gustav, perhaps was something related to the target domain, I recently checked and many "all around the world" (as you correctly pointed =)) are making successful requests.
Don't hesitate to reach us through Intercom or here again, if you have further questions.
Is there any limit on writing files to disk? After uploading all my items to an ephemeral collection I would like to move them to a csv file and upload them to an external service (say S3).
The limit on the storage depends on the account and the number of containers, but you can check this article related to export the items to S3:
I have been working with Scrapinghub with the goal to automate how we build some reports for our customers. I am extremely happy with the results!! :))
Currently I am trying to store the number of inlinks and outlinks of each URL on a collection of websites (inlinks, as the number of links pointing to each item; outlinks, as the number of links detected on each item).
The outlinks are very easy to store, as you only have to store the number of links captured by the link extractor, but I don't have any idea about how to store the inlinks of each item.
On my own machine, I have created a dict using as key each URL and, as value, I am incrementing the number of links pointing to each URL. Then, using a pipeline, I add the information to the items. Works perfectly, it is very easy to accomplish :).
However... How can I do this on Scrapinghub? Is there a way to add information through a pipeline or using close_spider, method and still been able to request the items through the Item API? Or should I consider using the pipeline to send the results to another server (S3, FTP, or similar)?
Thank you for your help!!
First, thanks for your nice and constructive feedback and great to know you are pleased with the results using Scrapinghub platform.
Regarding your inquiry I think in two options. First, you could try to use Magic fields Addon, it allows you to create a new custom item for each request:
And Secondly, as you mention to use pipeline to send results through S3 you can consider to use export your items as related in this article: http://help.scrapinghub.com/scrapy-cloud/how-to-export-my-items-to-a-awss3-account-ui-mode
Hope you find this information useful,
Now I have written a scrapy spider. There're so many urls to be visited in this job.
I didn't use the items module, just put the records into mysql database directly in 'parse' function. I don't want to find the record duplicate in my database, and the record in the same page url maybe changes, so every time I need to select the old data from database by some unique key before I insert the data. If the current data is not the same as the history data, the old data would be deleted and new data would be inserted. But in this way, it take to much time for the spider running.
And is the DeltaFetch help for me? Can the addons filter by the whole records instead of the request url?
I think you can find useful to set Deltafetch addon enabled for your purposes. We have written a blog post to share some tips about this addon:
Also remember to enable the DotScrapy Persistence add on for DeltaFetch to work as related here:
Regarding your last questions, DeltaFetch only checks for duplicate URLs of requests that contain items. Requests to URLs that haven’t yielded items will still be revisited in subsequent crawls. Start URLs will also be revisited.
Hope this information can be useful.
I can't tell what's using too much memory, and the stats page doesn't show that I went near the 1G limit of the free tier
I think I found the issue shortly after posting:
The site I'm scraping added a really really really large web page which the spider tried to download and parse. I didn't see it in the log - perhaps page requests are only debug logs and I didn't think of dropping log level. In the end I found the culprit by looking at the last few requests on the Requests tab of a few crawls that exited with this reason. This new big page was in the last few on each.
In my case I could solve it by completely ignoring that page.
What is the best way to deal with robots.txt and crawl blocking?
This is for a site that wants to approve crawling.
Just do something like this:
Or something more.
According our more experienced support agents, you could check this article posted on our Blog:
which gives some ideas and tips on the content to those type of files and how to handle (or not).
Hi, I am using Scrapy Cloud
I want to customize my Slack Incoming WebHooks to get error information during execution of jobs in scrapy cloud.
Could I get error notifications in real-time or after job completion?
I want some polling web hooks or something else
Hey Seok more updates for you from our best developers.
About Notifications, perhaps is not the suitable option due the limitations on adding comments by people and not by the system.
Using Jobs API only allows for probing what the status of a job, so if you want to know what's up with a job, you currently have to do regular polling, which is the opposite of what you want (a webhook for when the event happens).
So basically more than an Answer I probably give more headaches :) Sorry about that.
It has been said, that there're some projects to review things like you propose and some of our developers could be motivated to see our users asking to improve notifications in real time.
Another suggestion provided for some ancient masters here in SH, is to simply implement a python script to check job status as posted here: https://blog.scrapinghub.com/2016/09/28/how-to-run-python-scripts-in-scrapy-cloud/
I think that this approach will be very useful by now...
Hope this time give you a better solution.
I'm used these spider configuration:
CONCURRENT_REQUESTS = 1
DOWNLOAD_TIMEOUT = 300
AUTOTHROTTLE_MAX_DELAY = 60
AUTOTHROTTLE_ENABLED = false
CONCURRENT_REQUESTS_PER_DOMAIN = 1
And get result, for 1 minute scraping:
How to slow down scraping? I need about 1 page for 2-5 second
You can set for instance: DOWNLOAD_DELAY = 5 (Which is 5 secs of delay).
Have you considered to use Crawlera? It can improve our crawling methods in order to give you more concurrent requests.
Let me know if you need further help.
After I upload the scrapy code by shub, how can I see the detail code? where does the code show?
You could check the deploy information on the Code & Deploys section of your project in Scrapy Cloud, but at this moment you can't check the code of the spider on the dashboard.
After we started to use version 2 of Portia we are experiencing unwanted deduplication of similar items from crawl. Looking through the logs of these crawl reveals that these items does indeed have different values for at least one field in each item. As we see it these items are not duplicates, and should not be discarded.
As a note. All fields in the item are configured with the Vary-option enabled, and both Required-options disabled.
Crawl logs were read on the web interface from https://app.scrapinghub.com/p/110257/36/4/log
Our experts suggests to disable Vary-option. This should improve your crawling for this particular case. All the fields in the data format used by that spider have vary = True set, so they're ignored when checking for duplicates.
Let me know if this was helpful.
I've got this in my settings.py:
FEED_URI = 'search_results.csv'
FEED_FORMAT = 'csv'
This works OK locally but I wonder if it has any effect in the scrapinghub cloud. I know I can view the items and export them as CSV but the order of the fields is different from what I intended.
My spider is running now and I am wondering if there will actually be a file called search_results.csv file at the end. If so, where might I find it?
Our more experienced support engineers suggest to check:
use setting suggested and run the job.
Finally when you export CSV should be ordered.
I am trying to deploy my first project to scrapinghub and I'm very confused. I tried to follow the instructions here. It says that if you type "shub deploy" you will "be guided through a wizard that will set up the project configuration file (
scrapinghub.yml) for you."
But this never happens. I just get "ImportError: No module named bs4". I suppose that is because the dependency (on beautifulsoup4) needs to be set in the scrapinghub,yml file, which hasn't been created..
Should I try to write the file by hand? If so, what needs to be in it and in what folder should it be stored? If I shouldn't write the file by hand, is there something I can do to meet this mysterious wizard so he can do it for me??
OK, I finally figured it out. I manually created a scrapinghub.yml file, in the same directory that scrapy.cfg was in. Then I created a file called requirements.txt, as explained here. This file should be in that same directory!
scrapinghub.yml looks like this:
And requirements.txt looks like this:
On to the next error... but I will leave that for another thread.
Searched for this topic but no luck, apologies if this is a duplicate
I'm scraping a few pages where I need to extract a few non-visible pieces of data. Specifically, the `src` attributes on images, some `data-*` attributes on misc. `html` tags, and some raw text from the content of a few `<script>` tags.
Is this possible to do in Portia? I haven't been able to figure it out on my own.
If not possible, is it possible to augment a Portia scraper with custom python? Or does a job have to be either all-Scrapy/Python or All-Portia?
Yes Brandon! You can add an extractor to the annotation. In the same options where you configured the CSS selector you can add an extractor which will process the text with your pre-defined regex.
I'm having massive troubles with Portia today: Almost every action triggers a backend error (red notifications). Interacting with a website doesn't work, neither does deleting data formats.
It appears as if you'd switched my account from the old portia to portia 2.0, because yesterday the interface was totally different.
I also have a suggestion for an improvement: Your error messages are not very helpful :(.
Can anyone help me out or let me know what else I should post in order to resolve these problems?
Thanks Ched for your feeback!. I will forward your suggestion to our Portia team, it has a lot of sense what you propose.
I have a client whom I referred to Crawlera, and I had him provide the API Key to me to run the routine. When I input it and test, it will work, but within a day or so I get the error: "Failed to match the login check" and if I have him log in and send me the API Key again, it has changed, and the new one works. How can I prevent it from changing?
Hey Jason, our Crawlera engineers informs that no changes can occur on the API key without authorization of your client.
Don't hesitate to ask if you need further assistance.
Customer support service by UserEcho