Hi Folks, I was just trying the service before making a purchase. so I made a get a request to http://httpbin.org/ip . just to make sure that the ip, I get in response is not mine. but I am getting following error.
requests.exceptions.ConnectionError: HTTPConnectionPool(host='proxy.crawlera.com', port=8010): Max retries exceeded with url: http://httpbin.org/ip (Caused by : [Errno 54] Connection reset by peer)
Please suggest me how can i test the service successfully before making an purchase.
I'm looking for a way to split the breadcrumbs field to separate categories. I don't think I can accomplish it with regexes in a robust way. Is there a way to extend Portia in order to supply a custom extractor to it?
What are general possibilities to extend Portia while still using ScrapingHub to host it?
Eventhough Crawlera has a lot of IPs, if all of the IPs already got banned, the IPs will need to be renewed. Does Crawlera do that?
Crawlera works differently, it protects the IPs from getting banned instead of rotating them, like regular proxy providers do.
The items my spiders are scraping are stored in the job in alphabetical order.
How can I change this?
I want to store them in a custom order that is more logical based on the left-to-right order (in CSV) of the fields logically.
Hi Robert, items are displayed in alphabetical order in the Items Tab, but at storage level there is no order set in place, structure is a like dictionary or json object. When looking at the items tab in the job, there is no way to alter that order, but I could tag this conversation to add a feature request.
In other words it makes sense to download a CSV file with a logical column order as you stated, and could be done by modifying CSV Fields in the project settings.
Let me know if this is what you are looking for,
I have this project 111149
It runs extremely fast on my localhost, it scrapes 40 items/minute on my localhost
BUT it is scraping 0.5 items/minute on scrapinghub ... I am using Crawlera ... even I tried without crawlera, it scraped only 5 item/minute ....
What is reason of slowness?
All projects in Scrapy Cloud have the Autothrottle addon enabled by default. Since you are already using Crawlera I would suggest you turn this addon off, by adding AUTOTHROTTLE_ENABLED = false to your job settings in the UI.
By the way, if you click on Help on the tob bar there's a Talk to support button for a messaging app.
I created an account on the scrapinghub.com and invited a colleage to join the project, but when he tries to create a spider in Portial, he cannot open it (by clicking on the spider name on the left).
The chrome developer console shows an error 500 response for page "https://portia-beta.scrapinghub.com/api/projects/111XXX/spiders/www.example.com/samples", and this is the response: [Errno 2] No file or directory: 'spiders/www.example.com.json'
(Not sure about security issues so I redacted the website name and ID)
Any idea if this is a Bug or an issue on our side ?
I need to know all of the IPs of proxy.crawlera.com, as our firewall just can set access rules by IPs only, not support for domain setting. Thanks.
Hi regarding IPs for proxy.crawlera.com sub-domain:
They are not static, so they are likely to change, not that often though.
An approach to white-list the IPs in your firewall, could be to flush DNS on any machine and get the new IP by resolving the sub-domain when needed.
I already wrote out the full question with code samples here: http://stackoverflow.com/questions/40032826/scrapy-fails-to-filter-out-duplicate-urls. Essentially the middleware I wrote to filter out duplicates is not being called for every request. Any ideas why?
I'm new to scrapy cloud.
I have an spider that uses a little sqlite backend.
Is it possible to use sqlite with scrapy cloud?
Will I have a ssh connection to get my data?
Thanks a lot for your time.
I see, Scrapy Cloud has a limited disk space per unit used, and it only lasts for the duration of the job run. So that could be a problem for long term persistence.
You cannot get ssh access to the data, only open the job console while it's running from the Dashboard.
Options to consider:
- tune your scrapy project to add the calculated fields as additional scrapy.Item therefore being able to access/download them by using the Dashboard, or the Items API
- use DotScrapy Persistence.(which requires am S3 bucket setup) and use the sqlite solution. ssh access would depend on your setup.
- use an external instance of sqlite to decouple Scrapy Cloud from the solution.
Hope it helps,
I am having trouble using DotScrapy Persistence. The Scrapinghub documentation says "...by calling the scrapy.utils.project.data_path function...". Well, I am not able to find that function in the Scrapy documentation.
Could anybody let me know please how to use that function? I'd like to store a date between two runs of my spider...
I have proxies from http://stormproxies.com/ ... way these works that you need to tell them 1 IP where your scraper is running and then they allow you to connect with their server and use proxies.
BUT ... I have a amazon-crawler project in scrapinghub ... I have confirmed that each time Spider runs it runs from a new IP ... hence I am not able to use http://stormproxies.com/ services.
Can you make my amazon-crawler project run using same IP each time? is it possible?
Hi, short answer is that's not possible, please check this support note https://support.scrapinghub.com/topics/2012-what-are-the-scrapy-cloud-ip-addresses/
Hi, how can I delete Projets or change their names?
Hi! please refer to this article in our Knowledge Base https://support.scrapinghub.com/topics/2280-how-to-delete-a-project-in-scrapy-cloud/
I see these are options you can set https://github.com/scrapinghub/portia/blob/master/docs/spiders.rst) to avoid getting the slybot_fewitems_scraped error which stops your scrape. However, I do not see a way to set these directly on the spider settings through the UI. Am I missing something?
To set the SLYCLOSE_SPIDER_CHECK_PERIOD and SLYCLOSE_SPIDER_PERIOD_ITEMS you would need to navigate to the Job settings of the spider, and then set the values under the Scrapy raw settings as shown in example below.
And then click on SAVE. After save, the settings would appear under scrapy settings as well.
The settings will be overridden for the spider. Now when the spider will run, it will consider these settings as well.
I was able to successfully create a Portia 2.0 spider from a sample page. However, one of the fields I would like extract from the sample looks like this:
<body onload="set_map(14,33.911208,-118.165676,'map_canvas');" id="second">
where 14,33.911208,-118.165676 is a latitude / longitude value.
I get the sense I could extract this field using a regex extractor but I cannot figure out a way to do so using Portia. I cannot seem to select the body element at all using the Portia UI.
ohh I see the issue now :).
Then select the CSS mode, and use the "href" selector
After that use the REGEX magic you want to access those coordinates.
Customer support service by UserEcho