Welcome to the Scrapinghub community forum! We discuss all things related to Scrapy Cloud, Portia and Crawlera. You can participate by reading posts, asking questions, providing feedback, helping others, and voting the best questions & answers. Scrapinghub employees regularly pop in to answer questions, share tips, and post announcements.

Remember to check the Help Center!

Please remember to check the Scrapinghub Help Center before asking here, your question may be already answered there.

0
Not a bug
ssobanana 4 weeks ago in Scrapy Cloud • updated by Pablo Vaz (Support Engineer) 4 weeks ago 7

Hello,

This morning after logging in scrapinghub, i see that my account is missing 2 projects.

I'm on a free account, are projects deleted without warnings after a while ?

Thanks !

Answer

You are welcome :)

Yeah Deletion is ir-reversible action. Hence project cannot be bought back.

0
Started
terry.zeng 4 weeks ago in Scrapy Cloud • updated 4 weeks ago 2

Hi,


I am using scrapinghub api to fetch project summary data, but the response only tell me finished job status and count, for pending and running job, there has no data all the time.


Can anyone help me find out the reason?


Cheers,

Terry

0
Answered
Johan Hansson 4 weeks ago in Scrapy Cloud • updated by Pablo Vaz (Support Engineer) 4 weeks ago 1

I have a few modules with functions that I created myself to help with certain tasks in some spiders. When running a crawl on my computer locally it all works fine, but when uploading to scrapy cloud it doesnt work at all. I've put the modules in a modules/ directory and try to import them from a spider by: from testproject.modules.testmodule import TestModule


Is there any other settings I have to do except for just importing a module like i normally would do in Python 3?


Directory structure is like:


testproject

testproject

modules

testmodule.py

spiders

Answer

Hi Johan, to deploy custom spiders please check this article:

https://blog.scrapinghub.com/2016/09/28/how-to-run-python-scripts-in-scrapy-cloud/


It allows you to deploy other files or modules by normal shub deploy.


Best regards,

Pablo

0
Answered
Tristan 4 weeks ago in Portia • updated by Pablo Vaz (Support Engineer) 3 weeks ago 3

Hi

Which types of Regex does Portia/Scrapy support for include/exclude urls, when added in Portia interface?


Does it support \d \s [0-9] [^0-9] sorts of regex?


Is there maybe a library reference for this? I see a page on the query cleaner on your site but not in general.


Also I want to figure out how to make the query case insensitive? Is there a setting or just do

/[Ff]older/[Pp]age.html

for:

/Folder/Page.html

/folder/page.html
/folder/Page.html

Thanks


tristan

Answer

Hi Tristan!

For example, if you want to configure a URL pattern for:

https://www.kickstarter.com/projects/1865494715/apollo-7-worlds-most-compact-true-wireless-earphon/comments?cursor=14162300


you should use:


(https:\/\/www\.kickstarter\.com\/projects\/1865494715\/apollo-7-worlds-most-compact-true-wireless-earphon\/comments\?cursor=14162300)+


Best,

Pablo

0
Answered
Mani Zia 4 weeks ago in Crawlera • updated by Pablo Vaz (Support Engineer) 4 weeks ago 1

Hi All,

please help me out, i am trying to crawl the site with the given link : https://en-ae.wadi.com/home_entertainment-televisions/?ref=navigation
but scrapy (python) is unable to crawl it. i also used classes


import scrapy

from selenium import webdriverfrom scrapy.http

import TextResponse


but still it is returning only null in one line, nothing else. looking forward to your kind response.

Thanks

Regards zia.

Answer

Hi Mani!


We can't provide Scrapy assistance. But let me suggest you other channels to ask:

1. StackOverflow - Scrapy

There's a vast community of Scrapy developers and users contributing actively there.

2. Github - Scrapy

Same


If your projects require urgent attention, please share with us your needs through:

https://scrapinghub.com/quote

To receive our sales assistance if considering to hire our professional services.


Best luck with your project!


Pablo

0
Not a bug
San 4 weeks ago in Portia • updated 4 weeks ago 1

am trying to scrape a super simple table on a webpage. But whatever I try, I keep getting errors:

requests.exceptions.HTTPError: 404 Client Error: Not Found for url: https://portia.scrapinghub.com/api/projects/174585/download/.....

Really weird because this spider is as simple as can be.


Answer

Hi San.


Even, this seems a simple task, Portia is not suggested to parse tables.

Perhaps you can try using Scrapy.

Check this tutorials:

https://helpdesk.scrapinghub.com/support/solutions/articles/22000201028-learn-scrapy-video-tutorials-

Best regards,

Pablo


0
Answered
Art 4 weeks ago in Scrapy Cloud • updated by Pablo Vaz (Support Engineer) 3 weeks ago 4

Hi.

1. My spiders return inconsistent results when run on the same project and same setting. No errors in the log. It just makes different number of requests and returns different number of results. Moreover, the later searches usually provide more results. Why is that? What can be done?

2. Some pages return empty results, though there are results on the page and page is accessible. There are no errors in the log and requests show code 200 and some data downloaded. How can i know the reason? Is there any settings for load page delay? Maybe page does not load fully or i do not know. Please share your thoughts.

Answer

Hey Art,


Let's try to change that unsatisfaction mark =)


About:


"If i got blocked, I would expect to get some sort of error message. Moreover, when I use autothrotle and low threads, results are even worse, which should not be the case if it was antibot malware or smth of a sort. What do you think?"


As I mentioned, when using Crawlera, you receive bans notification. In fact you are only charged for 200 status requests (successful requests). Moreover, Crawlera rotates outgoing IPs avoiding bans.


If you deploy a spider in Scrapy Cloud and run it without Crawlera, the outgoing IP will be the same and the antibots system can detect easily. Of course using Autothrottling can help, but no much if your project is more complex or the sites crawled are well protected.


Thanks for reporting the broken link, I changed for a nice article that my friend Valdir write in our Blog. I hope you can enjoy it.


Kind regards,


Pablo Vaz

0
Answered
Art 4 weeks ago in Scrapy Cloud • updated by Nestor Toledo Koplin (Support Engineer) 4 weeks ago 3

Hi, Shub always returns my scv with additional column - _type which is the name of the project. How do i get rid of it?

Answer

If you go to the settings "https://app.scrapinghub.com/p/<projectid>/settings", you'll find a CSV Fields field, here you can type in all the fields you want to appear (in the order you desire) separated with a ",".

Note that this applies to all the spiders in that project.


Alternatively, you can right click the export link, copy it and paste it in a new tab/window, remove _type from the URL and hit enter.

0
Nayak 4 weeks ago in Crawlera 0

Hi,


I want to make a webrequest for Google site using Crawlera Proxy along with Session API after integration, I observed that few times we are unable to get session id.

Below code shows webrequest for google

C# Code
private static void GetResponse()
{
ServicePointManager.SecurityProtocol = SecurityProtocolType.Ssl3 | SecurityProtocolType.Tls | SecurityProtocolType.Tls11 | SecurityProtocolType.Tls12;
ServicePointManager.ServerCertificateValidationCallback += (sender, certificate, chain, sslPolicyErrors) => true;
//Proxy API Call
var myProxy = new WebProxy("http://proxy.crawlera.com:8010");
myProxy.Credentials = new NetworkCredential("< C10 plan API key >", "");
// Session API Call

HttpWebRequest sessionRequest = (HttpWebRequest)WebRequest.Create("http://proxy.crawlera.com:8010/sessions");
sessionRequest.Credentials = new NetworkCredential("< C10Plan API key >", "");
sessionRequest.Method = "POST";
HttpWebResponse sessionResponse = (HttpWebResponse)sessionRequest.GetResponse();
StreamReader srSession = new StreamReader(sessionResponse.GetResponseStream());
string sessionId = srSession.ReadToEnd();

// Google Request
string searchResults = "http://google.com/search?q=Ganesh Nayak K&num=100";
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(searchResults);
request.Proxy = myProxy;
request.PreAuthenticate = true;
request.Headers.Add("x-crawlera-use-https", "1");
request.Headers.Add("X-Crawlera-Session", sessionId);
request.ServerCertificateValidationCallback += (sender, certificate, chain, sslPolicyErrors) => true;

HttpWebResponse response = (HttpWebResponse)request.GetResponse(); // To get the response from server it is taking lot of time

Stream resStream = response.GetResponseStream();
StreamReader sr = new StreamReader(resStream);
sr.ReadToEnd();
}

If we try to implement same code without session we are getting response in a quicker way and able to process the request more faster.

Without session I have processed more number of requests but few times I got below exception

--> Unable to connect to the remote server à A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond 64.58.114.15:8010

Regards,
Ganesh Nayak K
0
Started
Mykhailo Kuznietsov 1 month ago in Scrapy Cloud • updated by Nestor Toledo Koplin (Support Engineer) 2 weeks ago 4

I cant deploy spiders. Anyone can help? Thanks

Login succeeded Pulling a base image 1.1: Pulling from scrapinghub/scrapinghub-stack-scrapy Digest: sha256:7d91fec51fece61ce4b3464c042f95c56e4378f3df277232c4468f8f082a2957 Status: Image is up to date for scrapinghub/scrapinghub-stack-scrapy:1.1 Building an image: Step 1 : FROM scrapinghub/scrapinghub-stack-scrapy:1.1 # Executing 2 build triggers...  Step 1 : ENV PIP_TRUSTED_HOST $PIP_TRUSTED_HOST PIP_INDEX_URL $PIP_INDEX_URL ---> Using cache Step 1 : RUN test -n $APT_PROXY && echo 'Acquire::http::Proxy \"$APT_PROXY\";' >/etc/apt/apt.conf.d/proxy ---> Using cache ---> c6c57ea4da6c Step 2 : ENV PYTHONUSERBASE /app/python ---> Using cache ---> 4d10e43f61ae Step 3 : ENTRYPOINT /usr/local/sbin/kumo-entrypoint ---> Using cache ---> dfa4b1ae177a Step 4 : ADD kumo-entrypoint eggbased-entrypoint /usr/local/sbin/ ---> Using cache ---> 4e3e6d59beb2 Step 5 : ADD run-pipcheck /usr/local/bin/ ---> Using cache ---> 2ec58858cf2e Step 6 : RUN chmod +x /usr/local/bin/run-pipcheck ---> Using cache ---> 2e5fe00582e8 Step 7 : RUN chmod +x /usr/local/sbin/kumo-entrypoint /usr/local/sbin/eggbased-entrypoint && ln -sf /usr/local/sbin/eggbased-entrypoint /usr/local/sbin/start-crawl && ln -sf /usr/local/sbin/eggbased-entrypoint /usr/local/sbin/scrapy-list && ln -sf /usr/local/sbin/eggbased-entrypoint /usr/local/sbin/run-pipcheck ---> Using cache ---> 983114ef04ec Step 8 : ADD requirements.txt /app/requirements.txt ---> Using cache ---> 97cd6bcf24ed Step 9 : RUN mkdir /app/python && chown nobody:nogroup /app/python ---> Using cache ---> 0895cdb5b56a Step 10 : RUN sudo -u nobody -E PYTHONUSERBASE=$PYTHONUSERBASE pip install --user --no-cache-dir -r /app/requirements.txt ---> Using cache ---> 32a357097f24 Step 11 : COPY *.egg /app/ ---> 1692cabbada6 Removing intermediate container c36c7e98abcd Step 12 : RUN if [ -d "/app/addons_eggs" ]; then rm -f /app/*.dash-addon.egg; fi ---> Running in ed56a51e5dcd ---> 79811aeaf4ea Removing intermediate container ed56a51e5dcd Step 13 : ENV PATH /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin ---> Running in f1a39a2a8a67 ---> 4f94437c3d1b Removing intermediate container f1a39a2a8a67 Successfully built 4f94437c3d1b >>> Checking python dependencies No broken requirements found. >>> Getting spiders list: 100percentoptical aaos_american_academy_of_orthopaedic_surgeons amwc_aesthetic_and_anti_aging_medicine_world_congress ced_centre_europeen_de_dermocosmetique cphi_istanbul cphi_japan cphi_north_america_and_informex cphi_russian cphi_south_east_asia dentalexpo efort_european_federation_of_national_association_of_orthopaedics_and_traumatology himss hospital ids_international_dental_show kimes lsie_life_science_industry_event medical_japan medizin medtec_europe medtrade_spring mido miof_moscow_international_optical_fair pharmapack pharmatech yankee_dental_congress WARNING: There're some errors when listing spiders: /usr/local/lib/python2.7/site-packages/scrapy_pagestorage.py:7: HubstorageDeprecationWarning: python-hubstorage is deprecated, please use python-scrapinghub >= 1.9.0 instead (https://pypi.python.org/pypi/scrapinghub). from hubstorage import ValueTooLarge >> Found 25 spider(s). Pushing the image to a repository: a2b90d1283e6: Preparing 298fdca015b6: Preparing 65509a58cfcf: Preparing 1243299d760a: Preparing 560cdcdf7b97: Preparing 969ae42bbcb9: Preparing 50d1d71ba8a5: Preparing 106f29618414: Preparing 3b50b419f2ed: Preparing 1d4b43e97b29: Preparing 8472d83e77cb: Preparing 50d1d71ba8a5: Waiting 969ae42bbcb9: Waiting d066f16d7ad7: Preparing 3b50b419f2ed: Waiting ef16f690993d: Preparing 1d4b43e97b29: Waiting fd72672f2ca3: Preparing 8472d83e77cb: Waiting d066f16d7ad7: Waiting ef16f690993d: Waiting 0a57ccde27fc: Preparing e8ae77a53ed2: Preparing fd72672f2ca3: Waiting 0a57ccde27fc: Waiting 6fcb26a859b5: Preparing e8ae77a53ed2: Waiting 3403d9941e1b: Preparing 76318a4cf363: Preparing 6fcb26a859b5: Waiting 3403d9941e1b: Waiting 725b4ef7ffce: Preparing 41ef8cc0bccb: Preparing 100396c46221: Preparing 7b4b54c74241: Preparing 76318a4cf363: Waiting d17d48b2382a: Preparing 725b4ef7ffce: Waiting 7b4b54c74241: Waiting 41ef8cc0bccb: Waiting d17d48b2382a: Waiting 100396c46221: Waiting a2b90d1283e6: Pushing [> ] 1.024 kB/70.55 kB a2b90d1283e6: Pushing [=======================> ] 33.79 kB/70.55 kB a2b90d1283e6: Pushing [===============================================> ] 66.56 kB/70.55 kB a2b90d1283e6: Pushing [==================================================>] 71.57 kB a2b90d1283e6: Pushing [==================================================>] 72.7 kB a2b90d1283e6: Pushing [==================================================>] 72.7 kB 298fdca015b6: Layer already exists 298fdca015b6: Layer already exists 1243299d760a: Layer already exists 1243299d760a: Layer already exists 65509a58cfcf: Layer already exists 65509a58cfcf: Layer already exists 560cdcdf7b97: Layer already exists 560cdcdf7b97: Layer already exists 50d1d71ba8a5: Layer already exists 50d1d71ba8a5: Layer already exists 3b50b419f2ed: Layer already exists 3b50b419f2ed: Layer already exists 106f29618414: Layer already exists 106f29618414: Layer already exists 969ae42bbcb9: Layer already exists 969ae42bbcb9: Layer already exists 1d4b43e97b29: Layer already exists 1d4b43e97b29: Layer already exists ef16f690993d: Layer already exists ef16f690993d: Layer already exists 8472d83e77cb: Layer already exists 8472d83e77cb: Layer already exists d066f16d7ad7: Layer already exists d066f16d7ad7: Layer already exists 6fcb26a859b5: Layer already exists 6fcb26a859b5: Layer already exists a2b90d1283e6: Pushed a2b90d1283e6: Pushed 0a57ccde27fc: Layer already exists 0a57ccde27fc: Layer already exists e8ae77a53ed2: Layer already exists e8ae77a53ed2: Layer already exists 76318a4cf363: Layer already exists 76318a4cf363: Layer already exists 725b4ef7ffce: Layer already exists 725b4ef7ffce: Layer already exists 3403d9941e1b: Layer already exists 3403d9941e1b: Layer already exists fd72672f2ca3: Layer already exists fd72672f2ca3: Layer already exists 41ef8cc0bccb: Layer already exists 41ef8cc0bccb: Layer already exists 7b4b54c74241: Layer already exists 7b4b54c74241: Layer already exists 100396c46221: Layer already exists 100396c46221: Layer already exists d17d48b2382a: Layer already exists d17d48b2382a: Layer already exists 53: digest: sha256:bace8532d5e90341006bc74e9c96f8e2504b31ed8eeeac1cfaed5d136601e125 size: 5327 {"ok": ["100percentoptical", "aaos_american_academy_of_orthopaedic_surgeons", "amwc_aesthetic_and_anti_aging_medicine_world_congress", "ced_centre_europeen_de_dermocosmetique", "cphi_istanbul", "cphi_japan", "cphi_north_america_and_informex", "cphi_russian", "cphi_south_east_asia", "dentalexpo", "efort_european_federation_of_national_association_of_orthopaedics_and_traumatology", "himss", "hospital", "ids_international_dental_show", "kimes", "lsie_life_science_industry_event", "medical_japan", "medizin", "medtec_europe", "medtrade_spring", "mido", "miof_moscow_international_optical_fair", "pharmapack", "pharmatech", "yankee_dental_congress"]}

Answer

Hello,


Apologies for the late reply, the reason the deploy failed is the long name for one of your spiders. Currently there's a limit of 60 characters per spider name. Please rename any spider that has a name longer than this limit and try deploying again.

0
Under review
Chris Fankhauser 1 month ago in Crawlera • updated by Pablo Vaz (Support Engineer) 4 weeks ago 1

Currently, I have several wide-open API keys which I use on our production servers.


It would be very beneficial to be able to use a restricted (to x requests per day, limited by IP address(es), etc.) development-only API key for development, which could be shared more widely than the production keys.


Is this possible to set up? If not, is it functionality which seems reasonable to implement?

Answer

Hi Chris, not sure to understand correctly, but the idea behind the different API keys for Crawlera is that you can set up different Crawlera accounts with different regions for example as shown in:

https://helpdesk.scrapinghub.com/solution/articles/22000188398-regional-ips-in-crawlera

You can also share your needs to evaluate for our engineers if it's possible to implement on an Enterprise account project. If interested, please share all details possible through: https://scrapinghub.com/quote.


Thanks Chris for always provide good questions and ideas helping us to provide a better service.


Best regards,


Pablo

0
Under review
Maynier, Jean 1 month ago in Scrapy Cloud • updated by Rashmi Vasudevan (Support Engineer) 4 weeks ago 1

We have more and more spiders and people working on them.

It is a pain not having a way to have meta data per project and per spider, like for example for spider having a long name, a description, a link to a specification doc, etc.


It would be good to have a way to set those on the front or even better extract it from code (settings,py or meta data in spider class).


Answer

Hi Jean,


Thank you for the suggestion.

It is a good feature to incorporate and our product team will be looking to do the same in one of our future releases.


Thanks,

Rashmi

0
Answered
FHERTZ 1 month ago in Crawlera • updated by Rashmi Vasudevan (Support Engineer) 1 month ago 1

Hello,


We need to crawl NL website that have restricted to Netherland ips,

When we test we have the message : "No available proxies"


It mean that crawlera don't have Netherland ip address ?


Thanks


Answer

Hello,


We do not have Netherland proxies to add at the moment.

Apologies for the inconvenience caused.


Rashmi

0
Waiting for Customer
DPr 1 month ago in Crawlera • updated by Nestor Toledo Koplin (Support Engineer) 4 weeks ago 1

Hello,


I am using Crawlera. I have tested it making requests to my own web. In my HTTP Log, I see the IP's from crawlera instead of mine. That's ok, however, Google Analytics says that these visits are from my city and not from the different countries of the IPs of Crawlera.


Is it normal? Anyone knows what could be the problem?


Thank you,


David

0
Answered
terry.zeng 1 month ago in Scrapy Cloud • updated 4 weeks ago 3

Hi,

I am building a crawling project for my company, we decide to use scrapinghub,

but currently facing the issue is we cannot upload the item result to s3.


can you please give me a tutorial that shows how to do it?


Cheers,

Terry

Answer

Hello Terry,


You can accomplish this with Feed Exports, please check this tutorial: http://help.scrapinghub.com/scrapy-cloud/how-to-export-my-items-to-a-awss3-account-ui-mode