Welcome to the Scrapinghub community forum! We discuss all things related to Scrapy Cloud, Portia and Crawlera. You can participate by reading posts, asking questions, providing feedback, helping others, and voting the best questions & answers. Scrapinghub employees regularly pop in to answer questions, share tips, and post announcements.

Remember to check the Help Center!

Please remember to check the Scrapinghub Help Center before asking here, your question may be already answered there.

0
kzrster 1 week ago in Scrapy Cloud • updated 1 week ago 1

Hi !
I needed to scrape site which have many JS code. So I use scrapy+selenium. Aslo it should run at Scrapy Cloud.
I've write spider which uses scrapy+selenuim+phantomjs and run it on my local machine. All is ok.
Then I deployed project to Scrapy cloud using shub-image. Deployment is ok. But results of
webdriver.page_source is different. It's ok on local, not ok(HTML with inscription - 403, request 200 http) at cloud.
Then I decided to use crawlera acc. I've added it with:

service_args = [

            '--proxy="proxy.crawlera.com:8010"',
'--proxy-type=https',
'--proxy-auth="apikey"',
]


for Windows(local)
self.driver = webdriver.PhantomJS(executable_path=r'D:\programms\phantomjs-2.1.1-windows\bin\phantomjs.exe',service_args=service_args)


for docker

self.driver = webdriver.PhantomJS(executable_path=r'/usr/bin/phantomjs', service_args=service_args, desired_capabilities=dcap)

Again at local all is ok. Cloud not ok.
I've checked cralwera info. It's ok. Requests sends from both(local and cloud).

I dont get what's wrong.
I think It might be differences between phantomjs versions(Windows, Linux).

Any ideas?










0
Answered
jasonhousman2 1 week ago in Scrapy Cloud • updated by Pablo Vaz (Support Engineer) 1 week ago 1

Sorry if this is a repeat question, however I just recently moved to the Scrapy scraping hub. After re-configuring my project while maintaining my pipeline, I noticed that items remains at zero. This makes sense to me given that I am exporting to a CSV, however I am curious is that the proper usage? The order of fields are important for this specific project so this would be best. If this is totally executable, how exactly will I receive the CSV?


Thanks

Answer

Hey Jason!


We write an article about how to extract CSV data: https://helpdesk.scrapinghub.com/support/solutions/articles/22000200409-fetching-latest-spider-data


But not sure if that is what you are looking for, also, we create some interesting tutorials that can bring you ideas on how to work properly with Scrapy Cloud: https://helpdesk.scrapinghub.com/support/solutions/articles/22000200392-scrapy-cloud-video-tutorials-

Please let us know if helps or if we can help you further.


Best regards,


Pablo

0
Answered
Jazzity 1 week ago in Scrapy Cloud • updated by Pablo Vaz (Support Engineer) 2 days ago 4

Hey everybody,


I am trying to store scraped images to S3. However, I get the following error message:


The authorization mechanism you have provided is not supported. Please use AWS4-HMAC-SHA256.


Initial research tells me that this means that AWS4-HMAC-SHA256 (aka "V4") is one of two authentication schemes in S3, the other one being the older scheme, Signature Version 2 ("V2").


Does anybody know how I can switch to V4 - or any other hints that help me upload my images to S3?


My test project is called "S3_test" and has the ID 178090.


Any help is greatly appreciated.


Best wishes


Sebastian

Answer

Hey Glad to hear that works!

Was a pleasure to help Sebastian!

Thanks for your nice feedback,


Best,


Pablo Vaz

0
e.rawkz 2 weeks ago in Scrapy Cloud 0

IHave a pet project of which scrapes video hosting sites which then return as items the title, video Source URL( stream), and category( depending on the website being scraped) of which using scraping hubs python API client I then manually have to insert the project ID and the specific job Id to them iterate through the items to create a .m3u playlist... The purpose of the project being to agregate videos in one playlist of which one could use VLC(or choice program) .


Here's a quick write-up sample of more and less How I have been iterating to each project
...
list = conn.project_ids()
print("PROJECTS")
print("-#-" * 30)
for index, item in enumerate(list[1::]):
index = str(index)
item = str(item)
project = conn[item]
pspi = project.spiders()
jobs = project.jobs()
for x in pspi:
print("["+ index + "] | PROJECT ID " + item, x['id'], x['tags'])
....
The issue being is that I am unable to then iterate through jobs to then call each job (aware that using "list" is not recommended as it is a python fuction, this is just an example of the proccess more-or-less I go through)...


I understand also that I'm not being very clear as English is not my native language ultimately all I wish to do is 2 iterate through projects jobs to be able to call all job.items from all jobs in the given project...

0
Answered
g4s.evry 2 weeks ago in Crawlera • updated by Thriveni Patil (Support Engineer) 2 weeks ago 1

Hi,


I was able to access below Url. Today I am unable to access this Url.


http://help.scrapinghub.com/crawlera/


It says 404 Not found.

Answer

Hello,


We have moved to new Interface, you can find the Crawlera KB articles in https://helpdesk.scrapinghub.com/solution/folders/22000131039 .


Regards,

Thriveni Patil

0
Answered
jkluv000 2 weeks ago in Portia • updated by Thriveni Patil (Support Engineer) 2 weeks ago 1

Is portia natively using crawlera or is there an integration between the 2?

Answer

Hello,


By default Portia doesnt use Crawlera. One would need to subscribe to Crawlera and then enable it for the Project through the Addon settings (https://helpdesk.scrapinghub.com/solution/articles/22000200395-scrapy-cloud-addons) and then run the Portia Spider. Then the spider will use Crawlera while crawling.


Regards,

Thriveni Patil

0
Waiting for Customer
ayushabesit 2 weeks ago in Scrapy Cloud • updated by Pablo Vaz (Support Engineer) 1 week ago 3

Hi , I created a project in scrapinghub , then i deleted it , then i again created a new project and trying to run the spider , but when i give command "shub deploy" it is going to previous project id and giving error: Deploy failed(404) , Project:non_field_errors .It is showing that deploying to previous ID but current id is different. so suggest the solution.

0
tofunao1 2 weeks ago in Scrapy Cloud 0

I found a topic: https://support.scrapinghub.com/topics/708-api-for-periodic-jobs/

This topic introduced how to use api to add periodic jobs, use

curl -X POST -u APIKEY: "http://dash.scrapinghub.com/api/periodic_jobs?project=PROJECTID" -d '{"hour": "0", "minutes_shift": "0", "month": "*", "spiders": [{"priority": "2", "args": {}, "name": "SPIDER"}], "day": "*"}'

Last day I found there always returns some error: https://support.scrapinghub.com/topics/2479-api-for-periodic-jobs-cannot-be-used/

Now I found when I change this command by replace '"' by '\"', then it works, such as:

curl -X POST -u APIKEY: \"http://dash.scrapinghub.com/api/periodic_jobs?project=PROJECTID\" -d "{\"hour\": \"0\", \"minutes_shift\": \"0\", \"month\": \"*\", \"spiders\": [{\"priority\": \"2\", \"args\": {}, \"name\": \"SPIDER\"}], \"day\": \"*\"}"

So I successfully add the periodic jobs.

But when I need to set the 'day of month’, I modify the command.

curl -X POST -u APIKEY: \"http://dash.scrapinghub.com/api/periodic_jobs?project=PROJECTID\" -d "{\"hour\": \"0\", \"minutes_shift\": \"0\", \"month\": \"*\", \"spiders\": [{\"priority\": \"2\", \"args\": {}, \"name\": \"SPIDER\"}], \"day\": \"*\", \"dayofmonth\": \"7\"}"

But the result always return 'every day' in 'day of month'. The '7' in the picture is altered manually.

Is there any bug in the api code? How can I solve it?




    0
    Answered
    Regan 3 weeks ago in Crawlera • updated by Nestor Toledo Koplin (Support Engineer) 5 days ago 1

    I get the following error when trying to access an ssl website through the proxy in C#: The remote server returned an error: (407) Proxy Authentication Required.


    I have installed the certificate and tried the two following code methods below:


    1.

    var key = _scrapingApiKey;
    var myProxy = new WebProxy("http://proxy.crawlera.com:8010");

    HttpWebRequest request = (HttpWebRequest)WebRequest.Create(url);

    vvar encodedApiKey = Base64Encode(key);
    request.Headers.Add("Proxy-Authorization", "Basic " + encodedApiKey);

    request.Proxy = myProxy;
    request.PreAuthenticate = true;

    WebResponse response = request.GetResponse();


    2.

    var myProxy = new WebProxy("http://proxy.crawlera.com:8010");

    myProxy.Credentials = new NetworkCredential(_scrapingApiKey, "");

    HttpWebRequest request = (HttpWebRequest)WebRequest.Create(url);

    request.Proxy = myProxy;

    request.PreAuthenticate = true;

    WebResponse response = request.GetResponse();


    What is the correct way to make the proxy work when accessing SSL websites?

    Answer

    Hello,


    The top code should work, but make sure to include the ":" after the APIKey.

    0
    dyv 3 weeks ago in Scrapy Cloud 0

    My spiders return inconsistent results. I am working on the website, https://www.ifixit.com/. How can I have same number of items every time I run the spider?

    0
    Waiting for Customer
    Base79 3 weeks ago in Portia • updated by Nestor Toledo Koplin (Support Engineer) 2 weeks ago 11

    Hi there,


    This tool is new to me, but I keep running into a problem right from the start.

    The New Sample button doesn't show anywhere after I have created a new spider.

    This way I can not select any data.

    0
    terry.zeng 3 weeks ago in Scrapy Cloud 0

    Hi,


    I found there has a unmatch count,

    when i run:

    from scrapinghub import ScrapinghubClient

    project = ScrapinghubClient(APIKEY).get_project(PROJECT_ID)

    project.jobs.summary()


    in the summary, the count of pending shows 7, but in the list, only has 5.

    {"count": 7, "name": "pending", "summary": [{"ts": 1491548459773, "spider": "test_quotes9", "elapsed": 16852, "state": "pending", "version": "e1fe743-master", "key": "168276/16/3", "pending_time": 1491548459773}, {"ts": 1491548459498, "spider": "test_quote", "elapsed": 17127, "state": "pending", "version": "e1fe743-master", "key": "168276/8/8", "pending_time": 1491548459498}, {"ts": 1491548459480, "spider": "test_quotes8", "elapsed": 17145, "state": "pending", "version": "e1fe743-master", "key": "168276/15/3", "pending_time": 1491548459480}, {"ts": 1491548459467, "spider": "test_quotes7", "elapsed": 17158, "state": "pending", "version": "e1fe743-master", "key": "168276/14/3", "pending_time": 1491548459467}, {"ts": 1491548459451, "spider": "test_quotes5", "elapsed": 17174, "state": "pending", "version": "e1fe743-master", "key": "168276/12/3", "pending_time": 1491548459451}]}


    cheers,

    Terry

    0
    Started
    robi9011235 3 weeks ago in Portia • updated 1 week ago 7

    This article give me a bit of information but I still don't get what I need to do in order for it to work and the reason it's not working.

    http://help.scrapinghub.com/portia/annotations-and-data-extraction

    Answer

    Hey Robi, sorry to hear you have experienced problems using Portia.

    When you said: "And I'm paying for this thing", that's strange, we offer Portia as a free service you shouldn't pay for it. Please let us know if some third party is charging you for use Portia.


    About bugs and issues, unfortunately yes, we have been experiencing some, since our release of Portia v2 and we are trying to solve as soon as possible. Again, we offer our Portia as a free service and your contribution and constructive feedback are always welcome.


    Please check: https://helpdesk.scrapinghub.com/support/solutions/articles/22000200446-troubleshooting-portia, to know more,


    Best regards,


    Pablo

    Support team

    0
    Answered
    Jazzity 3 weeks ago in Scrapy Cloud • updated by Nestor Toledo Koplin (Support Engineer) 1 week ago 6

    Dear Forum,


    I am trying to store scraped images to S3.

    However, when launching the scraper I get the following error message:


    ValueError: Missing scheme in request url: h



    The message no longer appears when I deactivate the images addon, so it would seem that the problem is not actually the request url.


    These are my spider settings:



    Any helpful is greatly appreciated!


    Regards,


    Sebastian

    Answer

    Hi Sebastian, please check if you are setting the item as a list and not as a string in your spider, for example if you are yielding:


    yield {

    'image': response.css('example').extract_first(),

    }

    use


    yield {

    'image': [response.css('example').extract_first()],
    }

    To know more, please check the example provided in this excellent blog post:

    https://blog.scrapinghub.com/2016/02/24/scrapy-tips-from-the-pros-february-2016-edition/


    Best,


    Pablo

    0
    Answered
    robi9011235 3 weeks ago in Portia • updated 3 weeks ago 4

    I'm trying to crawl this website: https://www.fxp.co.il/

    But I always get the message: "Frames are not supported by portia"

    But thing is, it worked a few days ago with the same project.


    Also, unfortunately I'm having a really bad expirience with Portia. Always getting different errors when creating new projects, trying to load existing projects, and always trying to reconnect to Portia server. You product is really buggy and this results with bad expirience for me.

    I wish there would be better alternative but all I found is just not as easy, simple and fast.

    Answer

    Hey Robi,


    About:

    "I wish there would be better alternative but all I found is just not as easy, simple and fast"


    That's the cost for making more UX friendly:

    https://helpdesk.scrapinghub.com/support/solutions/articles/22000200446-troubleshooting-portia


    Our team is hardly working for fixing all bugs and misbehavior of Portia, unfortunately that not depends just on our Portia. If that site improves their security, Portia won't work as usual. Even any change in the site could affect Portia interaction.


    If your project turns more ambitious, my suggestion is to think in a more powerful crawler like Scrapy. Check this comparison table:

    https://helpdesk.scrapinghub.com/support/solutions/articles/22000201026-portia-or-scrapy

    If interested in to learn Scrapy, please check this excellent videos provided by Valdir:

    https://helpdesk.scrapinghub.com/support/solutions/articles/22000201028-learn-scrapy-video-tutorials-


    If your project requires urgent attention, you can also consider to hire our experts. It can save you a lot of time and resources: https://scrapinghub.com/quote


    Regardless above suggestions, thanks for your feedback, I will share with our Portia team as well.


    Best regards,


    Pablo Vaz

    Support team