Welcome to the Scrapinghub feedback & support site! We discuss all things related to Scrapy Cloud, Portia and Crawlera. You can participate by reading posts, asking questions, providing feedback, helping others, and voting the best questions & answers. Scrapinghub employees regularly pop in to answer questions, share tips, and post announcements.
0
Gabriel Munits 10 hours ago in Scrapy Cloud 0

Hey everyone,

I am launching new service which bypasses reCaptcha with multi language support/=.

0
hello 17 hours ago in Scrapy Cloud 0

I am using the following code to try and bring back all fields within an job using the items API;


$sch_id="172/73/3" //job ID

$ch = curl_init();

curl_setopt_array($ch, array(
CURLOPT_RETURNTRANSFER => 1,
CURLOPT_URL => "https://storage.scrapinghub.com/items/". $sch_id ."?format=json&fields=_type,design,price,material",
CURLOPT_CUSTOMREQUEST =>"GET",
CURLOPT_HTTPHEADER, array(
'Accept: application/x-jsonlines',
),
CURLOPT_USERPWD => "e658eb1xxxxxxxxxxxx4b42de6fd" . ":" . "",
));

$result = curl_exec($ch);
print_r(json_decode($result));

curl_close ($ch);

There are 4 fields I am trying to get as json but the request only brings back "_type" and "price". I have tried various things with different headers and the request URL but no luck.


Any advice would be appreciated.


Cheers,

Adam

0
Alex L 4 days ago in Scrapy Cloud 0

Currently scripts can only be deployed only by using the shub deploy command. When we push scripts to git, the app doesn't seem to pull the scripts from our repo.


Will pulling scripts from the git hook be supported in the future or do you guys intend to stay on using shub deploy for now?

0
shamily23v 5 days ago in Scrapy Cloud • updated 5 days ago 0

I would like to write/ update data in mongodb with the items crawled from scraping hub.

0
mattegli 1 week ago in Scrapy Cloud 0

I am trying to deploy a (local) portia project to scrapinghub. After adding "slybot" to requirements.txt I can deploy successfully, but when running the spider the following error occures:


Traceback (most recent call last):

  File "/usr/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 1299, in _inlineCallbacks
    result = g.send(result)
  File "/usr/local/lib/python2.7/site-packages/scrapy/crawler.py", line 90, in crawl
    six.reraise(*exc_info)
  File "/usr/local/lib/python2.7/site-packages/scrapy/crawler.py", line 71, in crawl
    self.spider = self._create_spider(*args, **kwargs)
  File "/usr/local/lib/python2.7/site-packages/scrapy/crawler.py", line 94, in _create_spider
    return self.spidercls.from_crawler(self, *args, **kwargs)
  File "/usr/local/lib/python2.7/site-packages/scrapy/spiders/__init__.py", line 50, in from_crawler
    spider = cls(*args, **kwargs)
  File "/app/python/lib/python2.7/site-packages/slybot/spidermanager.py", line 51, in __init__
    **kwargs)
  File "/app/python/lib/python2.7/site-packages/slybot/spider.py", line 44, in __init__
    ((t['scrapes'], t) for t in spec['templates']
KeyError: 'templates'
0
BobC 1 week ago in Scrapy Cloud 0

The following URL renders fine in all browsers EXCEPT the Scrapinghub browser:

https://tenforward.social/@Redshirt27

I'd like to find out why, but no clues are given. Help?

0
kzrster 1 week ago in Scrapy Cloud • updated 1 week ago 1

Hi !
I needed to scrape site which have many JS code. So I use scrapy+selenium. Aslo it should run at Scrapy Cloud.
I've write spider which uses scrapy+selenuim+phantomjs and run it on my local machine. All is ok.
Then I deployed project to Scrapy cloud using shub-image. Deployment is ok. But results of
webdriver.page_source is different. It's ok on local, not ok(HTML with inscription - 403, request 200 http) at cloud.
Then I decided to use crawlera acc. I've added it with:

service_args = [

            '--proxy="proxy.crawlera.com:8010"',
'--proxy-type=https',
'--proxy-auth="apikey"',
]


for Windows(local)
self.driver = webdriver.PhantomJS(executable_path=r'D:\programms\phantomjs-2.1.1-windows\bin\phantomjs.exe',service_args=service_args)


for docker

self.driver = webdriver.PhantomJS(executable_path=r'/usr/bin/phantomjs', service_args=service_args, desired_capabilities=dcap)

Again at local all is ok. Cloud not ok.
I've checked cralwera info. It's ok. Requests sends from both(local and cloud).

I dont get what's wrong.
I think It might be differences between phantomjs versions(Windows, Linux).

Any ideas?










0
Answered
jasonhousman2 1 week ago in Scrapy Cloud • updated by Pablo Vaz (Support Engineer) 1 week ago 1

Sorry if this is a repeat question, however I just recently moved to the Scrapy scraping hub. After re-configuring my project while maintaining my pipeline, I noticed that items remains at zero. This makes sense to me given that I am exporting to a CSV, however I am curious is that the proper usage? The order of fields are important for this specific project so this would be best. If this is totally executable, how exactly will I receive the CSV?


Thanks

Answer

Hey Jason!


We write an article about how to extract CSV data: https://helpdesk.scrapinghub.com/support/solutions/articles/22000200409-fetching-latest-spider-data


But not sure if that is what you are looking for, also, we create some interesting tutorials that can bring you ideas on how to work properly with Scrapy Cloud: https://helpdesk.scrapinghub.com/support/solutions/articles/22000200392-scrapy-cloud-video-tutorials-

Please let us know if helps or if we can help you further.


Best regards,


Pablo

0
Answered
Jazzity 1 week ago in Scrapy Cloud • updated by Pablo Vaz (Support Engineer) 2 days ago 4

Hey everybody,


I am trying to store scraped images to S3. However, I get the following error message:


The authorization mechanism you have provided is not supported. Please use AWS4-HMAC-SHA256.


Initial research tells me that this means that AWS4-HMAC-SHA256 (aka "V4") is one of two authentication schemes in S3, the other one being the older scheme, Signature Version 2 ("V2").


Does anybody know how I can switch to V4 - or any other hints that help me upload my images to S3?


My test project is called "S3_test" and has the ID 178090.


Any help is greatly appreciated.


Best wishes


Sebastian

Answer

Hey Glad to hear that works!

Was a pleasure to help Sebastian!

Thanks for your nice feedback,


Best,


Pablo Vaz

0
e.rawkz 2 weeks ago in Scrapy Cloud 0

IHave a pet project of which scrapes video hosting sites which then return as items the title, video Source URL( stream), and category( depending on the website being scraped) of which using scraping hubs python API client I then manually have to insert the project ID and the specific job Id to them iterate through the items to create a .m3u playlist... The purpose of the project being to agregate videos in one playlist of which one could use VLC(or choice program) .


Here's a quick write-up sample of more and less How I have been iterating to each project
...
list = conn.project_ids()
print("PROJECTS")
print("-#-" * 30)
for index, item in enumerate(list[1::]):
index = str(index)
item = str(item)
project = conn[item]
pspi = project.spiders()
jobs = project.jobs()
for x in pspi:
print("["+ index + "] | PROJECT ID " + item, x['id'], x['tags'])
....
The issue being is that I am unable to then iterate through jobs to then call each job (aware that using "list" is not recommended as it is a python fuction, this is just an example of the proccess more-or-less I go through)...


I understand also that I'm not being very clear as English is not my native language ultimately all I wish to do is 2 iterate through projects jobs to be able to call all job.items from all jobs in the given project...

0
Waiting for Customer
ayushabesit 2 weeks ago in Scrapy Cloud • updated by Pablo Vaz (Support Engineer) 1 week ago 3

Hi , I created a project in scrapinghub , then i deleted it , then i again created a new project and trying to run the spider , but when i give command "shub deploy" it is going to previous project id and giving error: Deploy failed(404) , Project:non_field_errors .It is showing that deploying to previous ID but current id is different. so suggest the solution.

0
tofunao1 2 weeks ago in Scrapy Cloud 0

I found a topic: https://support.scrapinghub.com/topics/708-api-for-periodic-jobs/

This topic introduced how to use api to add periodic jobs, use

curl -X POST -u APIKEY: "http://dash.scrapinghub.com/api/periodic_jobs?project=PROJECTID" -d '{"hour": "0", "minutes_shift": "0", "month": "*", "spiders": [{"priority": "2", "args": {}, "name": "SPIDER"}], "day": "*"}'

Last day I found there always returns some error: https://support.scrapinghub.com/topics/2479-api-for-periodic-jobs-cannot-be-used/

Now I found when I change this command by replace '"' by '\"', then it works, such as:

curl -X POST -u APIKEY: \"http://dash.scrapinghub.com/api/periodic_jobs?project=PROJECTID\" -d "{\"hour\": \"0\", \"minutes_shift\": \"0\", \"month\": \"*\", \"spiders\": [{\"priority\": \"2\", \"args\": {}, \"name\": \"SPIDER\"}], \"day\": \"*\"}"

So I successfully add the periodic jobs.

But when I need to set the 'day of month’, I modify the command.

curl -X POST -u APIKEY: \"http://dash.scrapinghub.com/api/periodic_jobs?project=PROJECTID\" -d "{\"hour\": \"0\", \"minutes_shift\": \"0\", \"month\": \"*\", \"spiders\": [{\"priority\": \"2\", \"args\": {}, \"name\": \"SPIDER\"}], \"day\": \"*\", \"dayofmonth\": \"7\"}"

But the result always return 'every day' in 'day of month'. The '7' in the picture is altered manually.

Is there any bug in the api code? How can I solve it?




    0
    dyv 3 weeks ago in Scrapy Cloud 0

    My spiders return inconsistent results. I am working on the website, https://www.ifixit.com/. How can I have same number of items every time I run the spider?

    0
    terry.zeng 3 weeks ago in Scrapy Cloud 0

    Hi,


    I found there has a unmatch count,

    when i run:

    from scrapinghub import ScrapinghubClient

    project = ScrapinghubClient(APIKEY).get_project(PROJECT_ID)

    project.jobs.summary()


    in the summary, the count of pending shows 7, but in the list, only has 5.

    {"count": 7, "name": "pending", "summary": [{"ts": 1491548459773, "spider": "test_quotes9", "elapsed": 16852, "state": "pending", "version": "e1fe743-master", "key": "168276/16/3", "pending_time": 1491548459773}, {"ts": 1491548459498, "spider": "test_quote", "elapsed": 17127, "state": "pending", "version": "e1fe743-master", "key": "168276/8/8", "pending_time": 1491548459498}, {"ts": 1491548459480, "spider": "test_quotes8", "elapsed": 17145, "state": "pending", "version": "e1fe743-master", "key": "168276/15/3", "pending_time": 1491548459480}, {"ts": 1491548459467, "spider": "test_quotes7", "elapsed": 17158, "state": "pending", "version": "e1fe743-master", "key": "168276/14/3", "pending_time": 1491548459467}, {"ts": 1491548459451, "spider": "test_quotes5", "elapsed": 17174, "state": "pending", "version": "e1fe743-master", "key": "168276/12/3", "pending_time": 1491548459451}]}


    cheers,

    Terry

    0
    Answered
    Jazzity 3 weeks ago in Scrapy Cloud • updated by Nestor Toledo Koplin (Support Engineer) 1 week ago 6

    Dear Forum,


    I am trying to store scraped images to S3.

    However, when launching the scraper I get the following error message:


    ValueError: Missing scheme in request url: h



    The message no longer appears when I deactivate the images addon, so it would seem that the problem is not actually the request url.


    These are my spider settings:



    Any helpful is greatly appreciated!


    Regards,


    Sebastian

    Answer

    Hi Sebastian, please check if you are setting the item as a list and not as a string in your spider, for example if you are yielding:


    yield {

    'image': response.css('example').extract_first(),

    }

    use


    yield {

    'image': [response.css('example').extract_first()],
    }

    To know more, please check the example provided in this excellent blog post:

    https://blog.scrapinghub.com/2016/02/24/scrapy-tips-from-the-pros-february-2016-edition/


    Best,


    Pablo