Our Forums have moved to a new location - please visit Scrapinghub Help Center to post new topics.

You can still browse older topics on this page.


Welcome to the Scrapinghub community forum! We discuss all things related to Scrapy Cloud, Portia and Crawlera. You can participate by reading posts, asking questions, providing feedback, helping others, and voting the best questions & answers. Scrapinghub employees regularly pop in to answer questions, share tips, and post announcements.

0
Answered
jasonhousman2 2 months ago in Scrapy Cloud • updated by Pablo Vaz (Support Engineer) 2 months ago 1

Sorry if this is a repeat question, however I just recently moved to the Scrapy scraping hub. After re-configuring my project while maintaining my pipeline, I noticed that items remains at zero. This makes sense to me given that I am exporting to a CSV, however I am curious is that the proper usage? The order of fields are important for this specific project so this would be best. If this is totally executable, how exactly will I receive the CSV?


Thanks

Answer

Hey Jason!


We write an article about how to extract CSV data: https://helpdesk.scrapinghub.com/support/solutions/articles/22000200409-fetching-latest-spider-data


But not sure if that is what you are looking for, also, we create some interesting tutorials that can bring you ideas on how to work properly with Scrapy Cloud: https://helpdesk.scrapinghub.com/support/solutions/articles/22000200392-scrapy-cloud-video-tutorials-

Please let us know if helps or if we can help you further.


Best regards,


Pablo

0
Answered
Jazzity 2 months ago in Scrapy Cloud • updated by Pablo Vaz (Support Engineer) 2 months ago 4

Hey everybody,


I am trying to store scraped images to S3. However, I get the following error message:


The authorization mechanism you have provided is not supported. Please use AWS4-HMAC-SHA256.


Initial research tells me that this means that AWS4-HMAC-SHA256 (aka "V4") is one of two authentication schemes in S3, the other one being the older scheme, Signature Version 2 ("V2").


Does anybody know how I can switch to V4 - or any other hints that help me upload my images to S3?


My test project is called "S3_test" and has the ID 178090.


Any help is greatly appreciated.


Best wishes


Sebastian

Answer

Hey Glad to hear that works!

Was a pleasure to help Sebastian!

Thanks for your nice feedback,


Best,


Pablo Vaz

0
Answered
e.rawkz 2 months ago in Scrapy Cloud • updated by Pablo Vaz (Support Engineer) 2 months ago 1

IHave a pet project of which scrapes video hosting sites which then return as items the title, video Source URL( stream), and category( depending on the website being scraped) of which using scraping hubs python API client I then manually have to insert the project ID and the specific job Id to them iterate through the items to create a .m3u playlist... The purpose of the project being to agregate videos in one playlist of which one could use VLC(or choice program) .


Here's a quick write-up sample of more and less How I have been iterating to each project
...
list = conn.project_ids()
print("PROJECTS")
print("-#-" * 30)
for index, item in enumerate(list[1::]):
index = str(index)
item = str(item)
project = conn[item]
pspi = project.spiders()
jobs = project.jobs()
for x in pspi:
print("["+ index + "] | PROJECT ID " + item, x['id'], x['tags'])
....
The issue being is that I am unable to then iterate through jobs to then call each job (aware that using "list" is not recommended as it is a python fuction, this is just an example of the proccess more-or-less I go through)...


I understand also that I'm not being very clear as English is not my native language ultimately all I wish to do is 2 iterate through projects jobs to be able to call all job.items from all jobs in the given project...

Answer

Hi!


We can't provide coding support through the community forum. Perhaps the best channels to ask are Stack Overflow and Github, many of our best developers contribute there actively.


Best,


Pablo

0
Answered
g4s.evry 2 months ago in Crawlera • updated by Thriveni Patil (Support Engineer) 2 months ago 1

Hi,


I was able to access below Url. Today I am unable to access this Url.


http://help.scrapinghub.com/crawlera/


It says 404 Not found.

Answer

Hello,


We have moved to new Interface, you can find the Crawlera KB articles in https://helpdesk.scrapinghub.com/solution/folders/22000131039 .


Regards,

Thriveni Patil

0
Answered
jkluv000 2 months ago in Portia • updated by Thriveni Patil (Support Engineer) 2 months ago 1

Is portia natively using crawlera or is there an integration between the 2?

Answer

Hello,


By default Portia doesnt use Crawlera. One would need to subscribe to Crawlera and then enable it for the Project through the Addon settings (https://helpdesk.scrapinghub.com/solution/articles/22000200395-scrapy-cloud-addons) and then run the Portia Spider. Then the spider will use Crawlera while crawling.


Regards,

Thriveni Patil

0
Answered
ayushabesit 2 months ago in Scrapy Cloud • updated by Pablo Vaz (Support Engineer) 2 weeks ago 4

Hi , I created a project in scrapinghub , then i deleted it , then i again created a new project and trying to run the spider , but when i give command "shub deploy" it is going to previous project id and giving error: Deploy failed(404) , Project:non_field_errors .It is showing that deploying to previous ID but current id is different. so suggest the solution.

0
Answered
Regan 3 months ago in Crawlera • updated by Nestor Toledo Koplin (Support Engineer) 2 months ago 1

I get the following error when trying to access an ssl website through the proxy in C#: The remote server returned an error: (407) Proxy Authentication Required.


I have installed the certificate and tried the two following code methods below:


1.

var key = _scrapingApiKey;
var myProxy = new WebProxy("http://proxy.crawlera.com:8010");

HttpWebRequest request = (HttpWebRequest)WebRequest.Create(url);

vvar encodedApiKey = Base64Encode(key);
request.Headers.Add("Proxy-Authorization", "Basic " + encodedApiKey);

request.Proxy = myProxy;
request.PreAuthenticate = true;

WebResponse response = request.GetResponse();


2.

var myProxy = new WebProxy("http://proxy.crawlera.com:8010");

myProxy.Credentials = new NetworkCredential(_scrapingApiKey, "");

HttpWebRequest request = (HttpWebRequest)WebRequest.Create(url);

request.Proxy = myProxy;

request.PreAuthenticate = true;

WebResponse response = request.GetResponse();


What is the correct way to make the proxy work when accessing SSL websites?

Answer

Hello,


The top code should work, but make sure to include the ":" after the APIKey.

0
Answered
dyv 3 months ago in Scrapy Cloud • updated by Pablo Vaz (Support Engineer) 2 weeks ago 1

My spiders return inconsistent results. I am working on the website, https://www.ifixit.com/. How can I have same number of items every time I run the spider?

Answer

Hi dyv, 


Unfortunately this is not possible to guarantee. Even with the best plans offered and the most strictly monitoring, some items got missing due other parameters beyond our control. Mainly due the backend issues or changes from the target domain side.


Many of our clients consider using DotScrapy persistence add-on to have a more robust control on the items extracted.


Best regards,


Pablo

0
Answered
Base79 3 months ago in Portia • updated by Pablo Vaz (Support Engineer) 2 weeks ago 12

Hi there,


This tool is new to me, but I keep running into a problem right from the start.

The New Sample button doesn't show anywhere after I have created a new spider.

This way I can not select any data.

0
Answered
Jazzity 3 months ago in Scrapy Cloud • updated by Nestor Toledo Koplin (Support Engineer) 2 months ago 6

Dear Forum,


I am trying to store scraped images to S3.

However, when launching the scraper I get the following error message:


ValueError: Missing scheme in request url: h



The message no longer appears when I deactivate the images addon, so it would seem that the problem is not actually the request url.


These are my spider settings:



Any helpful is greatly appreciated!


Regards,


Sebastian

Answer

Hi Sebastian, please check if you are setting the item as a list and not as a string in your spider, for example if you are yielding:


yield {

'image': response.css('example').extract_first(),

}

use


yield {

'image': [response.css('example').extract_first()],
}

To know more, please check the example provided in this excellent blog post:

https://blog.scrapinghub.com/2016/02/24/scrapy-tips-from-the-pros-february-2016-edition/


Best,


Pablo

0
Answered
robi9011235 3 months ago in Portia • updated 3 months ago 4

I'm trying to crawl this website: https://www.fxp.co.il/

But I always get the message: "Frames are not supported by portia"

But thing is, it worked a few days ago with the same project.


Also, unfortunately I'm having a really bad expirience with Portia. Always getting different errors when creating new projects, trying to load existing projects, and always trying to reconnect to Portia server. You product is really buggy and this results with bad expirience for me.

I wish there would be better alternative but all I found is just not as easy, simple and fast.

Answer

Hey Robi,


About:

"I wish there would be better alternative but all I found is just not as easy, simple and fast"


That's the cost for making more UX friendly:

https://helpdesk.scrapinghub.com/support/solutions/articles/22000200446-troubleshooting-portia


Our team is hardly working for fixing all bugs and misbehavior of Portia, unfortunately that not depends just on our Portia. If that site improves their security, Portia won't work as usual. Even any change in the site could affect Portia interaction.


If your project turns more ambitious, my suggestion is to think in a more powerful crawler like Scrapy. Check this comparison table:

https://helpdesk.scrapinghub.com/support/solutions/articles/22000201026-portia-or-scrapy

If interested in to learn Scrapy, please check this excellent videos provided by Valdir:

https://helpdesk.scrapinghub.com/support/solutions/articles/22000201028-learn-scrapy-video-tutorials-


If your project requires urgent attention, you can also consider to hire our experts. It can save you a lot of time and resources: https://scrapinghub.com/quote


Regardless above suggestions, thanks for your feedback, I will share with our Portia team as well.


Best regards,


Pablo Vaz

Support team

0
Not a bug
ssobanana 3 months ago in Scrapy Cloud • updated by Pablo Vaz (Support Engineer) 3 months ago 7

Hello,

This morning after logging in scrapinghub, i see that my account is missing 2 projects.

I'm on a free account, are projects deleted without warnings after a while ?

Thanks !

Answer

You are welcome :)

Yeah Deletion is ir-reversible action. Hence project cannot be bought back.

0
Started
terry.zeng 3 months ago in Scrapy Cloud • updated 3 months ago 2

Hi,


I am using scrapinghub api to fetch project summary data, but the response only tell me finished job status and count, for pending and running job, there has no data all the time.


Can anyone help me find out the reason?


Cheers,

Terry

0
Answered
Johan Hansson 3 months ago in Scrapy Cloud • updated by Pablo Vaz (Support Engineer) 3 months ago 1

I have a few modules with functions that I created myself to help with certain tasks in some spiders. When running a crawl on my computer locally it all works fine, but when uploading to scrapy cloud it doesnt work at all. I've put the modules in a modules/ directory and try to import them from a spider by: from testproject.modules.testmodule import TestModule


Is there any other settings I have to do except for just importing a module like i normally would do in Python 3?


Directory structure is like:


testproject

testproject

modules

testmodule.py

spiders

Answer

Hi Johan, to deploy custom spiders please check this article:

https://blog.scrapinghub.com/2016/09/28/how-to-run-python-scripts-in-scrapy-cloud/


It allows you to deploy other files or modules by normal shub deploy.


Best regards,

Pablo

0
Answered
Tristan 3 months ago in Portia • updated by Pablo Vaz (Support Engineer) 3 months ago 3

Hi

Which types of Regex does Portia/Scrapy support for include/exclude urls, when added in Portia interface?


Does it support \d \s [0-9] [^0-9] sorts of regex?


Is there maybe a library reference for this? I see a page on the query cleaner on your site but not in general.


Also I want to figure out how to make the query case insensitive? Is there a setting or just do

/[Ff]older/[Pp]age.html

for:

/Folder/Page.html

/folder/page.html
/folder/Page.html

Thanks


tristan

Answer

Hi Tristan!

For example, if you want to configure a URL pattern for:

https://www.kickstarter.com/projects/1865494715/apollo-7-worlds-most-compact-true-wireless-earphon/comments?cursor=14162300


you should use:


(https:\/\/www\.kickstarter\.com\/projects\/1865494715\/apollo-7-worlds-most-compact-true-wireless-earphon\/comments\?cursor=14162300)+


Best,

Pablo