Welcome to the Scrapinghub community forum! We discuss all things related to Scrapy Cloud, Portia and Crawlera. You can participate by reading posts, asking questions, providing feedback, helping others, and voting the best questions & answers. Scrapinghub employees regularly pop in to answer questions, share tips, and post announcements.

Remember to check the Help Center!

Please remember to check the Scrapinghub Help Center before asking here, your question may be already answered there.

0
Answered
Rubhan 5 days ago in Scrapy Cloud • updated by Nestor Toledo Koplin (Support Engineer) 2 days ago 3

I am trying to deploy scrapy app on scrapy cloud. But after deploy, Spider is not available to run. After deploy I am getting this status.
{"project": 169397, "version": "f79741f-master", "spiders": 0, "status": "ok"}
On running scrapy list, it is showing spider.

Answer

Thanks for the feedback!

Basic usage of deploying does not need a requirements.txt unless you need to deploy libraries which are not available in Scrapy Cloud. You can find that information in Deploying Dependencies section of shub documentation: https://shub.readthedocs.io/en/latest/deploying.html#deploying-dependencies

0
Answered
saurabh9 6 days ago in Portia • updated by adebar 4 days ago 4

Hi,

I am trying Portia on few sites, but adding new annotations has been a problem. Almost 80% time I am getting error "Resource 'api/projects/.........' not found" . I tried different brwosers, c;eared cache, incognito mode, but nothing helped. Any clues?

Answer

Hi Saurabh.

Thanks for reporting this issue. Some sites are hard to crawl with Portia due its complexity. We suggest to try with a new project and create that new annotations. If the problem persists Portia could be not working well with that site.


In that case, it is recommended to try with Scrapy:

https://doc.scrapy.org/en/latest/intro/tutorial.html

You can run your Scrapy spiders in our Platform for free! And there's a vast community willing to help in StackOverflow and Github.

Finally, you can always ask to our sales team for our data on demand services. We can extract the data you need for you and deliver to you in the most useful formats. If interested don't hesitate to contact us through:

https://scrapinghub.com/quote


Kind regards,

Pablo

0
Answered
Dave 1 week ago in Portia • updated by Pablo Vaz (Support Engineer) 6 days ago 1

Am I able to create a new column based on a regex extractor with Portia?

Answer

Hi Dave, using regex you can:


1. Configure URL patterns and use Query cleaner Addon:

http://help.scrapinghub.com/portia/using-regular-expressions-and-query-cleaner-addon


2. You can also use regex for more complex actions like crawl paginated listings:

https://portia.readthedocs.io/en/latest/examples.html#crawling-paginated-listings


3. You can also use regular expressions to extract a portion of the variable.

For example, let’s say you need to extract a parameter from a URL like this: http://www.example.com/product.html?item_no=345. The normal syntax, { "sku": "$field:url" } will store the full URL into the sku field. If we want to extract only the item_no value, we can use a regex like this:

{ "sku": "$field:url,r'item_no=(\d+)'" }

Not sure if the above suggestions can help but you can find more information in Portia docs:
https://portia.readthedocs.io/en/latest/index.html


Kind regards,

Pablo

0
Answered
vicente.tronco 1 week ago in Portia • updated by Pablo Vaz (Support Engineer) 6 days ago 1

For instance, I am extracting product informations of a website, but I would like to include a column with all the elements with the name of the brand, all the elements of this column should have the name XXX (and it is not on the elements page)

Answer

Hi Vicente, what if you go to the page you want (with that products) using the Portia browser, and then you set as start page.

I think could be a simple solution.

Kind regards,

Pablo

0
Answered
Nayak 1 week ago in Crawlera • updated by Pablo Vaz (Support Engineer) 6 days ago 1

Hi,


We want to make webrequest for one of popular domain continuesly, we created background service and requesting domain using .Net HttpWebRequest.
We got the problem of IP Ban. I just saw the Crawlera.
I just saw the post How to use Crawlera in C# .Net
var myProxy = new WebProxy("http://proxy.crawlera.com:8010");
myProxy.Credentials = new NetworkCredential("<API KEY>", "");
HttpWebRequest request = (HttpWebRequest)WebRequest.Create("some domain url");
request.Proxy = myProxy;
request.PreAuthenticate = true;

If i make a webrequest for any website repeatedly, if IP ban occurs, will crawlera automatically handle this request with different IP? or do we need to send same request once again?


If it handles automatically then Do we need to send any header information apart from above listed code.

Regards,

Nayak


Answer

Hi Nayak, yes Crawlera will automatically retry. After 5 times (by default) if the site still banning IPs Crawlera will give you a ban status and will try with another request.

Even though Crawlera should protect you against bans, sometimes it runs out of capacity and will return a 503 response. Because of this, we recommend you retry 503 responses up to 5 times. Consider using the x-crawlera-next-request-in to retry more efficiently.

Kind regards,

Pablo

0
Answered
abbyinohio 1 week ago in Portia • updated by Pablo Vaz (Support Engineer) 1 week ago 1

Help! I deployed a Portia project (https://portia.scrapinghub.com/#/projects/167699) to scrape Rotten Tomatoes. But when I run the job on Scrapy Cloud, it crawls hundreds of pages and scrapes only the first ten. I verified that none of my fields are required, and I can't find any error messages in the log file. Thank you!

Answer

Hi Abby,

If Portia scrape successfully the first pages but then it started to fail, could be a ban issue.

When you start to crawl, Portia crawls from a fixed IP and the site can detect you are requesting and start to banning you.
We can suggest to use Crawlera, our intelligent proxy rotator. It can help you to crawl more efficiently.

https://scrapinghub.com/crawlera/


Also, if the site is complex to scrape, it is recommended to start with Scrapy:

https://doc.scrapy.org/en/latest/intro/tutorial.html

Finally, you can always ask to our sales team for our data on demand services. We can extract the data you need for you and deliver to you in the most useful formats.


I hope to be helpful with this suggestions.

Kind regards,


Pablo

0
Answered
vicente.tronco 1 week ago in Portia • updated by Pablo Vaz (Support Engineer) 1 week ago 1
Answer

Hi Vicente, to delete any project please check this article:

http://help.scrapinghub.com/scrapy-cloud/deleting-projects

Kind regards,

Pablo

0
Answered
seschulz 1 week ago in Scrapy Cloud • updated by Pablo Vaz (Support Engineer) 1 week ago 1

Hello,

I am new in the scraping universe and would like to ask, which scraping method / product would suit my needs best: I would like to find out from forums and social platforms, if certain brands are being discussed at all there, and if so, with which adjectives they are being connected / mentioned. Am I on the right track, trying to solve this over scraping? Thanks a lot for your help.


Seb

Answer

Hi Seschulz,

Your project seems very interesting, and I think there's no one simple answer to your inquiry.

Currently we have two options to offer: Use Portia and deploy your Scrapy Projects.


Portia is our open source visual scraper, it is made for scraping simple sites, learn the beautiful "art of crawling" and performs simple and middle size projects with almost no programming knowledge.

You can start by reading: http://help.scrapinghub.com/portia/using-portia-20-the-complete-beginners-guide

Scrapy is a powerful crawler developed and maintained by our founders. It requires more programming knowledge but the results can be outstanding. The best part, is that you can deploy your projects for free in our platform and automate your project in a very nice and easy way. I suggest you to start with the tutorial:

https://doc.scrapy.org/en/latest/intro/tutorial.html#scrapy-tutorial

If you need further assistance, you can always hire our experts or ask for our datasets service:

https://scrapinghub.com/quote

They can save you a huge amount of time and resources.


Kind regards,

Pablo

0
Answered
Herman 1 week ago in Portia • updated 1 week ago 4

(1)

[scrapy.core.scraper] Error downloading <GET https://www.openrice.com/en/hongkong/restaurants>:[<twisted.python.failure.Failure twisted.internet.error.ConnectionDone: Connection was closed cleanly.>]


3 ScrapyDeprecationWarning

py.warnings] /src/slybot/slybot/slybot/closespider.py:10: ScrapyDeprecationWarning: Importing from scrapy.xlib.pydispatch is deprecated and will no longer be supported in future Scrapy versions. If you just want to connect signals use the from_crawler class method, otherwise import pydispatch directly if needed. See:https://github.com/scrapy/scrapy/issues/1762

More
Answer

Hi Herman, the error reported seems to be a connection failure according our experts.

Please try to run it again.

About the warning, it shouldn't give you any problem. Our developers will update all necessary libraries when required.

Kind regards,

Pablo

0
Not a bug
Firdaus AD 2 weeks ago in Portia • updated by Pablo Vaz (Support Engineer) 1 week ago 2

hi all,

i use portia very good and extract data from my target website . for two month


but last week portia not extract data i want,


please anybody help me what happen.


http://m.mudah.my/view?q=&ca=1_3_a&sa=&cg=1000&catname=VEHICLES&o=1&f=a&srch=1&so=1&ad_id=53011516


i want extract CALL | SMS data..

Answer

Hi Firdaus,


Have you tried with other similar sites? If works for other sites, please check:

http://help.scrapinghub.com/portia/troubleshooting-portia

Kind regards,

Pablo

0
Answered
brandonmp 2 weeks ago in Portia • updated 1 week ago 2

I've looked in the forums for a similar question, but the only relevant questions are a few years old.


I have a site with a paginated list of links I want to crawl & scrape.


The pagination links are of the form:


`<a href="javascript:void(0);" onclick="changePage(1);">2</a>`


The links I want to crawl and scrape are of the pattern:


`<div class="someClass" onclick="window.location='/detailsLite?acId=8545&listingId=1037524'"> // content </div>`


Can Portia handle either of these types of navigation?


I've tried activating Javascript for the crawler on all pages, as well as configuring URL patterns to follow all links, but Portia still indicates it can't find any links to crawl.


Answer

Hi Brandon,

You can find how to set paths for crawling on this article:

http://help.scrapinghub.com/portia/using-regular-expressions-and-query-cleaner-addon
and also please check:

http://help.scrapinghub.com/portia/how-do-you-extract-data-from-a-list-of-urls
I hope you find this helpful.

Kind regards,

Pablo

0
Not a bug
robi9011235 2 weeks ago in Scrapy Cloud • updated by Pablo Vaz (Support Engineer) 1 week ago 1

HI there. I always get this error when trying to run spider.

twisted] Unhandled error in Deferred

Traceback (most recent call last):

File "/usr/local/lib/python2.7/site-packages/scrapy/commands/crawl.py", line 57, in run

	    self.crawler_process.crawl(spname, **opts.spargs)
	  File "/usr/local/lib/python2.7/site-packages/scrapy/crawler.py", line 163, in crawl
	    return self._crawl(crawler, *args, **kwargs)
	  File "/usr/local/lib/python2.7/site-packages/scrapy/crawler.py", line 167, in _crawl
	    d = crawler.crawl(*args, **kwargs)
	  File "/usr/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 1274, in unwindGenerator
	    return _inlineCallbacks(None, gen, Deferred())
	--- <exception caught here> ---
	  File "/usr/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 1128, in _inlineCallbacks
	    result = g.send(result)
	  File "/usr/local/lib/python2.7/site-packages/scrapy/crawler.py", line 90, in crawl
	    six.reraise(*exc_info)
	  File "/usr/local/lib/python2.7/site-packages/scrapy/crawler.py", line 71, in crawl
	    self.spider = self._create_spider(*args, **kwargs)
	  File "/usr/local/lib/python2.7/site-packages/scrapy/crawler.py", line 94, in _create_spider
	    return self.spidercls.from_crawler(self, *args, **kwargs)
	  File "/usr/local/lib/python2.7/site-packages/scrapy/spiders/__init__.py", line 50, in from_crawler
	    spider = cls(*args, **kwargs)
	  File "/src/slybot/slybot/slybot/spidermanager.py", line 54, in __init__
	    **kwargs)
	  File "/src/slybot/slybot/slybot/spider.py", line 58, in __init__
	    settings, spec, item_schemas, all_extractors)
	  File "/src/slybot/slybot/slybot/spider.py", line 226, in _configure_plugins
	    self.logger)
	  File "/src/slybot/slybot/slybot/plugins/scrapely_annotations/annotations.py", line 89, in setup_bot
	    self.extractors.append(SlybotIBLExtractor(list(group)))
	  File "/src/slybot/slybot/slybot/plugins/scrapely_annotations/extraction/extractors.py", line 61, in __init__
	    for p, v in zip(parsed_templates, template_versions)
	  File "/src/slybot/slybot/slybot/plugins/scrapely_annotations/extraction/extractors.py", line 70, in build_extraction_tree
	    basic_extractors = ContainerExtractor.apply(template, basic_extractors)
	  File "/src/slybot/slybot/slybot/plugins/scrapely_annotations/extraction/container_extractors.py", line 65, in apply
	    extraction_tree = cls._build_extraction_tree(containers)
	  File "/src/slybot/slybot/slybot/plugins/scrapely_annotations/extraction/container_extractors.py", line 144, in _build_extraction_tree
	    parent = containers[parent_id]
	exceptions.KeyError: u'5e88-4205-901a#parent'
Answer

Hi Robi, please review the fields you trying to scrape.

I could replicate a similar spider for that site you are trying to scrape using Portia and after 2 minutes more than 300 items were scraped successfully.

Kind regards,

Pablo

0
Answered
trustyao 2 weeks ago in Scrapy Cloud • updated by Pablo Vaz (Support Engineer) 1 week ago 2

I was in China now and China adopts the GMT+8 Time Zone.

Everytime I login in the scrapinghub, it use the GMT time. But it makes me puzzled.

So I want to change the timezone from GMT to GMT+8. How can I do it?

Thanks

Answer

Hi Trustyao,

Our platform uses all time sets in UTC time and at the moment can't be changed. But it's a good suggestion.

Kind regards,

Pablo

0
Not a bug
Håkan Waara 2 weeks ago in Portia • updated by Pablo Vaz (Support Engineer) 1 week ago 2

I've created a Portia spider for a site that is rendered with javascript and React. It works great in the UI and finds the right items. But when run, it returns 0 items. Any idea how I can debug this? Does Portia support "dynamic" pages?

Answer

Hi Hakan,

Could you share the site you are trying to scrape? Have you tried with other similar sites? If works for other sites, please check:

http://help.scrapinghub.com/portia/troubleshooting-portia


Kind regards,

Pablo

0
Started
Chris Fankhauser 2 weeks ago in Crawlera • updated 11 hours ago 3

I'm noticing some unusual sudden behavior from Crawlera lately...


First off, I log all outgoing requests/responses locally and build a dashboard using that log information so that I can see if something's gone horribly wrong on my end. I've noticed a couple of things which are disconcerting:


1) It looks like, as of March 3rd, `X-Crawlera-Version`, `X-Crawlera-Debug-Request-Time`, and `X-Crawlera-Debug-UA` are no longer being returned? This isn't a deal-breaker, but it makes me suspicious that something substantial has changed in regards to:


2) The number of requests seems to be dramatically underreported in the Crawlera dashboard as of, surprise, March 3rd. The number of "Failed" requests jumps up on the same day.


Here's my internal dashboard's graph of all requests/responses using Crawlera:


...and here's the graph in the Crawlera dashboard:



So... something seems to be clearly off here. Is anyone able to shine some light on the discrepancies I'm seeing? Thanks.

Answer
Pablo Vaz (Support Engineer) yesterday at 7:11 p.m.

Hi Chris,


Several fixes has been released regarding the issue you reported.

How is the performance now?


Kind regards,

Pablo Vaz

Support team