Welcome to the Scrapinghub feedback & support site! We discuss all things related to Scrapy Cloud, Portia and Crawlera. You can participate by reading posts, asking questions, providing feedback, helping others, and voting the best questions & answers. Scrapinghub employees regularly pop in to answer questions, share tips, and post announcements.
0
Laurent Ades 17 hours ago in Portia 0

Hi,

new to all this... very exciting!

To give it a try i am scraping the open-mesh website to get all the products (less than 30 in total)

I have created a spider .... defined a sample page which pattern is followed by all product pages as expected...

http://www.open-mesh.com/products/s48-48-port-poe-cloud-managed-switch.html

It works pretty well, except for the price which sometimes is scraped, sometimes not... and can't get my hand on a specific reason coming for a difference in one page vs another one. I have tried to define fields with css or xpath but it does not change...

I have read other posts that kinda sound like my issue - bu t not exactly - whereby extraction does not always come up as expected....

Bug to corrected in the coming version (as i have read it ) or me doing something stupid ?


Thanks

0
Answered
Adam 4 days ago in Portia • updated by Laurent Ades 18 hours ago 2

Hi guys,


My spider doesn't capture all of the fields that I've specified, even though it seems to work in the "Extracted Items" preview. I've tried different things and still no luck.


Some facts:


  • Data is available on the page load (it's not loaded with AJAX).
  • It's happening for all of the scraped pages.
  • I have 100% match on 3 out of 7 fields and none on the remaining 4.
  • I have tried setting up new sample page from scratch, using new schema but I still have the same issue.


There's nothing unusual in the log:


0:2017-02-16 08:02:39INFO

Log opened.

1:2017-02-16 08:02:39INFO

[scrapy.log] Scrapy 1.2.2 started

2:2017-02-16 08:02:40INFO

[stderr] /usr/local/lib/python2.7/site-packages/scrapy_pagestorage.py:7: HubstorageDeprecationWarning: python-hubstorage is deprecated, please use python-scrapinghub >= 1.9.0 instead (https://pypi.python.org/pypi/scrapinghub).

3:2017-02-16 08:02:40INFO

[stderr] from hubstorage import ValueTooLarge

4:2017-02-16 08:02:40INFO

[stderr] /usr/local/lib/python2.7/site-packages/scrapy/crawler.py:129: ScrapyDeprecationWarning: SPIDER_MANAGER_CLASS option is deprecated. Please use SPIDER_LOADER_CLASS.

5:2017-02-16 08:02:40INFO

[stderr] self.spider_loader = _get_spider_loader(settings)

6:2017-02-16 08:02:40INFO

[root] Slybot 0.13.0b30 Spider

7:2017-02-16 08:02:40INFO

[stderr] /src/slybot/slybot/slybot/plugins/scrapely_annotations/builder.py:334: ScrapyDeprecationWarning: Attribute `_root` is deprecated, use `root` instead

8:2017-02-16 08:02:40INFO

[stderr] elems = [elem._root for elem in page.css(selector)]

9:2017-02-16 08:02:40INFO

[scrapy.utils.log] Scrapy 1.2.2 started (bot: scrapybot)

10:2017-02-16 08:02:40INFO

[scrapy.utils.log] Overridden settings: {'LOG_LEVEL': 'INFO', 'AUTOTHROTTLE_ENABLED': True, 'STATS_CLASS': 'sh_scrapy.stats.HubStorageStatsCollector', 'MEMUSAGE_LIMIT_MB': 950, 'TELNETCONSOLE_HOST': '0.0.0.0', 'LOG_ENABLED': False, 'MEMUSAGE_ENABLED': True}

11:2017-02-16 08:02:40WARNING

[py.warnings] /src/slybot/slybot/slybot/closespider.py:10: ScrapyDeprecationWarning: Importing from scrapy.xlib.pydispatch is deprecated and will no longer be supported in future Scrapy versions. If you just want to connect signals use the from_crawler class method, otherwise import pydispatch directly if needed. See: https://github.com/scrapy/scrapy/issues/1762

More
12:2017-02-16 08:02:40INFO

[scrapy.log] HubStorage: writing items to https://storage.scrapinghub.com/items/156095/5/17

13:2017-02-16 08:02:40INFO

[scrapy.middleware] Enabled extensions:

More
14:2017-02-16 08:02:40INFO

[scrapy.middleware] Enabled downloader middlewares:

More
15:2017-02-16 08:02:40WARNING

[py.warnings] /usr/local/lib/python2.7/site-packages/scrapy_pagestorage.py:50: ScrapyDeprecationWarning: log.msg has been deprecated, create a python logger and log through it instead

More
16:2017-02-16 08:02:40INFO

[scrapy.log] HubStorage: writing pages to https://storage.scrapinghub.com/collections/156095/cs/Pages

17:2017-02-16 08:02:41INFO

[scrapy.middleware] Enabled spider middlewares:

More
18:2017-02-16 08:02:41INFO

[scrapy.middleware] Enabled item pipelines:

More
19:2017-02-16 08:02:41INFO

[scrapy.core.engine] Spider opened

20:2017-02-16 08:02:41INFO

[scrapy.extensions.logstats] Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)

21:2017-02-16 08:02:41INFO

TelnetConsole starting on 6023

22:2017-02-16 08:02:51WARNING

[py.warnings] /src/slybot/slybot/slybot/plugins/scrapely_annotations/processors.py:226: ScrapyDeprecationWarning: Attribute `_root` is deprecated, use `root` instead

More
23:2017-02-16 08:02:51WARNING

[py.warnings] /src/slybot/slybot/slybot/plugins/scrapely_annotations/processors.py:213: ScrapyDeprecationWarning: Attribute `_root` is deprecated, use `root` instead

More
24:2017-02-16 08:03:42INFO

[scrapy.extensions.logstats] Crawled 149 pages (at 149 pages/min), scraped 91 items (at 91 items/min)

25:2017-02-16 08:04:33INFO

[scrapy.crawler] Received SIGTERM, shutting down gracefully. Send again to force

26:2017-02-16 08:04:33INFO

[scrapy.core.engine] Closing spider (shutdown)

27:2017-02-16 08:04:41INFO

[scrapy.extensions.logstats] Crawled 188 pages (at 39 pages/min), scraped 126 items (at 35 items/min)

28:2017-02-16 08:05:11INFO

[scrapy.statscollectors] Dumping Scrapy stats:

More
29:2017-02-16 08:05:12INFO

[scrapy.core.engine] Spider closed (shutdown)

30:2017-02-16 08:05:12INFO

(TCP Port 6023 Closed)

31:2017-02-16 08:05:12INFO

Main loop terminated.


Any ideas why this happened?


Update:


I've tried setting up a brand new spider from scratch, same problem occurred.


On the original spider, I've added some random fields that are always on the page (such as login link or telephone number), that doesn't seem to get picked up.


I've tried to rename one of the 3 fields that work to see if my changes are actually deployed successfully, this worked, I could see the renamed field in the scraped data, still missing the other 4 fields though.


Thanks,

Adam

Answer

Hi Adam, our Portia team is about to release a new version of Portia with fixes to most bugs, like this, reported by our users.

We will update in our community for this new release.

Kind regards,

Pablo

0
Completed
Tristan Bailey 6 days ago in Portia • updated by Pablo Vaz (Support Engineer) 5 days ago 1

I see in Portia 2.0 there is the option for FeedUrl as a starting page list type - text one link per line.

Is it possible to pass this feedurl in the API to start a new spider like "start_urls"?
(looks like may be not?)


Second part there is another post that mentions that you can do it with as RSS or XML sitemap.

I can not find the docs for this. It looks like it might work, but is there a spec for these formats, as they can vary.


Third part, is there any limit to the number of urls in these bulk methods for seeding?


thanks


tristan


Answer

Hi Tristan,


For the first question, the feed refers to a URL, so if you can update the data provided in that URL and schedule the spider in Scrapy Cloud, you could solve this. Perhaps there's another solution more efficient and our community members would like to share.

I think the second question is related to first one. But feel free to elaborate a bit more what you want to achieve so we can find a possible solution.


For the last question, according our Portia developers there's no limit for URLs, but keep in mind that pushing Portia beyond its limits has, as you may experience, uncomfortable consequences due to memory usage and capacity of our free storage.


Feel free to explore using Portia and share with us what you find. Your contributions are very helpful.


Kind regards,

Pablo

0
Answered
19dc60 6 days ago in Portia • updated by Pablo Vaz (Support Engineer) 5 days ago 1

I am getting the following when attempting to open my spider in Portia. Please advise why.

"Failed to load resource: the server responded with a status of 403 (Forbidden)"

Answer

Hi 19dc60!


Possibly a network issue, now is working fine. Feel free to ask if you need further assistance.


Kind regards,

Pablo

0
Answered
Tristan Bailey 2 weeks ago in Portia • updated by Nestor Toledo Koplin (Support Engineer) 1 week ago 3

Hi


I want to crawl a clients website, but Portia just stops at page 1, I cant see robots.txt blocking so how can I see what might block this spider?



Time (UTC) Level Message
0: 2017-02-09 16:38:21 INFO Log opened.
1: 2017-02-09 16:38:21 INFO [scrapy.log] Scrapy 1.2.2 started
2: 2017-02-09 16:38:21 INFO [stderr] /usr/local/lib/python2.7/site-packages/scrapy_pagestorage.py:7: HubstorageDeprecationWarning: python-hubstorage is deprecated, please use python-scrapinghub >= 1.9.0 instead (https://pypi.python.org/pypi/scrapinghub).
3: 2017-02-09 16:38:21 INFO [stderr] from hubstorage import ValueTooLarge
4: 2017-02-09 16:38:21 INFO [stderr] /usr/local/lib/python2.7/site-packages/scrapy/crawler.py:129: ScrapyDeprecationWarning: SPIDER_MANAGER_CLASS option is deprecated. Please use SPIDER_LOADER_CLASS.
5: 2017-02-09 16:38:21 INFO [stderr] self.spider_loader = _get_spider_loader(settings)
6: 2017-02-09 16:38:21 INFO [root] Slybot 0.13.0b30 Spider
7: 2017-02-09 16:38:22 INFO [stderr] /src/slybot/slybot/slybot/plugins/scrapely_annotations/builder.py:334: ScrapyDeprecationWarning: Attribute `_root` is deprecated, use `root` instead
8: 2017-02-09 16:38:22 INFO [stderr] elems = [elem._root for elem in page.css(selector)]
9: 2017-02-09 16:38:22 INFO [scrapy.utils.log] Scrapy 1.2.2 started (bot: scrapybot)
10: 2017-02-09 16:38:22 INFO [scrapy.utils.log] Overridden settings: {'CLOSESPIDER_ITEMCOUNT': 1000, 'LOG_LEVEL': 'INFO', 'STATS_CLASS': 'sh_scrapy.stats.HubStorageStatsCollector', 'LOG_ENABLED': False, 'MEMUSAGE_LIMIT_MB': 950, 'TELNETCONSOLE_HOST': '0.0.0.0', 'CLOSESPIDER_PAGECOUNT': 1000, 'AUTOTHROTTLE_ENABLED': True, 'MEMUSAGE_ENABLED': True}
11: 2017-02-09 16:38:22 WARNING [py.warnings] /src/slybot/slybot/slybot/closespider.py:10: ScrapyDeprecationWarning: Importing from scrapy.xlib.pydispatch is deprecated and will no longer be supported in future Scrapy versions. If you just want to connect signals use the from_crawler class method, otherwise import pydispatch directly if needed. See: https://github.com/scrapy/scrapy/issues/1762 More
12: 2017-02-09 16:38:22 INFO [scrapy.log] HubStorage: writing items to https://storage.scrapinghub.com/items/135909/1/12524
13: 2017-02-09 16:38:22 INFO [scrapy.middleware] Enabled extensions: More
14: 2017-02-09 16:38:22 INFO [scrapy.middleware] Enabled downloader middlewares: More
15: 2017-02-09 16:38:22 WARNING [py.warnings] /usr/local/lib/python2.7/site-packages/scrapy_pagestorage.py:50: ScrapyDeprecationWarning: log.msg has been deprecated, create a python logger and log through it instead More
16: 2017-02-09 16:38:22 INFO [scrapy.log] HubStorage: writing pages to https://storage.scrapinghub.com/collections/135909/cs/Pages
17: 2017-02-09 16:38:23 INFO [scrapy.middleware] Enabled spider middlewares: More
18: 2017-02-09 16:38:23 INFO [scrapy.middleware] Enabled item pipelines: More
19: 2017-02-09 16:38:23 INFO [scrapy.core.engine] Spider opened
20: 2017-02-09 16:38:24 INFO [scrapy.extensions.logstats] Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
21: 2017-02-09 16:38:24 INFO TelnetConsole starting on 6023
22: 2017-02-09 16:38:37 ERROR [scrapy.core.scraper] Error downloading : []
23: 2017-02-09 16:38:37 INFO [scrapy.core.engine] Closing spider (finished)
24: 2017-02-09 16:38:38 INFO [scrapy.statscollectors] Dumping Scrapy stats: Less
{'downloader/exception_count': 3,
'downloader/exception_type_count/twisted.web._newclient.ResponseFailed': 3,
'downloader/request_bytes': 699,
'downloader/request_count': 3,
'downloader/request_method_count/GET': 3,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2017, 2, 9, 16, 38, 37, 649231),
'log_count/ERROR': 1,
'log_count/INFO': 9,
'log_count/WARNING': 2,
'memusage/max': 76365824,
'memusage/startup': 76365824,
'scheduler/dequeued': 3,
'scheduler/dequeued/disk': 3,
'scheduler/enqueued': 3,
'scheduler/enqueued/disk': 3,
'start_time': datetime.datetime(2017, 2, 9, 16, 38, 24, 453847)}
25: 2017-02-09 16:38:39 INFO [scrapy.core.engine] Spider closed (finished)
26: 2017-02-09 16:38:39 INFO Main loop terminated.
Answer

Hi Tristan,


I looked at the 22: 2017-02-09 16:38:37 ERROR [scrapy.core.scraper] Error downloading : [] in your project.

Try changing your start URL to https://www..., (not posting the complete URL because I assume you removed it purposely from the logs in this post).

0
Answered
Andreas Dreyer Hysing 2 weeks ago in Portia • updated by Pablo Vaz (Support Engineer) 5 days ago 2

I have finally managed to get a full application with end users based on Portia. It is natural to make a separate scrapinghub.com project for the deployed application (as production environment). I wish to configure, test, and develop Portia for new web sites in one project, and move the Portia configuration to a separate project when they are tested and ready. This will enable me to have different settings per project, and save the data to separate buckets in Amazon S3.

There used to be a button for moving Portia spiders between projects in Portia 1.0. In the new UI I can not find any such thing. Creating a an identical spider configuration every time I want to move data is a cumbersome and error prone extra step.

Please concider making a such feature and follow up on its status.

Answer

The feature to copy spiders between projects is currently under QA testing, but it will be released in the following patches.

0
Started
Jay 2 weeks ago in Portia • updated by Pablo Vaz (Support Engineer) 2 weeks ago 1

I love Portia but since about a week or so all links in the UI show as non-follow (marked red). This happens even if the option "follow all in domain links" is marked and also in previous projects where previously the links were marked green. The scrape works just fine, but it makes it really hard to make sure the portia follows the right links before running the whole spider.

Answer

Hi Jay, thanks for share your valuable feedback, some minor issues like this are on QA stage and will be released soon.

Best regards!

0
Answered
Andreas Dreyer Hysing 3 weeks ago in Portia • updated by Pablo Vaz (Support Engineer) 2 weeks ago 1

Dear Scrapinghub Support


We have converted all our spiders from previous versions to Portia 2.

We are very happy with how easy it is to set up, and configure new spiders with the new interface. Great job!


After working with the system for some time we have noticed that the periodic jobs stopped working. We suspect all configuration changes in Portia triggers the Periodic job to lose it's relation the spider. We observe that most of the periodic jobs have noe value in the "Spiders" field as shown on the attached screen shot.


What can we do to detect and prevent periodic jobs from stopping?


This is a minor issue for us as long as scrainghub.com is under active development in my organisation.

Answer

Hi Andreas, glad you find new interface easy to use. Since new release of Portia 2.0 and discontinuing Portia 1.0 this last month we have released some bug fixes and this could impact in the jobs performance as in the settings you made.

We are releasing very soon new fixes and will forward this inquiry to our developers for further tests.

Thanks for your feedback!

0
Answered
shivam6294 3 weeks ago in Portia • updated by Pablo Vaz (Support Engineer) 3 weeks ago 1

I want to crawl data from a jsp website that contains Javascript links, something like : <a href="javascript::login_as_guest()"> <¡¬¬image¬¬></a>. I know this is possible using a tool like PhantomJS or Selenium, but is this possible using Portia? Can I define a sequence of JS links to be clicked on from the starting page in order to get to the page that contains the data to be extracted?

Answer

Hi Shivam, you can enable JS on Portia projects and define a link using Regular expressions.

This can be done pushing the config wheel near the spider name on Portia.

You can also use CSS selectors to extract some data and finally if you know the result to that sequence provided by JS links you can use a rule for the link follower using generating URLs or follow patterns:
http://help.scrapinghub.com/portia/how-do-you-extract-data-from-a-list-of-urls
http://help.scrapinghub.com/portia/using-regular-expressions-and-query-cleaner-addon
Hope this information could help you to crawl successfully,

Kind regards!

Pablo

0
Answered
norman 4 weeks ago in Portia • updated by Pablo Vaz (Support Engineer) 4 weeks ago 1

Hey folks,


I'm trying to extract some information from pinterest. More specifically from ads.pinterest.com. Here's what I did so far:

* set up the portia project

* in the spider properties I enabled login with the following url: https://www.pinterest.com/login

* start urls is set to one specific one under https://ads.pinterest.com/xyz/traffic

* created the sample


I can't get the login part to work though. I tried with JS enabled and disabled. In the job two requests are made: Both target the login url. The second is triggered by the first. The second one returns a 403 so I assume that pinterest doesn't like it?


Any help would be greatly appreciated. I'm overwhelmed with what you folks did with scrapy! I was using it a couple of years ago and portia seems like a very nice addition to the toolset.

Answer

Hi Norman, I've checked your settings and seems to be fine for me.

I think Pinterest is a hard site to crawl due its complexity and popularity (which makes more sophisticated in terms of security).
Try first with another site on which you have to enable login, and if you can successfully extract items it should be a problem of Pinterest blocking the crawler. If you can't extract any items on different domains, it could be a bug in Portia and you can report it.
Also, I want to thank you for your nice feedback regarding Scrapy! Those kind of comments gives us strength to move forward and continue improving our services.

Kind regards.

Pablo

Answer

Hi Matt, our team solved those issues and should be fine now.

Thanks for your feedback!

0
Answered
Tristan Bailey 1 month ago in Portia • updated by Pablo Vaz (Support Engineer) 1 month ago 1

What is the best way to deal with robots.txt and crawl blocking?


This is for a site that wants to approve crawling.


Just do something like this:
-- robots.txt

User-agent: Scrapy

Disallow:


Or something more.

Answer

Hey Tristan!

According our more experienced support agents, you could check this article posted on our Blog:
https://blog.scrapinghub.com/2016/08/25/how-to-crawl-the-web-politely-with-scrapy/

which gives some ideas and tips on the content to those type of files and how to handle (or not).

Best regards!

0
Answered
Andreas Dreyer Hysing 2 months ago in Portia • updated by Pablo Vaz (Support Engineer) 1 month ago 1

After we started to use version 2 of Portia we are experiencing unwanted deduplication of similar items from crawl. Looking through the logs of these crawl reveals that these items does indeed have different values for at least one field in each item. As we see it these items are not duplicates, and should not be discarded.


As a note. All fields in the item are configured with the Vary-option enabled, and both Required-options disabled.



Crawl logs were read on the web interface from https://app.scrapinghub.com/p/110257/36/4/log


Answer

Hi Andreas!

Our experts suggests to disable Vary-option. This should improve your crawling for this particular case. All the fields in the data format used by that spider have vary = True set, so they're ignored when checking for duplicates.

Let me know if this was helpful.

Kind regards,

Pablo

0
Answered
brandonmp 2 months ago in Portia • updated by Pablo Vaz (Support Engineer) 1 month ago 3

Searched for this topic but no luck, apologies if this is a duplicate


I'm scraping a few pages where I need to extract a few non-visible pieces of data. Specifically, the `src` attributes on images, some `data-*` attributes on misc. `html` tags, and some raw text from the content of a few `<script>` tags.


Is this possible to do in Portia? I haven't been able to figure it out on my own.


If not possible, is it possible to augment a Portia scraper with custom python? Or does a job have to be either all-Scrapy/Python or All-Portia?

Answer

Yes Brandon! You can add an extractor to the annotation. In the same options where you configured the CSS selector you can add an extractor which will process the text with your pre-defined regex.

+1
Fixed
ched 2 months ago in Portia • updated by Pablo Vaz (Support Engineer) 1 month ago 5

I'm having massive troubles with Portia today: Almost every action triggers a backend error (red notifications). Interacting with a website doesn't work, neither does deleting data formats.


It appears as if you'd switched my account from the old portia to portia 2.0, because yesterday the interface was totally different.


I also have a suggestion for an improvement: Your error messages are not very helpful :(.


Can anyone help me out or let me know what else I should post in order to resolve these problems?

Answer

Thanks Ched for your feeback!. I will forward your suggestion to our Portia team, it has a lot of sense what you propose.
Best regards!