Welcome to the Scrapinghub feedback & support site! We discuss all things related to Scrapy Cloud, Portia and Crawlera. You can participate by reading posts, asking questions, providing feedback, helping others, and voting the best questions & answers. Scrapinghub employees regularly pop in to answer questions, share tips, and post announcements.
0
Answered
Inva 3 weeks ago in Portia • updated by Pablo Vaz (Support Engineer) 2 weeks ago 1

Trying to scrape the buylist information from this website's buy list http://www.coolstuffinc.com/main_buylist_display.php


Having difficulty figuring out how to even begin with it.  it all happens in that url, there's multiple steps in the process to get to a particular game's buylist with multiple sets per game.

Answer

Inva,


Please try to follow all the material provided in our Learn center to adapt your spider to your needs after:

https://learn.scrapinghub.com/portia/

We don't provide assistance for each case in particular, but we can offer our experts assistance, if interested, let us know by providing more information through:

https://scrapinghub.com/quote

Our engineers will freely provide a solution according your needs once to know if it's possible to extract data from that site.


Best,


Pablo

0
Answered
steven.sank 3 weeks ago in Scrapy Cloud • updated by Pablo Vaz (Support Engineer) 3 weeks ago 1

Looking for a way to publish periodic job run. 

Answer

Hi Steven,


I think you can use our API. Please take a moment to explore: https://doc.scrapinghub.com/scrapy-cloud.html

If not possible with API, perhaps you can run your own .py scripts on our Scrapy Cloud, check this blog post from Elías:

https://blog.scrapinghub.com/2016/09/28/how-to-run-python-scripts-in-scrapy-cloud/ 


I hope these suggestion were hopeful for you,


Best,


Pablo


0
Answered
mark80 3 weeks ago in Portia • updated by Pablo Vaz (Support Engineer) 3 weeks ago 1

http://www.ancebergamo.it/elencoimprese.asp?lettera=C


i need extract company name and mail info from every company paginated in alphabetic order..

i've set url generation, specifing every alphabetic letter but only 3 item after thousand of request..strangely slow..

can you take a look?


Pablo i've tried to add you to project? can you see my invitation?

Answer

Hi Mark,


As you can see, every page is like:

http://www.ancebergamo.it/elencoimprese.asp?lettera=1
http://www.ancebergamo.it/elencoimprese.asp?lettera=A
http://www.ancebergamo.it/elencoimprese.asp?lettera=B

and so on...


Please try to use pagination or URL list generation as seen on:

1. Extract data from a List of URLs

2. Handle pagination in Portia


I've noticed you has Follow all domain links on the pagination settings, at least for link_name_ink_extraction.


I hope you find this helpful.


Best regards!


Pablo

0
Answered
Вася Местный 3 weeks ago in Scrapy Cloud • updated by Pablo Vaz (Support Engineer) 3 weeks ago 1

I need to run the same spider over and over again and write(append) results to the same csv or google sheets doc. Is it possible here? How do I proceed?

Answer

I think you can use our API. Please take a moment to explore: https://doc.scrapinghub.com/scrapy-cloud.html If not possible with API, perhaps you can run your own .py scripts on our Scrapy Cloud, check this blog post from Elías:

https://blog.scrapinghub.com/2016/09/28/how-to-run-python-scripts-in-scrapy-cloud/ 

I hope these suggestion were hopeful for you,

Best,

Pablo

0
Completed
Вася Местный 4 weeks ago in Scrapy Cloud • updated by Pablo Vaz (Support Engineer) 3 weeks ago 1

Hi there. I need to rename the items while exporting. How do I do this? The only solution I can see is to chnage Items everywehre in the project. But it is not an easy task and obviously is not pythonic. Any other solution please?

Answer

Hi please check the other suggestion I made about using python scripts or our API.


Best,


Pablo

0
Not a bug
smi 4 weeks ago in Portia • updated 3 weeks ago 2

Hi!


I'm using portia to get data from a list of search results (businesses).


In that list are lnadline numbers, mobile numbers and so on.


Using .result-container > .row > .col2 > .icon-mobile > a it should return all numbers that have a mobile icon in front of it (when searching for mobile numbers) and in the GUI it works as it should.

But when the crawler runs and finds an item in the list that does not contain a mobile number, it picks the ones from all others on the page.


How can I prevent that?


Thanks! 

Answer

Hey smi, 


Try to play a bit with CSS selector or XPath selector to fetch that information:


 

Once you make the anotation to extract items, you just click on the wheel near the type of anotation, to config this. 


I hope this information helps.


Best regards,


Pablo Vaz

0
Answered
devin 4 weeks ago in Crawlera • updated by Pablo Vaz (Support Engineer) 3 weeks ago 1

Does anyone have any experience using nightmareJS with crawlera? I am having trouble specifying the proxy server using just the electron switches.

Answer

Hey Deving,


Our team is actively working on provide a better interaction for Crawlera with different languages and browsers. At this moment the most close approach we can provide is in our KB for some spooky cousins of NightmareJS:

Crawlera with CasperJS, PhantomJS...


Please check in our forum for similar inquiries about NighmareJS, and also consider our Stack Overflow - Crawlera channel to ask there. Many of our best developers contribute there actively.


Best regards!


Pablo

0
Answered
ks446 4 weeks ago in Scrapy Cloud • updated by Pablo Vaz (Support Engineer) 3 weeks ago 1

I have a scrapy script running on scrapinghub. The scraper takes one argument as a csv file where the urls have been stored. The script runs without error, but the problem is that it isn't scraping all the items from the url. I have no idea why this is ha

Answer

Hey ks446,


It could be for many reasons. To discard any issue with your deploy, you can run your script locally and check if works fine extracting all items.


If works fine, check how much time spend the spider to run and check in the script if there aren't infinite loops or something related that could extend the time pushing the job to cancel due no new items extracted.


Finally, consider that the site itself could be banning your spider. The only solution for this case is to use our proxy rotator Crawlera to make requests from different IPs. If interested to know more, please check:

What is Crawlera?


Best regards!


Pablo

0
Answered
lucse11 4 weeks ago in Scrapy Cloud • updated by Pablo Vaz (Support Engineer) 4 weeks ago 1

I'm having problem following pagination of this website: http://gamesurf.tiscali.it/ps4/recensioni.html

My spider part of code :

for pag in response.css('li.square-nav'):
    next = pag.css('li.square-nav > a > span::text').extract_first()
    if next=='»':
        next_page_url = pag.css('a::attr(href)').extract_first()
        if next_page_url:
            next_page_url = response.urljoin(next_page_url)
            yield scrapy.Request(url=next_page_url, callback=self.parse)


If i run my spider in terminal it works on all pages of the website, but when i deploy to scrapinghub and run from the button in the dashboard, spider scrape only the first page of the website.

Between log messages there is a warning: [py.warnings] /app/__main__.egg/reccy/spiders/reccygsall.py:21: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal.


I have checked problem is not caused by robot.txt.

How can i fix this?

Thanks


Answer

Hey Lucse,


Please check this post, seems related to your issue.

https://stackoverflow.com/questions/18193305/python-unicode-equal-comparison-failed


Basically Your program, seems to be comparing unicode objects with str objects, and the contents of a str object is not a valid UTF8 encoding. Not much convinced that would work, but did you try using something like:


if next == unicode('»'):


or related?


Best,


Pablo

0
Answered
signedup88 1 month ago in Crawlera • updated by Pablo Vaz (Support Engineer) 3 weeks ago 1

Hey guys does anybody have experience connecting to crawlera using  webdriver / selenium.


I am running a project on a WIN PC

Answer

Hey Signedup88,


Since it’s not so trivial to set up proxy authentication in Selenium, a popular option is to employ Polipo as a proxy. Update Polipo configuration file /etc/polipo/config to include Crawlera credentials (if the file is not present, copy and rename config.sample found in Polipo source folder):

parentProxy = "proxy.crawlera.com:8010"
parentAuthCredentials = "<API key>:"

For password safety reasons this content is displayed as (hidden) in the Polipo web interface manager. The next step is to specify Polipo proxy details in the Selenium automation script, e.g. for Python and Firefox:

from selenium import webdriver
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.common.proxy import *
polipo_proxy = "localhost:8123"
proxy = Proxy({
    'proxyType': ProxyType.MANUAL,
    'httpProxy': polipo_proxy,
    'ftpProxy' : polipo_proxy,
    'sslProxy' : polipo_proxy,
    'noProxy'  : ''
})
driver = webdriver.Firefox(proxy=proxy)
driver.get("http://scrapinghub.com")
assert "Scrapinghub" in driver.title
elem = driver.find_element_by_class_name("portia")
actions = ActionChains(driver)
actions.click(on_element=elem)
actions.perform()
print "Clicked on Portia!"
driver.close()

Best regards,


Pablo

0
Answered
mark80 1 month ago in Portia • updated by Pablo Vaz (Support Engineer) 3 weeks ago 5

http://vetrina.assimpredilance.it/results.aspx?rs=

I need to go through 113 page scraping for every name the address and mail ..

How can I set up multipage scraping?


Answer

Hey Mark, I hope you find the solution by checking these two approaches:


1. Extract data from a List of URLs


2. Handle pagination in Portia


if not possible with those two approaches, perhaps Portia is not the right solution for your needs as mouch1 suggested.


mouch1, thanks for your suggestions.


Best regards!


Pablo

0
Answered
mouch 1 month ago in Scrapy Cloud • updated by Pablo Vaz (Support Engineer) 3 weeks ago 4

Hi there,



I have a spider that is running perfectly well without proxy - on ScrapingHub also.

I then implemented a rotating proxy and bought few proxies for my use. Locally, it is running like a charm. 


So, I decided to move this to ScrapingHub but the spider is not working anymore. It actually never ends.


See below my logs

2017-05-28 14:07:27 INFO [scrapy.core.engine] Spider opened
2017-05-28 14:07:27 INFO [scrapy.extensions.logstats] Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-05-28 14:07:27 INFO TelnetConsole starting on 6023
2017-05-28 14:07:27 INFO [rotating_proxies.middlewares] Proxies(good: 0, dead: 0, unchecked: 5, reanimated: 0, mean backoff time: 0s)
2017-05-28 14:07:57 INFO [rotating_proxies.middlewares] Proxies(good: 0, dead: 0, unchecked: 5, reanimated: 0, mean backoff time: 0s)
2017-05-28 14:08:27 INFO [scrapy.extensions.logstats] Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-05-28 14:08:27 INFO [rotating_proxies.middlewares] Proxies(good: 0, dead: 0, unchecked: 5, reanimated: 0, mean backoff time: 0s)
2017-05-28 14:08:57 INFO [rotating_proxies.middlewares] Proxies(good: 0, dead: 0, unchecked: 5, reanimated: 0, mean backoff time: 0s)
2017-05-28 14:09:27 INFO [scrapy.extensions.logstats] Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-05-28 14:09:27 INFO [rotating_proxies.middlewares] Proxies(good: 0, dead: 0, unchecked: 5, reanimated: 0, mean backoff time: 0s)
2017-05-28 14:09:57 INFO [rotating_proxies.middlewares] Proxies(good: 0, dead: 0, unchecked: 5, reanimated: 0, mean backoff time: 0s)
2017-05-28 14:10:27 INFO [scrapy.extensions.logstats] Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-05-28 14:10:27 INFO [rotating_proxies.middlewares] Proxies(good: 0, dead: 0, unchecked: 5, reanimated: 0, mean backoff time: 0s)
2017-05-28 14:10:57 INFO [rotating_proxies.middlewares] Proxies(good: 0, dead: 0, unchecked: 5, reanimated: 0, mean backoff time: 0s)
2017-05-28 14:11:27 INFO [scrapy.extensions.logstats] Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-05-28 14:11:27 INFO [rotating_proxies.middlewares] Proxies(good: 0, dead: 0, unchecked: 5, reanimated: 0, mean backoff time: 0s)
2017-05-28 14:11:57 INFO [rotating_proxies.middlewares] Proxies(good: 0, dead: 0, unchecked: 5, reanimated: 0, mean backoff time: 0s)



I'm still wondering what is going wrong here. Why the rotating proxy extension is doing... nothing?

Would it be possible that ScrapingHub is actually locking the use of proxy extensions to ensure we use Crawlera instead? Still, it is hard for me to understand how technically it could detect this :)


Thank you for your feedback on this,

Cyril

Answer

Hey Cyril,


Nice post, your contributions helps to improve this forum and we encourage to continue doing that. Well done!


About your last question, indeed your own proxies won't be used. We use our own proxies, with Scrapy Cloud projects (Scrapy or Portia) and of course when enabling Crawlera (making all requests from a pool of proxies).


Best regards,


Pablo

0
Planned
v2065925 1 month ago in Portia • updated by Thriveni Patil (Support Engineer) 4 weeks ago 6

Starting from yesterday In URL Generation, GENERATION LIST сontains errors


Before the bug

I introduced a fixed part x and list 1 2 3

GENERATION LIST consisted of correct urls:

x1

x2

x3


But since yesterday urls became incorrect and contain an extra sign %20

x%201

x%202

x%203


How can fix this?


Answer

The URL rendering %20 in browser is a bug and Portia team would be working to fix the bug. 


But when the spider is run, the URLs in the request are correctly rendered. They seem to be blocked by site with 403 http code. You may need to use Proxy rotater like Crawlera to evade the bans.

0
Answered
tofunao1 1 month ago in Scrapy Cloud • updated by Pablo Vaz (Support Engineer) 3 weeks ago 1

Now I need to write a new spider. And the spider need to:

  1. Download a zip file from a website, about 3GB per file.
  2. Unzip the download file, then I got many xml files.
  3. Parse the xml, and select the information what I need into one item or mysql tables.

But there exists some questions in above steps:

  1. Where can I put the download files? Amazon S3?
  2. How can I unzip the file if I put the file in S3?
  3. If the files in S3 is very big, such as 3GB. How can I open the S3 file from scrapinghub?
  4. Can I use the ftp instead of the Amazon S3 if the file is 3GB?

Thank you.

Answer

Hi Tofunao, we don't provide coding assistance through this forum.


I suggest to visit our Reddit - Scrapy channel:

https://www.reddit.com/r/scrapy/

and poste there any inquiries related to the spider. 


Although these suggestions you can find more information in our Scrapy Cloud API related to manage your items and fetching data:

https://doc.scrapinghub.com/scrapy-cloud.html

Regards,


Pablo

0
Answered
Chris Fankhauser 1 month ago in Crawlera • updated by Pablo Vaz (Support Engineer) 3 weeks ago 1

I've recently noticed in the Crawlera documentation that the Fetch API is now deprecated and going away "soon". Considering that we use it almost exclusively, I'm a little concerned about timing re: moving away from it.


Is it possible to get a more definitive timeline of the retirement of the Fetch API?  At some organizations "soon" can mean "next week" and at others it can mean "2019".  Thanks!

Answer

Hi Chris, I've escalated your question but no timeline for retirement yet. But feel free to ask anytime you need.


As you know, even planned, those kind of things sometimes get delayed due other projects by our developers.


Best regards,


Pablo