Welcome to the Scrapinghub community forum! We discuss all things related to Scrapy Cloud, Portia and Crawlera. You can participate by reading posts, asking questions, providing feedback, helping others, and voting the best questions & answers. Scrapinghub employees regularly pop in to answer questions, share tips, and post announcements.

Remember to check the Help Center!

Please remember to check the Scrapinghub Help Center before asking here, your question may be already answered there.

0
1044204605 23 minutes ago in Crawlera 0

i bought the crawlera service just now, and successfully run my scrapy poject using crawlera proxy, but the usage of my crawlera service is always 0% and there's nothing in the histogram on the website, why?

0
Answered
han 2 days ago in Scrapy Cloud • updated by Pablo Vaz (Support Engineer) 11 hours ago 1

Hi, I know that there is a 24 hour limit for free accounts.

If I have a large scraping job that will definitely take more than 24 hours to run, is there a way I can continue from where the scraping has stopped?

Answer

Hey Han, yes you can.

When you upgrade your Scrapy unit, it allows you to crawl as long as you need.


Here are more details:

Scrapy Units


Best regards,

Pablo 

0
Answered
Sergej 5 days ago in Crawlera • updated 4 hours ago 2

Hello,


I am experiencing issues by using Crawlera on https sites with Scrapy and PhantomJs. My Config is:


        service_args = [
            '--proxy=proxy.crawlera.com:8010',
            '--proxy-type=http',
            '--proxy-auth=XXXXXXXXXXXXXXXXX:',
            '--webdriver-logfile=phantom.log',
            '--webdriver-loglevel=DEBUG',
            '--ssl-protocol=any',
            '--ssl-client-certificate-file=crawlera-ca.crt',
            '--ignore-ssl-errors=true',
            ]


Tough I always get this error and the result is empty:

"errorCode":99,"errorString":"Cannot provide a certificate with no key, "


I am stuck for hours on this problem. Any help much appreciated. 


Thank you!

Sergej

Answer

Hi Sergej,


This is a common issue and our team is working to provide a prompt solution on next releases of Crawlera. An upgrade is planned for next sprint which should fix this problem.


Best regards,

Pablo

0
Answered
oday_merhi 6 days ago in Portia • updated by Pablo Vaz (Support Engineer) 11 hours ago 1

Hello guys,


So i'm trying to scrape articles with certain texts in them. Can i teach my spider to scrape specific articles. For example if i was on a food site that had articles and i only wanted recipes with banana. Is there a way for me to set up the spider to only scrape the articles with the key word "banana"?


Thank you for your help!

Answer

Hey Oday, yes I think is possible to set up some kind of extractor using regular expressions.


If not possible with Portia, you can try with Scrapy:

https://helpdesk.scrapinghub.com/support/solutions/articles/22000201028-learn-scrapy-video-tutorials-


Best regards,


Pablo

0
MikeLambert 6 days ago in Splash 0

It seems Distil is blocking my use of Splash now. I'm not sure if the site I'm accessing just started using Distil, or if Distil just started detecting Splash.


Is there any information as to how Distil is detecting Splash? I've seen examples for Selenium needing to delete certain properties from document.window, but am unclear as to exactly how Splash is automating things, and what it might stick in the browser that make it detectable.


I did find https://www.quora.com/How-do-you-crawl-Crunchbase-and-bypass-the-bot-protection , where Pablo Hoffman (scrapinghub co-founder) suggests contacting scrapinghub to help with crawling. I'm not sure what the costs for a full consulting gig to do this are (any estimates?)


I'm already using scrapinghub/splash and pay for an instance, but if it's impossible to get through Distil, I'll just have to turn off my spaslh JS instance and remove the feature, so any pointers (public or private to mlambert@gmail.com) would be appreciated!

0
Under review
joaquin 7 days ago in Crawlera • updated 2 days ago 13

I am running some crawlers on a site, all of them using crawlera, and I am getting several 429 error statuses (which mean too many requests), so crawlera doesn't seem to be adjusting its throttling to accommodate for these errors.


Does your throttling algorithm consider 429 status codes?


I am using scrapy plus the crawlera middleware btw.

0
Answered
ysappier 1 week ago in Scrapy Cloud • updated by Martin Olveyra (Engineer) 5 days ago 2
0
Waiting for Customer
Daryl Tavernor 1 week ago in Portia • updated by Pablo Vaz (Support Engineer) 11 hours ago 1

Hi


I need to scrape supplier websites which have their product prices behind a log in wall, so I've set up the spider to log in but when I run the job, it doesn't actually log in and collect the data.


Any ideas?

Answer

Hi Daryl, would be great if you can specify in more details the errors obtained or issues experienced in order to help you better.


Here's a guide of the issues you can experience and the correct channel to ask if you need help.

Portia Troubleshooting


Best regards,

Pablo

0
Vinc 1 week ago in Crawlera 0

Hello


I know this is not a normal use for Crawlera, I'm crawling a site that's very picky about the IP one crawls from, up until one manages to register and login.


So I need to manually log in the site and get the session cookies before I can start crawling, and since my IP is generally blocked to do that, I wanted to use Crawlera with my browser.


So far I managed to get Chrome+FoxyProxy to work with HTTP connections, however whenever I get to an HTTPS Chrome returns an ERR_CONNECTION_CLOSED error.


Anybody knows how to get this to work? It doesn't necessarily need to be Chrome with foxyproxy, anything that would allow me to browse a simple page would work really (polipo? firefox? squid?)


Thanks

0
han 1 week ago in Scrapy Cloud 0

Hi there, I started using scrapinghub a week ago, and have been using it to scrape some ecommerce websites.


I notice that the crawl job for a particular website keeps ending prematurely without any error logs.

On some instances, I try to visit the website and I found out that I got blocked.

So, I activated crawlera and the results is the same.


What could I be missing out?

0
Answered
Rodrigo Barria 2 weeks ago in Portia • updated by Pablo Vaz (Support Engineer) 1 week ago 1

Hi,


would like to know how to make the Spider go inside each one of the real-estate properties listed in this pages:


http://www.portalinmobiliario.com/venta/departamento/santa-isabel-santiago-santiago-metropolitana?tp=2&op=1&ca=2&ts=1&dd=2&dh=6&bd=0&bh=6&or=f-des&mn=1&sf=1&sp=0&sd=47%2C00



In addition, that principal page consists of several pages, how can I make the spider go to the 2nd page and so on...


thank you very much

Answer

Hi Rodrigo,


You have many options on how Portia crawls the site, try with "Follow all in-domain links"



If this doesn't work, try with different alternatives. Take a few minutes to explore these articles:

Portia > List of URLs

Portia > Pagination


Best regards,


Pablo


0
Not a bug
parulchouhan1990 2 weeks ago in Scrapy Cloud • updated by Pablo Vaz (Support Engineer) 2 weeks ago 1

I am trying to insert input urls using the text file.

def start_requests(self):
        # read file data
        with open(self.url, 'r') as f:
            content = f.readlines()
        
        for url in content:
            yield scrapy.Request(url)


Using above code but getting error 

IOError: [Errno 2] No such file or directory
Answer

Hi Parul, 


As seen in our Article:


You need to declare the files in the <strong>package_data</strong>  section of your <strong>setup.py</strong>  file.

For example, if your Scrapy project has the following structure:

myproject/
  __init__.py
  settings.py
  resources/
    cities.txt
scrapy.cfg
setup.py

You would use the following in your <strong>setup.py</strong>  to include the <strong>cities.txt</strong>  file:

HTML

setup(
    name='myproject',
    version='1.0',
    packages=find_packages(),
    package_data={
        'myproject': ['resources/*.txt']
    },
    entry_points={
        'scrapy': ['settings = myproject.settings']
    },
    zip_safe=False,
)

Note that the <strong>zip_safe</strong> flag is set to <strong>False</strong> , as this may be needed in some cases.

Now you can access the <strong>cities.txt</strong>  file content from <strong>setting.py</strong> like this:

import pkgutil
data = pkgutil.get_data("myproject", "resources/cities.txt")

Note that this code works for the example Scrapy project structure defined at the beginning of the article. If your project has different structure - you will need to adjust <strong>package_data</strong> section and your code accordingly.

For advanced resource access take a look at setuptools pkg_resources module.


Best regards,


Pablo

0
Answered
Matthew Sealey 2 weeks ago in Portia • updated 2 weeks ago 5

I'm trying to use Portia to pull data from a possible list of pages based on a list. I know a lot of pages don't exist, but I don't know which ones.


So far Portia gets stuck in a loop of reattempting pages multiple times. That increases the request limit unnecessarily. Is there a way of limiting Portia to perhaps just two attempts at a single page before it discards it from attempting again?

Answer

Hi Matthew!

Have you tried some extra setting using regex? Perhaps, you don't know exactly which pages are unnecessary but you know some more information about the URL and avoid it.


Check this article:

Portia > Regex


Best regards!

Pablo

0
Answered
mark80 2 weeks ago in Portia • updated by Pablo Vaz (Support Engineer) 2 weeks ago 5

it's my first project and it seams to me great app!! 

here http://www.sia.ch/it/affiliazione/elenco-dei-membri/socii-individuali/ we have 255 page (little number on top of list) and i need not only extract these 4 visible column but either mail and telephone inside every name of the list..

i've yet extracted 255 page with main 4 column sample of the link, but i don't know how go one level deeper in every name

can i do all job with a single crawler project?

Answer

Hey Mark, I think I could make it. Sent you an invitation to take a look into the project.

Feel free to open portia to check the settings I made.


Best,


Pablo

0
Answered
Jorge 3 weeks ago in Portia • updated by Pablo Vaz (Support Engineer) 2 weeks ago 1

it's posible apply filter to scraped data (i know it is posible) but i would like to download .JSON code with the filter criteria, and dodge rest of data, it is posible?


thanks in advance

Answer

Hola Jorge!


I think you can play a bit sharing data with spiders as shown in this article:

https://helpdesk.scrapinghub.com/support/solutions/articles/22000200420-sharing-data-between-spiders

but not sure if this is efficient for your purposes.


I would prefer to filter locally, but of course that depends on the project.


Best,


Pablo