Welcome to the Scrapinghub feedback & support site! We discuss all things related to Scrapy Cloud, Portia and Crawlera. You can participate by reading posts, asking questions, providing feedback, helping others, and voting the best questions & answers. Scrapinghub employees regularly pop in to answer questions, share tips, and post announcements.
0
Answered
steven.sank 3 weeks ago in Scrapy Cloud • updated by Pablo Vaz (Support Engineer) 3 weeks ago 1

Looking for a way to publish periodic job run. 

Answer

Hi Steven,


I think you can use our API. Please take a moment to explore: https://doc.scrapinghub.com/scrapy-cloud.html

If not possible with API, perhaps you can run your own .py scripts on our Scrapy Cloud, check this blog post from Elías:

https://blog.scrapinghub.com/2016/09/28/how-to-run-python-scripts-in-scrapy-cloud/ 


I hope these suggestion were hopeful for you,


Best,


Pablo


0
Answered
Вася Местный 3 weeks ago in Scrapy Cloud • updated by Pablo Vaz (Support Engineer) 3 weeks ago 1

I need to run the same spider over and over again and write(append) results to the same csv or google sheets doc. Is it possible here? How do I proceed?

Answer

I think you can use our API. Please take a moment to explore: https://doc.scrapinghub.com/scrapy-cloud.html If not possible with API, perhaps you can run your own .py scripts on our Scrapy Cloud, check this blog post from Elías:

https://blog.scrapinghub.com/2016/09/28/how-to-run-python-scripts-in-scrapy-cloud/ 

I hope these suggestion were hopeful for you,

Best,

Pablo

0
Completed
Вася Местный 4 weeks ago in Scrapy Cloud • updated by Pablo Vaz (Support Engineer) 3 weeks ago 1

Hi there. I need to rename the items while exporting. How do I do this? The only solution I can see is to chnage Items everywehre in the project. But it is not an easy task and obviously is not pythonic. Any other solution please?

Answer

Hi please check the other suggestion I made about using python scripts or our API.


Best,


Pablo

0
Answered
ks446 4 weeks ago in Scrapy Cloud • updated by Pablo Vaz (Support Engineer) 3 weeks ago 1

I have a scrapy script running on scrapinghub. The scraper takes one argument as a csv file where the urls have been stored. The script runs without error, but the problem is that it isn't scraping all the items from the url. I have no idea why this is ha

Answer

Hey ks446,


It could be for many reasons. To discard any issue with your deploy, you can run your script locally and check if works fine extracting all items.


If works fine, check how much time spend the spider to run and check in the script if there aren't infinite loops or something related that could extend the time pushing the job to cancel due no new items extracted.


Finally, consider that the site itself could be banning your spider. The only solution for this case is to use our proxy rotator Crawlera to make requests from different IPs. If interested to know more, please check:

What is Crawlera?


Best regards!


Pablo

0
Answered
lucse11 4 weeks ago in Scrapy Cloud • updated by Pablo Vaz (Support Engineer) 4 weeks ago 1

I'm having problem following pagination of this website: http://gamesurf.tiscali.it/ps4/recensioni.html

My spider part of code :

for pag in response.css('li.square-nav'):
    next = pag.css('li.square-nav > a > span::text').extract_first()
    if next=='»':
        next_page_url = pag.css('a::attr(href)').extract_first()
        if next_page_url:
            next_page_url = response.urljoin(next_page_url)
            yield scrapy.Request(url=next_page_url, callback=self.parse)


If i run my spider in terminal it works on all pages of the website, but when i deploy to scrapinghub and run from the button in the dashboard, spider scrape only the first page of the website.

Between log messages there is a warning: [py.warnings] /app/__main__.egg/reccy/spiders/reccygsall.py:21: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal.


I have checked problem is not caused by robot.txt.

How can i fix this?

Thanks


Answer

Hey Lucse,


Please check this post, seems related to your issue.

https://stackoverflow.com/questions/18193305/python-unicode-equal-comparison-failed


Basically Your program, seems to be comparing unicode objects with str objects, and the contents of a str object is not a valid UTF8 encoding. Not much convinced that would work, but did you try using something like:


if next == unicode('»'):


or related?


Best,


Pablo

0
Answered
mouch 1 month ago in Scrapy Cloud • updated by Pablo Vaz (Support Engineer) 3 weeks ago 4

Hi there,



I have a spider that is running perfectly well without proxy - on ScrapingHub also.

I then implemented a rotating proxy and bought few proxies for my use. Locally, it is running like a charm. 


So, I decided to move this to ScrapingHub but the spider is not working anymore. It actually never ends.


See below my logs

2017-05-28 14:07:27 INFO [scrapy.core.engine] Spider opened
2017-05-28 14:07:27 INFO [scrapy.extensions.logstats] Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-05-28 14:07:27 INFO TelnetConsole starting on 6023
2017-05-28 14:07:27 INFO [rotating_proxies.middlewares] Proxies(good: 0, dead: 0, unchecked: 5, reanimated: 0, mean backoff time: 0s)
2017-05-28 14:07:57 INFO [rotating_proxies.middlewares] Proxies(good: 0, dead: 0, unchecked: 5, reanimated: 0, mean backoff time: 0s)
2017-05-28 14:08:27 INFO [scrapy.extensions.logstats] Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-05-28 14:08:27 INFO [rotating_proxies.middlewares] Proxies(good: 0, dead: 0, unchecked: 5, reanimated: 0, mean backoff time: 0s)
2017-05-28 14:08:57 INFO [rotating_proxies.middlewares] Proxies(good: 0, dead: 0, unchecked: 5, reanimated: 0, mean backoff time: 0s)
2017-05-28 14:09:27 INFO [scrapy.extensions.logstats] Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-05-28 14:09:27 INFO [rotating_proxies.middlewares] Proxies(good: 0, dead: 0, unchecked: 5, reanimated: 0, mean backoff time: 0s)
2017-05-28 14:09:57 INFO [rotating_proxies.middlewares] Proxies(good: 0, dead: 0, unchecked: 5, reanimated: 0, mean backoff time: 0s)
2017-05-28 14:10:27 INFO [scrapy.extensions.logstats] Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-05-28 14:10:27 INFO [rotating_proxies.middlewares] Proxies(good: 0, dead: 0, unchecked: 5, reanimated: 0, mean backoff time: 0s)
2017-05-28 14:10:57 INFO [rotating_proxies.middlewares] Proxies(good: 0, dead: 0, unchecked: 5, reanimated: 0, mean backoff time: 0s)
2017-05-28 14:11:27 INFO [scrapy.extensions.logstats] Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-05-28 14:11:27 INFO [rotating_proxies.middlewares] Proxies(good: 0, dead: 0, unchecked: 5, reanimated: 0, mean backoff time: 0s)
2017-05-28 14:11:57 INFO [rotating_proxies.middlewares] Proxies(good: 0, dead: 0, unchecked: 5, reanimated: 0, mean backoff time: 0s)



I'm still wondering what is going wrong here. Why the rotating proxy extension is doing... nothing?

Would it be possible that ScrapingHub is actually locking the use of proxy extensions to ensure we use Crawlera instead? Still, it is hard for me to understand how technically it could detect this :)


Thank you for your feedback on this,

Cyril

Answer

Hey Cyril,


Nice post, your contributions helps to improve this forum and we encourage to continue doing that. Well done!


About your last question, indeed your own proxies won't be used. We use our own proxies, with Scrapy Cloud projects (Scrapy or Portia) and of course when enabling Crawlera (making all requests from a pool of proxies).


Best regards,


Pablo

0
Answered
tofunao1 1 month ago in Scrapy Cloud • updated by Pablo Vaz (Support Engineer) 3 weeks ago 1

Now I need to write a new spider. And the spider need to:

  1. Download a zip file from a website, about 3GB per file.
  2. Unzip the download file, then I got many xml files.
  3. Parse the xml, and select the information what I need into one item or mysql tables.

But there exists some questions in above steps:

  1. Where can I put the download files? Amazon S3?
  2. How can I unzip the file if I put the file in S3?
  3. If the files in S3 is very big, such as 3GB. How can I open the S3 file from scrapinghub?
  4. Can I use the ftp instead of the Amazon S3 if the file is 3GB?

Thank you.

Answer

Hi Tofunao, we don't provide coding assistance through this forum.


I suggest to visit our Reddit - Scrapy channel:

https://www.reddit.com/r/scrapy/

and poste there any inquiries related to the spider. 


Although these suggestions you can find more information in our Scrapy Cloud API related to manage your items and fetching data:

https://doc.scrapinghub.com/scrapy-cloud.html

Regards,


Pablo

0
Answered
han 1 month ago in Scrapy Cloud • updated by Pablo Vaz (Support Engineer) 1 month ago 1

Hi, I know that there is a 24 hour limit for free accounts.

If I have a large scraping job that will definitely take more than 24 hours to run, is there a way I can continue from where the scraping has stopped?

Answer

Hey Han, yes you can.

When you upgrade your Scrapy unit, it allows you to crawl as long as you need.


Here are more details:

Scrapy Units


Best regards,

Pablo 

0
Answered
ysappier 1 month ago in Scrapy Cloud • updated by Martin Olveyra (Engineer) 1 month ago 2
0
Answered
han 2 months ago in Scrapy Cloud • updated by Pablo Vaz (Support Engineer) 1 month ago 1

Hi there, I started using scrapinghub a week ago, and have been using it to scrape some ecommerce websites.


I notice that the crawl job for a particular website keeps ending prematurely without any error logs.

On some instances, I try to visit the website and I found out that I got blocked.

So, I activated crawlera and the results is the same.


What could I be missing out?

Answer

Hi Han, even we can't provide ban assistance or crawl tuning for standard accounts, there are some best practices you can keep in mind when enabling Crawlera for a particular project.


Please take some minutes to explore more details in:

Crawlera best practices


Best regards,

Pablo

0
Not a bug
parulchouhan1990 2 months ago in Scrapy Cloud • updated by Pablo Vaz (Support Engineer) 2 months ago 1

I am trying to insert input urls using the text file.

def start_requests(self):
        # read file data
        with open(self.url, 'r') as f:
            content = f.readlines()
        
        for url in content:
            yield scrapy.Request(url)


Using above code but getting error 

IOError: [Errno 2] No such file or directory
Answer

Hi Parul, 


As seen in our Article:


You need to declare the files in the <strong>package_data</strong>  section of your <strong>setup.py</strong>  file.

For example, if your Scrapy project has the following structure:

myproject/
  __init__.py
  settings.py
  resources/
    cities.txt
scrapy.cfg
setup.py

You would use the following in your <strong>setup.py</strong>  to include the <strong>cities.txt</strong>  file:

HTML

setup(
    name='myproject',
    version='1.0',
    packages=find_packages(),
    package_data={
        'myproject': ['resources/*.txt']
    },
    entry_points={
        'scrapy': ['settings = myproject.settings']
    },
    zip_safe=False,
)

Note that the <strong>zip_safe</strong> flag is set to <strong>False</strong> , as this may be needed in some cases.

Now you can access the <strong>cities.txt</strong>  file content from <strong>setting.py</strong> like this:

import pkgutil
data = pkgutil.get_data("myproject", "resources/cities.txt")

Note that this code works for the example Scrapy project structure defined at the beginning of the article. If your project has different structure - you will need to adjust <strong>package_data</strong> section and your code accordingly.

For advanced resource access take a look at setuptools pkg_resources module.


Best regards,


Pablo

0
Answered
Kalo 2 months ago in Scrapy Cloud • updated by Pablo Vaz (Support Engineer) 2 months ago 1

Is there a way to load just a small element from a page's DOM instead of loading the whole page? I'd like to scrape just a few divs out of the whole html page. The idea is to increase the speed and minimize transfer bandwidth.

Answer

Hi Kalo, this can be done with a headless browser like Splash.

For example you can turn off the images or execute custom JavaScript.

To know more about it, please check the documentation in: https://splash.readthedocs.io/en/stable/

Kind regards,

Pablo Vaz

0
Completed
hello 2 months ago in Scrapy Cloud • updated by Pablo Vaz (Support Engineer) 2 months ago 1

I am using the following code to try and bring back all fields within an job using the items API;


$sch_id="172/73/3" //job ID

$ch = curl_init();

curl_setopt_array($ch, array(
CURLOPT_RETURNTRANSFER => 1,
CURLOPT_URL => "https://storage.scrapinghub.com/items/". $sch_id ."?format=json&fields=_type,design,price,material",
CURLOPT_CUSTOMREQUEST =>"GET",
CURLOPT_HTTPHEADER, array(
'Accept: application/x-jsonlines',
),
CURLOPT_USERPWD => "e658eb1xxxxxxxxxxxx4b42de6fd" . ":" . "",
));

$result = curl_exec($ch);
print_r(json_decode($result));

curl_close ($ch);

There are 4 fields I am trying to get as json but the request only brings back "_type" and "price". I have tried various things with different headers and the request URL but no luck.


Any advice would be appreciated.


Cheers,

Adam

Answer

Hi,


We suggest to use the script provided in our docs:


<!--?php
$ch = curl_init();
$url = 'https://twitter.com/';
$proxy = 'proxy.crawlera.com:8010';
$proxy_auth = '<API KEY-->:';
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_PROXY, $proxy);
curl_setopt($ch, CURLOPT_PROXYUSERPWD, $proxy_auth);
curl_setopt($ch, CURLOPT_HEADER, 1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 30);
curl_setopt($ch, CURLOPT_TIMEOUT, 30);
curl_setopt($ch, CURLOPT_CAINFO, '/path/to/crawlera-ca.crt');
$scraped_page = curl_exec($ch);
curl_close($ch);
echo $scraped_page;
?> 



But if no possible, try to add the  reference to the certificate for fetching HTTPS. In the code provided by us this is explicated in:


curl_setopt($ch, CURLOPT_CAINFO, '/path/to/crawlera-ca.crt');
0
Answered
Alex L 2 months ago in Scrapy Cloud • updated 2 months ago 2

Currently scripts can only be deployed only by using the shub deploy command. When we push scripts to git, the app doesn't seem to pull the scripts from our repo.


Will pulling scripts from the git hook be supported in the future or do you guys intend to stay on using shub deploy for now?

Answer

Hi Alex, we are currently supporting deploy from GitHub repository.

Please take a moment to check this tutorial:

Deploying a Project from a Github Repository


Kind regards,


Pablo

0
Answered
shamily23v 2 months ago in Scrapy Cloud • updated by Pablo Vaz (Support Engineer) 2 months ago 1

I would like to write/ update data in mongodb with the items crawled from scraping hub.

Answer

Hi Shamily,


If you have a MongoDB server that you would like your spiders to write to, and you would like to open access to that server only from Scrapy Cloud IPs.

Unfortunately, this is not possible. We cannot provide you a reliable range of IP addresses for our Scrapy Cloud crawling servers, because they're not static, they change frequently. So, even if we were to provide you the list we have now, it will soon change and your spider's connection with Mongo will break.

Here are a couple alternatives to consider:

  • Write a script that pulls the data from Scrapinghub (using the API) and writes it to your Mongo server. This script can run in your mongo server or any server (and you only need to whitelist that IP)
  • Use authentication in Mongo

To know more about our API: https://doc.scrapinghub.com/api/items.html#items-api

Kind regards,


Pablo