Welcome to the Scrapinghub feedback & support site! We discuss all things related to Scrapy Cloud, Portia and Crawlera. You can participate by reading posts, asking questions, providing feedback, helping others, and voting the best questions & answers. Scrapinghub employees regularly pop in to answer questions, share tips, and post announcements.
0
Tolga 8 hours ago in Portia 0

I have a shared web hosting account and I wonder if i can install portia on that for scraping ? I have checked the portia installation documents but they only explains how to install portia locally.

0
Answered
Umair 12 hours ago in Scrapy Cloud • updated by Tomas Rinke (Support Engineer) 9 hours ago 1

Is it possible to download the project code from Scrapinghub account? I deployed code using shub command.

Answer

Hi, if using shub, then fetch-eggs returns a zip file with all the eggs along with __main__.egg which has the code.

shub fetch-eggs --help
Usage: shub fetch-eggs [OPTIONS] [TARGET]

  Download all eggs deployed to a Scrapy CLoud project into a zip file.

  You can either fetch to your default target (as defined in
  scrapinghub.yml), or explicitly supply a numerical project ID or a target
  defined in scrapinghub.yml (see shub deploy).

Options:
  --help  Show this message and exit.

0
Not a bug
Umair yesterday at 8:56 a.m. in Scrapy Cloud • updated by Nestor Toledo Koplin (Support Engineer) yesterday at 11:59 a.m. 1

I have a Scrapy project deployed on ScrapingHub


I am getting `exceptions.ImportError: No module named MySQLdb` when I run that code.

I am not sure how to I install MySQLdb module there on ScrapingHub.

Note:

I can run the same project in my personal profile of ScrapingHub, but there is another option of creating `Company` in scrapinghub, and I have been added under a company by my client. I am getting that error when I try to run same project under Company's profile.
Answer
Nestor Toledo Koplin (Support Engineer) yesterday at 11:59 a.m.

Hello,


We've had a look at your projects and we found that the reason why it runs on your personal project and not in the company's project is because of the Scrapy Cloud stack (https://support.scrapinghub.com/topics/1962-scrapy-cloud-stacks/). Your personal project has the hworker stack which has MySQLdb included, your company's project has the Scrapy stack and MySQLdb is not provided for this stack.


To solve this, please remove the MySQLdb egg from the Code & Deploys section from the company's project and use requirements to install MySQL-python.

To use requirmentes to install a dependency, please see: https://support.scrapinghub.com/topics/1970-deploying-dependencies-to-your-scrapinghub-project/

0
Answered
Aditya yesterday at 2:57 a.m. in Scrapy Cloud • updated by Tomas Rinke (Support Engineer) yesterday at 8:19 a.m. 1

Lets say I want to run

scrapy crawl bot -a start="some url"


How do I add those arguments when running the job on scrapinghub.com or Scrapy Cloud .

Tried adding argument as '-a start' and value as "some url"

argument as '-a' and value as 'start="some url"


Answer
Tomas Rinke (Support Engineer) yesterday at 8:19 a.m.

Hi, in Scrapy Cloud you don't need to precede the argument with "-a".
When clicking "Run" in your Job Dasboard, you could follow this sample:

0
Ehtisham 5 days ago in Portia • updated 5 days ago 0

I created a project..Then i create a spider and published the project..However, i am unable to create a sample page.When ever i click on "New Sample" it generates an error.Please review the attched screenshot.Let me know, if you can help please?


Thanks.

0
Waiting for Customer
Hello!! 1 week ago in Portia • updated by Thriveni Patil (Support Engineer) yesterday at 8:55 a.m. 3

Hello,

Xpath seems to be read-only in portia (I can chance css selector). Am I missing something?


Thanks


0
Waiting for Customer
Tall Steve 1 week ago in Crawlera • updated yesterday at 8:53 a.m. 2

I have a page that requires .NET's __VIEWSTATE variable and a few others posted to it. It is therefore a 2 stage process.

1. Grab the first page with a GET and extract the __VIEWSTATE variable contents, then

2. POST these to the next page to view the contents.

Part 1 works and I get the viewstate var contents OK, but then part 2 times out after 30 seconds. I am using the same Curl handle and not re-initialising curl after part 1 - this works fine when not using the crawlera proxy. It only fails when I use the proxy


Code - curl and proxy already initialised, just doing a second call with POST VARS:

curl_setopt($curl,CURLOPT_POST, 1); //0 for a get request 1 for POST
$postData=array(
'__VIEWSTATE'=>$viewstate,
'__VIEWSTATEGENERATOR'=>$viewstategenerator,
'__EVENTVALIDATION'=>$eventvalidation,
'ctl00$cphMainContent$tabDocuments'=>'Documents');
curl_setopt($curl, CURLOPT_POSTFIELDS, $postData);
curl_setopt($curl, CURLOPT_REFERER, "https://websiteURL.com");
$contents = curl_exec($curl); //Grab the page


Do POST vars work with Crawlera?

Thanks,

Steve

0
Willy 1 week ago in Portia • updated 1 week ago 2

When I view the example output of a sample page (extracted items), all fields appear to get properly returned. However, when I run the spider, certain fields are consistently missing, specifically the 'website' which is one of the most important fields for us. Does anyone know why this happening and how to fix? I made sure all changes in portia have been pushed to the cloud and I deleted all extra spiders and sample pages so there is only one version of each.




0
Kirimaks 1 week ago in Crawlera • updated 1 week ago 0

Hello. I change my plan from c10 to c50 and crawlera start to work very strange. Before I spend two c10 plans, and now 50000+ pages crawled. When I change plan it start to work slowly and looks like 2 or 3 times more banned requests and response with capcahs. Right now maybe 1 of 5 requests work. I didn't change scraper's code (I use casperjs). I'm not sure which user agent will be used if I didn't set any user agent in crawlera header?

0
Answered
Jesse521 2 weeks ago in Scrapy Cloud • updated by Tomas Rinke (Support Engineer) 8 hours ago 2

Hi,

I wonder, is there any api rate limit for creating jobs, retrieving jobs?


since you didn't have job notifications/webhooks (or, I overlook?), I have to check the job status from time to time.


Therefore, I'm very concerned about the rate limit.



Answer

Thanks for your concern, the limits are in place to support such requirements. I.e: check job status every minute

Regards,


Tomas

0
Answered
Eyeless 2 weeks ago in Crawlera • updated by Tomas Rinke (Support Engineer) 8 hours ago 1

Hello,


As I understood, Crawlera selects new proxy for each request, but I would like to make, for instance, 10 requests from one IP, and then 10 from another one, but not 20 requests from 20 different IPs. I've found a 3 years old topic with information, that I can pass the 'X-Crawlera-Slave' response header to my new request's headers to use the same proxy, but Martin has also written, that this feature was not available. Maybe now situation has changed and I can use same proxy for several requests somehow?

Answer

Hi, reading your post It seems sessions fit your requirement: https://doc.scrapinghub.com/crawlera.html#sessions


Let us know if it helps,


Tomas

0
Manthiran 2 weeks ago in Scrapy Cloud 0

Please check below link

my question , This is tricky webpage ,please help us.

0
Jan De Wit 3 weeks ago in Portia 0

I did the migration from Kimono to Portia as advertises https://blog.scrapinghub.com/2016/02/17/portia-alternative-to-kimono/. Or rather you guys did for which I send you my gratitude. Thank you very very much!

Started running some spiders, but to my disappointment, no results.

I cannot find out why, because when I check the Portia setup, the fields are correctly selected.

0
Fixed
Ben Həɍila 3 weeks ago in Scrapy Cloud • updated by Tomas Rinke (Support Engineer) 8 hours ago 2

Created a crawler using Portia then tried running it. See this error message:


Traceback (most recent call last):

  File "/usr/local/lib/python2.7/site-packages/sh_scrapy/crawl.py", line 130, in _run_usercode
    _run(args, settings)
  File "/usr/local/lib/python2.7/site-packages/sh_scrapy/crawl.py", line 90, in _run
    _run_scrapy(args, settings)
  File "/usr/local/lib/python2.7/site-packages/sh_scrapy/crawl.py", line 98, in _run_scrapy
    execute(settings=settings)
  File "/usr/local/lib/python2.7/site-packages/scrapy/cmdline.py", line 141, in execute
    cmd.crawler_process = CrawlerProcess(settings)
  File "/usr/local/lib/python2.7/site-packages/scrapy/crawler.py", line 238, in __init__
    super(CrawlerProcess, self).__init__(settings)
  File "/usr/local/lib/python2.7/site-packages/scrapy/crawler.py", line 129, in __init__
    self.spider_loader = _get_spider_loader(settings)
  File "/usr/local/lib/python2.7/site-packages/scrapy/crawler.py", line 325, in _get_spider_loader
    return loader_cls.from_settings(settings.frozencopy())
  File "/src/slybot/slybot/slybot/spidermanager.py", line 91, in from_settings
    return cls(datadir, zipfile, spider_cls, settings=settings)
  File "/src/slybot/slybot/slybot/spidermanager.py", line 81, in __init__
    ZipFile(zipfile).extractall(datadir)
  File "/usr/local/lib/python2.7/zipfile.py", line 770, in __init__
    self._RealGetContents()
  File "/usr/local/lib/python2.7/zipfile.py", line 811, in _RealGetContents
    raise BadZipfile, "File is not a zip file"
BadZipfile: File is not a zip file
Answer

Hi, thanks for reporting this, let us know if you are still experiecing the issue. It seemed a temporary instance issue.


Tomas

0
Elie VANGU 3 weeks ago in Portia 0

Hello,


I don't understant this errot message while trying to run my crawling in order to retrieve datas :


Traceback (most recent call last):
  File "/usr/local/lib/python2.7/site-packages/sh_scrapy/crawl.py", line 130, in _run_usercode
    _run(args, settings)
  File "/usr/local/lib/python2.7/site-packages/sh_scrapy/crawl.py", line 90, in _run
    _run_scrapy(args, settings)
  File "/usr/local/lib/python2.7/site-packages/sh_scrapy/crawl.py", line 98, in _run_scrapy
    execute(settings=settings)
  File "/usr/local/lib/python2.7/site-packages/scrapy/cmdline.py", line 141, in execute
    cmd.crawler_process = CrawlerProcess(settings)
  File "/usr/local/lib/python2.7/site-packages/scrapy/crawler.py", line 238, in __init__
    super(CrawlerProcess, self).__init__(settings)
  File "/usr/local/lib/python2.7/site-packages/scrapy/crawler.py", line 129, in __init__
    self.spider_loader = _get_spider_loader(settings)
  File "/usr/local/lib/python2.7/site-packages/scrapy/crawler.py", line 325, in _get_spider_loader
    return loader_cls.from_settings(settings.frozencopy())
  File "/src/slybot/slybot/slybot/spidermanager.py", line 91, in from_settings
    return cls(datadir, zipfile, spider_cls, settings=settings)
  File "/src/slybot/slybot/slybot/spidermanager.py", line 81, in __init__
    ZipFile(zipfile).extractall(datadir)
  File "/usr/local/lib/python2.7/zipfile.py", line 770, in __init__
    self._RealGetContents()
  File "/usr/local/lib/python2.7/zipfile.py", line 811, in _RealGetContents
    raise BadZipfile, "File is not a zip file"
BadZipfile: File is not a zip file

Please help me !

MAny thanks