My local project used ITEM_PIPELINES to pass all items to custom XLSX in the format desired (using openpyxl to specify the format). After deploying it to the cloud I was able to also deploy the egg for openpyxl so that the project ran, and it does complete successfully. However, I can't find where to download the created XLSX file.
Sorry if I'm missing something obvious--after combing through the dashboard for a way to download it I also searched through the knowledge base and the forums here and couldn't find an explanation. Thanks in advance!
I am just new to crapinghib . I have just started testing portia 2 ( beta ) . In starting page I want to add more URLS . In beta version I had to add url one by one which is not convenient for 100s of urls as starting page . ( Note : I dont want crawl all links Just those starting pages ).
It seems in current stable version of portia I can paste Bulk Links as starting pages which is missing in portia2 beta version . Can you please give us an option to paste bult URLS
I was using crawlera to scrape linkedin page but I keep getting 500 error like the following:
< HTTP/1.1 503 Service Unavailable
Does crawlera not support linkedin anymore?
Here's an example: I'm trying to scrape data about the courses on this university site. How do I make portia follow the links from the starting page to a page with course data, but not follow any more links? After it gets the course page, I want it to go back to the starting page and try the next link. In the example site above, I would want it to click "Aerospace Studies" and scrape that page, but then go back to the start page and click "African American Studies".
Hi, first of all you could configure your link crawling with URL pattern set regular expression that matches the courses like this: /course/ and toggle the link highlighting in order to see the links that's following(green) and the ones that are not (red)
Or other option is to follow all domain links and set the scrapy setting DEPTH_LIMIT to 1 in your project. But this is going to match all domain links present in that site.
Just noob question but I found no solution, tried all but nothing works
I tried to scrap this link
https://www.example.com/search?q=&page=1 to 4000
I made crawling rules like this
I also tried
Still, it doesn't work at all
Im very new to this stuff so bear with me. I'm trying to get data from this website
I want the hero (heroName), games played, games banned, popularity, and win %. When i run it the spider i created with portia I only get the field for heroName back and it has combined data for the entire table in 1 field.
Why is this happening when the "test spider" button seems to return exaclty what I want?
1. Is there a paid option that I can buy that will increase the speed of my spider? Some of my sites take more than a day to crawl.
2. Is there a paid option that will allow me to use IP addresses from the united states or a way to set the ip's to use a source like trustedproxies.com?
no, it's not the same.
All the spiders you develop in a scrapy project could be deployed and run in Scrapy Cloud.
The resources allocated to each job run are called job units.
Each job units provides 1GB of RAM, 2.5GB of storage, a given amount of CPU capacity and 1 concurrent crawl.
So depending the amount of available units of your plan you could:
- run concurrent jobs (with more than one job unit)
- run a job with multiple job units (if needs more resources)
with send authorization tokek? whith wget not work?
Hi, a sample applied from this post:
wget -e use_proxy=yes -e http_proxy=http://CRAWLERA_APIKEY:@proxy.crawlera.com:8010/ http://httpbin.org/ip
I purchased a "Scrapy Cloud Container Unit" thinking that it would allow me to run more job at a time, but my 2nd job appears to still be pending the first job's completion. What option do I need to purchase to be able to run more than 1 job at a time.
In order to run 2+ jobs at a time, you'd need to purchase additional Scrapy Cloud Units. If you only need to run 2 jobs at a time, then only 2 Units would need to be purchased.
Please note that the free unit is replaced by the purchased one. See: http://doc.scrapinghub.com/scrapy-cloud.html#pricing for more information.
This is the first time I've ever noticed this, but there is a "wait time" that is ticking away next to a job that I started. It shows the job in pending status. Why is there a "wait time" before the job starts? Is there a way to bypass the wait time or a way to determine how long I will have to wait? or a way to upgrade so I don't have to wait? Thanks.
The wait time is the time that the job has been in pending status, until there's an available free unit to run it. In order to run 2+ jobs at a time, you'd need to purchase additional units. You'd need at least 1 unit per 1 running job.
I'm trying to use Crawlera and Scrapy to scrape data from Amazon (which uses HTTPS).
When I run the script from my PC I get the error code 407 (Bad Auth)
I've been searching for hours in many different places to try and learn how to get this to work. I've seen the example that goes with Python Requests, but how do I get HTTPS to work with Scrapy (and Crawlera)?
A 407 error code from Crawlera is an authentication error, there's possibly a typo in the APIKEY or perhaps you are not using the correct one. Please make sure you are using the one that is displayed here (click on the gear icon next to your user):
My scenario is as simple as that there are multiple pages of data and there is only one next button which will trigger ajax call to load the next page when clicked. How can I crawl all of the data using splash? It seems that Splash is stateless, i.e. it doesn't remember the page of previous call therefore it lose the position of the pagination last time it was at. Any hints are appreciated.
I have checked some related subjects, and there more people asking for this, but it seems there is no way to tell the scraping system to use some certain file (sitemap, rss feed, single html file with links) in the scraping process.
The only way I found was to add all url's to portia, manual and by command, but there is no way to tell the scraping system that it should use some file with all start url's defined.
I hope someone is willing to make this
Customer support service by UserEcho