Since it’s not so trivial to set up proxy authentication in Selenium, a simple option is to employ crawlera-headless-proxy as a middle layer between Crawlera and the Browser. To run the headless-proxy, we need to have it available on the system. The project link above contains all the necessary info to get it running. Here we chose to use the docker image for added simplicity.
$ docker run -p 3128:3128 scrapinghub/crawlera-headless-proxy -d -a $CRAWLERA_API_KEY
Let's digest that command: We're using docker run, so we're (creating, if required, and) running a container using the scrapinghub/crawlera-headless-proxy image. The -d modified puts the headless-proxy in debug mode, so we can see all the activity going through it. Finally, the required -a specifies the user's API key from Crawlera, which in this case is stored on an environment variable.
This command should produce a lot of output and stay attached to your terminal session. We need it running for our browser to connect to. Next step is producing selenium script
For Python and Chrome:
from selenium import webdriver from selenium.webdriver.common.action_chains import ActionChains from selenium.webdriver.common.proxy import Proxy, ProxyType from selenium.webdriver.common.desired_capabilities import DesiredCapabilities headless_proxy = "127.0.0.1:3128" proxy = Proxy({ 'proxyType': ProxyType.MANUAL, 'httpProxy': headless_proxy, 'ftpProxy' : headless_proxy, 'sslProxy' : headless_proxy, 'noProxy' : '' }) chrome_options = webdriver.ChromeOptions() # disable images to speed up the page loading prefs = {"profile.managed_default_content_settings.images": 2} chrome_options.add_experimental_option("prefs", prefs) capabilities = dict(DesiredCapabilities.CHROME) proxy.add_to_capabilities(capabilities) driver = webdriver.Chrome(desired_capabilities=capabilities, options=chrome_options) driver.set_page_load_timeout(600) driver.get("https://www.whatismyip.com/") elem = driver.find_element_by_css_selector("a.btn.btn-success.btn-md.btn-block") actions = ActionChains(driver) actions.click(on_element=elem) actions.perform() print("Clicked on a button!") driver.close()
For Java and Chrome:
import org.openqa.selenium.By; import org.openqa.selenium.Proxy; import org.openqa.selenium.WebDriver; import org.openqa.selenium.chrome.ChromeDriver; import org.openqa.selenium.chrome.ChromeOptions; import org.openqa.selenium.remote.DesiredCapabilities; import org.openqa.selenium.support.ui.ExpectedConditions; import org.openqa.selenium.support.ui.WebDriverWait; public class CrawleraTest { public static void main(String[] args) { Proxy proxy = new Proxy(); proxy.setHttpProxy("127.0.0.1:3128"); proxy.setSslProxy("127.0.0.1:3128"); DesiredCapabilities capabilities = DesiredCapabilities.chrome(); capabilities.setCapability("proxy", proxy); ChromeOptions options = new ChromeOptions(); options.addArguments("start-maximized"); capabilities.setCapability(ChromeOptions.CAPABILITY, options); WebDriver driver = new ChromeDriver(capabilities); driver.get("https://www.whatismyip.com/"); WebDriverWait wait = new WebDriverWait(driver, 60); wait.until(ExpectedConditions.visibilityOfElementLocated(By.xpath("//a[@role='button'][contains(text(),'Questions & Answers')]"))); driver.quit(); } }