When it comes to web scrapping, perhaps rvest (check out this official tutorial) is the best option available for scrapping semi static webpages (the reason why I’m calling it “semi-static” is because you can interact little-bit with dynamic webpages using rvest. For quick reference check this out).
However, if you have to crawl a page which dynamically changes it’s content based on user input and interaction then probably you’ll end up using RSelenium. Now if you see the GitHub page it clearly states that it’s meant for Selenium 2. Now Selenium 3 has been released and Firefox has been updated with some Marionette (which I have really less idea about). Accordingly I usually failed to use older code and immediately started researching about how to make use of RSelenium again on updated dependencies. So here is what you need to do to get things done.
Step 1: Get selenium standalone from Selenium Official website. Here is the direct link for selenium 3.
Step 2: Place it somewhere PATH will detect or place in a known directory. [I usually keep it in C:\Dev location.]
Step 3: Download GeckoDriver and keep in a folder (you can keep in same folder as created under step 2). Appropriate driver version has to be installed based on your OS.
Step 4: Install RSelenium in R issuing
install.packages("RSelenium")
Step 5: Use RSelenium in R following these codes
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
rm(list=ls()) | |
options(stringsAsFactors = F) | |
library(RSelenium) | |
# One can make terminal invisible but initially it helps in detecting any potential problem | |
# Use proper path for Selenium and geckodriver.exe. | |
# remember to rename the donwloaded Selenium | |
sel <- startServer(dir = "C:/Dev/Selenium/", | |
javaargs = c("-Dwebdriver.gecko.driver=\"C:/Dev/Selenium/geko/geckodriver.exe\""), | |
invisible = F) | |
remDr <- remoteDriver(remoteServerAddr = "localhost", | |
port = 4444, | |
browserName = "firefox", | |
extraCapabilities = list(marionette = TRUE)) | |
remDr$open() | |
# test | |
remDr$navigate("https://www.google.com/") |
Main important lines of the code are highlighted below
# in startServer javaargs = c("-Dwebdriver.gecko.driver=\"C:/Dev/Selenium/geko/geckodriver.exe\"") # in remoteDriver extraCapabilities = list(marionette = TRUE)
Here are few references form where I gathered my knowledge:
- Stackoverflow.com question addressing actual solution for version compatibility
- RSelenium : Headless browsing
- Phantomjs & rvest [just intro]
- WebDriver <-> Marionette proxy
Extras
Few other aspects are required to be considered while running R-Selenium for the first time. Frequently used options including marionette option as mention above are listed below.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
sel <- startServer(dir = "C:/Dev/Selenium/", | |
args = c("-port 4455"), | |
javaargs = c("-Dwebdriver.gecko.driver=\"C:/Dev/Selenium/geko/geckodriver.exe\""), | |
invisible = F) | |
firefox_profile.me <- makeFirefoxProfile(list(marionette = TRUE, | |
webdriver_accept_untrusted_certs = TRUE, # for sites which has expired certificates (sometimes required for internal sites) | |
webdriver_assume_untrusted_issuer = TRUE, # for the same reason | |
browser.download.dir = "C:/temp", # download directory. However it's not having any effects as of now. | |
network.proxy.socks = "<proxy ip>", # for proxy settings specify the proxy host IP | |
network.proxy.socks_port = 3128L, # proxy port. Last character "L" for specifying integer is very important and if not specified it will not have any impact | |
network.proxy.type = 1L)) # 1 for manual and 2 for automatic configuration script. here also "L" is important | |
remDr <- remoteDriver(remoteServerAddr = "localhost", | |
port = 4455, | |
browserName = "firefox", | |
extraCapabilities = firefox_profile.me) |
For other configurable options check out GitHub Source of Selenium and after opening search in the page (using browser in page search option) for “set_preference” .
Also check out RSelenium interactive docs at rdrr.io.