When it comes to web scrapping, perhaps rvest  (check out this official tutorial) is the best option available for scrapping semi static webpages (the reason why I’m calling it “semi-static” is because you can interact little-bit with dynamic webpages using rvest. For quick reference check this out).

However, if you have to crawl a page which dynamically changes it’s content based on user input and interaction then probably you’ll end up using RSelenium. Now if you see the GitHub page it clearly states that it’s meant for Selenium 2. Now Selenium 3 has been released and Firefox has been updated with some Marionette (which I have really less idea about). Accordingly I usually failed to use older code and immediately started researching about how to make use of RSelenium again on updated dependencies. So here is what you need to do to get things done.

Step 1: Get selenium standalone from Selenium Official website. Here is the direct link for selenium 3.

Step 2: Place it somewhere PATH will detect or place in a known directory. [I usually keep it in C:\Dev location.]

Step 3: Download GeckoDriver and keep in a folder (you can keep in same folder as created under step 2). Appropriate driver version has to be installed based on your OS.

Step 4: Install RSelenium in R issuing

install.packages("RSelenium")

Step 5: Use RSelenium  in R following these codes


rm(list=ls())
options(stringsAsFactors = F)
library(RSelenium)
# One can make terminal invisible but initially it helps in detecting any potential problem
# Use proper path for Selenium and geckodriver.exe.
# remember to rename the donwloaded Selenium
sel <- startServer(dir = "C:/Dev/Selenium/",
javaargs = c("-Dwebdriver.gecko.driver=\"C:/Dev/Selenium/geko/geckodriver.exe\""),
invisible = F)
remDr <- remoteDriver(remoteServerAddr = "localhost",
port = 4444,
browserName = "firefox",
extraCapabilities = list(marionette = TRUE))
remDr$open()
# test
remDr$navigate("https://www.google.com/&quot;)

Main important lines of the code are highlighted below

# in startServer
javaargs = c("-Dwebdriver.gecko.driver=\"C:/Dev/Selenium/geko/geckodriver.exe\"")

# in remoteDriver
extraCapabilities = list(marionette = TRUE)

Here are few references form where I gathered my knowledge:

  1. Stackoverflow.com question addressing actual solution for version compatibility
  2. RSelenium : Headless browsing
  3. Phantomjs & rvest [just intro]
  4. WebDriver <-> Marionette proxy

Extras

Few other aspects are required to be considered while running R-Selenium for the first time. Frequently used options including marionette option as mention above are listed below.


sel <- startServer(dir = "C:/Dev/Selenium/",
args = c("-port 4455"),
javaargs = c("-Dwebdriver.gecko.driver=\"C:/Dev/Selenium/geko/geckodriver.exe\""),
invisible = F)
firefox_profile.me <- makeFirefoxProfile(list(marionette = TRUE,
webdriver_accept_untrusted_certs = TRUE, # for sites which has expired certificates (sometimes required for internal sites)
webdriver_assume_untrusted_issuer = TRUE, # for the same reason
browser.download.dir = "C:/temp", # download directory. However it's not having any effects as of now.
network.proxy.socks = "<proxy ip>", # for proxy settings specify the proxy host IP
network.proxy.socks_port = 3128L, # proxy port. Last character "L" for specifying integer is very important and if not specified it will not have any impact
network.proxy.type = 1L)) # 1 for manual and 2 for automatic configuration script. here also "L" is important
remDr <- remoteDriver(remoteServerAddr = "localhost",
port = 4455,
browserName = "firefox",
extraCapabilities = firefox_profile.me)

For other configurable options check out GitHub Source of Selenium and after opening search in the page (using browser in page search option) for “set_preference” .

Also check out RSelenium interactive docs at rdrr.io.