Goal

The Goal of this session is to learn how to get data from the World Wide Web using R. Although we are going to talk about a few concepts first, the core of this session will be spent on getting data from websites that do not offer any interface to automate information retrieval, like via Web services such as REST, SOAP nor application programming interfaces (APIs). Therefore it is necessary to scrape the information embedded in the website itself.

When you want to extract information or download data from a website that is too large for efficient manual downloading or needs to be frequently updated, you should first:

  1. Check if the website has any available Web services or if APIs have been developed to this end
  2. Check if any R (or other language you know) package has been developed by others as a wrapper around the API to facilitate the use of these Web services
  3. Nothing found? Well let’s code this ourselves then!

Which R packages are available?

As usual, it is a good start to look at the CRAN View to have an idea of R packages available: https://CRAN.R-project.org/view=WebTechnologies

Here are some of the key packages:

There are also function in the utils package, such as download.file(). Note that these functions do not handle https (motivation behind the curl R package)

In this session we are going to use rvest: first for a simple tutorial, followed by a challenge in groups

Some background

HTTP: Hypertext Transfer Protocol

URL

At the heart of web communications is the request message, which is sent via Uniform Resource Locators (URLs). Basic URL structure:

The protocol is typically http or https for secure communications. The default port is 80, but one can be set explicitly, as illustrated in the above image. The resource path is the local path to the resource on the server.

Request

The actions that should be performed on the host are specified via HTTP verbs. Today we are going to focus on two actions that are often used in web forms:

  • GET: fetch an existing resource. The URL contains all the necessary information the server needs to locate and return the resource.
  • POST: create a new resource. POST requests usually carry a payload that specifies the data for the new resource.

Response

Status codes:

  • 1xx: Informational Messages
  • 2xx: Successful; most known is 200: OK, request was successfully processed
  • 3xx: Redirection
  • 4xx: Client Error; the famous 404: resource not found
  • 5xx: Server Error

HTML

The HyperText Markup Language (HTML) describes and defines the content of a webpage. Other technologies besides HTML are generally used to describe a webpage’s appearance/presentation (CSS) or functionality (JavaScript).

“Hyper Text” in HTML refers to links that connect webpages to one another, either within a single website or between websites. Links are a fundamental aspect of the Web.

HTML uses “markup” to annotate text, images, and other content for display in a Web browser. HTML markup includes special “elements” such as <head>, <title>, <body>, <header>, <footer>, <article>, <section>, <p>, <div>, <span>, <img>, and many others.

Using you web browser, you can inspect the HTML content of any webpage of the World Wide Web.

XML

The eXtensible Markup Language (XML) provides a general approach for representing all types of information, such as data sets containing numerical and categorical variables. XML provides the basic, common, and quite simple structure and syntax for all “dialects” or vocabularies. For example, HTML, SVG and EML are specific vocabularies of XML.

XPath

XPath is quite simple but yet very powerful. Similar syntax to a file system hierarchy, it allows to identify nodes of interest by specifying paths through the tree, based on node names, node content, and a node’s relationship to other nodes in the hierarchy. We typically use XPath to locate nodes in a tree and then use R functions to extract data from those nodes and bring the data into R.

CSS

Cascading Style Sheets (CSS) is a stylesheet language used to describe the presentation of a document written in HTML or XML. CSS describes how elements should be rendered on screen, on paper, in speech, or on other media. In CSS, selectors are used to target the HTML elements on a web page that we want to style. There are a wide variety of CSS selectors available, allowing for fine grained precision when selecting elements to style.

Web scraping workflow

  1. Check for existing API and existing R packages
  2. Information identification: use your web browser inspector and/or http://selectorgadget.com/ to inspect the content and structure of the webpages you want to extract information from
  3. Choice of strategy: e.g. Xpath, CSS selector, …
  4. Information extraction: Choose the relevant R package(s) to accomplish your data extraction and code it


rvest

rvest is a set of wrappers functions around the xml2 and httr packages

Main functions

  • read_html: read a webpage into R as XML (document and nodes)
  • html_nodes: extract pieces out of HTML documents using XPath and/or CSS selectors
  • html_attr: extract attributes from HTML, such as href
  • html_text: extract text content

For more information on the package: here

Quick example

Let us get started with organizing your evenings. The Funk Zone is a pretty fun part of Santa Barbara, where you can find wineries, bars and restaurants. Check this out: http://santabarbaraca.com/explore-and-discover-santa-barbara/neighborhoods-towns/santa-barbara/the-funk-zone/

We are going to scrape the name of the places and their websites out of this webpage and compile this information into a csv, so you will be able to quickly choose where to go to relax at the end of the day.

1. Look at the website structure using our web browser inspector: