Goal

The Goal of this session is to learn how to mine PDFs to get information out of them. Text mining encompasses a vast field of theoretical approaches and methods with one thing in common: text as input information (Feiner et al, 2008)

Which R packages are available?

Looking at the Natural Language Processing (NLP) CRAN view, you will realize there are a lot of different packages to accomplish this complex task : https://cran.r-project.org/web/views/NaturalLanguageProcessing.html.

Here are some important packages:

tm: provides a framework and the algorithmic background for mining text
quanteda: A fast and flexible framework for the management, processing, and quantitative analysis of textual data in R. It has very nice features, among which include finding specific words and their context in the text
tidytext: provides means for text mining for word processing and sentiment analysis using dplyr, ggplot2, and other tidy tools

=> In this quick introduction we are going to use quanteda

Analyzing peer-reviewed journal articles about BP Deep Horizon’s oil spill

First, let us load the necessary packages

library("readtext")
library("quanteda")

1. Import the PDFs into R

# set path to the PDF (here on Aurora)
pdf_path <- "/tmp/oil_spill_pdfs"

# List the PDFs about the BP oil spill
pdfs <- list.files(path = pdf_path, pattern = 'pdf$',  full.names = TRUE) 

# Import the PDFs into R
spill_texts <- readtext(pdfs, 
                        docvarsfrom = "filenames", 
                        sep = "_", 
                        docvarnames = c("First_author", "Year"))

2. Create the Corpus object needed for the text analysis

# Transform the journal articles into a corpus object
spill_corpus  <- corpus(spill_texts)

# Some stats about the journal articles
tokenInfo <- summary(spill_corpus)

## Corpus consisting of 11 documents.
## 
##                Text Types Tokens Sentences First_author Year
##      Arora_2017.pdf  2459  14888       617        Arora 2017
##    Harding_2016.pdf  1570   6593       322      Harding 2016
##       John_2016.pdf  2364  14363       575         John 2016
##     Kolian_2015.pdf  2290   9405       278       Kolian 2015
##      Mitch_2016.pdf   692   2066        88        Mitch 2016
##      Olson_2016.pdf  2145   8357       316        Olson 2016
##  Pietroski_2015.pdf  1767   7253       370    Pietroski 2015
##     Turner_2016.pdf  2163   9767       481       Turner 2016
##  Vallarino_2017.pdf  2306   9752       401    Vallarino 2017
##       wade_2016.pdf  1902   8884       424         wade 2016
##     Wilson_2014.pdf   876   2503        79       Wilson 2014
## 
## Source:  /home/brun/oss/oss-lessons/data-liberation/* on x86_64 by brun
## Created: Wed Sep 20 14:41:23 2017
## Notes:

Add metadata to the Corpus object

For example we can add the information that these texts are written in English.

# add metadata to files, in this case that they are written in english
metadoc(spill_corpus, 'language') <- "english" 

# visualize corpus structure and contents, now with added metadata
summary(spill_corpus, showmeta = T)

## Corpus consisting of 11 documents.
## 
##                Text Types Tokens Sentences First_author Year _language
##      Arora_2017.pdf  2459  14888       617        Arora 2017   english
##    Harding_2016.pdf  1570   6593       322      Harding 2016   english
##       John_2016.pdf  2364  14363       575         John 2016   english
##     Kolian_2015.pdf  2290   9405       278       Kolian 2015   english
##      Mitch_2016.pdf   692   2066        88        Mitch 2016   english
##      Olson_2016.pdf  2145   8357       316        Olson 2016   english
##  Pietroski_2015.pdf  1767   7253       370    Pietroski 2015   english
##     Turner_2016.pdf  2163   9767       481       Turner 2016   english
##  Vallarino_2017.pdf  2306   9752       401    Vallarino 2017   english
##       wade_2016.pdf  1902   8884       424         wade 2016   english
##     Wilson_2014.pdf   876   2503        79       Wilson 2014   english
## 
## Source:  /home/brun/oss/oss-lessons/data-liberation/* on x86_64 by brun
## Created: Wed Sep 20 14:41:23 2017
## Notes:

Subset coprus

Do you want only articles before 2017?

summary(corpus_subset(spill_corpus, Year < 2017))

## Corpus consisting of 9 documents.
## 
##                Text Types Tokens Sentences First_author Year
##    Harding_2016.pdf  1570   6593       322      Harding 2016
##       John_2016.pdf  2364  14363       575         John 2016
##     Kolian_2015.pdf  2290   9405       278       Kolian 2015
##      Mitch_2016.pdf   692   2066        88        Mitch 2016
##      Olson_2016.pdf  2145   8357       316        Olson 2016
##  Pietroski_2015.pdf  1767   7253       370    Pietroski 2015
##     Turner_2016.pdf  2163   9767       481       Turner 2016
##       wade_2016.pdf  1902   8884       424         wade 2016
##     Wilson_2014.pdf   876   2503        79       Wilson 2014
## 
## Source:  /home/brun/oss/oss-lessons/data-liberation/* on x86_64 by brun
## Created: Wed Sep 20 14:41:23 2017
## Notes:

Search for words with context: 4 words on each side of the keyword

kwic(spill_corpus, "dispersant", 4)

3. Build a Document-Feature Matrix (DFM)

More information about DFM can be found on Quanteda’s vignette: http://quanteda.io/articles/quickstart.html. In a nutshell, additional rules can be applied on top of the tokenization process, such as ignoring certain words, punctuation, case, …

# construct the DFM, which is the base object to further analyze the journal articles
spills_DFM <- dfm(spill_corpus, tolower = TRUE, stem = FALSE, 
                  remove = c("et", "al", "fig", "table", "ml", "http",
                             stopwords("SMART")),
                  remove_punct = TRUE, remove_numbers = TRUE)

# returns the top 20 frequent words
topfeatures(spills_DFM, 20)

##            oil        samples          spill           pahs           gulf 
##            652            371            305            246            240 
##         sample             bp  environmental         mexico           data 
##            193            191            182            168            164 
##          water          study         social     reputation            pah 
##            154            147            138            135            135 
## concentrations     weathering      deepwater        horizon          total 
##            133            124            123            122            118

Note: You can check what words are listed by default in stopwords:

head(stopwords("english"), 20)

##  [1] "i"          "me"         "my"         "myself"     "we"        
##  [6] "our"        "ours"       "ourselves"  "you"        "your"      
## [11] "yours"      "yourself"   "yourselves" "he"         "him"       
## [16] "his"        "himself"    "she"        "her"        "hers"

4. Extract information from a Document-Feature Matrix (DFM)

Word cloud

Quickly visualize the most frequent words:

# set the seed for wordcloud
set.seed(1)

# plots wordcloud
textplot_wordcloud(spills_DFM, min.freq = 60, random.order=F, 
                   rot.per = .10,  
                   colors = RColorBrewer::brewer.pal(8,'Dark2'))

Grouping documents by metadata

Here we are grouping the documents by year of publication:

spills_DFM_yearly <- dfm(spill_corpus, groups = "Year", tolower = TRUE, stem = TRUE, 
                  remove = c("et", "al", "fig", "table", "ml", "http",
                             stopwords("SMART")),
                  remove_punct = TRUE, remove_numbers = TRUE)

# Sort by year and show the top 20 most frequent words
dfm_sort(spills_DFM_yearly)[,1:20]

## Document-feature matrix of: 4 documents, 20 features (11.3% sparse).
## 4 x 20 sparse Matrix of class "dfmSparse"
##       features
## docs   oil sampl pah spill  bp gulf studi concentr water environment
##   2014  13    35  12    13   6    8     3        3    11          13
##   2015 171   137  21    85  15   35    25       19    40          30
##   2016 486   451 304   143  29  128    73      161   132          20
##   2017  99    13  44    96 215   70   102       19    11         123
##       features
## docs   mexico weather data environ pollut hydrocarbon reput social rate
##   2014      6       0   23       0      4           9     0      0    0
##   2015     25      34   10      43     21          25     0      0   45
##   2016     68     129  114      70     74         100     0      1   88
##   2017     69       2   17      47     51          16   144    137    1
##       features
## docs   manag
##   2014     0
##   2015     4
##   2016     9
##   2017   116

Searching for concepts using sets of keywords

One very powerful feature of quanteda is to allow to group keywords by dictionary to mine texts.

myDict <- dictionary(list(pollution = c("oil", "oiled", "crude", "petroleum", "pahs", "pah", "tph", "benzo", "hydrocarbons", "pollution"),
                          measurement = c("data", "sample", "samples", "sampling", "study")))

spills_DFM <- dfm(spill_corpus, dictionary = myDict)

spills_DFM

## Document-feature matrix of: 11 documents, 2 features (0% sparse).
## 11 x 2 sparse Matrix of class "dfmSparse"
##                     features
## docs                 pollution measurement
##   Arora_2017.pdf            88          69
##   Harding_2016.pdf         125         123
##   John_2016.pdf            375         214
##   Kolian_2015.pdf          196         135
##   Mitch_2016.pdf             5           1
##   Olson_2016.pdf           180          66
##   Pietroski_2015.pdf        48          32
##   Turner_2016.pdf          157          32
##   Vallarino_2017.pdf       106          40
##   wade_2016.pdf            236         172
##   Wilson_2014.pdf           55          61

The above text manipulations are the necessary steps to enable more advanced text analaysis, such as topical modeling and similarities between texts.

References and sources

Book on text mining in R: http://tidytextmining.com/
tm package: https://cran.r-project.org/web/packages/tm/vignettes/tm.pdf
quanteda package:
- Overview: https://github.com/kbenoit/quanteda
- Getting started: http://quanteda.io/articles/quickstart.html
Munzert, Simon. Automated Data Collection with R: A Practical Guide to Web Scraping and Text Mining. Chichester, West Sussex, United Kingdom: John Wiley & Sons Inc, 2015.
Text Mining with R: A Tidy Approach, by Julia Silge and David Robinson.

Text Mining

Julien Brun