Goal

The Goal of this session is to learn how to mine PDFs to get information out of them. Text mining encompasses a vast field of theoretical approaches and methods with one thing in common: text as input information (Feiner et al, 2008)

Which R packages are available?

Looking at the Natural Language Processing (NLP) CRAN view, you will realize there are a lot of different packages to accomplish this complex task : https://cran.r-project.org/web/views/NaturalLanguageProcessing.html.

Here are some important packages:

=> In this quick introduction we are going to use quanteda

Analyzing peer-reviewed journal articles about BP Deep Horizon’s oil spill

First, let us load the necessary packages

library("readtext")
library("quanteda")

1. Import the PDFs into R

# set path to the PDF (here on Aurora)
pdf_path <- "/tmp/oil_spill_pdfs"

# List the PDFs about the BP oil spill
pdfs <- list.files(path = pdf_path, pattern = 'pdf$',  full.names = TRUE) 

# Import the PDFs into R
spill_texts <- readtext(pdfs, 
                        docvarsfrom = "filenames", 
                        sep = "_", 
                        docvarnames = c("First_author", "Year"))

2. Create the Corpus object needed for the text analysis

# Transform the journal articles into a corpus object
spill_corpus  <- corpus(spill_texts)

# Some stats about the journal articles
tokenInfo <- summary(spill_corpus)
## Corpus consisting of 11 documents.
## 
##                Text Types Tokens Sentences First_author Year
##      Arora_2017.pdf  2459  14888       617        Arora 2017
##    Harding_2016.pdf  1570   6593       322      Harding 2016
##       John_2016.pdf  2364  14363       575         John 2016
##     Kolian_2015.pdf  2290   9405       278       Kolian 2015
##      Mitch_2016.pdf   692   2066        88        Mitch 2016
##      Olson_2016.pdf  2145   8357       316        Olson 2016
##  Pietroski_2015.pdf  1767   7253       370    Pietroski 2015
##     Turner_2016.pdf  2163   9767       481       Turner 2016
##  Vallarino_2017.pdf  2306   9752       401    Vallarino 2017
##       wade_2016.pdf  1902   8884       424         wade 2016
##     Wilson_2014.pdf   876   2503        79       Wilson 2014
## 
## Source:  /home/brun/oss/oss-lessons/data-liberation/* on x86_64 by brun
## Created: Wed Sep 20 14:41:23 2017
## Notes:

Add metadata to the Corpus object

For example we can add the information that these texts are written in English.

# add metadata to files, in this case that they are written in english
metadoc(spill_corpus, 'language') <- "english" 

# visualize corpus structure and contents, now with added metadata
summary(spill_corpus, showmeta = T)
## Corpus consisting of 11 documents.
## 
##                Text Types Tokens Sentences First_author Year _language
##      Arora_2017.pdf  2459  14888       617        Arora 2017   english
##    Harding_2016.pdf  1570   6593       322      Harding 2016   english
##       John_2016.pdf  2364  14363       575         John 2016   english
##     Kolian_2015.pdf  2290   9405       278       Kolian 2015   english
##      Mitch_2016.pdf   692   2066        88        Mitch 2016   english
##      Olson_2016.pdf  2145   8357       316        Olson 2016   english
##  Pietroski_2015.pdf  1767   7253       370    Pietroski 2015   english
##     Turner_2016.pdf  2163   9767       481       Turner 2016   english
##  Vallarino_2017.pdf  2306   9752       401    Vallarino 2017   english
##       wade_2016.pdf  1902   8884       424         wade 2016   english
##     Wilson_2014.pdf   876   2503        79       Wilson 2014   english
## 
## Source:  /home/brun/oss/oss-lessons/data-liberation/* on x86_64 by brun
## Created: Wed Sep 20 14:41:23 2017
## Notes:

Subset coprus

Do you want only articles before 2017?

summary(corpus_subset(spill_corpus, Year < 2017))
## Corpus consisting of 9 documents.
## 
##                Text Types Tokens Sentences First_author Year
##    Harding_2016.pdf  1570   6593       322      Harding 2016
##       John_2016.pdf  2364  14363       575         John 2016
##     Kolian_2015.pdf  2290   9405       278       Kolian 2015
##      Mitch_2016.pdf   692   2066        88        Mitch 2016
##      Olson_2016.pdf  2145   8357       316        Olson 2016
##  Pietroski_2015.pdf  1767   7253       370    Pietroski 2015
##     Turner_2016.pdf  2163   9767       481       Turner 2016
##       wade_2016.pdf  1902   8884       424         wade 2016
##     Wilson_2014.pdf   876   2503        79       Wilson 2014
## 
## Source:  /home/brun/oss/oss-lessons/data-liberation/* on x86_64 by brun
## Created: Wed Sep 20 14:41:23 2017
## Notes:

Search for words with context: 4 words on each side of the keyword

kwic(spill_corpus, "dispersant", 4)

3. Build a Document-Feature Matrix (DFM)

More information about DFM can be found on Quanteda’s vignette: http://quanteda.io/articles/quickstart.html. In a nutshell, additional rules can be applied on top of the tokenization process, such as ignoring certain words, punctuation, case, …

# construct the DFM, which is the base object to further analyze the journal articles
spills_DFM <- dfm(spill_corpus, tolower = TRUE, stem = FALSE, 
                  remove = c("et", "al", "fig", "table", "ml", "http",
                             stopwords("SMART")),
                  remove_punct = TRUE, remove_numbers = TRUE)

# returns the top 20 frequent words
topfeatures(spills_DFM, 20) 
##            oil        samples          spill           pahs           gulf 
##            652            371            305            246            240 
##         sample             bp  environmental         mexico           data 
##            193            191            182            168            164 
##          water          study         social     reputation            pah 
##            154            147            138            135            135 
## concentrations     weathering      deepwater        horizon          total 
##            133            124            123            122            118

Note: You can check what words are listed by default in stopwords:

head(stopwords("english"), 20)
##  [1] "i"          "me"         "my"         "myself"     "we"        
##  [6] "our"        "ours"       "ourselves"  "you"        "your"      
## [11] "yours"      "yourself"   "yourselves" "he"         "him"       
## [16] "his"        "himself"    "she"        "her"        "hers"

4. Extract information from a Document-Feature Matrix (DFM)

Word cloud

Quickly visualize the most frequent words:

# set the seed for wordcloud
set.seed(1)

# plots wordcloud
textplot_wordcloud(spills_DFM, min.freq = 60, random.order=F, 
                   rot.per = .10,  
                   colors = RColorBrewer::brewer.pal(8,'Dark2')) 

Grouping documents by metadata

Here we are grouping the documents by year of publication:

spills_DFM_yearly <- dfm(spill_corpus, groups = "Year", tolower = TRUE, stem = TRUE, 
                  remove = c("et", "al", "fig", "table", "ml", "http",
                             stopwords("SMART")),
                  remove_punct = TRUE, remove_numbers = TRUE)

# Sort by year and show the top 20 most frequent words
dfm_sort(spills_DFM_yearly)[,1:20]
## Document-feature matrix of: 4 documents, 20 features (11.3% sparse).
## 4 x 20 sparse Matrix of class "dfmSparse"
##       features
## docs   oil sampl pah spill  bp gulf studi concentr water environment
##   2014  13    35  12    13   6    8     3        3    11          13
##   2015 171   137  21    85  15   35    25       19    40          30
##   2016 486   451 304   143  29  128    73      161   132          20
##   2017  99    13  44    96 215   70   102       19    11         123
##       features
## docs   mexico weather data environ pollut hydrocarbon reput social rate
##   2014      6       0   23       0      4           9     0      0    0
##   2015     25      34   10      43     21          25     0      0   45
##   2016     68     129  114      70     74         100     0      1   88
##   2017     69       2   17      47     51          16   144    137    1
##       features
## docs   manag
##   2014     0
##   2015     4
##   2016     9
##   2017   116

Searching for concepts using sets of keywords

One very powerful feature of quanteda is to allow to group keywords by dictionary to mine texts.

myDict <- dictionary(list(pollution = c("oil", "oiled", "crude", "petroleum", "pahs", "pah", "tph", "benzo", "hydrocarbons", "pollution"),
                          measurement = c("data", "sample", "samples", "sampling", "study")))
spills_DFM <- dfm(spill_corpus, dictionary = myDict)

spills_DFM
## Document-feature matrix of: 11 documents, 2 features (0% sparse).
## 11 x 2 sparse Matrix of class "dfmSparse"
##                     features
## docs                 pollution measurement
##   Arora_2017.pdf            88          69
##   Harding_2016.pdf         125         123
##   John_2016.pdf            375         214
##   Kolian_2015.pdf          196         135
##   Mitch_2016.pdf             5           1
##   Olson_2016.pdf           180          66
##   Pietroski_2015.pdf        48          32
##   Turner_2016.pdf          157          32
##   Vallarino_2017.pdf       106          40
##   wade_2016.pdf            236         172
##   Wilson_2014.pdf           55          61

The above text manipulations are the necessary steps to enable more advanced text analaysis, such as topical modeling and similarities between texts.

References and sources