The Goal of this session is to learn how to mine PDFs to get information out of them. Text mining encompasses a vast field of theoretical approaches and methods with one thing in common: text as input information (Feiner et al, 2008)
Looking at the Natural Language Processing (NLP) CRAN view, you will realize there are a lot of different packages to accomplish this complex task : https://cran.r-project.org/web/views/NaturalLanguageProcessing.html.
Here are some important packages:
tm
: provides a framework and the algorithmic background for mining textquanteda
: A fast and flexible framework for the management, processing, and quantitative analysis of textual data in R. It has very nice features, among which include finding specific words and their context in the texttidytext
: provides means for text mining for word processing and sentiment analysis using dplyr, ggplot2, and other tidy tools=> In this quick introduction we are going to use quanteda
First, let us load the necessary packages
library("readtext")
library("quanteda")
# set path to the PDF (here on Aurora)
pdf_path <- "/tmp/oil_spill_pdfs"
# List the PDFs about the BP oil spill
pdfs <- list.files(path = pdf_path, pattern = 'pdf$', full.names = TRUE)
# Import the PDFs into R
spill_texts <- readtext(pdfs,
docvarsfrom = "filenames",
sep = "_",
docvarnames = c("First_author", "Year"))
# Transform the journal articles into a corpus object
spill_corpus <- corpus(spill_texts)
# Some stats about the journal articles
tokenInfo <- summary(spill_corpus)
## Corpus consisting of 11 documents.
##
## Text Types Tokens Sentences First_author Year
## Arora_2017.pdf 2459 14888 617 Arora 2017
## Harding_2016.pdf 1570 6593 322 Harding 2016
## John_2016.pdf 2364 14363 575 John 2016
## Kolian_2015.pdf 2290 9405 278 Kolian 2015
## Mitch_2016.pdf 692 2066 88 Mitch 2016
## Olson_2016.pdf 2145 8357 316 Olson 2016
## Pietroski_2015.pdf 1767 7253 370 Pietroski 2015
## Turner_2016.pdf 2163 9767 481 Turner 2016
## Vallarino_2017.pdf 2306 9752 401 Vallarino 2017
## wade_2016.pdf 1902 8884 424 wade 2016
## Wilson_2014.pdf 876 2503 79 Wilson 2014
##
## Source: /home/brun/oss/oss-lessons/data-liberation/* on x86_64 by brun
## Created: Wed Sep 20 14:41:23 2017
## Notes:
For example we can add the information that these texts are written in English.
# add metadata to files, in this case that they are written in english
metadoc(spill_corpus, 'language') <- "english"
# visualize corpus structure and contents, now with added metadata
summary(spill_corpus, showmeta = T)
## Corpus consisting of 11 documents.
##
## Text Types Tokens Sentences First_author Year _language
## Arora_2017.pdf 2459 14888 617 Arora 2017 english
## Harding_2016.pdf 1570 6593 322 Harding 2016 english
## John_2016.pdf 2364 14363 575 John 2016 english
## Kolian_2015.pdf 2290 9405 278 Kolian 2015 english
## Mitch_2016.pdf 692 2066 88 Mitch 2016 english
## Olson_2016.pdf 2145 8357 316 Olson 2016 english
## Pietroski_2015.pdf 1767 7253 370 Pietroski 2015 english
## Turner_2016.pdf 2163 9767 481 Turner 2016 english
## Vallarino_2017.pdf 2306 9752 401 Vallarino 2017 english
## wade_2016.pdf 1902 8884 424 wade 2016 english
## Wilson_2014.pdf 876 2503 79 Wilson 2014 english
##
## Source: /home/brun/oss/oss-lessons/data-liberation/* on x86_64 by brun
## Created: Wed Sep 20 14:41:23 2017
## Notes:
Do you want only articles before 2017?
summary(corpus_subset(spill_corpus, Year < 2017))
## Corpus consisting of 9 documents.
##
## Text Types Tokens Sentences First_author Year
## Harding_2016.pdf 1570 6593 322 Harding 2016
## John_2016.pdf 2364 14363 575 John 2016
## Kolian_2015.pdf 2290 9405 278 Kolian 2015
## Mitch_2016.pdf 692 2066 88 Mitch 2016
## Olson_2016.pdf 2145 8357 316 Olson 2016
## Pietroski_2015.pdf 1767 7253 370 Pietroski 2015
## Turner_2016.pdf 2163 9767 481 Turner 2016
## wade_2016.pdf 1902 8884 424 wade 2016
## Wilson_2014.pdf 876 2503 79 Wilson 2014
##
## Source: /home/brun/oss/oss-lessons/data-liberation/* on x86_64 by brun
## Created: Wed Sep 20 14:41:23 2017
## Notes:
kwic(spill_corpus, "dispersant", 4)
More information about DFM can be found on Quanteda’s vignette: http://quanteda.io/articles/quickstart.html. In a nutshell, additional rules can be applied on top of the tokenization process, such as ignoring certain words, punctuation, case, …
# construct the DFM, which is the base object to further analyze the journal articles
spills_DFM <- dfm(spill_corpus, tolower = TRUE, stem = FALSE,
remove = c("et", "al", "fig", "table", "ml", "http",
stopwords("SMART")),
remove_punct = TRUE, remove_numbers = TRUE)
# returns the top 20 frequent words
topfeatures(spills_DFM, 20)
## oil samples spill pahs gulf
## 652 371 305 246 240
## sample bp environmental mexico data
## 193 191 182 168 164
## water study social reputation pah
## 154 147 138 135 135
## concentrations weathering deepwater horizon total
## 133 124 123 122 118
Note: You can check what words are listed by default in stopwords:
head(stopwords("english"), 20)
## [1] "i" "me" "my" "myself" "we"
## [6] "our" "ours" "ourselves" "you" "your"
## [11] "yours" "yourself" "yourselves" "he" "him"
## [16] "his" "himself" "she" "her" "hers"
Quickly visualize the most frequent words:
# set the seed for wordcloud
set.seed(1)
# plots wordcloud
textplot_wordcloud(spills_DFM, min.freq = 60, random.order=F,
rot.per = .10,
colors = RColorBrewer::brewer.pal(8,'Dark2'))
Here we are grouping the documents by year of publication:
spills_DFM_yearly <- dfm(spill_corpus, groups = "Year", tolower = TRUE, stem = TRUE,
remove = c("et", "al", "fig", "table", "ml", "http",
stopwords("SMART")),
remove_punct = TRUE, remove_numbers = TRUE)
# Sort by year and show the top 20 most frequent words
dfm_sort(spills_DFM_yearly)[,1:20]
## Document-feature matrix of: 4 documents, 20 features (11.3% sparse).
## 4 x 20 sparse Matrix of class "dfmSparse"
## features
## docs oil sampl pah spill bp gulf studi concentr water environment
## 2014 13 35 12 13 6 8 3 3 11 13
## 2015 171 137 21 85 15 35 25 19 40 30
## 2016 486 451 304 143 29 128 73 161 132 20
## 2017 99 13 44 96 215 70 102 19 11 123
## features
## docs mexico weather data environ pollut hydrocarbon reput social rate
## 2014 6 0 23 0 4 9 0 0 0
## 2015 25 34 10 43 21 25 0 0 45
## 2016 68 129 114 70 74 100 0 1 88
## 2017 69 2 17 47 51 16 144 137 1
## features
## docs manag
## 2014 0
## 2015 4
## 2016 9
## 2017 116
One very powerful feature of quanteda
is to allow to group keywords by dictionary to mine texts.
myDict <- dictionary(list(pollution = c("oil", "oiled", "crude", "petroleum", "pahs", "pah", "tph", "benzo", "hydrocarbons", "pollution"),
measurement = c("data", "sample", "samples", "sampling", "study")))
spills_DFM <- dfm(spill_corpus, dictionary = myDict)
spills_DFM
## Document-feature matrix of: 11 documents, 2 features (0% sparse).
## 11 x 2 sparse Matrix of class "dfmSparse"
## features
## docs pollution measurement
## Arora_2017.pdf 88 69
## Harding_2016.pdf 125 123
## John_2016.pdf 375 214
## Kolian_2015.pdf 196 135
## Mitch_2016.pdf 5 1
## Olson_2016.pdf 180 66
## Pietroski_2015.pdf 48 32
## Turner_2016.pdf 157 32
## Vallarino_2017.pdf 106 40
## wade_2016.pdf 236 172
## Wilson_2014.pdf 55 61
The above text manipulations are the necessary steps to enable more advanced text analaysis, such as topical modeling and similarities between texts.
tm
package: https://cran.r-project.org/web/packages/tm/vignettes/tm.pdfquanteda
package: