For example if a word cloud is generated from the lyrics of a
love song words like ‘love’, ‘think’, ‘beautiful’ etc. will be shown in a large font size. In this post I will use a word cloud to
visualize the last words said by prisoners on death row before execution.
I like word clouds because they are a quick and intuitive way
of getting the general idea or gist of a body of text.
For this post, I am using the last words of death row inmates
in Texas which is available here. This dataset contains detailed information
(including last words) of death row inmates in Texas since 1982.
I used two R packages:
- tm – For text mining. This package requires another package called NLP.
- wordcloud – For creating the word cloud image. This package requires another package called RColorBrewer.
The code for creating the word cloud in R is shown at the end of this post (it
has comments so you can ‘read’ it):
The word cloud is shown below:
Word cloud of last words for Texas death-row inmates |
INTERPRETING THE WORD CLOUD
From the word cloud, some of the most frequent words are ‘love’,
‘know’, ‘will’, ‘god’, ‘family’ and ‘sorry’. Based on this one can infer that
most of the death row inmates were telling their family or friends how much
they love them, talking about God and saying that they were sorry for their
actions. The word ‘innocent’ is in small fonts at the lower left corner of the
image meaning some prisoners were probably claiming to be innocent before
execution.
Conclusion
You have seen a very quick word cloud tutorial in R. This is
the first of many word clouds I will generate for this blog. Expect to see more
in the near future. As always questions and suggestions are highly welcome.
Just leave a comment and I will do my best to answer your question. Have a
wonderful Easter celebration. Cheers!!!
R program to generate word cloud
#This is a script to generate a word cloud from the Texas deathrow dataset
#Clear the workspace
rm(list=ls())
#Load the dataset, make sure you are in the correct directory
texas_data = read.csv("Texas Education Data.csv", stringsAsFactors = FALSE)
#Lets just select the necessary columns -- Last statement and maybe execution number(probably won't use it)
#Column 1 contains execution number, column 11 contains the last statement
last_statements <- texas_data[c(1, 11)]
#Some prisoner's did not have a last statement let's replace them NA
last_statements$Last.Statement <- ifelse(last_statements$Last.Statement == "No last statement",
NA, last_statements$Last.Statement)
#Create a dataframe without NAs
last_statements_2 <- na.omit(last_statements)
#Although there are some statements that dodge this rule, this captures most of the complete statements
#Preparing the data for analysis
library(tm) #requires package NLP
#Let's create a corpus of the last statements
#A corpus is collection of documents containing text
last_corpus <- Corpus(VectorSource(last_statements_2$Last.Statement))
#Next perform various transformation on the corpus using the tm_map function
#Convert last statements to lower case and remove numbers
last_corpus_clean <- tm_map(last_corpus, tolower)
last_corpus_clean <- tm_map(last_corpus_clean, removeNumbers)
#Remove stopwords from the document
#Stopwords are words that occur very frequently in a language
#so there is no need including them. For example, I, you, we, it, him, her are stopwords in English
last_corpus_clean <- tm_map(last_corpus_clean, removeWords, stopwords())
#Also, remove punctuations from the document
last_corpus_clean <- tm_map(last_corpus_clean, removePunctuation)
#strip white space
last_corpus_clean <- tm_map(last_corpus_clean, stripWhitespace)
#Create a Document Term Matrix, I think this step was not needed
last_corpus_clean <- tm_map(last_corpus_clean, PlainTextDocument)
#last_dtm <- DocumentTermMatrix(last_corpus_clean)
#Let's visualize with a word cloud
library(wordcloud) #requires package RColorBrewer
#min.freq specifies how many times a word has to appear before it is displayed in the word cloud
#random.order = FALSE arranges the cloud in non-random order with higher frequency words placed close to the center
wordcloud(last_corpus_clean, min.freq = 40, random.order = FALSE)
No comments:
Post a Comment