Friday, 25 March 2016

WORD CLOUD VISUALIZATION IN R

Hello, welcome to my blog. In this post I want to demonstrate how to create a word cloud using the R programming language. For more information on the R programming language click here. A word cloud is an image composed of words used in a particular text or subject, in which size of each words indicates frequency or importance. 

For example if a word cloud is generated from the lyrics of a love song words like ‘love’, ‘think’, ‘beautiful’ etc. will be shown in  a large font size. In this post I will use a word cloud to visualize the last words said by prisoners on death row before execution.

I like word clouds because they are a quick and intuitive way of getting the general idea or gist of a body of text.

For this post, I am using the last words of death row inmates in Texas which is available here. This dataset contains detailed information (including last words) of death row inmates in Texas since 1982. 

I used two R packages:
  • tm – For text mining. This package requires another package called NLP.
  • wordcloud – For creating the word cloud image. This package requires another package called RColorBrewer.

The code for creating the word cloud in R is shown at the end of this post (it has comments so you can ‘read’ it):

The word cloud is shown below:

Word cloud of last words for Texas death-row inmates
INTERPRETING THE WORD CLOUD
From the word cloud, some of the most frequent words are ‘love’, ‘know’, ‘will’, ‘god’, ‘family’ and ‘sorry’. Based on this one can infer that most of the death row inmates were telling their family or friends how much they love them, talking about God and saying that they were sorry for their actions. The word ‘innocent’ is in small fonts at the lower left corner of the image meaning some prisoners were probably claiming to be innocent before execution.

Conclusion
You have seen a very quick word cloud tutorial in R. This is the first of many word clouds I will generate for this blog. Expect to see more in the near future. As always questions and suggestions are highly welcome. Just leave a comment and I will do my best to answer your question. Have a wonderful Easter celebration. Cheers!!!

R program to generate word cloud

#This is a script to generate a word cloud from the Texas deathrow dataset

#Clear the workspace
rm(list=ls())

#Load the dataset, make sure you are in the correct directory
texas_data = read.csv("Texas Education Data.csv", stringsAsFactors = FALSE)

#Lets just select the necessary columns -- Last statement and maybe execution number(probably won't use it)
#Column 1 contains execution number, column 11 contains the last statement
last_statements <- texas_data[c(1, 11)]

#Some prisoner's did not have a last statement let's replace them NA
last_statements$Last.Statement <- ifelse(last_statements$Last.Statement == "No last statement",
                                         NA, last_statements$Last.Statement)

#Create a dataframe without NAs
last_statements_2 <- na.omit(last_statements)
#Although there are some statements that dodge this rule, this captures most of the complete statements

#Preparing the data for analysis
library(tm) #requires package NLP

#Let's create a corpus of the last statements
#A corpus is collection of documents containing text
last_corpus <- Corpus(VectorSource(last_statements_2$Last.Statement))

#Next perform various transformation on the corpus using the tm_map function

#Convert last statements to lower case and remove numbers
last_corpus_clean <- tm_map(last_corpus, tolower)
last_corpus_clean <- tm_map(last_corpus_clean, removeNumbers)

#Remove stopwords from the document
#Stopwords are words that occur very frequently in a language 
#so there is no need including them. For example, I, you, we, it, him, her are stopwords in English
last_corpus_clean <- tm_map(last_corpus_clean, removeWords, stopwords())

#Also, remove punctuations from the document
last_corpus_clean <- tm_map(last_corpus_clean, removePunctuation)

#strip white space
last_corpus_clean <- tm_map(last_corpus_clean, stripWhitespace)

#Create a Document Term Matrix, I think this step was not needed
last_corpus_clean <- tm_map(last_corpus_clean, PlainTextDocument)
#last_dtm <- DocumentTermMatrix(last_corpus_clean)

#Let's visualize with a word cloud
library(wordcloud) #requires package RColorBrewer

#min.freq specifies how many times a word has to appear before it is displayed in the word cloud
#random.order = FALSE arranges the cloud in non-random order with higher frequency words placed close to the center
wordcloud(last_corpus_clean, min.freq = 40, random.order = FALSE)