Welcome to Plain Data: WORD CLOUD VISUALIZATION IN R

Hello, welcome to my blog. In this post I want to demonstrate how to create a word cloud using the R programming language. For more information on the R programming language click here. A word cloud is an image composed of words used in a particular text or subject, in which size of each words indicates frequency or importance.

For example if a word cloud is generated from the lyrics of a love song words like ‘love’, ‘think’, ‘beautiful’ etc. will be shown in a large font size. In this post I will use a word cloud to visualize the last words said by prisoners on death row before execution.

I like word clouds because they are a quick and intuitive way of getting the general idea or gist of a body of text.

For this post, I am using the last words of death row inmates in Texas which is available here. This dataset contains detailed information (including last words) of death row inmates in Texas since 1982.

I used two R packages:

tm – For text mining. This package requires another package called NLP.
wordcloud – For creating the word cloud image. This package requires another package called RColorBrewer.

The code for creating the word cloud in R is shown at the end of this post (it has comments so you can ‘read’ it):

The word cloud is shown below:

Word cloud of last words for Texas death-row inmates

INTERPRETING THE WORD CLOUD

From the word cloud, some of the most frequent words are ‘love’, ‘know’, ‘will’, ‘god’, ‘family’ and ‘sorry’. Based on this one can infer that most of the death row inmates were telling their family or friends how much they love them, talking about God and saying that they were sorry for their actions. The word ‘innocent’ is in small fonts at the lower left corner of the image meaning some prisoners were probably claiming to be innocent before execution.

Conclusion

You have seen a very quick word cloud tutorial in R. This is the first of many word clouds I will generate for this blog. Expect to see more in the near future. As always questions and suggestions are highly welcome. Just leave a comment and I will do my best to answer your question. Have a wonderful Easter celebration. Cheers!!!

R program to generate word cloud

#This is a script to generate a word cloud from the Texas deathrow dataset

#Clear the workspace

rm(list=ls())

#Load the dataset, make sure you are in the correct directory

texas_data = read.csv("Texas Education Data.csv", stringsAsFactors = FALSE)

#Lets just select the necessary columns -- Last statement and maybe execution number(probably won't use it)

#Column 1 contains execution number, column 11 contains the last statement

last_statements <- texas_data[c(1, 11)]

#Some prisoner's did not have a last statement let's replace them NA

last_statements$Last.Statement <- ifelse(last_statements$Last.Statement == "No last statement",

NA, last_statements$Last.Statement)

#Create a dataframe without NAs

last_statements_2 <- na.omit(last_statements)

#Although there are some statements that dodge this rule, this captures most of the complete statements

#Preparing the data for analysis

library(tm) #requires package NLP

#Let's create a corpus of the last statements

#A corpus is collection of documents containing text

last_corpus <- Corpus(VectorSource(last_statements_2$Last.Statement))

#Next perform various transformation on the corpus using the tm_map function

#Convert last statements to lower case and remove numbers

last_corpus_clean <- tm_map(last_corpus, tolower)

last_corpus_clean <- tm_map(last_corpus_clean, removeNumbers)

#Remove stopwords from the document

#Stopwords are words that occur very frequently in a language

#so there is no need including them. For example, I, you, we, it, him, her are stopwords in English

last_corpus_clean <- tm_map(last_corpus_clean, removeWords, stopwords())

#Also, remove punctuations from the document

last_corpus_clean <- tm_map(last_corpus_clean, removePunctuation)

#strip white space

last_corpus_clean <- tm_map(last_corpus_clean, stripWhitespace)

#Create a Document Term Matrix, I think this step was not needed

last_corpus_clean <- tm_map(last_corpus_clean, PlainTextDocument)

#last_dtm <- DocumentTermMatrix(last_corpus_clean)

#Let's visualize with a word cloud

library(wordcloud) #requires package RColorBrewer

#min.freq specifies how many times a word has to appear before it is displayed in the word cloud

#random.order = FALSE arranges the cloud in non-random order with higher frequency words placed close to the center

wordcloud(last_corpus_clean, min.freq = 40, random.order = FALSE)

Welcome to Plain Data

Friday, 25 March 2016

WORD CLOUD VISUALIZATION IN R

No comments:

Post a Comment

Search This Blog

Blog Archive

About Me

Popular Posts

Translate