Arun Gopinath

Discover with Arun Gopinathan: Tech, Photography, and Everything in Between

Can you identify this legendary malayalam novel ? A ggwordcloud experiment in R

Arun Gopinath / 2021-06-17


A legendary novel

Yesterday out of nowhere the masterpiece novel, which changed the course of Malayalam literature forever pop up in my social media feed. After reading the digital copy (.epub) of the same, the idea for this article came into my mind. All I have for you is “words”.

Create a word cloud using R

First of all, load necessary R packages. {readtext} and {tidytext} for load our novel. {tidyverse} - the most important package for data wrangling. {ggplot2} and {ggwordcloud} for visualisation.

library(tidytext)
library(readtext)
library(here)
library(tidyverse)
library(ggplot2)
library(ggwordcloud)

Load files

Load files. (If you follow the clue, you are already in the right direction to find this masterpiece novel). Here, the novel is in ‘.txt’ format. “unnest_token” will convert the entire file into a two-column table.

k <- readtext(here("static/data","k.txt"))
token_k <- k %>% unnest_tokens(word,text)

Little bit data wrangling

We are almost there. The only thing we want to visualise is nothing but the word count. “count(word,sort = TRUE)” will get the job done for us. For tidiness of the plot count greater than “25” is used (Who loves messy visuals?). As a final step,{ggwordcloud} is applied with suitable text size, colour. The benefit of {ggwordcloud} is its integration with {ggplot2} outputs. For cooler representation some tilting of words are also included and stored in a new coloumn named “angle”.

wcloud <- token_k %>% 
  count(word,sort = TRUE) %>% 
  filter(n > 25) %>%  
  mutate(word = reorder(word,n)) %>% 
  mutate(angle = 45 * sample(-2:2, n(), replace = TRUE, prob = c(1, 1, 4, 1, 1)))

Final step - Visualisation

In the visualisation step, {ggplot} is used along with “geom_text_wordcloud_area” function in {ggwordcloud}. Colour is based on the frequency of words in the file. Also, most frequent words are set to “red” and less frequent ones to “blue”. Finally, a cool title and caption are added (Theme section)

set.seed(42)
kplot <- ggplot(wcloud, aes(label = word,size =n, angle = angle,
                            color = n)) +
   geom_text_wordcloud_area() +
  scale_size_area(max_size = 20) +
  
  # theme section
  theme_minimal()+
  scale_colour_gradient(low = "blue", high = "red", na.value = NA)+
  
  # title and caption
  labs(
    title = "Can you identify this legendary malayalam novel ?",
    caption = "Design: Arun Gopinath"
  )+
  theme(
  plot.title = element_text(color = "red", size = 18, face = "bold",hjust = 0.5),
  plot.caption = element_text(color = "darkgreen", face = "italic")
)
kplot