Eight to Late

Sensemaking and Analytics for Organizations

A gentle introduction to topic modeling using R

with 27 comments


The standard way to search for documents on the internet is via keywords or keyphrases. This is pretty much what Google and other search engines do routinely…and they do it well.  However, as useful as this is, it has its limitations. Consider, for example, a situation in which you are confronted with a large collection of documents but have no idea what they are about. One of the first things you might want to do is to classify these documents into topics or themes. Among other things this would help you figure out if there’s anything interest while also directing you to the relevant subset(s) of the corpus. For small collections, one could do this by simply going through each document but this is clearly infeasible for corpuses containing thousands of documents.

Topic modeling – the theme of this post – deals with the problem of automatically classifying sets of documents into themes

The article is organised as follows: I first provide some background on topic modelling. The algorithm that I use, Latent Dirichlet Allocation (LDA), involves some pretty heavy maths which I’ll avoid altogether. However, I will provide an intuitive explanation of how LDA works before moving on to a practical example which uses the topicmodels library in R. As in my previous articles in this series (see this post and this one), I will discuss the steps in detail along with explanations and provide accessible references for concepts that cannot be covered in the space of a blog post.

(Aside: Beware, LDA is also an abbreviation for Linear Discriminant Analysis a classification technique that I hope to cover later in my ongoing series on text and data analytics).

Latent Dirichlet Allocation – a math-free introduction

In essence, LDA is a technique that facilitates the automatic discovery of themes in a collection of documents.

The basic assumption behind LDA is that each of the documents in a collection consist of a mixture of collection-wide topics. However, in reality we observe only documents and words, not topics – the latter are part of the hidden (or latent) structure of documents. The aim is to infer the latent topic structure given the words and document.  LDA does this by recreating the documents in the corpus by adjusting the relative importance of topics in documents and words in topics iteratively.

Here’s a brief explanation of how the algorithm works, quoted directly from this answer by Edwin Chen on Quora:

  • Go through each document, and randomly assign each word in the document to one of the K topics. (Note: One of the shortcomings of LDA is that one has to specify the number of topics, denoted by K, upfront. More about this later.)
  • This assignment already gives you both topic representations of all the documents and word distributions of all the topics (albeit not very good ones).
  • So to improve on them, for each document d…
  • ….Go through each word w in d…
  • ……..And for each topic t, compute two things: 1) p(topic t | document d) = the proportion of words in document d that are currently assigned to topic t, and 2) p(word w | topic t) = the proportion of assignments to topic t over all documents that come from this word w. Reassign w a new topic, where you choose topic t with probability p(topic t | document d) * p(word w | topic t) (according to our generative model, this is essentially the probability that topic t generated word w, so it makes sense that we resample the current word’s topic with this probability).  (Note: p(a|b) is the conditional probability of a given that b has already occurred – see this post for more on conditional probabilities)
  • ……..In other words, in this step, we’re assuming that all topic assignments except for the current word in question are correct, and then updating the assignment of the current word using our model of how documents are generated.
  • After repeating the previous step a large number of times, you’ll eventually reach a roughly steady state where your assignments are pretty good. So use these assignments to estimate the topic mixtures of each document (by counting the proportion of words assigned to each topic within that document) and the words associated to each topic (by counting the proportion of words assigned to each topic overall).

For another simple explanation of how LDA works in, check out  this article by Matthew Jockers. For a more technical exposition, take a look at this video by David Blei, one of the inventors of the algorithm.

The iterative process described in the last point above is implemented using a technique called Gibbs sampling.  I’ll say a bit more about Gibbs sampling later, but you may want to have a look at this paper by Philip Resnick and Eric Hardesty that explains the nitty-gritty of the algorithm (Warning: it involves a fair bit of math, but has some good intuitive explanations as  well).

As a general point, I should also emphasise that you do not need to understand the ins and outs of an algorithm to use it but it does help to understand, at least at a high level, what the algorithm is doing. One needs to develop a feel for algorithms even if one doesn’t understand the details. Indeed, most people working in analytics do not know the details of the algorithms they use, but that doesn’t stop them from using algorithms intelligently. Purists may disagree. I think they are wrong.

Finally – because you’re no doubt wondering  :-) – the term “Dirichlet” in LDA refers to the fact that topics and words are assumed to follow Dirichlet distributions. There is no “good” reason for this apart from convenience – Dirichlet distributions provide good approximations to word distributions in documents and, perhaps more important, are computationally convenient.


As in my previous articles on text mining, I will use a collection of 30 posts from this blog as an example corpus. The corpus can be downloaded here. I will assume that you have R and RStudio installed. Follow this link if you need help with that.

The preprocessing steps are much the same as described in my previous articles.  Nevertheless, I’ll risk boring you with a detailed listing so that you can reproduce my results yourself:


#load text mining library


#set working directory (modify path as needed)


#load files into corpus
#get listing of .txt files in directory
filenames <- list.files(getwd(),pattern=”*.txt”)


#read files into a character vector
files <- lapply(filenames,readLines)


#create corpus from vector
docs <- Corpus(VectorSource(files))


#inspect a particular document in corpus


#start preprocessing
#Transform to lower case
docs <-tm_map(docs,content_transformer(tolower))


#remove potentially problematic symbols
toSpace <- content_transformer(function(x, pattern) { return (gsub(pattern, ” “, x))})
docs <- tm_map(docs, toSpace, “-“)
docs <- tm_map(docs, toSpace, “’”)
docs <- tm_map(docs, toSpace, “‘”)
docs <- tm_map(docs, toSpace, “•”)
docs <- tm_map(docs, toSpace, “””)
docs <- tm_map(docs, toSpace, ““”)


#remove punctuation
docs <- tm_map(docs, removePunctuation)
#Strip digits
docs <- tm_map(docs, removeNumbers)
#remove stopwords
docs <- tm_map(docs, removeWords, stopwords(“english”))
#remove whitespace
docs <- tm_map(docs, stripWhitespace)
#Good practice to check every now and then
#Stem document
docs <- tm_map(docs,stemDocument)


#fix up 1) differences between us and aussie english 2) general errors
docs <- tm_map(docs, content_transformer(gsub),
pattern = “organiz”, replacement = “organ”)
docs <- tm_map(docs, content_transformer(gsub),
pattern = “organis”, replacement = “organ”)
docs <- tm_map(docs, content_transformer(gsub),
pattern = “andgovern”, replacement = “govern”)
docs <- tm_map(docs, content_transformer(gsub),
pattern = “inenterpris”, replacement = “enterpris”)
docs <- tm_map(docs, content_transformer(gsub),
pattern = “team-“, replacement = “team”)
#define and eliminate all custom stopwords
myStopwords <- c(“can”, “say”,”one”,”way”,”use”,
“post”,”look”,”right”,”now”,”think”,”‘ve “,
“‘re “,”anoth”,”put”,”set”,”new”,”good”,
docs <- tm_map(docs, removeWords, myStopwords)
#inspect a document as a check


#Create document-term matrix
dtm <- DocumentTermMatrix(docs)
#convert rownames to filenames
rownames(dtm) <- filenames
#collapse matrix by summing over columns
freq <- colSums(as.matrix(dtm))
#length should be total number of terms
#create sort order (descending)
ord <- order(freq,decreasing=TRUE)
#List all terms in decreasing order of freq and write to disk

Check out the  preprocessing section in either this article or this one for detailed explanations of the code. The document term matrix (DTM) produced by the above code will be the main input into the LDA algorithm of the next section.

Topic modelling using LDA

We are now ready to do some topic modelling. We’ll use the topicmodels package written by Bettina Gruen and Kurt Hornik. Specifically, we’ll use the LDA function with the Gibbs sampling option mentioned earlier, and I’ll say  more about it in a second. The LDA function has a fairly large number of parameters. I’ll describe these briefly below. For more, please check out this vignette by Gruen and Hornik.

For the most part, we’ll use the default parameter values supplied by the LDA function,custom setting only the parameters that are required by the Gibbs sampling algorithm.

Gibbs sampling works by performing a random walk in such a way that reflects the characteristics of a desired distribution. Because the starting point of the walk is chosen at random, it is necessary to discard the first few steps of the walk (as these do not correctly reflect the properties of distribution). This is referred to as the burn-in period. We set the burn-in parameter to  4000. Following the burn-in period, we perform 2000 iterations, taking every 500th  iteration for further use.  The reason we do this is to avoid correlations between samples. We use 5 different starting points (nstart=5) – that is, five independent runs. Each starting point requires a seed integer (this also ensures reproducibility),  so I have provided 5 random integers in my seed list. Finally I’ve set best to TRUE (actually a default setting), which instructs the algorithm to return results of the run with the highest posterior probability.

Some words of caution are in order here. It should be emphasised that the settings above do not guarantee  the convergence of the algorithm to a globally optimal solution. Indeed, Gibbs sampling will, at best, find only a locally optimal solution, and even this is hard to prove mathematically in specific practical problems such as the one we are dealing with here. The upshot of this is that it is best to do lots of runs with different settings of parameters to check the stability of your results. The bottom line is that our interest is purely practical so it is good enough if the results make sense. We’ll leave issues  of mathematical rigour to those better qualified to deal with them :-)

As mentioned earlier,  there is an important parameter that must be specified upfront: k, the number of topics that the algorithm should use to classify documents. There are mathematical approaches to this, but they often do not yield semantically meaningful choices of k (see this post on stackoverflow for an example). From a practical point of view, one can simply run the algorithm for different values of k and make a choice based by inspecting the results. This is what we’ll do.

OK, so the first step is to set these parameters in R… and while we’re at it, let’s also load the topicmodels library (Note: you might need to install this package as it is not a part of the base R installation).

#load topic models library


#Set parameters for Gibbs sampling
burnin <- 4000
iter <- 2000
thin <- 500
seed <-list(2003,5,63,100001,765)
nstart <- 5
best <- TRUE


#Number of topics
k <- 5

That done, we can now do the actual work – run the topic modelling algorithm on our corpus. Here is the code:

#Run LDA using Gibbs sampling
ldaOut <-LDA(dtm,k, method=”Gibbs”, control=list(nstart=nstart, seed = seed, best=best, burnin = burnin, iter = iter, thin=thin))


#write out results
#docs to topics
ldaOut.topics <- as.matrix(topics(ldaOut))


#top 6 terms in each topic
ldaOut.terms <- as.matrix(terms(ldaOut,6))


#probabilities associated with each topic assignment
topicProbabilities <- as.data.frame(ldaOut@gamma)


#Find relative importance of top 2 topics
topic1ToTopic2 <- lapply(1:nrow(dtm),function(x)


#Find relative importance of second and third most important topics
topic2ToTopic3 <- lapply(1:nrow(dtm),function(x)


#write to file

The LDA algorithm returns an object that contains a lot of information. Of particular interest to us are the document to topic assignments, the top terms in each topic and the probabilities associated with each of those terms. These are printed out in the first three calls to write.csv above. There are a few important points to note here:

  1. Each document is considered to be a mixture of all topics (5 in this case). The assignments in the first file list the top topic – that is, the one with the highest probability (more about this in point 3 below).
  2. Each topic contains all terms (words) in the corpus, albeit with different probabilities. We list only the top  6 terms in the second file.
  3. The last file lists the probabilities with  which each topic is assigned to a document. This is therefore a 30 x 5 matrix – 30 docs and 5 topics. As one might expect, the highest probability in each row corresponds to the topic assigned to that document.  The “goodness” of the primary assignment (as discussed in point 1) can be assessed by taking the ratio of the highest to second-highest probability and the second-highest to the third-highest probability and so on. This is what I’ve done in the last nine lines of the code above.

Take some time to examine the output and confirm for yourself that that the primary topic assignments are best when the ratios of probabilities discussed in point 3 are highest. You should also experiment with different values of k to see if you can find better topic distributions. In the interests of space I will restrict myself to k = 5.

The table below lists the top 6 terms in topics 1 through 5.

Topic 1 Topic 2 Topic 3 Topic 4 Topic 5
1 work question chang system project
2 practic map organ data manag
3 mani time consult model approach
4 flexibl ibi manag design organ
5 differ issu work process decis
6 best plan problem busi problem

The table below lists the document to (primary) topic assignments:


Document Topic
BeyondEntitiesAndRelationships.txt 4
bigdata.txt 4
ConditionsOverCauses.txt 5
EmergentDesignInEnterpriseIT.txt 4
FromInformationToKnowledge.txt 2
FromTheCoalface.txt 1
HeraclitusAndParmenides.txt 3
IroniesOfEnterpriseIT.txt 3
MakingSenseOfOrganizationalChange.txt 5
MakingSenseOfSensemaking.txt 2
ObjectivityAndTheEthicalDimensionOfDecisionMaking.txt 5
OnTheInherentAmbiguitiesOfManagingProjects.txt 5
OrganisationalSurprise.txt 5
ProfessionalsOrPoliticians.txt 3
RitualsInInformationSystemDesign.txt 4
RoutinesAndReality.txt 4
ScapegoatsAndSystems.txt 5
SherlockHolmesFailedProjects.txt 3
sherlockHolmesMgmtFetis.txt 3
SixHeresiesForBI.txt 4
SixHeresiesForEnterpriseArchitecture.txt 3
TheArchitectAndTheApparition.txt 3
TheCloudAndTheGrass.txt 2
TheConsultantsDilemma.txt 3
TheDangerWithin.txt 5
TheDilemmasOfEnterpriseIT.txt 3
TheEssenceOfEntrepreneurship.txt 1
ThreeTypesOfUncertainty.txt 5
UnderstandingFlexibility.txt 1

From a quick perusal of the two tables it appears that the algorithm has done a pretty decent job. For example,topic 4 is about data and system design, and the documents assigned to it are on topic. However, it is far from perfect – for example, the interview I did with Neil Preston on organisational change (MakingSenseOfOrganizationalChange.txt) has been assigned to topic 5, which seems to be about project management. It ought to be associated with Topic 3, which is about change. Let’s see if we can resolve this by looking at probabilities associated with topics.

The table below lists the topic probabilities by document:

Topic 1 Topic 2 Topic 3 Topic 4 Topic 5
BeyondEn 0.071 0.064 0.024 0.741 0.1
bigdata. 0.182 0.221 0.182 0.26 0.156
Conditio 0.144 0.109 0.048 0.205 0.494
Emergent 0.121 0.226 0.204 0.236 0.213
FromInfo 0.096 0.643 0.026 0.169 0.066
FromTheC 0.636 0.082 0.058 0.086 0.138
Heraclit 0.137 0.091 0.503 0.162 0.107
IroniesO 0.101 0.088 0.388 0.26 0.162
MakingSe 0.13 0.206 0.262 0.089 0.313
MakingSe 0.09 0.715 0.055 0.067 0.074
Objectiv 0.216 0.078 0.086 0.242 0.378
OnTheInh 0.18 0.234 0.102 0.12 0.364
Organisa 0.089 0.095 0.07 0.092 0.655
Professi 0.155 0.064 0.509 0.128 0.144
RitualsI 0.103 0.064 0.044 0.676 0.112
Routines 0.108 0.042 0.033 0.69 0.127
Scapegoa 0.135 0.088 0.043 0.185 0.549
Sherlock 0.093 0.082 0.398 0.195 0.232
sherlock 0.108 0.136 0.453 0.123 0.18
SixHeres 0.159 0.11 0.078 0.516 0.138
SixHeres 0.104 0.111 0.366 0.212 0.207
TheArchi 0.111 0.221 0.522 0.088 0.058
TheCloud 0.185 0.333 0.198 0.136 0.148
TheConsu 0.105 0.184 0.518 0.096 0.096
TheDange 0.114 0.079 0.037 0.079 0.69
TheDilem 0.125 0.128 0.389 0.261 0.098
TheEssen 0.713 0.059 0.031 0.113 0.084
ThreeTyp 0.09 0.076 0.042 0.083 0.708
TOGAFOrN 0.158 0.232 0.352 0.151 0.107
Understa 0.658 0.065 0.072 0.101 0.105

In the table, the highest probability in each row is in bold. Also, in cases where the maximum and the second/third largest probabilities are close, I have highlighted the second (and third) highest probabilities in red.   It is clear that Neil’s interview (9th document in the above table) has 3  topics with comparable probabilities – topic 5 (project management), topic 3 (change) and topic 2 (issue mapping / ibis), in decreasing order of probabilities. In general, if a document has multiple topics with comparable probabilities, it simply means that the document speaks to all those topics in proportions indicated by the probabilities. A reading of Neil’s interview will convince you that our conversation did indeed range over all those topics.

That said, the algorithm is far from perfect. You might have already noticed a few poor assignments. Here is one – my post on Sherlock Holmes and the case of the failed project has been assigned to topic 3; I reckon it belongs in topic 5. There are a number of others, but I won’t belabor the point, except to reiterate that this precisely why you definitely want to experiment with different settings of the iteration parameters (to check for stability) and, more important, try a range of different values of k to find the optimal number of topics.

To conclude

Topic modelling provides a quick and convenient way to perform unsupervised classification of a corpus of documents.  As always, though, one needs to examine the results carefully to check that they make sense.

I’d like to end with a general observation. Classifying documents is an age-old concern that cuts across disciplines. So it is no surprise that topic modelling has got a look-in from diverse communities. Indeed, when I was reading up and learning about LDA, I found that some of the best introductory articles in the area have been written by academics working in English departments! This is one of the things I love about working in text analysis, there is a wealth of material on the web written from diverse perspectives. The term cross-disciplinary often tends to be a platitude , but in this case it is simply a statement of fact.

I hope that I have been able to convince you to explore this rapidly evolving field. Exciting times ahead, come join the fun.

Written by K

September 29, 2015 at 7:18 pm

27 Responses

Subscribe to comments with RSS.

  1. […] …and my introductory piece on topic modeling. […]


  2. […] If you liked this article, you might want to check out its sequel – an introduction to topic modeling. […]


  3. […] A gentle introduction to topic modeling using R […]


  4. […]  a process which I have dealt with at length in my  introductory pieces on text mining and topic modeling. In fact, the steps are actually identical to those detailed in the second piece. I will therefore […]

    Liked by 1 person

  5. hi
    i want to map topics to wordnet to form document topic representation for better clustering



    February 26, 2016 at 3:47 pm

  6. Thanks for your great tutorial. Just to mention that I am getting a nasty error. The reasons are a bit obscure to me. If I find out what is causing it, I will let you know what it is.


    Error in gsub(sprintf(“(*UCP)\\b(%s)\\b”, paste(sort(words, decreasing = TRUE), :
    input string 1 is invalid UTF-8



    April 5, 2016 at 4:35 am

    • after reading in the files with this command: docs <- Corpus(DirSource()), I didn’t get this error ay more



      July 23, 2016 at 9:39 pm

  7. Thanks so much for this article. Quick question: you wrote that “Each topic contains all terms (words) in the corpus, albeit with different probabilities.” I see the table where the terms for each topic are listed in order of their probabilities, but it is possible to see the probabilities themselves, so as to identify an ‘elbow in the curve’ i.e. where you transition from terms that have a reasonable probability of association with that topic to those that have virtually no probability?



    April 22, 2016 at 4:13 am

  8. thank you so much for this simple and useful post. It helped me alot.m



    May 23, 2016 at 6:08 pm

  9. Hi again. I am searching for an implementation of an Online topic modeling approach, one with the ability to detect new topics and accept new words as new documents arrive. do you know any packages or libraries with these features?



    June 3, 2016 at 7:29 pm

  10. Thank you for such an amazing tutorial!


    Rudraksh Tuwani

    June 7, 2016 at 6:12 pm

  11. Sorry I got this error “Error in is(x, “DocumentTermMatrix”) : object ‘dtm’ not found” I can not proceed what is dtm!!



    June 10, 2016 at 7:30 pm

  12. Sorry I got it, I did not start with pre processing!



    June 10, 2016 at 7:54 pm

  13. I got these errors
    >docs <-tm_map(docs,content_transformer(tolower))
    Warning message:
    In mclapply(content(x), FUN, …) :
    all scheduled cores encountered errors in user code

    -Then I tried this way
    docs writeLines(as.character(docs[[30]]))
    Error in UseMethod(“stripWhitespace”, x) :
    no applicable method for ‘stripWhitespace’ applied to an object of class “try-error”



    June 11, 2016 at 2:14 am

    • Sorry I forget to include some code
      -Then I tried this way
      > docs <- tm_map(docs, toSpace, "-",lazy=TRUE)
      it seemingly goes well, but subsequent operations give this error

      docs writeLines(as.character(docs[[30]]))
      Error in UseMethod(“stripWhitespace”, x) :
      no applicable method for ‘stripWhitespace’ applied to an object of class “try-error

      Liked by 1 person


      June 11, 2016 at 2:17 am

      • Hey Georgetip, I am having the exact same problem. Did you find a way around? Thanks!



        July 30, 2016 at 9:52 am

        • as mentioned above, after reading in the files with this command: docs <- Corpus(DirSource()), I didn’t get this error ay more



          July 31, 2016 at 2:54 am

  14. Hi K,

    Excellent article. I’ve tried it out myself and it works well.

    Quick question – how long did the LDA step take for you?

    Liked by 1 person

    Tom Roth

    July 11, 2016 at 2:49 pm

    • Hi Tom,

      Thanks for your comment! The duration of the LDA step depends on iter and nstart. For the parameter values shown in the code, I think it was a few minutes.





      July 12, 2016 at 6:21 am

      • Thanks Kailash!

        I’d tried it out on a larger dataset (~10000 documents) and the LDA step ran all night without finishing! The perils of machine learning without knowing exactly what you’re doing…

        Liked by 1 person

        Tom Roth

        July 12, 2016 at 9:54 am

        • Indeed…and such experimentation is part of the process (and fun!) of learning machine learning🙂



          July 12, 2016 at 10:08 am

        • Author has given addtional options such as
          burnin <- 4000
          iter <- 2000
          thin <- 500
          seed <-list(2003,5,63,100001,765)
          nstart <- 5
          best <- TRUE

          Instead use the default ones, try this – ldaOut <-LDA(dtm,k, method="Gibbs")

          or try decreasing the number of iterations



          October 4, 2016 at 4:50 pm

  15. Hi. Thanks for this. It’s been really helpful and I’ve managed to run this over my own corpus. I’m getting an error, however, on the final steps – finding the relative importance of topics . I wondered if anyone else has experienced this and can perhaps help?

    When I run the command:

    topic1ToTopic2 <- lapply(1:nrow(dtm),function(x)

    I get an error:

    Error in `[.data.frame`(sort(topicProbabilities[x, ]), k) :
    undefined columns selected


    Craig Hamilton

    August 17, 2016 at 12:52 am

  16. Im having trouble with the docs <- tm_map(docs,stemDocument) command. Apparently it depends on a package called SnowballC which is not compatible with R 3.3.1. Has anyone else had this problem? I tried running an older version of R but then I had trouble loading the tm library. Any thoughts on how to solve this?



    September 23, 2016 at 12:59 am

  17. About topicmodels package

    i want to know parameters’ mean

    What does burnin mean?? why 4000?
    What does iter mean?? why 4000?
    What does thin mean?? why 500?
    What does seed mean?? why (2003,5,63,100001,765)?
    What does nstart mean?? why 5?
    What does best mean??

    > burnin iter thin seed nstart best <- TRUE




    September 25, 2016 at 10:54 pm

  18. Thank you for the code and the clear explanation, Kailash. Does any topic modeling app/tool prevent a term from belonging to more than one topic?


    Vivek Astvansh

    October 22, 2016 at 4:05 am

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: