Eight to Late

Sensemaking and Analytics for Organizations

A gentle introduction to topic modeling using R

with 52 comments

Introduction

The standard way to search for documents on the internet is via keywords or keyphrases. This is pretty much what Google and other search engines do routinely…and they do it well.  However, as useful as this is, it has its limitations. Consider, for example, a situation in which you are confronted with a large collection of documents but have no idea what they are about. One of the first things you might want to do is to classify these documents into topics or themes. Among other things this would help you figure out if there’s anything interest while also directing you to the relevant subset(s) of the corpus. For small collections, one could do this by simply going through each document but this is clearly infeasible for corpuses containing thousands of documents.

Topic modeling – the theme of this post – deals with the problem of automatically classifying sets of documents into themes

The article is organised as follows: I first provide some background on topic modelling. The algorithm that I use, Latent Dirichlet Allocation (LDA), involves some pretty heavy maths which I’ll avoid altogether. However, I will provide an intuitive explanation of how LDA works before moving on to a practical example which uses the topicmodels library in R. As in my previous articles in this series (see this post and this one), I will discuss the steps in detail along with explanations and provide accessible references for concepts that cannot be covered in the space of a blog post.

(Aside: Beware, LDA is also an abbreviation for Linear Discriminant Analysis a classification technique that I hope to cover later in my ongoing series on text and data analytics).

Latent Dirichlet Allocation – a math-free introduction

In essence, LDA is a technique that facilitates the automatic discovery of themes in a collection of documents.

The basic assumption behind LDA is that each of the documents in a collection consist of a mixture of collection-wide topics. However, in reality we observe only documents and words, not topics – the latter are part of the hidden (or latent) structure of documents. The aim is to infer the latent topic structure given the words and document.  LDA does this by recreating the documents in the corpus by adjusting the relative importance of topics in documents and words in topics iteratively.

Here’s a brief explanation of how the algorithm works, quoted directly from this answer by Edwin Chen on Quora:

  • Go through each document, and randomly assign each word in the document to one of the K topics. (Note: One of the shortcomings of LDA is that one has to specify the number of topics, denoted by K, upfront. More about this later.)
  • This assignment already gives you both topic representations of all the documents and word distributions of all the topics (albeit not very good ones).
  • So to improve on them, for each document d…
  • ….Go through each word w in d…
  • ……..And for each topic t, compute two things: 1) p(topic t | document d) = the proportion of words in document d that are currently assigned to topic t, and 2) p(word w | topic t) = the proportion of assignments to topic t over all documents that come from this word w. Reassign w a new topic, where you choose topic t with probability p(topic t | document d) * p(word w | topic t) (according to our generative model, this is essentially the probability that topic t generated word w, so it makes sense that we resample the current word’s topic with this probability).  (Note: p(a|b) is the conditional probability of a given that b has already occurred – see this post for more on conditional probabilities)
  • ……..In other words, in this step, we’re assuming that all topic assignments except for the current word in question are correct, and then updating the assignment of the current word using our model of how documents are generated.
  • After repeating the previous step a large number of times, you’ll eventually reach a roughly steady state where your assignments are pretty good. So use these assignments to estimate the topic mixtures of each document (by counting the proportion of words assigned to each topic within that document) and the words associated to each topic (by counting the proportion of words assigned to each topic overall).

For another simple explanation of how LDA works in, check out  this article by Matthew Jockers. For a more technical exposition, take a look at this video by David Blei, one of the inventors of the algorithm.

The iterative process described in the last point above is implemented using a technique called Gibbs sampling.  I’ll say a bit more about Gibbs sampling later, but you may want to have a look at this paper by Philip Resnick and Eric Hardesty that explains the nitty-gritty of the algorithm (Warning: it involves a fair bit of math, but has some good intuitive explanations as  well).

As a general point, I should also emphasise that you do not need to understand the ins and outs of an algorithm to use it but it does help to understand, at least at a high level, what the algorithm is doing. One needs to develop a feel for algorithms even if one doesn’t understand the details. Indeed, most people working in analytics do not know the details of the algorithms they use, but that doesn’t stop them from using algorithms intelligently. Purists may disagree. I think they are wrong.

Finally – because you’re no doubt wondering  🙂 – the term “Dirichlet” in LDA refers to the fact that topics and words are assumed to follow Dirichlet distributions. There is no “good” reason for this apart from convenience – Dirichlet distributions provide good approximations to word distributions in documents and, perhaps more important, are computationally convenient.

Preprocessing

As in my previous articles on text mining, I will use a collection of 30 posts from this blog as an example corpus. The corpus can be downloaded here. I will assume that you have R and RStudio installed. Follow this link if you need help with that.

The preprocessing steps are much the same as described in my previous articles.  Nevertheless, I’ll risk boring you with a detailed listing so that you can reproduce my results yourself:

 

#load text mining library
library(tm)

 

#set working directory (modify path as needed)
setwd(“C:\\Users\\Kailash\\Documents\\TextMining”)

 

#load files into corpus
#get listing of .txt files in directory
filenames <- list.files(getwd(),pattern=”*.txt”)

 

#read files into a character vector
files <- lapply(filenames,readLines)

 

#create corpus from vector
docs <- Corpus(VectorSource(files))

 

#inspect a particular document in corpus
writeLines(as.character(docs[[30]]))

 

#start preprocessing
#Transform to lower case
docs <-tm_map(docs,content_transformer(tolower))

 

#remove potentially problematic symbols
toSpace <- content_transformer(function(x, pattern) { return (gsub(pattern, ” “, x))})
docs <- tm_map(docs, toSpace, “-“)
docs <- tm_map(docs, toSpace, “’”)
docs <- tm_map(docs, toSpace, “‘”)
docs <- tm_map(docs, toSpace, “•”)
docs <- tm_map(docs, toSpace, “””)
docs <- tm_map(docs, toSpace, ““”)

 

#remove punctuation
docs <- tm_map(docs, removePunctuation)
#Strip digits
docs <- tm_map(docs, removeNumbers)
#remove stopwords
docs <- tm_map(docs, removeWords, stopwords(“english”))
#remove whitespace
docs <- tm_map(docs, stripWhitespace)
#Good practice to check every now and then
writeLines(as.character(docs[[30]]))
#Stem document
docs <- tm_map(docs,stemDocument)

 

#fix up 1) differences between us and aussie english 2) general errors
docs <- tm_map(docs, content_transformer(gsub),
pattern = “organiz”, replacement = “organ”)
docs <- tm_map(docs, content_transformer(gsub),
pattern = “organis”, replacement = “organ”)
docs <- tm_map(docs, content_transformer(gsub),
pattern = “andgovern”, replacement = “govern”)
docs <- tm_map(docs, content_transformer(gsub),
pattern = “inenterpris”, replacement = “enterpris”)
docs <- tm_map(docs, content_transformer(gsub),
pattern = “team-“, replacement = “team”)
#define and eliminate all custom stopwords
myStopwords <- c(“can”, “say”,”one”,”way”,”use”,
“also”,”howev”,”tell”,”will”,
“much”,”need”,”take”,”tend”,”even”,
“like”,”particular”,”rather”,”said”,
“get”,”well”,”make”,”ask”,”come”,”end”,
“first”,”two”,”help”,”often”,”may”,
“might”,”see”,”someth”,”thing”,”point”,
“post”,”look”,”right”,”now”,”think”,”‘ve “,
“‘re “,”anoth”,”put”,”set”,”new”,”good”,
“want”,”sure”,”kind”,”larg”,”yes,”,”day”,”etc”,
“quit”,”sinc”,”attempt”,”lack”,”seen”,”awar”,
“littl”,”ever”,”moreov”,”though”,”found”,”abl”,
“enough”,”far”,”earli”,”away”,”achiev”,”draw”,
“last”,”never”,”brief”,”bit”,”entir”,”brief”,
“great”,”lot”)
docs <- tm_map(docs, removeWords, myStopwords)
#inspect a document as a check
writeLines(as.character(docs[[30]]))

 

#Create document-term matrix
dtm <- DocumentTermMatrix(docs)
#convert rownames to filenames
rownames(dtm) <- filenames
#collapse matrix by summing over columns
freq <- colSums(as.matrix(dtm))
#length should be total number of terms
length(freq)
#create sort order (descending)
ord <- order(freq,decreasing=TRUE)
#List all terms in decreasing order of freq and write to disk
freq[ord]
write.csv(freq[ord],”word_freq.csv”)

Check out the  preprocessing section in either this article or this one for detailed explanations of the code. The document term matrix (DTM) produced by the above code will be the main input into the LDA algorithm of the next section.

Topic modelling using LDA

We are now ready to do some topic modelling. We’ll use the topicmodels package written by Bettina Gruen and Kurt Hornik. Specifically, we’ll use the LDA function with the Gibbs sampling option mentioned earlier, and I’ll say  more about it in a second. The LDA function has a fairly large number of parameters. I’ll describe these briefly below. For more, please check out this vignette by Gruen and Hornik.

For the most part, we’ll use the default parameter values supplied by the LDA function,custom setting only the parameters that are required by the Gibbs sampling algorithm.

Gibbs sampling works by performing a random walk in such a way that reflects the characteristics of a desired distribution. Because the starting point of the walk is chosen at random, it is necessary to discard the first few steps of the walk (as these do not correctly reflect the properties of distribution). This is referred to as the burn-in period. We set the burn-in parameter to  4000. Following the burn-in period, we perform 2000 iterations, taking every 500th  iteration for further use.  The reason we do this is to avoid correlations between samples. We use 5 different starting points (nstart=5) – that is, five independent runs. Each starting point requires a seed integer (this also ensures reproducibility),  so I have provided 5 random integers in my seed list. Finally I’ve set best to TRUE (actually a default setting), which instructs the algorithm to return results of the run with the highest posterior probability.

Some words of caution are in order here. It should be emphasised that the settings above do not guarantee  the convergence of the algorithm to a globally optimal solution. Indeed, Gibbs sampling will, at best, find only a locally optimal solution, and even this is hard to prove mathematically in specific practical problems such as the one we are dealing with here. The upshot of this is that it is best to do lots of runs with different settings of parameters to check the stability of your results. The bottom line is that our interest is purely practical so it is good enough if the results make sense. We’ll leave issues  of mathematical rigour to those better qualified to deal with them 🙂

As mentioned earlier,  there is an important parameter that must be specified upfront: k, the number of topics that the algorithm should use to classify documents. There are mathematical approaches to this, but they often do not yield semantically meaningful choices of k (see this post on stackoverflow for an example). From a practical point of view, one can simply run the algorithm for different values of k and make a choice based by inspecting the results. This is what we’ll do.

OK, so the first step is to set these parameters in R… and while we’re at it, let’s also load the topicmodels library (Note: you might need to install this package as it is not a part of the base R installation).

#load topic models library
library(topicmodels)

 

#Set parameters for Gibbs sampling
burnin <- 4000
iter <- 2000
thin <- 500
seed <-list(2003,5,63,100001,765)
nstart <- 5
best <- TRUE

 

#Number of topics
k <- 5

That done, we can now do the actual work – run the topic modelling algorithm on our corpus. Here is the code:

#Run LDA using Gibbs sampling
ldaOut <-LDA(dtm,k, method=”Gibbs”, control=list(nstart=nstart, seed = seed, best=best, burnin = burnin, iter = iter, thin=thin))

 

#write out results
#docs to topics
ldaOut.topics <- as.matrix(topics(ldaOut))
write.csv(ldaOut.topics,file=paste(“LDAGibbs”,k,”DocsToTopics.csv”))

 

#top 6 terms in each topic
ldaOut.terms <- as.matrix(terms(ldaOut,6))
write.csv(ldaOut.terms,file=paste(“LDAGibbs”,k,”TopicsToTerms.csv”))

 

#probabilities associated with each topic assignment
topicProbabilities <- as.data.frame(ldaOut@gamma)
write.csv(topicProbabilities,file=paste(“LDAGibbs”,k,”TopicProbabilities.csv”))

 

#Find relative importance of top 2 topics
topic1ToTopic2 <- lapply(1:nrow(dtm),function(x)
sort(topicProbabilities[x,])[k]/sort(topicProbabilities[x,])[k-1])

 

#Find relative importance of second and third most important topics
topic2ToTopic3 <- lapply(1:nrow(dtm),function(x)
sort(topicProbabilities[x,])[k-1]/sort(topicProbabilities[x,])[k-2])

 

#write to file
write.csv(topic1ToTopic2,file=paste(“LDAGibbs”,k,”Topic1ToTopic2.csv”))
write.csv(topic2ToTopic3,file=paste(“LDAGibbs”,k,”Topic2ToTopic3.csv”))

The LDA algorithm returns an object that contains a lot of information. Of particular interest to us are the document to topic assignments, the top terms in each topic and the probabilities associated with each of those terms. These are printed out in the first three calls to write.csv above. There are a few important points to note here:

  1. Each document is considered to be a mixture of all topics (5 in this case). The assignments in the first file list the top topic – that is, the one with the highest probability (more about this in point 3 below).
  2. Each topic contains all terms (words) in the corpus, albeit with different probabilities. We list only the top  6 terms in the second file.
  3. The last file lists the probabilities with  which each topic is assigned to a document. This is therefore a 30 x 5 matrix – 30 docs and 5 topics. As one might expect, the highest probability in each row corresponds to the topic assigned to that document.  The “goodness” of the primary assignment (as discussed in point 1) can be assessed by taking the ratio of the highest to second-highest probability and the second-highest to the third-highest probability and so on. This is what I’ve done in the last nine lines of the code above.

Take some time to examine the output and confirm for yourself that that the primary topic assignments are best when the ratios of probabilities discussed in point 3 are highest. You should also experiment with different values of k to see if you can find better topic distributions. In the interests of space I will restrict myself to k = 5.

The table below lists the top 6 terms in topics 1 through 5.

Topic 1 Topic 2 Topic 3 Topic 4 Topic 5
1 work question chang system project
2 practic map organ data manag
3 mani time consult model approach
4 flexibl ibi manag design organ
5 differ issu work process decis
6 best plan problem busi problem

The table below lists the document to (primary) topic assignments:

 

Document Topic
BeyondEntitiesAndRelationships.txt 4
bigdata.txt 4
ConditionsOverCauses.txt 5
EmergentDesignInEnterpriseIT.txt 4
FromInformationToKnowledge.txt 2
FromTheCoalface.txt 1
HeraclitusAndParmenides.txt 3
IroniesOfEnterpriseIT.txt 3
MakingSenseOfOrganizationalChange.txt 5
MakingSenseOfSensemaking.txt 2
ObjectivityAndTheEthicalDimensionOfDecisionMaking.txt 5
OnTheInherentAmbiguitiesOfManagingProjects.txt 5
OrganisationalSurprise.txt 5
ProfessionalsOrPoliticians.txt 3
RitualsInInformationSystemDesign.txt 4
RoutinesAndReality.txt 4
ScapegoatsAndSystems.txt 5
SherlockHolmesFailedProjects.txt 3
sherlockHolmesMgmtFetis.txt 3
SixHeresiesForBI.txt 4
SixHeresiesForEnterpriseArchitecture.txt 3
TheArchitectAndTheApparition.txt 3
TheCloudAndTheGrass.txt 2
TheConsultantsDilemma.txt 3
TheDangerWithin.txt 5
TheDilemmasOfEnterpriseIT.txt 3
TheEssenceOfEntrepreneurship.txt 1
ThreeTypesOfUncertainty.txt 5
TOGAFOrNotTOGAF.txt 3
UnderstandingFlexibility.txt 1

From a quick perusal of the two tables it appears that the algorithm has done a pretty decent job. For example,topic 4 is about data and system design, and the documents assigned to it are on topic. However, it is far from perfect – for example, the interview I did with Neil Preston on organisational change (MakingSenseOfOrganizationalChange.txt) has been assigned to topic 5, which seems to be about project management. It ought to be associated with Topic 3, which is about change. Let’s see if we can resolve this by looking at probabilities associated with topics.

The table below lists the topic probabilities by document:

Topic 1 Topic 2 Topic 3 Topic 4 Topic 5
BeyondEn 0.071 0.064 0.024 0.741 0.1
bigdata. 0.182 0.221 0.182 0.26 0.156
Conditio 0.144 0.109 0.048 0.205 0.494
Emergent 0.121 0.226 0.204 0.236 0.213
FromInfo 0.096 0.643 0.026 0.169 0.066
FromTheC 0.636 0.082 0.058 0.086 0.138
Heraclit 0.137 0.091 0.503 0.162 0.107
IroniesO 0.101 0.088 0.388 0.26 0.162
MakingSe 0.13 0.206 0.262 0.089 0.313
MakingSe 0.09 0.715 0.055 0.067 0.074
Objectiv 0.216 0.078 0.086 0.242 0.378
OnTheInh 0.18 0.234 0.102 0.12 0.364
Organisa 0.089 0.095 0.07 0.092 0.655
Professi 0.155 0.064 0.509 0.128 0.144
RitualsI 0.103 0.064 0.044 0.676 0.112
Routines 0.108 0.042 0.033 0.69 0.127
Scapegoa 0.135 0.088 0.043 0.185 0.549
Sherlock 0.093 0.082 0.398 0.195 0.232
sherlock 0.108 0.136 0.453 0.123 0.18
SixHeres 0.159 0.11 0.078 0.516 0.138
SixHeres 0.104 0.111 0.366 0.212 0.207
TheArchi 0.111 0.221 0.522 0.088 0.058
TheCloud 0.185 0.333 0.198 0.136 0.148
TheConsu 0.105 0.184 0.518 0.096 0.096
TheDange 0.114 0.079 0.037 0.079 0.69
TheDilem 0.125 0.128 0.389 0.261 0.098
TheEssen 0.713 0.059 0.031 0.113 0.084
ThreeTyp 0.09 0.076 0.042 0.083 0.708
TOGAFOrN 0.158 0.232 0.352 0.151 0.107
Understa 0.658 0.065 0.072 0.101 0.105

In the table, the highest probability in each row is in bold. Also, in cases where the maximum and the second/third largest probabilities are close, I have highlighted the second (and third) highest probabilities in red.   It is clear that Neil’s interview (9th document in the above table) has 3  topics with comparable probabilities – topic 5 (project management), topic 3 (change) and topic 2 (issue mapping / ibis), in decreasing order of probabilities. In general, if a document has multiple topics with comparable probabilities, it simply means that the document speaks to all those topics in proportions indicated by the probabilities. A reading of Neil’s interview will convince you that our conversation did indeed range over all those topics.

That said, the algorithm is far from perfect. You might have already noticed a few poor assignments. Here is one – my post on Sherlock Holmes and the case of the failed project has been assigned to topic 3; I reckon it belongs in topic 5. There are a number of others, but I won’t belabor the point, except to reiterate that this precisely why you definitely want to experiment with different settings of the iteration parameters (to check for stability) and, more important, try a range of different values of k to find the optimal number of topics.

To conclude

Topic modelling provides a quick and convenient way to perform unsupervised classification of a corpus of documents.  As always, though, one needs to examine the results carefully to check that they make sense.

I’d like to end with a general observation. Classifying documents is an age-old concern that cuts across disciplines. So it is no surprise that topic modelling has got a look-in from diverse communities. Indeed, when I was reading up and learning about LDA, I found that some of the best introductory articles in the area have been written by academics working in English departments! This is one of the things I love about working in text analysis, there is a wealth of material on the web written from diverse perspectives. The term cross-disciplinary often tends to be a platitude , but in this case it is simply a statement of fact.

I hope that I have been able to convince you to explore this rapidly evolving field. Exciting times ahead, come join the fun.

Written by K

September 29, 2015 at 7:18 pm

52 Responses

Subscribe to comments with RSS.

  1. […] …and my introductory piece on topic modeling. […]

    Like

  2. […] If you liked this article, you might want to check out its sequel – an introduction to topic modeling. […]

    Like

  3. […] A gentle introduction to topic modeling using R […]

    Like

  4. […]  a process which I have dealt with at length in my  introductory pieces on text mining and topic modeling. In fact, the steps are actually identical to those detailed in the second piece. I will therefore […]

    Liked by 1 person

  5. hi
    i want to map topics to wordnet to form document topic representation for better clustering

    Like

    shakeel

    February 26, 2016 at 3:47 pm

  6. Thanks for your great tutorial. Just to mention that I am getting a nasty error. The reasons are a bit obscure to me. If I find out what is causing it, I will let you know what it is.

    writeLines(as.character(docs[[30]]))

    Error in gsub(sprintf(“(*UCP)\\b(%s)\\b”, paste(sort(words, decreasing = TRUE), :
    input string 1 is invalid UTF-8

    Like

    Hendrik

    April 5, 2016 at 4:35 am

    • after reading in the files with this command: docs <- Corpus(DirSource()), I didn’t get this error ay more

      Like

      Achim

      July 23, 2016 at 9:39 pm

  7. Thanks so much for this article. Quick question: you wrote that “Each topic contains all terms (words) in the corpus, albeit with different probabilities.” I see the table where the terms for each topic are listed in order of their probabilities, but it is possible to see the probabilities themselves, so as to identify an ‘elbow in the curve’ i.e. where you transition from terms that have a reasonable probability of association with that topic to those that have virtually no probability?

    Like

    Jake

    April 22, 2016 at 4:13 am

  8. thank you so much for this simple and useful post. It helped me alot.m

    Like

    zahra

    May 23, 2016 at 6:08 pm

  9. Hi again. I am searching for an implementation of an Online topic modeling approach, one with the ability to detect new topics and accept new words as new documents arrive. do you know any packages or libraries with these features?

    Like

    zahra

    June 3, 2016 at 7:29 pm

  10. Thank you for such an amazing tutorial!

    Like

    Rudraksh Tuwani

    June 7, 2016 at 6:12 pm

  11. Sorry I got this error “Error in is(x, “DocumentTermMatrix”) : object ‘dtm’ not found” I can not proceed what is dtm!!

    Like

    Georgetigp

    June 10, 2016 at 7:30 pm

  12. Sorry I got it, I did not start with pre processing!

    Like

    Georgetigp

    June 10, 2016 at 7:54 pm

  13. I got these errors
    >docs <-tm_map(docs,content_transformer(tolower))
    Warning message:
    In mclapply(content(x), FUN, …) :
    all scheduled cores encountered errors in user code

    -Then I tried this way
    docs writeLines(as.character(docs[[30]]))
    Error in UseMethod(“stripWhitespace”, x) :
    no applicable method for ‘stripWhitespace’ applied to an object of class “try-error”

    Like

    Georgetigp

    June 11, 2016 at 2:14 am

    • Sorry I forget to include some code
      -Then I tried this way
      > docs <- tm_map(docs, toSpace, "-",lazy=TRUE)
      it seemingly goes well, but subsequent operations give this error

      docs writeLines(as.character(docs[[30]]))
      Error in UseMethod(“stripWhitespace”, x) :
      no applicable method for ‘stripWhitespace’ applied to an object of class “try-error

      Liked by 1 person

      Georgetigp

      June 11, 2016 at 2:17 am

      • Hey Georgetip, I am having the exact same problem. Did you find a way around? Thanks!

        Like

        carlos

        July 30, 2016 at 9:52 am

        • as mentioned above, after reading in the files with this command: docs <- Corpus(DirSource()), I didn’t get this error ay more

          Like

          achim

          July 31, 2016 at 2:54 am

  14. Hi K,

    Excellent article. I’ve tried it out myself and it works well.

    Quick question – how long did the LDA step take for you?

    Liked by 1 person

    Tom Roth

    July 11, 2016 at 2:49 pm

    • Hi Tom,

      Thanks for your comment! The duration of the LDA step depends on iter and nstart. For the parameter values shown in the code, I think it was a few minutes.

      Regards,

      Kailash.

      Like

      K

      July 12, 2016 at 6:21 am

      • Thanks Kailash!

        I’d tried it out on a larger dataset (~10000 documents) and the LDA step ran all night without finishing! The perils of machine learning without knowing exactly what you’re doing…

        Liked by 1 person

        Tom Roth

        July 12, 2016 at 9:54 am

        • Indeed…and such experimentation is part of the process (and fun!) of learning machine learning 🙂

          Like

          K

          July 12, 2016 at 10:08 am

        • Author has given addtional options such as
          burnin <- 4000
          iter <- 2000
          thin <- 500
          seed <-list(2003,5,63,100001,765)
          nstart <- 5
          best <- TRUE

          Instead use the default ones, try this – ldaOut <-LDA(dtm,k, method="Gibbs")

          or try decreasing the number of iterations

          Like

          Saurabh

          October 4, 2016 at 4:50 pm

        • exactly… I encountered the same oroblem

          Like

          PAUL

          March 24, 2017 at 4:35 pm

  15. Hi. Thanks for this. It’s been really helpful and I’ve managed to run this over my own corpus. I’m getting an error, however, on the final steps – finding the relative importance of topics . I wondered if anyone else has experienced this and can perhaps help?

    When I run the command:

    topic1ToTopic2 <- lapply(1:nrow(dtm),function(x)
    sort(topicProbabilities[x,])[k]/sort(topicProbabilities[x,])[k-1])

    I get an error:

    Error in `[.data.frame`(sort(topicProbabilities[x, ]), k) :
    undefined columns selected

    Like

    Craig Hamilton

    August 17, 2016 at 12:52 am

  16. Im having trouble with the docs <- tm_map(docs,stemDocument) command. Apparently it depends on a package called SnowballC which is not compatible with R 3.3.1. Has anyone else had this problem? I tried running an older version of R but then I had trouble loading the tm library. Any thoughts on how to solve this?

    Like

    Jacob

    September 23, 2016 at 12:59 am

  17. About topicmodels package

    i want to know parameters’ mean

    What does burnin mean?? why 4000?
    What does iter mean?? why 4000?
    What does thin mean?? why 500?
    What does seed mean?? why (2003,5,63,100001,765)?
    What does nstart mean?? why 5?
    What does best mean??

    > burnin iter thin seed nstart best <- TRUE

    SOME BODY HELP ME!! PLEASE

    Like

    LIM

    September 25, 2016 at 10:54 pm

  18. Thank you for the code and the clear explanation, Kailash. Does any topic modeling app/tool prevent a term from belonging to more than one topic?

    Like

    Vivek Astvansh

    October 22, 2016 at 4:05 am

  19. Hi Kailash, This is an excellent article!! Really helps researchers like me. I have a question with respect to running LDA using Gibbs sampling. As I see, the LDA() allows for providing optional parameter of seed words with weights. Is this conceptually same as the z-label LDA (http://pages.cs.wisc.edu/~andrzeje/research/zl_lda.html)? It would be great if you could provide an example or pointers on how to input the seedwords with weights.

    Thanks, BSS

    Like

    SBS

    January 22, 2017 at 10:01 am

  20. Great write up thanks. I ‘m making great use of it.
    However have noticed some errors in your code snippets…methodologically speaking

    The stemmer in the latest tm package from snowballC requires plantext before stemming to get it to work properly

    corpus <- tm_map(corpus, PlainTextDocument)
    corpus <- tm_map(corpus, stemDocument)

    http://stackoverflow.com/questions/36967573/stemming-words-using-tm-package-in-r-does-not-work-properly

    Also an improvement on the unicode punctuation removal can be simplified…

    #remove special unicode chars
    corpus <- tm_map(corpus, function(x) iconv(x,'UTF-8', 'ASCII', sub=' '))

    http://stackoverflow.com/questions/9934856/removing-non-ascii-characters-from-data-files/9935242#9935242

    Liked by 1 person

    clancy

    February 20, 2017 at 2:23 pm

    • Thanks a ton for catching the errors and taking the time to bring them to my attention. Much appreciated!

      Regards,

      Kailash.

      Like

      K

      February 20, 2017 at 2:27 pm

  21. […] the one on text analysis (link in last line of previous section) and then move on to clustering, topic modelling, naive Bayes, decision trees, random forests and support vector machines. I’m slowly adding to […]

    Like

  22. Hi all,
    Thanks for this useful explanation of how LDA works.
    I have a question in regards to the files or “documents”. Different from having separate Word or text files with the data, I have my data on a Excel file. My Excel only has one column and as many rows as documents (each cell has a different text).
    Can I preprocess this Excel file? How can I run the LDA with this data structure?
    Thank you in advance.

    Like

    Marta

    March 4, 2017 at 3:08 am

  23. […] Awati, Kailash. “A gentle introduction to topic modeling using R.” Eight to Late. Accessed March 16, 2017. https://eight2late.wordpress.com/2015/09/29/a-gentle-introduction-to-topic-modeling-using-r/. […]

    Like

  24. Thanks for the information about LDA implementation. But I am not able to get the names of the files in the output csv file and instead it is just giving out a serial of numbers. Is there something that can be done to rectify this issue?

    Like

    samuel Benadict

    March 28, 2017 at 12:55 pm

  25. Dear Kailash,

    Thank you so much for this excellent intro to topic modelling! It has helped me a lot.

    One (very small) thing I have noticed about the data cleaning steps is that you remove punctuation before you remove stopwords. However, the stopwords contain words like “isn’t” which won’t be found if you remove punctuation first. So I have switched the two around. Is there any reason your did it the other way around?

    Also, the very first thing I now do after I load my data into R is to convert it to ASCII so I won’t encounter any problems with any special characters, which helps immensely.

    files2 <- stringi::stri_trans_general(files, "latin-ascii")

    Best regards,
    Sarah

    Like

    Sarah

    March 29, 2017 at 7:46 pm

  26. […] This piece on topic modeling is based on: topic modeling using R. […]

    Like

  27. Hi and tnx for this tutorial.

    Do you think that topic modeling could be used in case of a dataset (NOT a text corpus) with big set of dummy vars (150+) built from a couple of discrete attributes. Here, CA usually fails due to the dummies matrix being too sparse?

    best regards,
    gabriele

    Like

    Gabriele

    June 18, 2017 at 7:59 pm

  28. […] contains data points (topics/words). For a more detailed overview of using Topic Modelling, see Kailash Await’s excellent post from which my own script is derived, or read David Blei’s overview of the […]

    Like

  29. […] de seguir aquest exemple, m’agradaria agrupar automàticament (clusteritzar) una sèrie de documents en base a la seva […]

    Like

  30. Interesting Blog and easy to follow steps. I am trying a similar annalysis with tweets and the users. I am curious about how you got the Document and topic mapping and the table with topic probabilities by document: Would be nice if you could Throw some light on that part as well.

    Like

    Divya Iyer

    October 11, 2017 at 6:33 pm

  31. […] method to extract information from the text by assessing the proximity of words to each other. The topic modelling package provides functions to perform this analysis. I am not an expert in this field and simply […]

    Like

  32. […] method to extract information from the text by assessing the proximity of words to each other. The topic modelling package provides functions to perform this analysis. I am not an expert in this field and simply […]

    Like

  33. Great great great tutorial and we’re using it as an important guide for our study. While we almost can replicate everything, we found the code “rownames(dtm) <- filenames" with the error "Error in rownames(dtm <- filenames) : object 'filenames' not found. We don't know why and even what our files names are exactly. Any body can help?Thank you!

    Like

    Cathy Chen

    October 9, 2019 at 7:10 am

    • Hi Cathy,

      Thanks for reading and for your feedback. The filenames variable is created in the third line of code:

      filenames <- list.files(getwd(),pattern=”*.txt”)

      The line you're referring to changes rownames to match the filenames (for easier reference).

      Hope this helps.

      Regards,

      Kailash.

      Like

      K

      October 10, 2019 at 8:25 am

  34. […] to find the topics or themes underlying a set of documents. There is an easy to follow explanation here, which goes into more detail about how the particular algorithm I’m using, Latent Dirichlet […]

    Like

  35. “Each row of the input matrix needs to contain at least one non-zero entry”
    How can I solve this issue?

    Like

    marisa

    February 13, 2020 at 4:26 am

  36. […] codificar o LDA no R. A resposta foi sim, existem exemplo de código tanto para Python e R . “Por que você prefere que eu codifique em R?” Eu perguntei. Ele respondeu: […]

    Like

  37. Hi K,

    Thank you so much for the non-math introduction, that really helps me understand it. Would you happen to have an article that can explain the CTM side of this (in a non-math way!)? Please let me know if so, thank you!

    Like

    nana

    July 15, 2021 at 9:06 am


Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.