A gentle introduction to topic modeling using R

Introduction

The standard way to search for documents on the internet is via keywords or keyphrases. This is pretty much what Google and other search engines do routinely…and they do it well. However, as useful as this is, it has its limitations. Consider, for example, a situation in which you are confronted with a large collection of documents but have no idea what they are about. One of the first things you might want to do is to classify these documents into topics or themes. Among other things this would help you figure out if there’s anything interest while also directing you to the relevant subset(s) of the corpus. For small collections, one could do this by simply going through each document but this is clearly infeasible for corpuses containing thousands of documents.

Topic modeling – the theme of this post – deals with the problem of automatically classifying sets of documents into themes

The article is organised as follows: I first provide some background on topic modelling. The algorithm that I use, Latent Dirichlet Allocation (LDA), involves some pretty heavy maths which I’ll avoid altogether. However, I will provide an intuitive explanation of how LDA works before moving on to a practical example which uses the topicmodels library in R. As in my previous articles in this series (see this post and this one), I will discuss the steps in detail along with explanations and provide accessible references for concepts that cannot be covered in the space of a blog post.

(Aside: Beware, LDA is also an abbreviation for Linear Discriminant Analysis a classification technique that I hope to cover later in my ongoing series on text and data analytics).

Latent Dirichlet Allocation – a math-free introduction

In essence, LDA is a technique that facilitates the automatic discovery of themes in a collection of documents.

The basic assumption behind LDA is that each of the documents in a collection consist of a mixture of collection-wide topics. However, in reality we observe only documents and words, not topics – the latter are part of the hidden (or latent) structure of documents. The aim is to infer the latent topic structure given the words and document. LDA does this by recreating the documents in the corpus by adjusting the relative importance of topics in documents and words in topics iteratively.

Here’s a brief explanation of how the algorithm works, quoted directly from this answer by Edwin Chen on Quora:

Go through each document, and randomly assign each word in the document to one of the K topics. (Note: One of the shortcomings of LDA is that one has to specify the number of topics, denoted by K, upfront. More about this later.)
This assignment already gives you both topic representations of all the documents and word distributions of all the topics (albeit not very good ones).
So to improve on them, for each document d…
….Go through each word w in d…
……..And for each topic t, compute two things: 1) p(topic t | document d) = the proportion of words in document d that are currently assigned to topic t, and 2) p(word w | topic t) = the proportion of assignments to topic t over all documents that come from this word w. Reassign w a new topic, where you choose topic t with probability p(topic t | document d) * p(word w | topic t) (according to our generative model, this is essentially the probability that topic t generated word w, so it makes sense that we resample the current word’s topic with this probability). (Note: p(a|b) is the conditional probability of a given that b has already occurred – see this post for more on conditional probabilities)
……..In other words, in this step, we’re assuming that all topic assignments except for the current word in question are correct, and then updating the assignment of the current word using our model of how documents are generated.
After repeating the previous step a large number of times, you’ll eventually reach a roughly steady state where your assignments are pretty good. So use these assignments to estimate the topic mixtures of each document (by counting the proportion of words assigned to each topic within that document) and the words associated to each topic (by counting the proportion of words assigned to each topic overall).

For another simple explanation of how LDA works in, check out this article by Matthew Jockers. For a more technical exposition, take a look at this video by David Blei, one of the inventors of the algorithm.

The iterative process described in the last point above is implemented using a technique called Gibbs sampling. I’ll say a bit more about Gibbs sampling later, but you may want to have a look at this paper by Philip Resnick and Eric Hardesty that explains the nitty-gritty of the algorithm (Warning: it involves a fair bit of math, but has some good intuitive explanations as well).

As a general point, I should also emphasise that you do not need to understand the ins and outs of an algorithm to use it but it does help to understand, at least at a high level, what the algorithm is doing. One needs to develop a feel for algorithms even if one doesn’t understand the details. Indeed, most people working in analytics do not know the details of the algorithms they use, but that doesn’t stop them from using algorithms intelligently. Purists may disagree. I think they are wrong.

Finally – because you’re no doubt wondering 🙂 – the term “Dirichlet” in LDA refers to the fact that topics and words are assumed to follow Dirichlet distributions. There is no “good” reason for this apart from convenience – Dirichlet distributions provide good approximations to word distributions in documents and, perhaps more important, are computationally convenient.

Preprocessing

As in my previous articles on text mining, I will use a collection of 30 posts from this blog as an example corpus. The corpus can be downloaded here. I will assume that you have R and RStudio installed. Follow this link if you need help with that.

The preprocessing steps are much the same as described in my previous articles. Nevertheless, I’ll risk boring you with a detailed listing so that you can reproduce my results yourself:

#load text mining library

library(tm)

#set working directory (modify path as needed)

setwd(“C:\\Users\\Kailash\\Documents\\TextMining”)

#load files into corpus

#get listing of .txt files in directory

filenames <- list.files(getwd(),pattern=”*.txt”)

#read files into a character vector

files <- lapply(filenames,readLines)

#create corpus from vector

docs <- Corpus(VectorSource(files))

#inspect a particular document in corpus

writeLines(as.character(docs[[30]]))

#start preprocessing

#Transform to lower case

docs <-tm_map(docs,content_transformer(tolower))

#remove potentially problematic symbols

toSpace <- content_transformer(function(x, pattern) { return (gsub(pattern, ” “, x))})

docs <- tm_map(docs, toSpace, “-“)

docs <- tm_map(docs, toSpace, “’”)

docs <- tm_map(docs, toSpace, “‘”)

docs <- tm_map(docs, toSpace, “•”)

docs <- tm_map(docs, toSpace, “””)

docs <- tm_map(docs, toSpace, ““”)

#remove punctuation

docs <- tm_map(docs, removePunctuation)

#Strip digits

docs <- tm_map(docs, removeNumbers)

#remove stopwords

docs <- tm_map(docs, removeWords, stopwords(“english”))

#remove whitespace

docs <- tm_map(docs, stripWhitespace)

#Good practice to check every now and then

writeLines(as.character(docs[[30]]))

#Stem document

docs <- tm_map(docs,stemDocument)

#fix up 1) differences between us and aussie english 2) general errors

docs <- tm_map(docs, content_transformer(gsub),

pattern = “organiz”, replacement = “organ”)

docs <- tm_map(docs, content_transformer(gsub),

pattern = “organis”, replacement = “organ”)

docs <- tm_map(docs, content_transformer(gsub),

pattern = “andgovern”, replacement = “govern”)

docs <- tm_map(docs, content_transformer(gsub),

pattern = “inenterpris”, replacement = “enterpris”)

docs <- tm_map(docs, content_transformer(gsub),

pattern = “team-“, replacement = “team”)

#define and eliminate all custom stopwords

myStopwords <- c(“can”, “say”,”one”,”way”,”use”,

“also”,”howev”,”tell”,”will”,

“much”,”need”,”take”,”tend”,”even”,

“like”,”particular”,”rather”,”said”,

“get”,”well”,”make”,”ask”,”come”,”end”,

“first”,”two”,”help”,”often”,”may”,

“might”,”see”,”someth”,”thing”,”point”,

“post”,”look”,”right”,”now”,”think”,”‘ve “,

“‘re “,”anoth”,”put”,”set”,”new”,”good”,

“want”,”sure”,”kind”,”larg”,”yes,”,”day”,”etc”,

“quit”,”sinc”,”attempt”,”lack”,”seen”,”awar”,

“littl”,”ever”,”moreov”,”though”,”found”,”abl”,

“enough”,”far”,”earli”,”away”,”achiev”,”draw”,

“last”,”never”,”brief”,”bit”,”entir”,”brief”,

“great”,”lot”)

docs <- tm_map(docs, removeWords, myStopwords)

#inspect a document as a check

writeLines(as.character(docs[[30]]))

#Create document-term matrix

dtm <- DocumentTermMatrix(docs)

#convert rownames to filenames

rownames(dtm) <- filenames

#collapse matrix by summing over columns

freq <- colSums(as.matrix(dtm))

#length should be total number of terms

length(freq)

#create sort order (descending)

ord <- order(freq,decreasing=TRUE)

#List all terms in decreasing order of freq and write to disk

freq[ord]

write.csv(freq[ord],”word_freq.csv”)

Check out the preprocessing section in either this article or this one for detailed explanations of the code. The document term matrix (DTM) produced by the above code will be the main input into the LDA algorithm of the next section.

Topic modelling using LDA

We are now ready to do some topic modelling. We’ll use the topicmodels package written by Bettina Gruen and Kurt Hornik. Specifically, we’ll use the LDA function with the Gibbs sampling option mentioned earlier, and I’ll say more about it in a second. The LDA function has a fairly large number of parameters. I’ll describe these briefly below. For more, please check out this vignette by Gruen and Hornik.

For the most part, we’ll use the default parameter values supplied by the LDA function,custom setting only the parameters that are required by the Gibbs sampling algorithm.

Gibbs sampling works by performing a random walk in such a way that reflects the characteristics of a desired distribution. Because the starting point of the walk is chosen at random, it is necessary to discard the first few steps of the walk (as these do not correctly reflect the properties of distribution). This is referred to as the burn-in period. We set the burn-in parameter to 4000. Following the burn-in period, we perform 2000 iterations, taking every 500^th iteration for further use. The reason we do this is to avoid correlations between samples. We use 5 different starting points (nstart=5) – that is, five independent runs. Each starting point requires a seed integer (this also ensures reproducibility), so I have provided 5 random integers in my seed list. Finally I’ve set best to TRUE (actually a default setting), which instructs the algorithm to return results of the run with the highest posterior probability.

Some words of caution are in order here. It should be emphasised that the settings above do not guarantee the convergence of the algorithm to a globally optimal solution. Indeed, Gibbs sampling will, at best, find only a locally optimal solution, and even this is hard to prove mathematically in specific practical problems such as the one we are dealing with here. The upshot of this is that it is best to do lots of runs with different settings of parameters to check the stability of your results. The bottom line is that our interest is purely practical so it is good enough if the results make sense. We’ll leave issues of mathematical rigour to those better qualified to deal with them 🙂

As mentioned earlier, there is an important parameter that must be specified upfront: k, the number of topics that the algorithm should use to classify documents. There are mathematical approaches to this, but they often do not yield semantically meaningful choices of k (see this post on stackoverflow for an example). From a practical point of view, one can simply run the algorithm for different values of k and make a choice based by inspecting the results. This is what we’ll do.

OK, so the first step is to set these parameters in R… and while we’re at it, let’s also load the topicmodels library (Note: you might need to install this package as it is not a part of the base R installation).

#load topic models library

library(topicmodels)

#Set parameters for Gibbs sampling

burnin <- 4000

iter <- 2000

thin <- 500

seed <-list(2003,5,63,100001,765)

nstart <- 5

best <- TRUE

#Number of topics

k <- 5

That done, we can now do the actual work – run the topic modelling algorithm on our corpus. Here is the code:

#Run LDA using Gibbs sampling

ldaOut <-LDA(dtm,k, method=”Gibbs”, control=list(nstart=nstart, seed = seed, best=best, burnin = burnin, iter = iter, thin=thin))

#write out results

#docs to topics

ldaOut.topics <- as.matrix(topics(ldaOut))

write.csv(ldaOut.topics,file=paste(“LDAGibbs”,k,”DocsToTopics.csv”))

#top 6 terms in each topic

ldaOut.terms <- as.matrix(terms(ldaOut,6))

write.csv(ldaOut.terms,file=paste(“LDAGibbs”,k,”TopicsToTerms.csv”))

#probabilities associated with each topic assignment

topicProbabilities <- as.data.frame(ldaOut@gamma)

write.csv(topicProbabilities,file=paste(“LDAGibbs”,k,”TopicProbabilities.csv”))

#Find relative importance of top 2 topics

topic1ToTopic2 <- lapply(1:nrow(dtm),function(x)

sort(topicProbabilities[x,])[k]/sort(topicProbabilities[x,])[k-1])

#Find relative importance of second and third most important topics

topic2ToTopic3 <- lapply(1:nrow(dtm),function(x)

sort(topicProbabilities[x,])[k-1]/sort(topicProbabilities[x,])[k-2])

#write to file

write.csv(topic1ToTopic2,file=paste(“LDAGibbs”,k,”Topic1ToTopic2.csv”))

write.csv(topic2ToTopic3,file=paste(“LDAGibbs”,k,”Topic2ToTopic3.csv”))

The LDA algorithm returns an object that contains a lot of information. Of particular interest to us are the document to topic assignments, the top terms in each topic and the probabilities associated with each of those terms. These are printed out in the first three calls to write.csv above. There are a few important points to note here:

Each document is considered to be a mixture of all topics (5 in this case). The assignments in the first file list the top topic – that is, the one with the highest probability (more about this in point 3 below).
Each topic contains all terms (words) in the corpus, albeit with different probabilities. We list only the top 6 terms in the second file.
The last file lists the probabilities with which each topic is assigned to a document. This is therefore a 30 x 5 matrix – 30 docs and 5 topics. As one might expect, the highest probability in each row corresponds to the topic assigned to that document. The “goodness” of the primary assignment (as discussed in point 1) can be assessed by taking the ratio of the highest to second-highest probability and the second-highest to the third-highest probability and so on. This is what I’ve done in the last nine lines of the code above.

Take some time to examine the output and confirm for yourself that that the primary topic assignments are best when the ratios of probabilities discussed in point 3 are highest. You should also experiment with different values of k to see if you can find better topic distributions. In the interests of space I will restrict myself to k = 5.

The table below lists the top 6 terms in topics 1 through 5.

	Topic 1	Topic 2	Topic 3	Topic 4	Topic 5
1	work	question	chang	system	project
2	practic	map	organ	data	manag
3	mani	time	consult	model	approach
4	flexibl	ibi	manag	design	organ
5	differ	issu	work	process	decis
6	best	plan	problem	busi	problem

The table below lists the document to (primary) topic assignments:

Document	Topic
BeyondEntitiesAndRelationships.txt	4
bigdata.txt	4
ConditionsOverCauses.txt	5
EmergentDesignInEnterpriseIT.txt	4
FromInformationToKnowledge.txt	2
FromTheCoalface.txt	1
HeraclitusAndParmenides.txt	3
IroniesOfEnterpriseIT.txt	3
MakingSenseOfOrganizationalChange.txt	5
MakingSenseOfSensemaking.txt	2
ObjectivityAndTheEthicalDimensionOfDecisionMaking.txt	5
OnTheInherentAmbiguitiesOfManagingProjects.txt	5
OrganisationalSurprise.txt	5
ProfessionalsOrPoliticians.txt	3
RitualsInInformationSystemDesign.txt	4
RoutinesAndReality.txt	4
ScapegoatsAndSystems.txt	5
SherlockHolmesFailedProjects.txt	3
sherlockHolmesMgmtFetis.txt	3
SixHeresiesForBI.txt	4
SixHeresiesForEnterpriseArchitecture.txt	3
TheArchitectAndTheApparition.txt	3
TheCloudAndTheGrass.txt	2
TheConsultantsDilemma.txt	3
TheDangerWithin.txt	5
TheDilemmasOfEnterpriseIT.txt	3
TheEssenceOfEntrepreneurship.txt	1
ThreeTypesOfUncertainty.txt	5
TOGAFOrNotTOGAF.txt	3
UnderstandingFlexibility.txt	1

From a quick perusal of the two tables it appears that the algorithm has done a pretty decent job. For example,topic 4 is about data and system design, and the documents assigned to it are on topic. However, it is far from perfect – for example, the interview I did with Neil Preston on organisational change (MakingSenseOfOrganizationalChange.txt) has been assigned to topic 5, which seems to be about project management. It ought to be associated with Topic 3, which is about change. Let’s see if we can resolve this by looking at probabilities associated with topics.

The table below lists the topic probabilities by document:

	Topic 1	Topic 2	Topic 3	Topic 4	Topic 5
BeyondEn	0.071	0.064	0.024	0.741	0.1
bigdata.	0.182	0.221	0.182	0.26	0.156
Conditio	0.144	0.109	0.048	0.205	0.494
Emergent	0.121	0.226	0.204	0.236	0.213
FromInfo	0.096	0.643	0.026	0.169	0.066
FromTheC	0.636	0.082	0.058	0.086	0.138
Heraclit	0.137	0.091	0.503	0.162	0.107
IroniesO	0.101	0.088	0.388	0.26	0.162
MakingSe	0.13	0.206	0.262	0.089	0.313
MakingSe	0.09	0.715	0.055	0.067	0.074
Objectiv	0.216	0.078	0.086	0.242	0.378
OnTheInh	0.18	0.234	0.102	0.12	0.364
Organisa	0.089	0.095	0.07	0.092	0.655
Professi	0.155	0.064	0.509	0.128	0.144
RitualsI	0.103	0.064	0.044	0.676	0.112
Routines	0.108	0.042	0.033	0.69	0.127
Scapegoa	0.135	0.088	0.043	0.185	0.549
Sherlock	0.093	0.082	0.398	0.195	0.232
sherlock	0.108	0.136	0.453	0.123	0.18
SixHeres	0.159	0.11	0.078	0.516	0.138
SixHeres	0.104	0.111	0.366	0.212	0.207
TheArchi	0.111	0.221	0.522	0.088	0.058
TheCloud	0.185	0.333	0.198	0.136	0.148
TheConsu	0.105	0.184	0.518	0.096	0.096
TheDange	0.114	0.079	0.037	0.079	0.69
TheDilem	0.125	0.128	0.389	0.261	0.098
TheEssen	0.713	0.059	0.031	0.113	0.084
ThreeTyp	0.09	0.076	0.042	0.083	0.708
TOGAFOrN	0.158	0.232	0.352	0.151	0.107
Understa	0.658	0.065	0.072	0.101	0.105

In the table, the highest probability in each row is in bold. Also, in cases where the maximum and the second/third largest probabilities are close, I have highlighted the second (and third) highest probabilities in red. It is clear that Neil’s interview (9th document in the above table) has 3 topics with comparable probabilities – topic 5 (project management), topic 3 (change) and topic 2 (issue mapping / ibis), in decreasing order of probabilities. In general, if a document has multiple topics with comparable probabilities, it simply means that the document speaks to all those topics in proportions indicated by the probabilities. A reading of Neil’s interview will convince you that our conversation did indeed range over all those topics.

That said, the algorithm is far from perfect. You might have already noticed a few poor assignments. Here is one – my post on Sherlock Holmes and the case of the failed project has been assigned to topic 3; I reckon it belongs in topic 5. There are a number of others, but I won’t belabor the point, except to reiterate that this precisely why you definitely want to experiment with different settings of the iteration parameters (to check for stability) and, more important, try a range of different values of k to find the optimal number of topics.

To conclude

Topic modelling provides a quick and convenient way to perform unsupervised classification of a corpus of documents. As always, though, one needs to examine the results carefully to check that they make sense.

I’d like to end with a general observation. Classifying documents is an age-old concern that cuts across disciplines. So it is no surprise that topic modelling has got a look-in from diverse communities. Indeed, when I was reading up and learning about LDA, I found that some of the best introductory articles in the area have been written by academics working in English departments! This is one of the things I love about working in text analysis, there is a wealth of material on the web written from diverse perspectives. The term cross-disciplinary often tends to be a platitude , but in this case it is simply a statement of fact.

I hope that I have been able to convince you to explore this rapidly evolving field. Exciting times ahead, come join the fun.

Written by K

September 29, 2015 at 7:18 pm

Posted in Business Intelligence, Data Analytics, Data Science, R, Statistics, Text Analytics, Text Mining

52 Responses

Subscribe to comments with RSS.

[…] …and my introductory piece on topic modeling. […]

LikeLike

A gentle introduction to text mining using R | Eight to Late

September 29, 2015 at 9:14 pm

Reply
[…] If you liked this article, you might want to check out its sequel – an introduction to topic modeling. […]

LikeLike

A gentle introduction to cluster analysis using R | Eight to Late

September 29, 2015 at 9:19 pm

Reply
[…] A gentle introduction to topic modeling using R […]

LikeLike

A gentle introduction to Naïve Bayes classification using R | Eight to Late

November 6, 2015 at 2:17 pm

Reply
[…] a process which I have dealt with at length in my introductory pieces on text mining and topic modeling. In fact, the steps are actually identical to those detailed in the second piece. I will therefore […]

LikeLiked by 1 person

A gentle introduction to network graphs using R and Gephi | Eight to Late

December 2, 2015 at 7:20 am

Reply
hi
i want to map topics to wordnet to form document topic representation for better clustering

LikeLike

shakeel

February 26, 2016 at 3:47 pm

Reply
Thanks for your great tutorial. Just to mention that I am getting a nasty error. The reasons are a bit obscure to me. If I find out what is causing it, I will let you know what it is.

writeLines(as.character(docs[[30]]))

Error in gsub(sprintf(“(*UCP)\\b(%s)\\b”, paste(sort(words, decreasing = TRUE), :
input string 1 is invalid UTF-8

LikeLike

Hendrik

April 5, 2016 at 4:35 am

Reply
- after reading in the files with this command: docs <- Corpus(DirSource()), I didn’t get this error ay more
  
  LikeLike
  
  Achim
  
  July 23, 2016 at 9:39 pm
  
  Reply
Thanks so much for this article. Quick question: you wrote that “Each topic contains all terms (words) in the corpus, albeit with different probabilities.” I see the table where the terms for each topic are listed in order of their probabilities, but it is possible to see the probabilities themselves, so as to identify an ‘elbow in the curve’ i.e. where you transition from terms that have a reasonable probability of association with that topic to those that have virtually no probability?

LikeLike

Jake

April 22, 2016 at 4:13 am

Reply
thank you so much for this simple and useful post. It helped me alot.m

LikeLike

zahra

May 23, 2016 at 6:08 pm

Reply
Hi again. I am searching for an implementation of an Online topic modeling approach, one with the ability to detect new topics and accept new words as new documents arrive. do you know any packages or libraries with these features?

LikeLike

zahra

June 3, 2016 at 7:29 pm

Reply
Thank you for such an amazing tutorial!

LikeLike

Rudraksh Tuwani

June 7, 2016 at 6:12 pm

Reply
Sorry I got this error “Error in is(x, “DocumentTermMatrix”) : object ‘dtm’ not found” I can not proceed what is dtm!!

LikeLike

Georgetigp

June 10, 2016 at 7:30 pm

Reply
Sorry I got it, I did not start with pre processing!

LikeLike

Georgetigp

June 10, 2016 at 7:54 pm

Reply
I got these errors
>docs <-tm_map(docs,content_transformer(tolower))
Warning message:
In mclapply(content(x), FUN, …) :
all scheduled cores encountered errors in user code

-Then I tried this way
docs writeLines(as.character(docs[[30]]))
Error in UseMethod(“stripWhitespace”, x) :
no applicable method for ‘stripWhitespace’ applied to an object of class “try-error”

LikeLike

Georgetigp

June 11, 2016 at 2:14 am

Reply
- Sorry I forget to include some code
  -Then I tried this way
  > docs <- tm_map(docs, toSpace, "-",lazy=TRUE)
  it seemingly goes well, but subsequent operations give this error
  
  docs writeLines(as.character(docs[[30]]))
  Error in UseMethod(“stripWhitespace”, x) :
  no applicable method for ‘stripWhitespace’ applied to an object of class “try-error
  
  LikeLiked by 1 person
  
  Georgetigp
  
  June 11, 2016 at 2:17 am
  
  Reply
  - Hey Georgetip, I am having the exact same problem. Did you find a way around? Thanks!
    
    LikeLike
    
    carlos
    
    July 30, 2016 at 9:52 am
    
    Reply
    - as mentioned above, after reading in the files with this command: docs <- Corpus(DirSource()), I didn’t get this error ay more
      
      LikeLike
      
      achim
      
      July 31, 2016 at 2:54 am
      
      Reply
Hi K,

Excellent article. I’ve tried it out myself and it works well.

Quick question – how long did the LDA step take for you?

LikeLiked by 1 person

Tom Roth

July 11, 2016 at 2:49 pm

Reply
- Hi Tom,
  
  Thanks for your comment! The duration of the LDA step depends on iter and nstart. For the parameter values shown in the code, I think it was a few minutes.
  
  Regards,
  
  Kailash.
  
  LikeLike
  
  K
  
  July 12, 2016 at 6:21 am
  
  Reply
  - Thanks Kailash!
    
    I’d tried it out on a larger dataset (~10000 documents) and the LDA step ran all night without finishing! The perils of machine learning without knowing exactly what you’re doing…
    
    LikeLiked by 1 person
    
    Tom Roth
    
    July 12, 2016 at 9:54 am
    
    Reply
    - Indeed…and such experimentation is part of the process (and fun!) of learning machine learning 🙂
      
      LikeLike
      
      K
      
      July 12, 2016 at 10:08 am
      
      Reply
    - Author has given addtional options such as
      burnin <- 4000
      iter <- 2000
      thin <- 500
      seed <-list(2003,5,63,100001,765)
      nstart <- 5
      best <- TRUE
      
      Instead use the default ones, try this – ldaOut <-LDA(dtm,k, method="Gibbs")
      
      or try decreasing the number of iterations
      
      LikeLike
      
      Saurabh
      
      October 4, 2016 at 4:50 pm
      
      Reply
    - exactly… I encountered the same oroblem
      
      LikeLike
      
      PAUL
      
      March 24, 2017 at 4:35 pm
      
      Reply
Hi. Thanks for this. It’s been really helpful and I’ve managed to run this over my own corpus. I’m getting an error, however, on the final steps – finding the relative importance of topics . I wondered if anyone else has experienced this and can perhaps help?

When I run the command:

topic1ToTopic2 <- lapply(1:nrow(dtm),function(x)
sort(topicProbabilities[x,])[k]/sort(topicProbabilities[x,])[k-1])

I get an error:

Error in `[.data.frame`(sort(topicProbabilities[x, ]), k) :
undefined columns selected

LikeLike

Craig Hamilton

August 17, 2016 at 12:52 am

Reply
Im having trouble with the docs <- tm_map(docs,stemDocument) command. Apparently it depends on a package called SnowballC which is not compatible with R 3.3.1. Has anyone else had this problem? I tried running an older version of R but then I had trouble loading the tm library. Any thoughts on how to solve this?

LikeLike

Jacob

September 23, 2016 at 12:59 am

Reply
About topicmodels package

i want to know parameters’ mean

What does burnin mean?? why 4000?
What does iter mean?? why 4000?
What does thin mean?? why 500?
What does seed mean?? why (2003,5,63,100001,765)?
What does nstart mean?? why 5?
What does best mean??

> burnin iter thin seed nstart best <- TRUE

SOME BODY HELP ME!! PLEASE

LikeLike

LIM

September 25, 2016 at 10:54 pm

Reply
- Read this: https://cran.r-project.org/web/packages/topicmodels/topicmodels.pdf
  
  LikeLike
  
  Saurabh
  
  October 4, 2016 at 4:45 pm
  
  Reply
Thank you for the code and the clear explanation, Kailash. Does any topic modeling app/tool prevent a term from belonging to more than one topic?

LikeLike

Vivek Astvansh

October 22, 2016 at 4:05 am

Reply
Hi Kailash, This is an excellent article!! Really helps researchers like me. I have a question with respect to running LDA using Gibbs sampling. As I see, the LDA() allows for providing optional parameter of seed words with weights. Is this conceptually same as the z-label LDA (http://pages.cs.wisc.edu/~andrzeje/research/zl_lda.html)? It would be great if you could provide an example or pointers on how to input the seedwords with weights.

Thanks, BSS

LikeLike

SBS

January 22, 2017 at 10:01 am

Reply
Great write up thanks. I ‘m making great use of it.
However have noticed some errors in your code snippets…methodologically speaking

The stemmer in the latest tm package from snowballC requires plantext before stemming to get it to work properly

corpus <- tm_map(corpus, PlainTextDocument)
corpus <- tm_map(corpus, stemDocument)

http://stackoverflow.com/questions/36967573/stemming-words-using-tm-package-in-r-does-not-work-properly

Also an improvement on the unicode punctuation removal can be simplified…

#remove special unicode chars
corpus <- tm_map(corpus, function(x) iconv(x,'UTF-8', 'ASCII', sub=' '))

http://stackoverflow.com/questions/9934856/removing-non-ascii-characters-from-data-files/9935242#9935242

LikeLiked by 1 person

clancy

February 20, 2017 at 2:23 pm

Reply
- Thanks a ton for catching the errors and taking the time to bring them to my attention. Much appreciated!
  
  Regards,
  
  Kailash.
  
  LikeLike
  
  K
  
  February 20, 2017 at 2:27 pm
  
  Reply
[…] the one on text analysis (link in last line of previous section) and then move on to clustering, topic modelling, naive Bayes, decision trees, random forests and support vector machines. I’m slowly adding to […]

LikeLike

A prelude to machine learning | Eight to Late

February 23, 2017 at 3:13 pm

Reply
Hi all,
Thanks for this useful explanation of how LDA works.
I have a question in regards to the files or “documents”. Different from having separate Word or text files with the data, I have my data on a Excel file. My Excel only has one column and as many rows as documents (each cell has a different text).
Can I preprocess this Excel file? How can I run the LDA with this data structure?
Thank you in advance.

LikeLike

Marta

March 4, 2017 at 3:08 am

Reply
[…] Awati, Kailash. “A gentle introduction to topic modeling using R.” Eight to Late. Accessed March 16, 2017. https://eight2late.wordpress.com/2015/09/29/a-gentle-introduction-to-topic-modeling-using-r/. […]

LikeLike

Holy Indexes – Small Words Big Numbers

March 17, 2017 at 1:09 pm

Reply
Thanks for the information about LDA implementation. But I am not able to get the names of the files in the output csv file and instead it is just giving out a serial of numbers. Is there something that can be done to rectify this issue?

LikeLike

samuel Benadict

March 28, 2017 at 12:55 pm

Reply
Dear Kailash,

Thank you so much for this excellent intro to topic modelling! It has helped me a lot.

One (very small) thing I have noticed about the data cleaning steps is that you remove punctuation before you remove stopwords. However, the stopwords contain words like “isn’t” which won’t be found if you remove punctuation first. So I have switched the two around. Is there any reason your did it the other way around?

Also, the very first thing I now do after I load my data into R is to convert it to ASCII so I won’t encounter any problems with any special characters, which helps immensely.

files2 <- stringi::stri_trans_general(files, "latin-ascii")

Best regards,
Sarah

LikeLike

Sarah

March 29, 2017 at 7:46 pm

Reply
[…] This piece on topic modeling is based on: topic modeling using R. […]

LikeLike

Classifying documents into topics using LDA – Holla

June 12, 2017 at 6:55 pm

Reply
Hi and tnx for this tutorial.

Do you think that topic modeling could be used in case of a dataset (NOT a text corpus) with big set of dummy vars (150+) built from a couple of discrete attributes. Here, CA usually fails due to the dummies matrix being too sparse?

best regards,
gabriele

LikeLike

Gabriele

June 18, 2017 at 7:59 pm

Reply
[…] contains data points (topics/words). For a more detailed overview of using Topic Modelling, see Kailash Await’s excellent post from which my own script is derived, or read David Blei’s overview of the […]

LikeLike

Text Analysis - Harkive Stories - Harkive.org

July 28, 2017 at 2:28 am

Reply
[…] de seguir aquest exemple, m’agradaria agrupar automàticament (clusteritzar) una sèrie de documents en base a la seva […]

LikeLike

Topic modeling Franco Battiato’s lyrics – Openite

July 30, 2017 at 11:13 pm

Reply
Interesting Blog and easy to follow steps. I am trying a similar annalysis with tweets and the users. I am curious about how you got the Document and topic mapping and the table with topic probabilities by document: Would be nice if you could Throw some light on that part as well.

LikeLike

Divya Iyer

October 11, 2017 at 6:33 pm

Reply
[…] method to extract information from the text by assessing the proximity of words to each other. The topic modelling package provides functions to perform this analysis. I am not an expert in this field and simply […]

LikeLike

Qualitative Data Science: Using RQDA to analyse interview transcripts

May 3, 2018 at 9:59 am

Reply
[…] method to extract information from the text by assessing the proximity of words to each other. The topic modelling package provides functions to perform this analysis. I am not an expert in this field and simply […]

LikeLike

Qualitative Data Science: Using RQDA to analyse interviews

June 28, 2018 at 3:35 pm

Reply
Great great great tutorial and we’re using it as an important guide for our study. While we almost can replicate everything, we found the code “rownames(dtm) <- filenames" with the error "Error in rownames(dtm <- filenames) : object 'filenames' not found. We don't know why and even what our files names are exactly. Any body can help？Thank you!

LikeLike

Cathy Chen

October 9, 2019 at 7:10 am

Reply
- Hi Cathy,
  
  Thanks for reading and for your feedback. The filenames variable is created in the third line of code:
  
  filenames <- list.files(getwd(),pattern=”*.txt”)
  
  The line you're referring to changes rownames to match the filenames (for easier reference).
  
  Hope this helps.
  
  Regards,
  
  Kailash.
  
  LikeLike
  
  K
  
  October 10, 2019 at 8:25 am
  
  Reply
[…] to find the topics or themes underlying a set of documents. There is an easy to follow explanation here, which goes into more detail about how the particular algorithm I’m using, Latent Dirichlet […]

LikeLike

#RugbyWorldCup on Twitter - Part 2 - Degrees of Belief

October 23, 2019 at 9:58 am

Reply
“Each row of the input matrix needs to contain at least one non-zero entry”
How can I solve this issue?

LikeLike

marisa

February 13, 2020 at 4:26 am

Reply
[…] codificar o LDA no R. A resposta foi sim, existem exemplo de código tanto para Python e R . “Por que você prefere que eu codifique em R?” Eu perguntei. Ele respondeu: […]

LikeLike

R e Python no local de trabalho – Data Science e Machine Learning

July 23, 2020 at 5:05 am

Reply
Hi K,

Thank you so much for the non-math introduction, that really helps me understand it. Would you happen to have an article that can explain the CTM side of this (in a non-math way!)? Please let me know if so, thank you!

LikeLike

nana

July 15, 2021 at 9:06 am

Reply