A gentle introduction to topic modeling using R
Introduction
The standard way to search for documents on the internet is via keywords or keyphrases. This is pretty much what Google and other search engines do routinely…and they do it well. However, as useful as this is, it has its limitations. Consider, for example, a situation in which you are confronted with a large collection of documents but have no idea what they are about. One of the first things you might want to do is to classify these documents into topics or themes. Among other things this would help you figure out if there’s anything interest while also directing you to the relevant subset(s) of the corpus. For small collections, one could do this by simply going through each document but this is clearly infeasible for corpuses containing thousands of documents.
Topic modeling – the theme of this post – deals with the problem of automatically classifying sets of documents into themes
The article is organised as follows: I first provide some background on topic modelling. The algorithm that I use, Latent Dirichlet Allocation (LDA), involves some pretty heavy maths which I’ll avoid altogether. However, I will provide an intuitive explanation of how LDA works before moving on to a practical example which uses the topicmodels library in R. As in my previous articles in this series (see this post and this one), I will discuss the steps in detail along with explanations and provide accessible references for concepts that cannot be covered in the space of a blog post.
(Aside: Beware, LDA is also an abbreviation for Linear Discriminant Analysis a classification technique that I hope to cover later in my ongoing series on text and data analytics).
Latent Dirichlet Allocation – a math-free introduction
In essence, LDA is a technique that facilitates the automatic discovery of themes in a collection of documents.
The basic assumption behind LDA is that each of the documents in a collection consist of a mixture of collection-wide topics. However, in reality we observe only documents and words, not topics – the latter are part of the hidden (or latent) structure of documents. The aim is to infer the latent topic structure given the words and document. LDA does this by recreating the documents in the corpus by adjusting the relative importance of topics in documents and words in topics iteratively.
Here’s a brief explanation of how the algorithm works, quoted directly from this answer by Edwin Chen on Quora:
- Go through each document, and randomly assign each word in the document to one of the K topics. (Note: One of the shortcomings of LDA is that one has to specify the number of topics, denoted by K, upfront. More about this later.)
- This assignment already gives you both topic representations of all the documents and word distributions of all the topics (albeit not very good ones).
- So to improve on them, for each document d…
- ….Go through each word w in d…
- ……..And for each topic t, compute two things: 1) p(topic t | document d) = the proportion of words in document d that are currently assigned to topic t, and 2) p(word w | topic t) = the proportion of assignments to topic t over all documents that come from this word w. Reassign w a new topic, where you choose topic t with probability p(topic t | document d) * p(word w | topic t) (according to our generative model, this is essentially the probability that topic t generated word w, so it makes sense that we resample the current word’s topic with this probability). (Note: p(a|b) is the conditional probability of a given that b has already occurred – see this post for more on conditional probabilities)
- ……..In other words, in this step, we’re assuming that all topic assignments except for the current word in question are correct, and then updating the assignment of the current word using our model of how documents are generated.
- After repeating the previous step a large number of times, you’ll eventually reach a roughly steady state where your assignments are pretty good. So use these assignments to estimate the topic mixtures of each document (by counting the proportion of words assigned to each topic within that document) and the words associated to each topic (by counting the proportion of words assigned to each topic overall).
For another simple explanation of how LDA works in, check out this article by Matthew Jockers. For a more technical exposition, take a look at this video by David Blei, one of the inventors of the algorithm.
The iterative process described in the last point above is implemented using a technique called Gibbs sampling. I’ll say a bit more about Gibbs sampling later, but you may want to have a look at this paper by Philip Resnick and Eric Hardesty that explains the nitty-gritty of the algorithm (Warning: it involves a fair bit of math, but has some good intuitive explanations as well).
As a general point, I should also emphasise that you do not need to understand the ins and outs of an algorithm to use it but it does help to understand, at least at a high level, what the algorithm is doing. One needs to develop a feel for algorithms even if one doesn’t understand the details. Indeed, most people working in analytics do not know the details of the algorithms they use, but that doesn’t stop them from using algorithms intelligently. Purists may disagree. I think they are wrong.
Finally – because you’re no doubt wondering 🙂 – the term “Dirichlet” in LDA refers to the fact that topics and words are assumed to follow Dirichlet distributions. There is no “good” reason for this apart from convenience – Dirichlet distributions provide good approximations to word distributions in documents and, perhaps more important, are computationally convenient.
Preprocessing
As in my previous articles on text mining, I will use a collection of 30 posts from this blog as an example corpus. The corpus can be downloaded here. I will assume that you have R and RStudio installed. Follow this link if you need help with that.
The preprocessing steps are much the same as described in my previous articles. Nevertheless, I’ll risk boring you with a detailed listing so that you can reproduce my results yourself:
docs <- tm_map(docs, toSpace, “-“)
docs <- tm_map(docs, toSpace, “’”)
docs <- tm_map(docs, toSpace, “‘”)
docs <- tm_map(docs, toSpace, “•”)
docs <- tm_map(docs, toSpace, “””)
docs <- tm_map(docs, toSpace, ““”)
pattern = “organiz”, replacement = “organ”)
docs <- tm_map(docs, content_transformer(gsub),
pattern = “organis”, replacement = “organ”)
docs <- tm_map(docs, content_transformer(gsub),
pattern = “andgovern”, replacement = “govern”)
docs <- tm_map(docs, content_transformer(gsub),
pattern = “inenterpris”, replacement = “enterpris”)
docs <- tm_map(docs, content_transformer(gsub),
pattern = “team-“, replacement = “team”)
“also”,”howev”,”tell”,”will”,
“much”,”need”,”take”,”tend”,”even”,
“like”,”particular”,”rather”,”said”,
“get”,”well”,”make”,”ask”,”come”,”end”,
“first”,”two”,”help”,”often”,”may”,
“might”,”see”,”someth”,”thing”,”point”,
“post”,”look”,”right”,”now”,”think”,”‘ve “,
“‘re “,”anoth”,”put”,”set”,”new”,”good”,
“want”,”sure”,”kind”,”larg”,”yes,”,”day”,”etc”,
“quit”,”sinc”,”attempt”,”lack”,”seen”,”awar”,
“littl”,”ever”,”moreov”,”though”,”found”,”abl”,
“enough”,”far”,”earli”,”away”,”achiev”,”draw”,
“last”,”never”,”brief”,”bit”,”entir”,”brief”,
“great”,”lot”)
write.csv(freq[ord],”word_freq.csv”)
Check out the preprocessing section in either this article or this one for detailed explanations of the code. The document term matrix (DTM) produced by the above code will be the main input into the LDA algorithm of the next section.
Topic modelling using LDA
We are now ready to do some topic modelling. We’ll use the topicmodels package written by Bettina Gruen and Kurt Hornik. Specifically, we’ll use the LDA function with the Gibbs sampling option mentioned earlier, and I’ll say more about it in a second. The LDA function has a fairly large number of parameters. I’ll describe these briefly below. For more, please check out this vignette by Gruen and Hornik.
For the most part, we’ll use the default parameter values supplied by the LDA function,custom setting only the parameters that are required by the Gibbs sampling algorithm.
Gibbs sampling works by performing a random walk in such a way that reflects the characteristics of a desired distribution. Because the starting point of the walk is chosen at random, it is necessary to discard the first few steps of the walk (as these do not correctly reflect the properties of distribution). This is referred to as the burn-in period. We set the burn-in parameter to 4000. Following the burn-in period, we perform 2000 iterations, taking every 500th iteration for further use. The reason we do this is to avoid correlations between samples. We use 5 different starting points (nstart=5) – that is, five independent runs. Each starting point requires a seed integer (this also ensures reproducibility), so I have provided 5 random integers in my seed list. Finally I’ve set best to TRUE (actually a default setting), which instructs the algorithm to return results of the run with the highest posterior probability.
Some words of caution are in order here. It should be emphasised that the settings above do not guarantee the convergence of the algorithm to a globally optimal solution. Indeed, Gibbs sampling will, at best, find only a locally optimal solution, and even this is hard to prove mathematically in specific practical problems such as the one we are dealing with here. The upshot of this is that it is best to do lots of runs with different settings of parameters to check the stability of your results. The bottom line is that our interest is purely practical so it is good enough if the results make sense. We’ll leave issues of mathematical rigour to those better qualified to deal with them 🙂
As mentioned earlier, there is an important parameter that must be specified upfront: k, the number of topics that the algorithm should use to classify documents. There are mathematical approaches to this, but they often do not yield semantically meaningful choices of k (see this post on stackoverflow for an example). From a practical point of view, one can simply run the algorithm for different values of k and make a choice based by inspecting the results. This is what we’ll do.
OK, so the first step is to set these parameters in R… and while we’re at it, let’s also load the topicmodels library (Note: you might need to install this package as it is not a part of the base R installation).
iter <- 2000
thin <- 500
seed <-list(2003,5,63,100001,765)
nstart <- 5
best <- TRUE
That done, we can now do the actual work – run the topic modelling algorithm on our corpus. Here is the code:
write.csv(ldaOut.topics,file=paste(“LDAGibbs”,k,”DocsToTopics.csv”))
write.csv(ldaOut.terms,file=paste(“LDAGibbs”,k,”TopicsToTerms.csv”))
write.csv(topicProbabilities,file=paste(“LDAGibbs”,k,”TopicProbabilities.csv”))
sort(topicProbabilities[x,])[k]/sort(topicProbabilities[x,])[k-1])
sort(topicProbabilities[x,])[k-1]/sort(topicProbabilities[x,])[k-2])
write.csv(topic2ToTopic3,file=paste(“LDAGibbs”,k,”Topic2ToTopic3.csv”))
The LDA algorithm returns an object that contains a lot of information. Of particular interest to us are the document to topic assignments, the top terms in each topic and the probabilities associated with each of those terms. These are printed out in the first three calls to write.csv above. There are a few important points to note here:
- Each document is considered to be a mixture of all topics (5 in this case). The assignments in the first file list the top topic – that is, the one with the highest probability (more about this in point 3 below).
- Each topic contains all terms (words) in the corpus, albeit with different probabilities. We list only the top 6 terms in the second file.
- The last file lists the probabilities with which each topic is assigned to a document. This is therefore a 30 x 5 matrix – 30 docs and 5 topics. As one might expect, the highest probability in each row corresponds to the topic assigned to that document. The “goodness” of the primary assignment (as discussed in point 1) can be assessed by taking the ratio of the highest to second-highest probability and the second-highest to the third-highest probability and so on. This is what I’ve done in the last nine lines of the code above.
Take some time to examine the output and confirm for yourself that that the primary topic assignments are best when the ratios of probabilities discussed in point 3 are highest. You should also experiment with different values of k to see if you can find better topic distributions. In the interests of space I will restrict myself to k = 5.
The table below lists the top 6 terms in topics 1 through 5.
Topic 1 | Topic 2 | Topic 3 | Topic 4 | Topic 5 | |
1 | work | question | chang | system | project |
2 | practic | map | organ | data | manag |
3 | mani | time | consult | model | approach |
4 | flexibl | ibi | manag | design | organ |
5 | differ | issu | work | process | decis |
6 | best | plan | problem | busi | problem |
The table below lists the document to (primary) topic assignments:
Document | Topic |
BeyondEntitiesAndRelationships.txt | 4 |
bigdata.txt | 4 |
ConditionsOverCauses.txt | 5 |
EmergentDesignInEnterpriseIT.txt | 4 |
FromInformationToKnowledge.txt | 2 |
FromTheCoalface.txt | 1 |
HeraclitusAndParmenides.txt | 3 |
IroniesOfEnterpriseIT.txt | 3 |
MakingSenseOfOrganizationalChange.txt | 5 |
MakingSenseOfSensemaking.txt | 2 |
ObjectivityAndTheEthicalDimensionOfDecisionMaking.txt | 5 |
OnTheInherentAmbiguitiesOfManagingProjects.txt | 5 |
OrganisationalSurprise.txt | 5 |
ProfessionalsOrPoliticians.txt | 3 |
RitualsInInformationSystemDesign.txt | 4 |
RoutinesAndReality.txt | 4 |
ScapegoatsAndSystems.txt | 5 |
SherlockHolmesFailedProjects.txt | 3 |
sherlockHolmesMgmtFetis.txt | 3 |
SixHeresiesForBI.txt | 4 |
SixHeresiesForEnterpriseArchitecture.txt | 3 |
TheArchitectAndTheApparition.txt | 3 |
TheCloudAndTheGrass.txt | 2 |
TheConsultantsDilemma.txt | 3 |
TheDangerWithin.txt | 5 |
TheDilemmasOfEnterpriseIT.txt | 3 |
TheEssenceOfEntrepreneurship.txt | 1 |
ThreeTypesOfUncertainty.txt | 5 |
TOGAFOrNotTOGAF.txt | 3 |
UnderstandingFlexibility.txt | 1 |
From a quick perusal of the two tables it appears that the algorithm has done a pretty decent job. For example,topic 4 is about data and system design, and the documents assigned to it are on topic. However, it is far from perfect – for example, the interview I did with Neil Preston on organisational change (MakingSenseOfOrganizationalChange.txt) has been assigned to topic 5, which seems to be about project management. It ought to be associated with Topic 3, which is about change. Let’s see if we can resolve this by looking at probabilities associated with topics.
The table below lists the topic probabilities by document:
Topic 1 | Topic 2 | Topic 3 | Topic 4 | Topic 5 | |
BeyondEn | 0.071 | 0.064 | 0.024 | 0.741 | 0.1 |
bigdata. | 0.182 | 0.221 | 0.182 | 0.26 | 0.156 |
Conditio | 0.144 | 0.109 | 0.048 | 0.205 | 0.494 |
Emergent | 0.121 | 0.226 | 0.204 | 0.236 | 0.213 |
FromInfo | 0.096 | 0.643 | 0.026 | 0.169 | 0.066 |
FromTheC | 0.636 | 0.082 | 0.058 | 0.086 | 0.138 |
Heraclit | 0.137 | 0.091 | 0.503 | 0.162 | 0.107 |
IroniesO | 0.101 | 0.088 | 0.388 | 0.26 | 0.162 |
MakingSe | 0.13 | 0.206 | 0.262 | 0.089 | 0.313 |
MakingSe | 0.09 | 0.715 | 0.055 | 0.067 | 0.074 |
Objectiv | 0.216 | 0.078 | 0.086 | 0.242 | 0.378 |
OnTheInh | 0.18 | 0.234 | 0.102 | 0.12 | 0.364 |
Organisa | 0.089 | 0.095 | 0.07 | 0.092 | 0.655 |
Professi | 0.155 | 0.064 | 0.509 | 0.128 | 0.144 |
RitualsI | 0.103 | 0.064 | 0.044 | 0.676 | 0.112 |
Routines | 0.108 | 0.042 | 0.033 | 0.69 | 0.127 |
Scapegoa | 0.135 | 0.088 | 0.043 | 0.185 | 0.549 |
Sherlock | 0.093 | 0.082 | 0.398 | 0.195 | 0.232 |
sherlock | 0.108 | 0.136 | 0.453 | 0.123 | 0.18 |
SixHeres | 0.159 | 0.11 | 0.078 | 0.516 | 0.138 |
SixHeres | 0.104 | 0.111 | 0.366 | 0.212 | 0.207 |
TheArchi | 0.111 | 0.221 | 0.522 | 0.088 | 0.058 |
TheCloud | 0.185 | 0.333 | 0.198 | 0.136 | 0.148 |
TheConsu | 0.105 | 0.184 | 0.518 | 0.096 | 0.096 |
TheDange | 0.114 | 0.079 | 0.037 | 0.079 | 0.69 |
TheDilem | 0.125 | 0.128 | 0.389 | 0.261 | 0.098 |
TheEssen | 0.713 | 0.059 | 0.031 | 0.113 | 0.084 |
ThreeTyp | 0.09 | 0.076 | 0.042 | 0.083 | 0.708 |
TOGAFOrN | 0.158 | 0.232 | 0.352 | 0.151 | 0.107 |
Understa | 0.658 | 0.065 | 0.072 | 0.101 | 0.105 |
In the table, the highest probability in each row is in bold. Also, in cases where the maximum and the second/third largest probabilities are close, I have highlighted the second (and third) highest probabilities in red. It is clear that Neil’s interview (9th document in the above table) has 3 topics with comparable probabilities – topic 5 (project management), topic 3 (change) and topic 2 (issue mapping / ibis), in decreasing order of probabilities. In general, if a document has multiple topics with comparable probabilities, it simply means that the document speaks to all those topics in proportions indicated by the probabilities. A reading of Neil’s interview will convince you that our conversation did indeed range over all those topics.
That said, the algorithm is far from perfect. You might have already noticed a few poor assignments. Here is one – my post on Sherlock Holmes and the case of the failed project has been assigned to topic 3; I reckon it belongs in topic 5. There are a number of others, but I won’t belabor the point, except to reiterate that this precisely why you definitely want to experiment with different settings of the iteration parameters (to check for stability) and, more important, try a range of different values of k to find the optimal number of topics.
To conclude
Topic modelling provides a quick and convenient way to perform unsupervised classification of a corpus of documents. As always, though, one needs to examine the results carefully to check that they make sense.
I’d like to end with a general observation. Classifying documents is an age-old concern that cuts across disciplines. So it is no surprise that topic modelling has got a look-in from diverse communities. Indeed, when I was reading up and learning about LDA, I found that some of the best introductory articles in the area have been written by academics working in English departments! This is one of the things I love about working in text analysis, there is a wealth of material on the web written from diverse perspectives. The term cross-disciplinary often tends to be a platitude , but in this case it is simply a statement of fact.
I hope that I have been able to convince you to explore this rapidly evolving field. Exciting times ahead, come join the fun.
[…] …and my introductory piece on topic modeling. […]
LikeLike
A gentle introduction to text mining using R | Eight to Late
September 29, 2015 at 9:14 pm
[…] If you liked this article, you might want to check out its sequel – an introduction to topic modeling. […]
LikeLike
A gentle introduction to cluster analysis using R | Eight to Late
September 29, 2015 at 9:19 pm
[…] A gentle introduction to topic modeling using R […]
LikeLike
A gentle introduction to Naïve Bayes classification using R | Eight to Late
November 6, 2015 at 2:17 pm
[…] a process which I have dealt with at length in my introductory pieces on text mining and topic modeling. In fact, the steps are actually identical to those detailed in the second piece. I will therefore […]
LikeLiked by 1 person
A gentle introduction to network graphs using R and Gephi | Eight to Late
December 2, 2015 at 7:20 am
hi
i want to map topics to wordnet to form document topic representation for better clustering
LikeLike
shakeel
February 26, 2016 at 3:47 pm
Thanks for your great tutorial. Just to mention that I am getting a nasty error. The reasons are a bit obscure to me. If I find out what is causing it, I will let you know what it is.
writeLines(as.character(docs[[30]]))
Error in gsub(sprintf(“(*UCP)\\b(%s)\\b”, paste(sort(words, decreasing = TRUE), :
input string 1 is invalid UTF-8
LikeLike
Hendrik
April 5, 2016 at 4:35 am
after reading in the files with this command: docs <- Corpus(DirSource()), I didn’t get this error ay more
LikeLike
Achim
July 23, 2016 at 9:39 pm
Thanks so much for this article. Quick question: you wrote that “Each topic contains all terms (words) in the corpus, albeit with different probabilities.” I see the table where the terms for each topic are listed in order of their probabilities, but it is possible to see the probabilities themselves, so as to identify an ‘elbow in the curve’ i.e. where you transition from terms that have a reasonable probability of association with that topic to those that have virtually no probability?
LikeLike
Jake
April 22, 2016 at 4:13 am
thank you so much for this simple and useful post. It helped me alot.m
LikeLike
zahra
May 23, 2016 at 6:08 pm
Hi again. I am searching for an implementation of an Online topic modeling approach, one with the ability to detect new topics and accept new words as new documents arrive. do you know any packages or libraries with these features?
LikeLike
zahra
June 3, 2016 at 7:29 pm
Thank you for such an amazing tutorial!
LikeLike
Rudraksh Tuwani
June 7, 2016 at 6:12 pm
Sorry I got this error “Error in is(x, “DocumentTermMatrix”) : object ‘dtm’ not found” I can not proceed what is dtm!!
LikeLike
Georgetigp
June 10, 2016 at 7:30 pm
Sorry I got it, I did not start with pre processing!
LikeLike
Georgetigp
June 10, 2016 at 7:54 pm
I got these errors
>docs <-tm_map(docs,content_transformer(tolower))
Warning message:
In mclapply(content(x), FUN, …) :
all scheduled cores encountered errors in user code
-Then I tried this way
docs writeLines(as.character(docs[[30]]))
Error in UseMethod(“stripWhitespace”, x) :
no applicable method for ‘stripWhitespace’ applied to an object of class “try-error”
LikeLike
Georgetigp
June 11, 2016 at 2:14 am
Sorry I forget to include some code
-Then I tried this way
> docs <- tm_map(docs, toSpace, "-",lazy=TRUE)
it seemingly goes well, but subsequent operations give this error
docs writeLines(as.character(docs[[30]]))
Error in UseMethod(“stripWhitespace”, x) :
no applicable method for ‘stripWhitespace’ applied to an object of class “try-error
LikeLiked by 1 person
Georgetigp
June 11, 2016 at 2:17 am
Hey Georgetip, I am having the exact same problem. Did you find a way around? Thanks!
LikeLike
carlos
July 30, 2016 at 9:52 am
as mentioned above, after reading in the files with this command: docs <- Corpus(DirSource()), I didn’t get this error ay more
LikeLike
achim
July 31, 2016 at 2:54 am
Hi K,
Excellent article. I’ve tried it out myself and it works well.
Quick question – how long did the LDA step take for you?
LikeLiked by 1 person
Tom Roth
July 11, 2016 at 2:49 pm
Hi Tom,
Thanks for your comment! The duration of the LDA step depends on iter and nstart. For the parameter values shown in the code, I think it was a few minutes.
Regards,
Kailash.
LikeLike
K
July 12, 2016 at 6:21 am
Thanks Kailash!
I’d tried it out on a larger dataset (~10000 documents) and the LDA step ran all night without finishing! The perils of machine learning without knowing exactly what you’re doing…
LikeLiked by 1 person
Tom Roth
July 12, 2016 at 9:54 am
Indeed…and such experimentation is part of the process (and fun!) of learning machine learning 🙂
LikeLike
K
July 12, 2016 at 10:08 am
Author has given addtional options such as
burnin <- 4000
iter <- 2000
thin <- 500
seed <-list(2003,5,63,100001,765)
nstart <- 5
best <- TRUE
Instead use the default ones, try this – ldaOut <-LDA(dtm,k, method="Gibbs")
or try decreasing the number of iterations
LikeLike
Saurabh
October 4, 2016 at 4:50 pm
exactly… I encountered the same oroblem
LikeLike
PAUL
March 24, 2017 at 4:35 pm
Hi. Thanks for this. It’s been really helpful and I’ve managed to run this over my own corpus. I’m getting an error, however, on the final steps – finding the relative importance of topics . I wondered if anyone else has experienced this and can perhaps help?
When I run the command:
topic1ToTopic2 <- lapply(1:nrow(dtm),function(x)
sort(topicProbabilities[x,])[k]/sort(topicProbabilities[x,])[k-1])
I get an error:
Error in `[.data.frame`(sort(topicProbabilities[x, ]), k) :
undefined columns selected
LikeLike
Craig Hamilton
August 17, 2016 at 12:52 am
Im having trouble with the docs <- tm_map(docs,stemDocument) command. Apparently it depends on a package called SnowballC which is not compatible with R 3.3.1. Has anyone else had this problem? I tried running an older version of R but then I had trouble loading the tm library. Any thoughts on how to solve this?
LikeLike
Jacob
September 23, 2016 at 12:59 am
About topicmodels package
i want to know parameters’ mean
What does burnin mean?? why 4000?
What does iter mean?? why 4000?
What does thin mean?? why 500?
What does seed mean?? why (2003,5,63,100001,765)?
What does nstart mean?? why 5?
What does best mean??
> burnin iter thin seed nstart best <- TRUE
SOME BODY HELP ME!! PLEASE
LikeLike
LIM
September 25, 2016 at 10:54 pm
Read this: https://cran.r-project.org/web/packages/topicmodels/topicmodels.pdf
LikeLike
Saurabh
October 4, 2016 at 4:45 pm
Thank you for the code and the clear explanation, Kailash. Does any topic modeling app/tool prevent a term from belonging to more than one topic?
LikeLike
Vivek Astvansh
October 22, 2016 at 4:05 am
Hi Kailash, This is an excellent article!! Really helps researchers like me. I have a question with respect to running LDA using Gibbs sampling. As I see, the LDA() allows for providing optional parameter of seed words with weights. Is this conceptually same as the z-label LDA (http://pages.cs.wisc.edu/~andrzeje/research/zl_lda.html)? It would be great if you could provide an example or pointers on how to input the seedwords with weights.
Thanks, BSS
LikeLike
SBS
January 22, 2017 at 10:01 am
Great write up thanks. I ‘m making great use of it.
However have noticed some errors in your code snippets…methodologically speaking
The stemmer in the latest tm package from snowballC requires plantext before stemming to get it to work properly
corpus <- tm_map(corpus, PlainTextDocument)
corpus <- tm_map(corpus, stemDocument)
http://stackoverflow.com/questions/36967573/stemming-words-using-tm-package-in-r-does-not-work-properly
Also an improvement on the unicode punctuation removal can be simplified…
#remove special unicode chars
corpus <- tm_map(corpus, function(x) iconv(x,'UTF-8', 'ASCII', sub=' '))
http://stackoverflow.com/questions/9934856/removing-non-ascii-characters-from-data-files/9935242#9935242
LikeLiked by 1 person
clancy
February 20, 2017 at 2:23 pm
Thanks a ton for catching the errors and taking the time to bring them to my attention. Much appreciated!
Regards,
Kailash.
LikeLike
K
February 20, 2017 at 2:27 pm
[…] the one on text analysis (link in last line of previous section) and then move on to clustering, topic modelling, naive Bayes, decision trees, random forests and support vector machines. I’m slowly adding to […]
LikeLike
A prelude to machine learning | Eight to Late
February 23, 2017 at 3:13 pm
Hi all,
Thanks for this useful explanation of how LDA works.
I have a question in regards to the files or “documents”. Different from having separate Word or text files with the data, I have my data on a Excel file. My Excel only has one column and as many rows as documents (each cell has a different text).
Can I preprocess this Excel file? How can I run the LDA with this data structure?
Thank you in advance.
LikeLike
Marta
March 4, 2017 at 3:08 am
[…] Awati, Kailash. “A gentle introduction to topic modeling using R.” Eight to Late. Accessed March 16, 2017. https://eight2late.wordpress.com/2015/09/29/a-gentle-introduction-to-topic-modeling-using-r/. […]
LikeLike
Holy Indexes – Small Words Big Numbers
March 17, 2017 at 1:09 pm
Thanks for the information about LDA implementation. But I am not able to get the names of the files in the output csv file and instead it is just giving out a serial of numbers. Is there something that can be done to rectify this issue?
LikeLike
samuel Benadict
March 28, 2017 at 12:55 pm
Dear Kailash,
Thank you so much for this excellent intro to topic modelling! It has helped me a lot.
One (very small) thing I have noticed about the data cleaning steps is that you remove punctuation before you remove stopwords. However, the stopwords contain words like “isn’t” which won’t be found if you remove punctuation first. So I have switched the two around. Is there any reason your did it the other way around?
Also, the very first thing I now do after I load my data into R is to convert it to ASCII so I won’t encounter any problems with any special characters, which helps immensely.
files2 <- stringi::stri_trans_general(files, "latin-ascii")
Best regards,
Sarah
LikeLike
Sarah
March 29, 2017 at 7:46 pm
[…] This piece on topic modeling is based on: topic modeling using R. […]
LikeLike
Classifying documents into topics using LDA – Holla
June 12, 2017 at 6:55 pm
Hi and tnx for this tutorial.
Do you think that topic modeling could be used in case of a dataset (NOT a text corpus) with big set of dummy vars (150+) built from a couple of discrete attributes. Here, CA usually fails due to the dummies matrix being too sparse?
best regards,
gabriele
LikeLike
Gabriele
June 18, 2017 at 7:59 pm
[…] contains data points (topics/words). For a more detailed overview of using Topic Modelling, see Kailash Await’s excellent post from which my own script is derived, or read David Blei’s overview of the […]
LikeLike
Text Analysis - Harkive Stories - Harkive.org
July 28, 2017 at 2:28 am
[…] de seguir aquest exemple, m’agradaria agrupar automàticament (clusteritzar) una sèrie de documents en base a la seva […]
LikeLike
Topic modeling Franco Battiato’s lyrics – Openite
July 30, 2017 at 11:13 pm
Interesting Blog and easy to follow steps. I am trying a similar annalysis with tweets and the users. I am curious about how you got the Document and topic mapping and the table with topic probabilities by document: Would be nice if you could Throw some light on that part as well.
LikeLike
Divya Iyer
October 11, 2017 at 6:33 pm
[…] method to extract information from the text by assessing the proximity of words to each other. The topic modelling package provides functions to perform this analysis. I am not an expert in this field and simply […]
LikeLike
Qualitative Data Science: Using RQDA to analyse interview transcripts
May 3, 2018 at 9:59 am
[…] method to extract information from the text by assessing the proximity of words to each other. The topic modelling package provides functions to perform this analysis. I am not an expert in this field and simply […]
LikeLike
Qualitative Data Science: Using RQDA to analyse interviews
June 28, 2018 at 3:35 pm
Great great great tutorial and we’re using it as an important guide for our study. While we almost can replicate everything, we found the code “rownames(dtm) <- filenames" with the error "Error in rownames(dtm <- filenames) : object 'filenames' not found. We don't know why and even what our files names are exactly. Any body can help?Thank you!
LikeLike
Cathy Chen
October 9, 2019 at 7:10 am
Hi Cathy,
Thanks for reading and for your feedback. The filenames variable is created in the third line of code:
filenames <- list.files(getwd(),pattern=”*.txt”)
The line you're referring to changes rownames to match the filenames (for easier reference).
Hope this helps.
Regards,
Kailash.
LikeLike
K
October 10, 2019 at 8:25 am
[…] to find the topics or themes underlying a set of documents. There is an easy to follow explanation here, which goes into more detail about how the particular algorithm I’m using, Latent Dirichlet […]
LikeLike
#RugbyWorldCup on Twitter - Part 2 - Degrees of Belief
October 23, 2019 at 9:58 am
“Each row of the input matrix needs to contain at least one non-zero entry”
How can I solve this issue?
LikeLike
marisa
February 13, 2020 at 4:26 am
[…] codificar o LDA no R. A resposta foi sim, existem exemplo de código tanto para Python e R . “Por que você prefere que eu codifique em R?” Eu perguntei. Ele respondeu: […]
LikeLike
R e Python no local de trabalho – Data Science e Machine Learning
July 23, 2020 at 5:05 am
Hi K,
Thank you so much for the non-math introduction, that really helps me understand it. Would you happen to have an article that can explain the CTM side of this (in a non-math way!)? Please let me know if so, thank you!
LikeLike
nana
July 15, 2021 at 9:06 am