This is the third of a series of posts where I explore the application of structural topic models and its R implementation. The other posts are here and here; while here is the github repository with the jupyter notebooks where the contents of the posts are better integrated with code snippets and visualizations. The notebook with the contents of this and the next post is better rendered in Jupyter nbviewer.
Introduction
In the previous posts we have focused only on data preparation, without running any analysis or touching the stm package, which is what will happen here. In these paragraphs, we will initially see some diagnostic techniques for the identification of an adequate number of topics, which is a key step, and a question to which there is no “right” answer – or rather the answer is very context and needs-specific (see the stm package vignette or the technical article by Roberts et al. 2016). In the second part, which will take a separate post, we will perform some basic analysis on the resulting model, in order to understand more about it and to explore the potentialities of the stm package.
Model selection
In the manual of the stm R package the authors give some very generic rules of thumb in with regard to the number of topics:
For short corpora focused on very specific subject matter (such as survey experiments) 3-10 topics is a useful starting range. For small corpora (a few hundred to a few thousand) 5-50 topics is a good place to start. Beyond these rough guidelines it is application specific. Previous applications in political science with medium sized corpora (10k to 100k documents) have found 60-100 topics to work well. For larger corpora 100 topics is a useful default size. Of course, your mileage may vary.
Ours is a small and short corpora on a relatively specific subject matter, so we shall focus at least initially on the 5-50 range. We will now explore some basic diagnostics to identify a properly fit model.
For the sake of the present exercise, we will focus on a model with only the factor by which the salary is computed as topic prevalence covariate. This was chosen by reasons of simplicity, as the other covariates need some further massaging, and we will work with them at a later stage.
The “normal” dataframe needs to be converted into a matrix of word indeces and counts before being fed to the function. stm accepts different formats, among which quanteda dfm objects or sparse matrix, but here we will just use the native stm format prepared with the package functions textProcessor and prepDocuments:
suppressWarnings(library(stm))
suppressWarnings(library(stringr))
DF<-read.table("totaljobsCoordRates.txt")
DF<-DF[!is.na(DF$Description),]# stm doesn't work with missing data
DF<-DF[!is.na(DF$rateby),] DF$Description = str_replace_all(DF$Description, "/"," ")# substitute "/" with a space
DF$Description = str_replace_all(DF$Description, "'|"|/", "'") ## links and other eventual html encoding (adapted from https://juliasilge.com/blog/evaluating-stm/)
DF$Description = str_replace_all(DF$Description, "<a>", "") ##
DF$Description = str_replace_all(DF$Description, ">|<|&" , "") ##
DF$Description = str_replace_all(DF$Description, "&#[:digit:]+;", "") ##
DF$Description = str_remove_all(DF$Description, "]*>")
## textProcessor prepares the corpus, representing the documents as lists containting word indices and associated word counts, the vocab character vector associated with the word indices and a metadata matrix containing the covariates. As we can remove some words, here we opt to remove "work" and "will" from the corpus
processed<-textProcessor(DF$Description, metadata = DF, customstopwords=c("work", "will"), verbose=FALSE)
##PrepDocuments is a helpful function to perform some manipulations like removing words based on thresholds without compromising the indexes
out<-prepDocuments(processed$documents, processed$vocab, processed$meta, verbose=FALSE)
docs<-out$documents
vocab<-out$vocab
meta<-out$meta
As prepDocuments removes words based on thresholds, it might be helfpul to have a cursory look at them. I found also helpful the below procedure to identify a specific document(s) where a specific word has been used.
#identify documents where the word "heartland" has been used
wntword<-which(processed$vocab=="heartland")
filterdocs<-lapply(processed$documents, function(ch) grep(wntword, ch))
indexList 0)]
DF[as.numeric(names(indexList)),]
As mentioned above, with regards to the number of topics we should focus at least initially on the 5-50 range. stm allows to run some preliminary diagnostics in order to assess the models, in particular with the function searchK which performs a number of tests, like held-out likelihood, residual analysis, average exclusivity and semantic coherence. The results then can be easily plotted just calling plot on the resulting searchK object. Those preferring to work with quanteda or sparse matrices can have a look at this post by Julia Silge.
Here we will run a test on 5,10,15, 20 and 50 topics (note that the held-out is re-computed randomly with each run, so even with the same seed the results might be different).
storage1<-searchK(docs, vocab, K = c(5,10,15,20, 50),
prevalence=~ rateby, data=meta,set.seed(9999), verbose=FALSE)
print(storage1$results)
options(repr.plot.width=6, repr.plot.height=6)
plot(storage1)
Within certain boundaries, it seems that the choice of model is a matter of trade-offs. In our case, the best results seem to be in the range 10-20. It can be helpful to compare then semantic coherence to exclusivity, as models with fewer topics have higher semantic coherence for more topics, but lower exclusivity. To check for it however we have to fit the models first, which is what we do next. We will set the initiatlization method to the default “Spectral”, as advised by the author of the package, although alternatives are available (the vignette offers further information about the different methods of initialization). Also in this case the post of Julia Silge mentioned above presents an alternative procedure.
model10Prrateby<-stm(documents=out$documents,
vocab=out$vocab, prevalence =~ rateby, K=10, data=out$meta, init.type = "Spectral", verbose=FALSE)
model15Prrateby<-stm(documents=out$documents,
vocab=out$vocab, prevalence =~ rateby, K=15, data=out$meta, init.type = "Spectral", verbose=FALSE)
model20Prrateby<-stm(documents=out$documents,
vocab=out$vocab, prevalence =~ rateby, K=20, data=out$meta, init.type = "Spectral", verbose=FALSE)
As mentioned above, we check first exclusivity against semantic coherence per topic per model. stm includes the function topicQuality to plot them per each model, but the visual result tends to be confusing and cannot plot multiple models together, so in this occasion we will proceed with ggplot:
suppressWarnings(library(ggplot2))
suppressWarnings(library(plotly))
M10ExSem<-as.data.frame(cbind(c(1:10),exclusivity(model10Prrateby), semanticCoherence(model=model10Prrateby, docs), "Mod10"))
M15ExSem<-as.data.frame(cbind(c(1:15),exclusivity(model15Prrateby), semanticCoherence(model=model15Prrateby, docs), "Mod15"))
M20ExSem<-as.data.frame(cbind(c(1:20),exclusivity(model20Prrateby), semanticCoherence(model=model20Prrateby, docs), "Mod20"))
ModsExSem<-rbind(M10ExSem, M15ExSem, M20ExSem)
colnames(ModsExSem)<-c("K","Exclusivity", "SemanticCoherence", "Model")
ModsExSem$Exclusivity<-as.numeric(as.character(ModsExSem$Exclusivity))
ModsExSem$SemanticCoherence<-as.numeric(as.character(ModsExSem$SemanticCoherence))
options(repr.plot.width=7, repr.plot.height=7, repr.plot.res=100)
plotexcoer<-ggplot(ModsExSem, aes(SemanticCoherence, Exclusivity, color = Model))+geom_point(size = 2, alpha = 0.7) +
geom_text(aes(label=K), nudge_x=.05, nudge_y=.05)+
labs(x = "Semantic coherence",
y = "Exclusivity",
title = "Comparing exclusivity and semantic coherence")
plotexcoer
The three models appear fairly similar in this regard, with the model with 20 topics having two outliers with relatively lower semantic coherence and the model with 10 topics with the tendence to have lower exclusivity. For the sake of the present exercise we will then proceed with the model with 15 topics, which appears to be a good compromise. On the other hand, the other two models can be considered good enough and in another situation they might be preferrable. As mentioned above, the choice of the model can be very context-specific. In the next entry, which I hope I will be able to post shortly, we will explore this 15-topics model in order to understand something more about it.