This is the fourth of a series of posts where I explore the application of structural topic models and its R implementation. The other posts are here, here and here; while this is the github repository with the jupyter notebooks where the contents of the posts are better integrated with code snippets and visualizations. The notebook with the contents of this and the previous post is better rendered in Jupyter nbviewer.
At the end of the previous post, we had decided to go ahead with the model with 15 topics. Next, we want to know a bit better our model. stm stores the document-topic proportions and the topic-word distributions in two matrices, θ (which is also refferred to, somewhat confusingly, as γ) and β.
We shall start having a look at θ, which can be called directly from the model, or more conveniently with the built-in function make.dt, which allows to load also the metadata (which in our case is helpful considering we used rateby as prevalence co-variate).
topicprop15<-make.dt(model15Prrateby, meta) topicprop15[1:3,c(1:18, 27)]#visualize proportions, job title and rateby
|1||0.011672||0.023733||0.000349||8.56E-02||0.010568||0.007209||0.002097||0.13695||0.713769||2.39E-05||0.002362||1.93E-05||0.003276||0.000651||0.001745||1||Dementia Friendly Communities Officer||year|
Consulting the table might be quite cumbersome unless you want to have a look at the topic proportions of a specific document. stm offers the plot.STM function with “hist” as argument in order to visualize the estimates of document-topic proportions (Julia Silge in another post offers a different procedure using ggplot to otbain a similar plot, I have included this in the appendix):
This plot basically tells us which topics are coming from which documents. As expected, each topic has no relation or very little relation with several documents. I personally found also helpful to plot the topic distribution per document, using tidytext and ggplot:
suppressWarnings(library(tidytext))# the package sometimes is not loaded correctly. If this happens, you might have to re-start the kernel td_theta <- tidytext::tidy(model15Prrateby, matrix = "theta") selectiontdthteta<-td_theta[td_theta$document%in%c(1:15),]#select the first 30 documents. be careful to select a sensible interval, as attempting to load a very huge corpus might crash the kernel thetaplot1<-ggplot(selectiontdthteta, aes(y=gamma, x=as.factor(topic), fill = as.factor(topic))) + geom_bar(stat="identity",alpha = 0.8, show.legend = FALSE) + facet_wrap(~ document, ncol = 3) + labs(title = "Theta values per document", y = expression(theta), x = "Topic") thetaplot1
As an example, we see here that document 9 is strongly associated with topic 15 and document 15 with topic 8, whereas documents 10 or 13 are more mixed. Next, we want to understand more about each topic – what are they really about. If we go back to the β matrix, we can have a more analytical look at the word frequencies per topic. The matrix stores the log of the word probabilities for each topic, and plotting it can give us a good overall understanding of the distribution of words per topic (the procedure below is mutuated from the second post of Julia Silge ):
suppressWarnings(library(dplyr)) suppressWarnings(library(drlib))#drlib is available on github and needs devtools to be installed td_beta <- tidytext::tidy(model15Prrateby) options(repr.plot.width=7, repr.plot.height=8, repr.plot.res=100) td_beta %>% group_by(topic) %>% top_n(10, beta) %>% ungroup() %>% mutate(topic = paste0("Topic ", topic), term = reorder_within(term, beta, topic)) %>% ggplot(aes(term, beta, fill = as.factor(topic))) + geom_col(alpha = 0.8, show.legend = FALSE) + facet_wrap(~ topic, scales = "free_y") + coord_flip() + scale_x_reordered() + labs(x = NULL, y = expression(beta), title = "Highest word probabilities for each topic", subtitle = "Different words are associated with different topics")
In this case I found pretty helpful to go and have a more detailed look at the word distribution within each topic (the plot above focus only on the top 10 words for each topic):
betaT1<-td_beta %>% mutate(topic = paste0("Topic ", topic), term = reorder_within(term, beta, topic)) %>%filter(topic=="Topic 1")#beta values for topic 1 betaplotT1<-ggplot(betaT1[betaT1$beta>0.003,], aes(term, beta, fill = as.factor(topic))) + geom_bar(alpha = 0.8, show.legend = FALSE, stat = "Identity")+coord_flip()+labs(x ="Terms", y = expression(beta), title = "Word probabilities for Topic 1")#plot word probabilities higher than 0.003 for topic 1 betaplotT1
Also for topic exploration the stm package offers some nice built-in functions that make analysis easier. In order to identify the content of each topic, we can use again plot.STM, this time with argument “summary” to visualize the topic distribution (which topics are overall more common), with most common words for each topic (as we will see later there are several options for the words to be visualized, but here we will leave the default).
plot.STM(model15Prrateby, "summary", n=5)# distribution and top 5 words per topic
labelTopics (or sageLabels) gives us more detailed insights on the popular words in each topic. As mentioned above, other than the highest probability, we can visualize the FREX words (FREX weights words by frequency and exclusivity to the topic), lift words (frequency divided by frequency in other topics), and score (similar to lift, but with log frequencies). The vignette and the manual of stm offer more details on this aspect. plot.STM can again be used with “labels” as argument to plot the words in a more visually appealing way (although I found the final result not always satisfactory).
This type of analyis can be helpful in particular to make comparisons between topics and understand more which differences there are between them. As an example, topics 5,6,9 seem to have a degree of overlap, as they are all focusing on customer care, so we might want to check them in more detail:
labelTopics(model15Prrateby, topics=c(5,6,9), n=10)# complete list of top 10 words per topics 5-6-9 plot.STM(model15Prrateby, "labels", topics=c(5,6,9), label="frex", n=10, width=55)#top 10 FREX words per topics 5-6-9
Highest Prob: custom, team, store, servic, sale, work, product, hour, role, retail
FREX: store, retail, sale, love, kitchen, brand, product, great, discount, price
Lift: ‘can, afford, ambassador, here, high-street, lowest, merchandis, pub, sister, tv’s
Score: store, sale, saver, custom, superdrug, retail, bike, fashion, kitchen, love
Topic 6 Top Words:
Highest Prob: custom, role, skill, excel, client, administr, offic, manag, team, experi
FREX: order, invoic, purchas, supplier, administr, detail, client, manufactur, tax, verbal
Lift: bill, bullet, coleman, courier, expedit, fork, forklift, inventori, law, louis
Score: sale, purchas, invoic, custom, manufactur, adecco, stock, ledger, angel, global
Topic 9 Top Words:
Highest Prob: custom, ’ll, servic, make, need, role, help, can, ’re, work
FREX: ’ll, ’re, just, everyon, thing, don’t, realli, want, legal, get
Lift: bag, colour, discrimin, forget, moment, offici, optic, perhap, perk, prejudic
Score: ’ll, ’re, custom, realli, don’t, rta, everyon, thing, showroom, addleman
We can also have a glimpse at highly representative documents per each topic with findThoughts, and plot them with plotQuote (although this might give best results with shorter documents):
thoughts5 <- findThoughts(model15Prrateby,texts=DF$Description, topics=5, n=3)$docs[]# take 3 representative documents per topic 5 thoughts6 <- findThoughts(model15Prrateby,texts=DF$Description, topics=6, n=3)$docs[]# take 3 representative documents per topic 6 thoughts9 <- findThoughts(model15Prrateby,texts=DF$Description, topics=9, n=3)$docs[]# take 3 representative documents per topic 9 plotQuote(thoughts5, width=30, maxwidth=500, text.cex=1.25, main="Topic 5") plotQuote(thoughts6, width=30, maxwidth=500, text.cex=1.25, main="Topic 6") plotQuote(thoughts9, width=30, maxwidth=500, text.cex=1.25, main="Topic 9")
We might assume that topic 5 is more related to retail, with positions like store assistant, topic 6 closer to back office and sales positions and topic 9 somewhere in between. It is worth noting that several words in topic 9 seem to have weak semantic content (like “‘ll”, “‘re”, “‘just”), and we might consider to re-run the model taking them out as stopwords. If we go back to the distribution of estimates, very few documents seem to have a strong association with topic 9, and if we recover the original exclusivity-semantic coherence plot, we note that topic 9 has the lowest exclusivity in the model we are using:
Another very helpful function to use for topic analysis is again plot.STM with “perspectives” as argument, wich allows us to have a graphical display of topical contrasts, where words are plotted with size proportional to their use within the topic and oriented along the X-axis based on how much they favor one topic against the other (the vertical configuration is random):
Our supposition about the differences between topic 6 and 5 seems confirmed, while the topic 9 is confirmed to have a weak semantic content. We will probably have to re-run the model taking out some additional stopwords. However, we can first conclude this overview of the main functions of stm having a look at the difference made by the inclusion of “rateby” as a topic prevalence co-variate. As mentioned in a previous post, the Structural Topic Model allows the analysis of relationships with document metadata, in form of co-variates, either in terms of the degree of association of a document to a topic, either of the association of a word to a topic. The function to extract the relationship and associated uncertaintuy on all the topics of the model is estimateEffect, which basically estimates a regression with documents as units, the outcome is the proportion of each document about a topic in an STM model and the covariates are document-meta data:
out$meta$rateby<-as.factor(out$meta$rateby) prep<-estimateEffect(1:15~ rateby, model15Prrateby, metadata=out$meta, uncertainty="Global")#nsim is defaulted to 25, but on a small model a higher number lead to more consistent results summary(prep, topics=c(1:3), nsim=1000)# summary of regression on topic 1-3
estimateEffect(formula = 1:15 ~ rateby, stmobj = model15Prrateby,
metadata = out$meta, uncertainty = “Global”)Topic 1:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.03983 0.01380 2.885 0.00411 **
ratebyday 0.03308 0.04554 0.726 0.46801
ratebyhour 0.06448 0.02136 3.019 0.00269 **
ratebymonth -0.02540 0.18667 -0.136 0.89184
ratebyweek -0.04007 0.05885 -0.681 0.49628
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ‘ 1Topic 2:Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.038904 0.014162 2.747 0.00627 **
ratebyday -0.026867 0.040058 -0.671 0.50278
ratebyhour 0.034647 0.019766 1.753 0.08036 .
ratebymonth 0.008482 0.210999 0.040 0.96795
ratebyweek -0.038638 0.053918 -0.717 0.47401
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ‘ 1Topic 3:Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.021089 0.011812 1.785 0.074928 .
ratebyday 0.453879 0.049058 9.252 < 2e-16 ***
ratebyhour 0.007749 0.015948 0.486 0.627310
ratebymonth -0.020844 0.147313 -0.141 0.887546
ratebyweek 0.220659 0.065674 3.360 0.000851 ***
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ‘ 1
For reasons of space here we printed only the first three topics, but overall it seems that rateby doesn’t have significant effectes, with the only exception of topic 3 (educational jobs) with the covariates week and day. The results of estimateEffect can also be plotted with a variety of methods, pointestimate being probably the best for factor variables as in our case:
options(repr.plot.width=7, repr.plot.height=7, repr.plot.res=100) plot.estimateEffect(prep, model=model15Prrateby, covariate="rateby", topics=3, method="pointestimate", xlim=c(-.5,1))
We can basically infer that for offers related to topic 3 the rate tend to be computed by day rather than by other factors. As mentioned at the beginning, other covariates might offer more insightful results, and rateby was selected just for simplicity reasons.
In this document we have seen some diagnostic techniques to select a number of topics for a stm model, and gave overview of the main functions offered by the stm R package. There are a number of additional add-on packages and functions that allow further analysis, but we will explore them in a latter stage. Some of them are mentioned in the paragraph below.
Based on the results of this work, we identified a model with 15 topics as a good fit, but we also concluded that we should re-run the model on a corpus from where more stopwords are excluded. We will proceed in this way in the next entry of this series, where we will give some more depth to our analysis. In particular, we will include further variables to the analysis, explore the use of topical content co-variates, and focus more on topic correlations.
References and helpful resources
A good idea is to read in detail the package vignette, where some of the functions used here are presented in more detail. The theoretical aspects of the model are explained in more detail in Roberts et al., “A model of text for experimentation in the social sciences”, 2016. The website of the authors of the model and the package offers an extensive repository of resources, including supporting packages for analysis and visualization, as well as a list of scientific articles adopting stm.
Julia Silge has authored two interesting posts (one and two) where stm is used adopting tidy data principles.
The procedure below has been used by Julia Silge to obtain the distribution of document probabilities per each topic, in fashion similar to plot.STM with “hist” as argument. It might be helpful if using a tidy data approach, or wants to have a higher degree of control on the visualization.
ggplot(td_theta, aes(gamma, fill = as.factor(topic))) + geom_histogram(alpha = 0.8, show.legend = FALSE) + facet_wrap(~ topic, ncol = 3) + labs(title = "Distribution of document probabilities for each topic", y = "Number of documents", x = expression(theta))