This is the sixth (and I think will be for a while the last) entry of a series of posts where I explore the application of the structural topic model and its R implementation. The other posts are here; while this is the github repository with the jupyter notebooks where the contents of the posts are better integrated with code snippets and visualizations.
In the previous post we saw the effect of a discrete variable on topic distributions. Here we will see a different example, this time based on a spatial dummy variable (I drew inspiration for this from a study by Sonya Sachdeva et al. about tweets as an element to model wildfire smoke dispertion). Spatial modelling is not really possible as of now within stm, however some rudimentary solutions can be devised with the use of dummy variables. Here as an example we try to see if being located in Newcastle Upon Tyne has any influence on the topicality of a job offer. We create a dummy variable and then train a new model of 20 topics:
DF$isnewcastle<-as.numeric(DF$Location=="Newcastle upon Tyne")
processed2<-textProcessor(DF$Description, metadata = DF, customstopwords=c("work", "will", "ll", "re", "just" ), verbose=FALSE)
out2<-prepDocuments(processed2$documents, processed2$vocab, processed2$meta, verbose=FALSE)
model20b<-stm(documents=out2$documents,
vocab=out2$vocab, prevalence=~ salexp+is.part+isnewcastle, K=20, data=out2$meta, init.type = "Spectral", verbose=FALSE)
plot.STM(model20b, "summary", n=5)# distribution and top 5 words per topic
We then run a new regression and plot the results. As this time the variable is binary, we can use the “difference” method of plot.estimateEffect, which estimates the mean difference in topic proportions for two different values of the covariate:
prep2<-estimateEffect(1:20~ isnewcastle+salexp+is.part, model20b, meta=out2$meta, uncertainty="Global", nsims=200)
plot.estimateEffect(prep2, model=model20b, cov.value1="1", cov.value2="0", covariate="isnewcastle", topics=c(1:20), method="difference",
nsims = 100, xlab="Outside Newcastle vs within Newcastle", labeltype="custom", custom.labels=c(1:20),ci.level=.99)
At a confidence interval of 99% we can notice how the location has some significant effects for topics 8, 5 and 11. Topic 5 appears related to sales and customer care, while Topic 11 seems to be relative to administrative jobs; explaining the effect of location on their topicality would require some additional investigation. However, topic 8 appears related to University and research positions, so it makes sense that this topic is more likely to appear in job offers in Newcastle, which is seat of two Universities.
labelTopics(model20b, n=10, topics=c(5,8,11))
Topic 5 Top Words:
Highest Prob: custom, servic, busi, role, team, look, build, career, new, day
FREX: kitchen, test, youll, career, want, softwar, survey, autom, retail, know
Lift: promis, dilapid, fair, laser, massiv, mric, perk, tech, tester, -star
Score: survey, laser, kitchen, surveyor, promis, autom, water, tester, custom, youll
Topic 8 Top Words:
Highest Prob: support, univers, research, includ, work, applic, gender, contribut, well, equal
FREX: univers, gender, research, intellig, sex, sexual, regardless, orient, marit, race
Lift: librari, metal, wwwnclacuk, archaeolog, athena, berri, euraxess, everybodi, intellig, kingfish
Score: gender, univers, marit, research, intellig, everybodi, sexual, ethnic, pregnanc, archaeolog
Topic 11 Top Words:
Highest Prob: pleas, post, support, servic, administr, date, team, email, inform, job
FREX: counti, email, council, short-list, advert, administr, invit, grade, date, note
Lift: answerphon, chronolog, cyp, genogram, headquart, hrresourcesdurhamgovuk, nmw, real-, res, retriev
Score: short-list, cvs, invit, counti, notifi, advert, junk, council, notif, email
thoughts5b <- findThoughts(model20,texts=DF$Description, topics=5, n=3)$docs[[1]]
thoughts8b <- findThoughts(model20,texts=DF$Description, topics=8, n=3)$docs[[1]]
thoughts11b <- findThoughts(model20,texts=DF$Description, topics=11, n=3)$docs[[1]]
options(repr.plot.width=10, repr.plot.height=6, repr.plot.res=100)
par(mfrow=c(1,3), mar=c(0,0,0,0))
plotQuote(thoughts5b,width=60, maxwidth=400, text.cex=1)
plotQuote(thoughts8b,width=60, maxwidth=400, text.cex=1)
plotQuote(thoughts11b,width=60, maxwidth=400, text.cex=1)
Conclusions
In this document, we have seen some possible applications of the structural topic model for the analysis of relationships between document metadata and the topics in the documents.
We have seen as an example how offers with salaries above average tend to have a higher representation of a topic related to qualified health positions, and those with lower salaries of topics related to manual jobs.
In another example, we have seen how offers in Newcastle can be expected to have a higher representation of the topic relative to university and research.
As mentioned, the results presented here don’t have statistical value, and the examples mentioned are relatively trivial: they are to be considered only as an example of the use of the model and its R implementation, and of the type of investigations that can be conducted with the model. The stm package offers a flexible platform for the investigation of metadata and topics, and as shown by the list of studies adopting it available on the page of the developers of the model, whilst its main use so far has been in social/political human sciences other applications are possible and promising.