Messing around with STM, part IV – Analysis of metadata

This is the fifth entry of a series where I explore the application of the structural topic model and its R implementation. The other posts are here; while this is the github repository with the jupyter notebooks where the contents of the posts are better integrated with code snippets and visualizations.


In this and the next post, we will train two new models, investigating first the relation between salary expectation and topic a job offer, then to see if a specific location of a job offer can have an effect on topic. Again, as mentioned in the past, the results here have no statistical validity whatsoever and are meant to be only an example of an application of stm.

Data preparation

A first challenge concerns the type of data available, as the salary data we managed to obtain from are computed by different temporal factors (year/month/week/day/hour), and for some are given only a single value while for others a range. We then try here to identify on which percentile each salary falls (taking the average value for the salary ranges), based on the data of the UK Office for National Statistics . For the sake of simplicity, we then group the results in just three groups – below average (up to the 40th percentile), average (from 40th to 60th) and above average. Unfortunately the data are available only for year, hour and week, so we have to exclude the salaries with day and month factor.

salaries<-read.csv2("salaries.csv", sep=";")# data from
salaries[,3:13]<-as.numeric(as.character(unlist(salaries[,3:13])))#be sure that data are seen as numeric

DF<-DFa[!$Description),]#get rid of entries with no description...
DF<-DF[!$rateby),]#...or temporal rate 
DF$salaryavg<-(DF$minrate+DF$maxrate)/2#consider average values for rates
DF$is.part<-as.numeric(grepl("Part-time", DF$Type))#dummy variable is offer is part time

ratefactors<-as.character(unique(DF$rateby))#identify rate factors

lbls = c(1,2,2.5,3,4,5,6,7,7.5,8,9,10)#percentiles included in the ONS dataset

#next we identify the breaks according to temporal factor and if part or full time
breaksPartYear<-c(0, as.numeric(sort(salaries[6,3:13])), Inf)
breaksFullYear<-c(0, as.numeric(sort(salaries[5,3:13])), Inf)
breaksPartHour<-c(0, as.numeric(sort(salaries[4,3:13])), Inf)
breaksFullHour<-c(0, as.numeric(sort(salaries[3,3:13])), Inf)
breaksPartWeek<-c(0, as.numeric(sort(salaries[2,3:13])), Inf)
breaksFullWeek<-c(0, as.numeric(sort(salaries[1,3:13])), Inf)

#then we identify on which percentile the salary falls
DF[which (DF$is.part==TRUE  & DF$rateby==ratefactors[1]),]$perc<-
  as.numeric(as.vector(cut(DF[which (DF$is.part==TRUE  & DF$rateby==ratefactors[1]),]$salaryavg, breaks=breaksPartYear, right=FALSE, labels = lbls)))
DF[which (DF$is.part==FALSE  & DF$rateby==ratefactors[1]),]$perc<-
  as.numeric(as.vector(cut(DF[which (DF$is.part==FALSE  & DF$rateby==ratefactors[1]),]$salaryavg, breaks=breaksFullYear, right=FALSE, labels = lbls)))

DF[which (DF$is.part==TRUE  & DF$rateby==ratefactors[2]),]$perc<-
  as.numeric(as.vector(cut(DF[which (DF$is.part==TRUE  & DF$rateby==ratefactors[2]),]$salaryavg, breaks=breaksPartHour, right=FALSE, labels = lbls)))
DF[which (DF$is.part==FALSE  & DF$rateby==ratefactors[2]),]$perc<-
  as.numeric(as.vector(cut(DF[which (DF$is.part==FALSE  & DF$rateby==ratefactors[2]),]$salaryavg, breaks=breaksFullHour, right=FALSE, labels = lbls)))

DF[which (DF$is.part==TRUE  & DF$rateby==ratefactors[3]),]$perc<-
  as.numeric(as.vector(cut(DF[which (DF$is.part==TRUE  & DF$rateby==ratefactors[3]),]$salaryavg, breaks=breaksPartWeek, right=FALSE, labels = lbls)))
DF[which (DF$is.part==FALSE  & DF$rateby==ratefactors[3]),]$perc<-
  as.numeric(as.vector(cut(DF[which (DF$is.part==FALSE  & DF$rateby==ratefactors[3]),]$salaryavg, breaks=breaksFullWeek, right=FALSE, labels = lbls)))

DF<-DF[!$perc),]#get rid of no entries missing data

#simplify the percentiles to just three levels
simlbls<-c("below", "avg", "above")
DF$salexp<-cut(DF$perc, breaks=c(0,4,6,10), labels=simlbls)

DF$is.part<-as.factor(DF$is.part)#dummy for part time is seen as factor

DF$percF<-as.factor(DF$perc)#create a column with percentile as factor just in case

We can see that the distribution of the salaries in our (again, not statistically representative) sample doesn’t follow the national one, which can be explained, among other things, considering that tends to have less postings for qualified job offers:


Model preparation and analysis

We can next move to the model preparation. We have seen in detail a workflow for this purpose in a previous post, so this time we will just focus on a model with 20 topics. Compared to our last attempt, here we will exclude some recurring stopwords with little semantic content.

DF$Description = str_replace_all(DF$Description, "/"," ")# replace "/" with a space  
DF$Description = str_replace_all(DF$Description, "'|"|/", "'") ## links and other eventual html encoding (adapted from
DF$Description = str_replace_all(DF$Description, "<a>", " ")             ## 
DF$Description = str_replace_all(DF$Description, ">|<|&", " ")      ## DF$Description = str_replace_all(DF$Description, "&#[:digit:]+;", " ")        ## DF$Description = str_remove_all(DF$Description, "]*>")

processed<-textProcessor(DF$Description, metadata = DF, customstopwords=c("work", "will", "ll", "re", "just" ),verbose=FALSE)

out<-prepDocuments(processed$documents, processed$vocab, processed$meta, verbose=FALSE)

                          vocab=out$vocab, prevalence=~ salexp+is.part, K=20, data=out$meta, init.type = "Spectral", verbose=FALSE)

We familiarize a bit with the model, exploring the distribution of θ and β values and the topics content:

suppressWarnings(library(drlib))#drlib is available on github and needs devtools to be installed

td_theta20 <- tidytext::tidy(model20, matrix = "theta") td_beta20 %   group_by(topic) %>%
  top_n(10, beta) %>%
  ungroup() %>%
  mutate(topic = paste0("Topic ", topic),
         term = reorder_within(term, beta, topic)) %>%
  ggplot(aes(term, beta, fill = as.factor(topic))) +
  geom_col(alpha = 0.8, show.legend = FALSE) +
  facet_wrap(~ topic, scales = "free_y") +
  coord_flip() +
  scale_x_reordered() +
  labs(x = NULL, y = expression(beta),
       title = "Highest word probabilities for each topic"

ggplot(td_theta20, aes(gamma, fill = as.factor(topic))) +
  geom_histogram(alpha = 0.8, show.legend = FALSE) +
  facet_wrap(~ topic, ncol = 3) +
  labs(title = "Distribution of document probabilities for each topic",
       y = "Number of documents", x = expression(theta))


plot.STM(model20, "summary", n=5)# distribution and top 5 words per topic


We move next to explore the relation between the salary expectation and the topicality of a job offer. The most immediate way to answer the question is to estimate a regression with the topic proportions as outcome variable and the percentile as a predictor (controlling for other variables), running the built-in estimateEffect function. We see here as an example the results of the regression on topic 17 controlling for part time:

prep<-estimateEffect(1:20~ salexp+is.part, model20, meta=out$meta, uncertainty="Global", nsims=200)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.01688136 0.009674837 1.744873 8.177491e-02
salexpavg 0.05622030 0.024783473 2.268459 2.383349e-02
salexpabove 0.14116401 0.028954446 4.875383 1.569247e-06
is.part1 -0.02864489 0.021204874 -1.350864 1.775027e-01

The results can then be plotted with plot.estimateEffect. This allows to visually understand the effect of a covariate (in this case, the salary level on the job offer) on the expected topic proportion. Here we see for example topics 17 and 12:

plot.estimateEffect(prep, model=model20, covariate="salexp", topics=c(12,17), nsims = 100, ci.level=.99)#

We see significant effects for topic 17 from salaries above average, while in offers with salaries below average we can expect a representation higher than average of topic 12 (offers with average salaries have also significant effects on both topics). Topic 17 appears to be related to offers for qualified health professionals, and 12 to be related to manual jobs like cleaning, driving and warehousing, hence the results make empirical sense:

labelTopics(model20, n=10, topics=c(12,17))

Topic 12 Top Words:
Highest Prob: hour, requir, type, job, prefer, experi, salari, year, clean, must
FREX: clean, prefer, licens, licenc, drive, driver, temporari, type, hour, full-tim
Lift: avenu, tachograph, cpc, hoover, moor, temp, hartlepool, blaydon, iron, empti
Score: clean, driver, prefer, cleaner, temporari, preferredlic, licens, part-tim, licenc, saturday
Topic 17 Top Words:
Highest Prob: applic, post, pleas, servic, provid, clinic, within, opportun, trust, commit
FREX: trust, clinic, nhs, colleg, hospit, post, psycholog, cancer, foundat, medic
Lift: derwentsid, gynaecolog, irrespect, outpati, profess, trustwid, -repres, “employ, “healthcar, burn
Score: clinic, nhs, psycholog, trust, colleg, derwentsid, cancer, maxxima, teach, hospit
thoughts12 <- findThoughts(model20,texts=DF$Description, topics=12, n=3)$docs[[1]]
thoughts17 <- findThoughts(model20,texts=DF$Description, topics=17, n=3)$docs[[1]]

plotQuote(thoughts12,width=50, maxwidth=400, text.cex=.85)
plotQuote(thoughts17,width=40, maxwidth=400, text.cex=.85)

In the next post, we will explore the effect of a different variable, exploring if the location of a job offer can have an effect on its topicality.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s