A function to help the selection of structural topic models

It has been quite some time since my last post here (and even that was quite far away from the previous one). As for everyone, the last few months have been quite challenging, but most importantly, I had been quite busy finalizing my Masters’ dissertation, which evolved towards an unexpected direction and required more work than previously expected. As soon as it will be wrapped up, I will possibly write something about that and the whole Masters’ experience here.

In the meantime, I wanted to share a simple wrapper function I put together during the work, which I hope can be handy for anyone working with the structural topic model. For those who need to know more about topic modelling and the structural topic model, I recommend to check the resources listed below, or my old posts about them.

In topic modeling, there is no “right” answer for the number of topics that is appropriate for any given corpus – there are however a number of metrics that can guide the choice. The stm R package offers the searchK function to quickly look at some of these metrics, namely held-out likelihood, residual analysis, average exclusivity and semantic coherence. Roughly speaking, held-out likelihood and residual analysis give a good understanding of the model fit, whereas exclusivity and sematic coherence focus on quality of the topics. To triangulate all of them can be quite challenging, especially in the case of semantic coherence vs exclusivity trade-off, which tend to be anti-correlated.

Semantic coherence is a metric related to pointwise mutual information; the core idea is that in models which are semantically coherent the words which are most probable under a topic should co-occur within the same document. While there is correlation between this metric and human judgement, models with less topics tend to have high semantic coherence score. As a counterpoint, Roberts et al. suggested to use also the exclusivity measure, which is calculated for a topic within a model by summing up the score of a frequency/exclusivity (Frex) metric for the most recurring words within the topic itself.

As semantic coherence and exclusivity tend to be anti-correlated, the choice of a model based on them is a matter of trade-off. It is therefore possible to identify a Pareto-optimal set of models maximizing both metrics; to this end I found particularly helpful using the package rPref.

There are a number of options in order to further limit the choice to a single model within the identified Pareto-optimal set. Whereas the authors of the stm package in a similar setting opted for a simple random choice within the Pareto set, a possible alternative is to identify the model whose topics are the less dispersed across the two (scaled) dimensions semantic coherence – exclusivity, as a way to ensure that all the topics within the model are of uniform quality.

My usual workflow is then to check the held-out likelihood and residual checks in order to exclude the underperforming models and identify a bracket of viable models which should perform satisfactorily, then within this bracket run the function below, which identify the Pareto-set as described above (I suggest to check it on github). The function returns a list where the first element is either the whole of models in the Pareto-set, or if the researcher prefers, only the one with less dispersion, that with less topics, or a random one. Additional arguments are the Skyline plot, and the corresponding average semantic coherence/exclusivity values for all the models taken into consideration. I give here a basic example of the workflow, using the gadarian dataset included in the stm package.

Resources
On topic models in general, see the general introduction of Megan R. Brett,
Topic Modeling: A basic introduction”, Journal of Digital Humanities 2:1, 2012 and Ted Underwood, “Topic modeling made just simple enough“. A bibliography about topic modelling is available on David Mimno’s blog.
For the Structural Topic Model, the best place to start is the author’s website, where all the method papers can be found, alongside a selection of works using the model. If you are interested in the R package, apart from the reference manual the vignette presents a comprehensive overview.
Finally, Julia Silge has given a good introduction to the STM and alternative workflows here and here.

# Arguments
# low= lower number of topics to train a model on 
# up= upper number of topics to train a model on
# out = output from prepDocuments
#formula = formula for topic prevalence covariates (optional)
#formulaContent = formula for topic content covariates (optional)
# init= method of initialization, by default the spectral initialization
# selection = method of selection for the final model, can be "All" (returns all models), "Dispersion" (returns the model with less dispersion), "Random" (random choice), "Min" (model with less topics). "All" by default, "Dispersion" needs K>2. 
# seed= seed for "Random" selection (optional)
#method dispersion won't work with K=2
GetSTModels<-function(low, up, out, formula=NULL, 
                           formulaContent=NULL,
                           init=c("Spectral", "LDA", "Random", "Custom"), selection=c("All","Dispersion","Random","Min"), seed=NA)
{  library(stm)
  library(tidyverse)
  library(furrr)
  library(rPref)
  library(aspace)
  ModelsTibble %
    mutate(topic_model = future_map(K, ~stm(documents=out$documents, 
                                            vocab=out$vocab, prevalence= formula, content=formulaContent,
                                            K=., data=out$meta, init.type = init, verbose=FALSE)))
  
  
  AllModelsExclSemCohTibble%
    mutate(SemanticCoherence = future_map(ModelsTibble$topic_model,
                                          ~semanticCoherence(model=.,out$documents)), 
           Exclusivity=future_map(ModelsTibble$topic_model,
                                  ~exclusivity(model=.)))
  
  centroidsT %
    mutate(SemanticCoherence=future_map(AllModelsExclSemCohTibble$SemanticCoherence,
                                        ~mean(.)),
           Exclusivity=future_map(AllModelsExclSemCohTibble$Exclusivity,
                                  ~mean(.))) 
  
  PrefSemcohExclusLevsT <- psel(centroidsT, high(unlist(SemanticCoherence)) * high(unlist(Exclusivity)),top=nrow(centroidsT))# Pareto frontiers https://journal.r-project.org/archive/2016-2/roocks.pdf
  PrefSemcohExclusTopKT <- PrefSemcohExclusLevsT[PrefSemcohExclusLevsT$.level==1,]#Dominating Pareto frontier 
  
  PrefSemcohExclusLevsT$SemanticCoherence<-unlist(PrefSemcohExclusLevsT$SemanticCoherence)
  PrefSemcohExclusLevsT$Exclusivity<-unlist(PrefSemcohExclusLevsT$Exclusivity)
  
  AllModelsExclSemCohSelT<-AllModelsExclSemCohTibble[AllModelsExclSemCohTibble$K%in%PrefSemcohExclusTopKT[,1],]   
  
  
  #centroids
  ScaledCentroidsT<-data.frame(SemantiCoherenceSc=scale(unlist(AllModelsExclSemCohSelT$SemanticCoherence)),
                               Exclusivitysc=scale(unlist(AllModelsExclSemCohSelT$Exclusivity)),
                               K=unlist(future_map(AllModelsExclSemCohSelT$K,~rep(.,.))))
  
  if (selection=="Random"){
    if (is.na(seed)){
      selected<-sample(ModelsTibble$K,1)
      SelectedModel<-ModelsTibble[ModelsTibble$K==selected,][[2]][[1]]  
    }else { set.seed(seed)
      selected<-sample(ModelsTibble$K,1)
      SelectedModel<-ModelsTibble[ModelsTibble$K==selected,][[2]][[1]]  
    }
  }else if (selection=="Min") {
    SelectedModel<-ModelsTibble[ModelsTibble$K==min(ModelsTibble$K),][[2]][[1]]  
  } else if (selection=="Dispersion") {
    DispersionT<-(future_map(unique(ScaledCentroidsT$K),~calc_sdd(id=.,
                                                                  filename="disp.txt", 
                                                                  points=(ScaledCentroidsT[,c(1,2)][ScaledCentroidsT$K==.,]))))
    
    DispersionT<-bind_rows(DispersionT)
    
    lessDispT<-DispersionT[DispersionT$SDD.area == min(DispersionT$SDD.area),]
    
    SelectedModel<-ModelsTibble[ModelsTibble$K==lessDispT$id,][[2]][[1]]
  }else {
    SelectedModel<-ModelsTibble[[2]]
  }
  
  ParetoPlotT<-ggplot(PrefSemcohExclusLevsT, aes(SemanticCoherence, Exclusivity))+
    geom_point(size = 2.5, alpha = 0.7, aes(colour=as.factor(.level)),show.legend = FALSE) + 
    geom_text(aes(label=K,colour=as.factor(.level)), nudge_x=.001, nudge_y=.005,show.legend = FALSE) +
    geom_step(direction = "vh",aes(colour=as.factor(.level)),show.legend = FALSE)+ 
    labs(x = "Semantic coherence",
         y = "Exclusivity",
         title = "Exclusivity and semantic coherence (model average), with Pareto front line for each level")+
    theme_bw()
  
  AllResultsT<-list(SelectedModel,ParetoPlotT,PrefSemcohExclusLevsT)
  return(AllResultsT)
  
  
}


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s