Topic Modeling and Digital Humanities: Overview (1)
In this post:
- What is a topic model?
- UX considerations
- Existing techniques
This will be the first post in a series of posts as we begin a new project on exploring topic modeling for the digital humanities, following the previous work of (link)TOME.
A topic model is a model of how often words occur together in a group of texts. Many other posts have been written about the definition of “topic model” in detail, in addition to detailing various algorithms.
Here, I am going to highlight some challenges of viewing, exploring, and learning from topic modeling results.
How much information to show at once? Topic models typically have a lot of topics – which leads to information overload. How can you browse it a manner to learn something and not be overwhelmed?
How can you understand different views of the model? The results change based on number of topics specified by user. A small number means separate topics will merge, larger number means combined topics will split. Both are correct, but each occludes information.
Can you design for both a user familiar with the contents of the texts and a user who is unfamiliar? These are likely very different use cases. One will have questions in mind and one will be attempting to gain an initial understanding.
How do you design for trust? The model may or may not be misleading, or both. It’s not inherently bad if it is misleading – so long as the user recognizes that it is – and design can aid with this understanding.
What capabilities can metadata add? Many topic models disregard metadata and just use content. But if we use metadata, which might include things like author’s gender, race, regional location, and year, how else might one be able to explore the model?
How do you make the topic model a sustainable addition to existing work flows? Topic models would presumably be more useful if they are integrated into existing ways people work. This especially applies to people who may be less familiar with technical fields like computer science.
Existing techniques (some of them)
Dendrograms: A type of tree diagram emphasizing hierarchical clustering
More on the definition: https://en.wikipedia.org/wiki/Dendrogram
Pro: It maintains more complexity
Con: Its linear structure restricts links between topics
Pro: Shows some connections, nodes can be sized based on number of occurrences
Con: Topic models aren’t actually networks
PCA (Principal Component Analysis)
Explores the model into two dimensions
Pro: Solves issue of false network diagrams
Con: Words overlap
There are many other interfaces, particularly ones specific to a given dataset. These will be the subjects of the next few blog posts.