TOME – DH LAB https://dhlab.lmc.gatech.edu The Digital Humanities Lab at Georgia Tech Thu, 31 Jan 2019 17:52:42 +0000 en-US hourly 1 https://wordpress.org/?v=6.2.2 41053961 Vectors of Freedom https://dhlab.lmc.gatech.edu/tome/vectors-of-freedom/ https://dhlab.lmc.gatech.edu/tome/vectors-of-freedom/#respond Wed, 01 Aug 2018 18:42:01 +0000 https://dhlab.lmc.gatech.edu/?p=727 For the past several years, the DH Lab has been working on a project, TOME, aimed at visualizing the themes in a corpus of nineteenth-century newspapers. In designing this tool, our central motivation was to be able to more clearly trace the various and often conflicting conversations about slavery and its abolition that were taking place in these papers, which spanned multiple audiences and communities. (More info here).

Around the same time, a team at the University of Delaware launched the Colored Conventions Project, aimed at recovering the advocacy work performed at the Colored Conventions–organizing meetings in which Black Americans, fugitive and free, strategized about how to achieve legal, labor, and educational justice. Among the key interventions of the CCP project is to emphasize how, in the nineteenth century, organizing work took place in person as much as on the page; and how this work was performed by collectives as much as individuals.

Taking this scholarship into account, we realized that the story told in the corpus of newspapers that we’d assembled for the TOME project was, in all likelihood, a very different one from the story told through the Colored Conventions. We thought could learn more about both conversations by looking at them together. We could ask questions like: “How did themes travel from the conventions into print, or the other way around?” “Were there people or groups who played prominent roles in one venue, or the other, or both?” “What are the key differences between the conversations that took place in person vs. those that took place in print?” And, crucially: “Who are the people or groups who have not yet been recognized for their contributions, but should be?”

In Summer 2018, Arshiya Singh (BS CS ’18), advised by Dr. Klein, began to lay the groundwork for some of the models that will help to answer these questions. What follows are a series of blog posts that document our progress.

NB: Our work employs the CCP Corpus in addition to our own. In making use of that corpus, we honor the CCP’s commitment to a use of data that humanizes and acknowledges the Black people whose collective organizational histories are assembled there. Although the subjects of datasets are often reduced to abstract data points, we affirm and adhere to the CCP’s commitment to contextualizing and narrating the conditions of the people who appear as “data” and to name them when possible. 

 

 

]]>
https://dhlab.lmc.gatech.edu/tome/vectors-of-freedom/feed/ 0 727
TOME Discussion https://dhlab.lmc.gatech.edu/tome/513/ https://dhlab.lmc.gatech.edu/tome/513/#respond Thu, 26 Jan 2017 19:09:52 +0000 http://dhlab.lmc.gatech.edu/?p=513 Following the previous two blog posts where I looked at a few topic modeling interfaces, I’ll return to the lab here and write about TOME. These are based on image documentation of the project – not actual interaction with the program.

for_dh

Strengths:

Multiple views: One of the first things I noticed was that it combines different views onto one page. This combines the previous two interfaces I discussed – where one had too many disparate views and the other had one view, but was limited in other representations as a result. Here, the main visualization can be seen at the top and is clearly the most important since it is the largest. Underneath, which is cut off in the image, are other visualizations including prevalence, related topics, and geographic distribution. This might require scrolling, but is certainly better than a completely different page.

Sorting and filtering: There appear to be many different ways to sort and visualize the display – including by year, relevance, popularity, and within a certain group of documents. For exploratory purposes having more options is a good thing – so long as it doesn’t get overwhelming.

Limitations:

Scrolling: It always depends on how many topics there are, but it would be ideal if they could all be shown without requiring scrolling.

Searching: Less of a limitation, but more of a note on how users will vary on their needs based on how much they already know about the topic and/or sets of documents. The standard search bar is good if you already know what you’re looking for, but otherwise it is not very helpful and might cause frustration.

]]>
https://dhlab.lmc.gatech.edu/tome/513/feed/ 0 513
InPhO Topic Explorer https://dhlab.lmc.gatech.edu/tome/inpho-topic-explorer/ https://dhlab.lmc.gatech.edu/tome/inpho-topic-explorer/#respond Thu, 26 Jan 2017 14:25:28 +0000 http://dhlab.lmc.gatech.edu/?p=500 This interface works in terms of visualizing a topic modeling and understanding context. However, less effort has gone into the visual and interaction design than in the last interface discussed, which makes for a steeper learning curve.

Starting Interface

screen-shot-2017-01-26-at-8-30-12-am

The interface begins as a drop-down menu within the homepage, which also includes documentation on the code and how to download it. I already know that in our design, we’re  focusing attention on the interface and exploration of the topic model results. We will probably want the documentation to be separate from the interface, like the interface in the previous blog post.

Strengths and Limitations

screen-shot-2017-01-26-at-8-30-34-am

Two inputs: The simplicity of the starting interface is helpful. It’s clear that I need to choose a corpus of text – so I chose “Letters of Thomas Jefferson.” At first glance, it appears to require some knowledge of the documents already in the “Type to match document titles…” bar. What if I don’t know any? Clicking on the “random” button, which is also the “shuffle” button in music apps like Spotify, does nothing at first. It actually does, it just takes a little too long to load. It selects the letter “To Mr. Dumas, July 13, 1790.” I can also just click the Visualize button without any document in the bar, although this is not clear through the interface.

screen-shot-2017-01-26-at-8-33-48-am

Visualize #: The Visualize button allows selection of 20, 40, 60, or 80 topics. I have no idea how this will affect anything, so I choose 20. It might be good to have this variation, but more indication on how it might change things for users unfamiliar with topic models might be necessary.

Main Interface

screen-shot-2017-01-26-at-8-39-34-am

Overall, it’s nice that the primary interface is more or less on one page so there is not as much need to move around between separate views that can feel disjointed. However, there are less ways to view the topic model. The main view is a horizontal bar chart, with each bar representing a document and each section of the bar representing a topic.

Strengths and Limitations

screen-shot-2017-01-26-at-8-45-31-amscreen-shot-2017-01-26-at-8-42-36-am

Color: I’m not sure if this was just unlucky color assignment, but the top two topics that appear the most in the focal document were assigned the same color variable, making it nearly impossible to distinguish between the two. Typically, when visualizing categorical data (here, the categories are the topics), people can only distinguish about 8 colors – after that, it becomes much more difficult. Here, for the 20 topics, 9 colors are being used, but more than one is assigned to  multiple topics, which defeats the purpose of distinguishing by color.

Connections: The interaction of hovering over the topics in the document and showing the name of the topic in the key is helpful.

Scale: Someone familiar with topic modeling might understand the “Similarity to” scale at the top, but others that are not might want a quick note of what it means.

Checkbox features: “Normalize topic bars” makes each part of the bar for the document in proportion to the collection as a whole, rather than the individual document. This is definitely a useful feature for context, and using a checkbox makes it easy. Similarly, the “Alphabetical sort” option is a useful, and simple, feature.

Topic model #: Changing the topic model quantity is helpful, using the little dropdown menu next to the dropdown menu for the focal document. The loading time is quick, a loading status bar is provided to show it is working, and then there is animation so the transition isn’t jarring. However, there is also a bar on the far left, where you can click on the same numbers (20, 40, 60, 80) and change to the topic model, but then it transitions to a blank slate. The bar on the left likely indicates a “home” or “reset” which is why this happens, but I’m not sure what it adds or what the use cases would be.

Reordering: Clicking on a segment sorts the documents by “Top Documents for Topic #”. This is useful for exploring context. However, the “focal document” then becomes lost in the reordered list. There is no highlight or visual call to attention on the document listed at the top, which is what we started with. This would probably be a useful feature to have, in order to trace a document throughout explorations of various topics.

Randomizing: Randomizing the document brings up new titles, but still requires the press of the Enter button to display the new data. Having a random button allows for playful discovery, so it’s nice to have.

Undo: The browser’s back button doesn’t always take you to the exact last place in the model viewing, so having an “undo” button of sorts would provide for handling of user mistakes or simply additional navigation.

]]>
https://dhlab.lmc.gatech.edu/tome/inpho-topic-explorer/feed/ 0 500
Topics in PLMA interface https://dhlab.lmc.gatech.edu/tome/topics-in-plma-interface/ https://dhlab.lmc.gatech.edu/tome/topics-in-plma-interface/#respond Wed, 25 Jan 2017 15:33:09 +0000 http://dhlab.lmc.gatech.edu/?p=483 To start thinking about how we might design an interface to act with topic modeling results, it’s worth looking at existing interfaces and what they do well or don’t do well.

This interface by Andrew Goldstone allows for browsing topic models of articles from PMLA, the journal of the Modern Language Association of America. The model and code can be used for other sets of text – this is just an example.

The interface has multiple views and ways to explore the model – something can be learned from each of these. The site as a whole, and the navigation between the views, is slightly confusing and conceptually basic, but since it is described as an alpha version, I won’t dwell on those issues. The bulk of the content is in the “Overview” section.

Overview: Grid

screen-shot-2017-01-24-at-9-42-24-pm

A simple display of each topic as a circle, with the words sized corresponding to its weight within the topic. The topic bubbles are arranged in order of the number of the topic. Clicking a bubble leads to the corresponding topic page, which is not part of the “Overview” section.”

Strengths:

  • Easily apparent each bubble represents a topic
  • Movement from overview of topics to one specific topic is clear

Limitations:

  • Perhaps a technical flaw, the thickness of the border on each bubble varies but doesn’t seem to actually represent anything.
  • Six words are in each bubble, however there are many more “top words” within the topic model. This may or may not be a “problem” but it is something to be aware of.
  • Besides word size within each bubble, there are little other visual cues to guide the experience…which is why there are more views…

Overview: Scaled

screen-shot-2017-01-24-at-9-54-13-pm

The topics are spatially arranged by similarity. This uses principal coordinates analysis, which I mentioned in the previous blog post.

Strengths:

  • Provides additional information – the similarity
  • Provides interaction to discern overlapping topic bubbles
  • Useful for discovery

Limitations:

  • How it is arranged is not immediately clear – at least to a novice user, one less familiar with topic modeling
  • Requires additional interaction to zoom in to areas with overlapping topic bubbles
  • Difficult to locate topic within scale

Overview: List

screen-shot-2017-01-24-at-9-45-16-pm

A table format allows for comparison of more information – adding a small “over time” illustration and proportion of words in the corpus assigned to the topic.

Strengths:

  • More information viewable at once

Limitations:

  • The author mentions in the documentation that the bar for the proportion of words can be misleading because “the highest-proportion topics are often the least interesting parts of the model — agglomerations of very common words without a clear thematic content.”
  • The y-axes of the mini bar charts “over time” are not all the same scale
  • Requires scrolling, so it can’t all be viewed at once

Overview: Stacked

screen-shot-2017-01-24-at-9-45-21-pm

This view evokes D3 the most – varied by color, each topic model is stacked on top of each other and shown as the appearance increases or decreases over time.

Strengths:

  • Shows trends of each topic model over time, all at once

Limitations:

  • Topic models with less appearances are harder to distinguish without interaction

Topic

screen-shot-2017-01-24-at-9-42-35-pm

Each topic has its own view, including top words and their weight, proportion over time, and top documents.

Strengths:

  • Clicking on the bar in the timeline limits documents. Great use of drilling down. Although, it would be nice to be able to select multiple at a time. 

Limitations:

  • Clicking on the document or the word opens up the document or word interface, leaving the context of the topic itself behind. Expansion within topic is alternative option.

Document

screen-shot-2017-01-25-at-10-32-25-am

View shows title and topics by proportion. To reach a document, it must be clicked on from the bibliography page or from a topic page. If topics are the priority, this makes sense – otherwise it might be interesting to also have a document view (beyond standard bibliography).

Word

screen-shot-2017-01-25-at-10-32-15-am

View shows topics in which the word occurs. Similarly to document, it must be reached by a list of all words (for documents, bibliography) or a topic page. If the word only appears in one topic, it’s surprising when the view changes to show a different word among many topics. The animation helps with these transitions.

General considerations

  • Time to load – with a large dataset, lag time can be frustrating for the user and even make interaction impossible
  • Difficult to follow a topic throughout the different views – As the screenshots that I took show, I looked at Topic 18 throughout my exploration. When on a topic page, there is no way to see the topic simultaneously in any of the Overview views. For example, I would like to return to the Overview and perhaps see Topic 18 highlighted in the Scale or Table view. However, navigating between words and documents, and discovering other topics in the document, is likely helpful to a humanities scholar.

UX Questions 

From exploring this interface, I’ve come up with a list of questions to consider in future designs. Many of them are classic information visualization questions, with no one answer.

  1. How many words of a topic should be displayed at once to best indicate the topic? Generally, how much information should be shown at once?
  2. How do we design for exploration and discovery of information? Rather than a search like a database query?
  3.  When is it appropriate to change the presentation of data, e.g. axis variables?
  4. When drilling down to a detail, how much context should be shown?
  5. How much visualization beyond the topic itself is necessary, e.g. do we also want to visualize all of the documents where each document is a variable?
  6. What should be sacrificed for quicker load times?

 

There are a lot of things to be learned from these different views. Great documentation, thought process, explanation of design decisions by author here: http://agoldst.github.io/dfr-browser/

]]>
https://dhlab.lmc.gatech.edu/tome/topics-in-plma-interface/feed/ 0 483
Topic Modeling and Digital Humanities: Overview (1) https://dhlab.lmc.gatech.edu/tome/topic-modeling-and-digital-humanities-overview-1/ https://dhlab.lmc.gatech.edu/tome/topic-modeling-and-digital-humanities-overview-1/#respond Fri, 20 Jan 2017 14:51:32 +0000 http://dhlab.lmc.gatech.edu/?p=481 In this post:

  • What is a topic model?
  • UX considerations
  • Existing techniques

This will be the first post in a series of posts as we begin a new project on exploring topic modeling for the digital humanities, following the previous work of (link)TOME.

A topic model is a model of how often words occur together in a group of texts. Many other posts have been written about the definition of “topic model” in detail, in addition to detailing various algorithms.

http://journalofdigitalhumanities.org/2-1/topic-modeling-and-digital-humanities-by-david-m-blei/

http://journalofdigitalhumanities.org/2-1/what-can-topic-models-of-pmla-teach-us-by-ted-underwood-and-andrew-goldstone/

http://programminghistorian.org/lessons/topic-modeling-and-mallet

https://tedunderwood.com/2012/04/07/topic-modeling-made-just-simple-enough/

Here, I am going to highlight some challenges of viewing, exploring, and learning from topic modeling results.

UX considerations

How much information to show at once? Topic models typically have a lot of topics – which leads to information overload. How can you browse it a manner to learn something and not be overwhelmed?

How can you understand different views of the model? The results change based on number of topics specified by user. A small number means separate topics will merge, larger number means combined topics will split. Both are correct, but each occludes information.

Can you design for both a user familiar with the contents of the texts and a user who is unfamiliar? These are likely very different use cases. One will have questions in mind and one will be attempting to gain an initial understanding.

How do you design for trust? The model may or may not be misleading, or both. It’s not inherently bad if it is misleading – so long as the user recognizes that it is – and design can aid with this understanding.

What capabilities can metadata add? Many topic models disregard metadata and just use content. But if we use metadata, which might include things like author’s gender, race, regional location, and year, how else might one be able to explore the model?

How do you make the topic model a sustainable addition to existing work flows? Topic models would presumably be more useful if they are integrated into existing ways people work. This especially applies to people who may be less familiar with technical fields like computer science.

Existing techniques (some of them)

Dendrograms: A type of tree diagram emphasizing hierarchical clustering

More on the definition: https://en.wikipedia.org/wiki/Dendrogram

Use: http://blog.rolffredheim.com/2013/11/visualising-structure-in-topic-models.html

Pro: It maintains more complexity

Con: Its linear structure restricts links between topics

Network visualization

Use: https://tedunderwood.com/2012/11/11/visualizing-topic-models/

Pro: Shows some connections, nodes can be sized based on number of occurrences

Con: Topic models aren’t actually networks

PCA (Principal Component Analysis)

Explores the model into two dimensions

e.g. https://tedunderwood.files.wordpress.com/2012/11/prettierpca.jpg

Pro: Solves issue of false network diagrams

Con: Words overlap

There are many other interfaces, particularly ones specific to a given dataset. These will be the subjects of the next few blog posts.

]]>
https://dhlab.lmc.gatech.edu/tome/topic-modeling-and-digital-humanities-overview-1/feed/ 0 481
Talk at Digital Humanities 2014 https://dhlab.lmc.gatech.edu/talks/talk-at-digital-humanities-2014/ https://dhlab.lmc.gatech.edu/talks/talk-at-digital-humanities-2014/#comments Thu, 24 Jul 2014 04:05:37 +0000 http://dhlab.lmc.gatech.edu/?p=109 What follows is the text of a talk about the TOME project delivered at the Digital Humanities 2014 conference in Lausanne, Switzerland. We’re in the process of writing up a longer version with more technical details, but in the interim, feel free to email me with any questions. 

NB: For display purposes, I’ve removed several of the less-essential slides, but you can view the complete slidedeck here.

Just over a hundred years ago, in 1898, Henry Gannett published the second of what would become three illustrated Statistical Atlases of the United States. Based on the results of the Census of 1890– and I note, if only to make myself feel a little better about the slow pace of academic publishing today, eight years after the census was first compiled– Gannett, working with what he openly acknowledged as a team of “many men and many minds,” developed an array of new visual forms to convey the results of the eleventh census to the US public.

Slide04The first Statistical Atlas, published a decade prior, was conceived in large part to mark the centennial anniversary of the nation’s founding. That volume was designed to show the nation’s territorial expansion, its economic development, its cultural advancement, and social progress. But Gannett, with the centennial receding from view, understood the goal of the second atlas in more disciplinary terms: to “fulfill its mission in popularizing and extending the study of statistics.”

It’s not too much of a stretch, I think, to say that we’re at a similar place in the field of DH today. We’re moved through the first phase of the field’s development– the shift from humanities computing to digital humanities– and we’ve addressed a number of public challenges about its function and position in the academy. We also now routinely encounter deep and nuanced DH scholarship that is concerned digital methods and tools.

And yet, for various reasons, these tools and methods are rarely used by non-digitally-inclined scholars. The project I’m presenting today, on behalf of a project team that also includes Jacob Eisenstein and Iris Sun, was conceived in large part to address this gap in the research pipeline. We wanted to help humanities scholars with sophisticated, field-specific research questions employ equally sophisticated digital tools in their research. Just as we can now use search engines like Google or Apache Solr without needing to know anything about how search works, our team wondered if we could develop a tool to allow non-technical scholars employ another digital method– topic modeling– without needing to know how it worked. (And I should note here that we’re not the first to make this observation about search; Ben Schmidt and Ted Underwood, as early as 2010, have also published remarks to this end).

Slide05Given this methodological objective, we also wanted to identify a set of humanities research questions that would inform our tool’s development. To this end, we chose a set of nineteenth-century antislavery newspapers, significant not only because they provide the primary record of slavery’s abolition, but also because they were one of the first places, in the United States, where men and women, and African Americans and whites, were published together, on the same page. We wanted to discover if, and if so, how these groups of people framed similar ideas in different ways.

For instance, William Lloyd Garrison, probably the most famous newspaper editor of that time (he who began the first issue of The Liberator, in 1831, with the lines, “I will not equivocate — I will not excuse — I will not retreat a single inch — AND I WILL BE HEARD”) decided to hire a woman, Lydia Maria Child, to edit the National Anti-Slavery Standard, the official newspaper of the American Anti-Slavery Society. Child was a fairly famous novelist by that point, but she also wrote stories for children, and published a cookbook, so Garrison thought she could “impart useful hints to the government as well as to the family circle.” But did she? And if so, how effective– or how widely adopted– was this change in topic or tone?

Slide07The promise of topic modeling for the humanities is that it might help us answer questions like these. (I don’t have time to give a background on topic modeling today, but if you have questions, you can ask later). The salient feature, for our project, is that these models are able to identify sets of words (or “topics”) that tend to appear in the same documents, as well as the extent to which each topic is present in each document. When you run a topic model, as we did using MALLET, the output typically takes the form of lists of words and percentages, which may suggest some deep insight — grouping, for example, womanrights, and husband — but rarely offer a clear sense of where to go next. Recently, Andrew Goldstone released an interface for browsing a topic model. But if topic modeling is to be taken up by non-technical scholars, interfaces such as this must be able to do more than facilitate browsing; they must enable scholars to recombine such preliminary analysis to test theories and develop arguments.

Slide08In fact, the goal of integrating preliminary analytics with interactive research is not new; exploratory data analysis (or EDA, as it’s commonly known) has played a fundamental role in quantitative research since at least the 1970s, when it was described by John Tukey. In comparison to formal hypothesis testing, EDA is more, well, exploratory; it’s meant to help the researcher develop a general sense of the properties of his or her dataset before embarking on more specific inquiries. Typically, EDA combines visualizations such as scatterplots and histograms with lightweight quantitative analysis, serving to check basic assumptions, reveal errors in the data-processing pipeline, identify relationships between variables, and suggest preliminary models. This idea has since been adapted for use in DH– for instance, the WordSeer project, out of Berkeley, frames their work in terms of exploratory text analysis. In keeping with the current thinking about EDA, WordSeer interweaves exploratory text analysis with more formal statistical modeling, facilitating an iterative process of discovery driven by scholarly insight.

Slide10EDA tends to focus on the visual representation of data, since it’s generally thought that visualizations enhance, or otherwise amplify, cognition In truth, the most successful visual forms are perceived pre-cognitively; their ability to guide users through the underlying information is experienced intuitively; and the assumptions made by the designers are so aligned with the features of their particular dataset, and the questions that dataset might begin to address, that they become invisible to the end-user.

 

Slide11So in the remainder of my time today, I want to talk through the design decisions that have influenced the development of our tool as we sought to adapt ideas about visualization and EDA for use with topic modeling scholarly archives. In doing so, my goal is also to take up the  call, as recently voiced by Johanna Drucker, to resist the “intellectual Trojan horse” of humanities-oriented visualizations, which “conceal their epistemological biases under a guise of familiarity.” What I’ll talk through today should, I hope, seem at once familiar and new. For our visual design decisions involved serious thinking about time and space, concepts central to the humanities, as well as about the process of conducting humanities research generally conceived. So in the remainder of my talk, I’ll present two prototype interface designs, and explain the technical and theoretical ideas that underlie each, before sketching the path of our future work.

Slide12Understanding the evolution of ideas– about abolition, or ideology more generally– requires attending to change over time. Our starting point was a sense that whatever visualization we created needed to highlight, for the end-user, how specific topics–such as those describing civil rights and the Mexican-American War, to name two that Lydia Maria Child wrote about– might become more or less prominent at various points in time. For some topics, such as the Mexican-American War, history tells us that there should be a clear starting point. But for other topics, such as the one that seems to describe civil rights, their prevalence may wax and wane over time. Did Child employ the language of the home to advocate for equal rights, as Garrison hoped she would? Or did she merely adopt the more direct line of argument that other (male) editors employed?

To begin to answer these questions, our interface needed to support nuanced scholarly inquiry. More specifically, we wanted the user to be able to identify significant topics over time for a selected subset of documents– not just in the entire dataset. This subset of documents, we thought, might be chosen by specific metadata, such as newspaper title; this would allow you to see how Child’s writing about civil rights compared to other editors work on the subject. Alternately, you might, through a keyword search, choose to see all the documents that dealt with issues of rights. So in this way, you could compare the conversation around civil rights with the one that framed the discussion about women’s rights. (It’s believed that the debates about the two issues developed in parallel, although often with different ideological underpinnings).

At this point, it’s probably also important to note that in contrast to earlier, clustering-based techniques for identifying themes in documents, topic modeling can identify multiple topics in a single document. This is especially useful when dealing with historical newspaper data, which tends to be segmented by page and not article. So you could ask: Did Child begin by writing about civil rights overtly, with minimal reference to domestic issues? Or did Child always frame the issue of civil rights in the context of the home?

Slide14Our first design was based on exploring these changes in topical composition. In this design, we built on the concept of a dust-and-magnets visualization. Think of that toy where you could use a little magnetized wand to draw a mustache on a man; this model treats each topic as a magnet, which exerts force multiple specks of dust (the individual documents). (At left is an image from an actual dust-and-magnets paper).
In our adaptation of this model, we represented each newspaper as a trail of dust, with each speck– or point– corresponding to a single issue of the newspaper. The position of each point, on an x/y axis, is determined by its topical composition, with respect to each topic displayed in the field. That is to say– the force exerted on each newspaper issue by a particular topic corresponds to the strength of that topic in the issue. In the slide below, you can see highlighted the dust trail of the Anti-Slavery Bugle as it relates to five topics, including the civil rights and women’s rights topics previously mentioned. (They have different numbers here). I also should note that for the dust trails to be spatially coherent, we had to apply some smoothing. We also used color to convey additional metadata. Here, for instance, each color in a newspaper trail corresponds to a different editor. So by comparing multiple dust-trails, and by looking at individual trails, you can see the thematic differences between (or within) publications.

Slide15Another issue addressed by this design is the fact that documents are almost always composed of more than two topics. In other words, for the topics’ force to be represented most accurately, they must be arranged in an n-dimensional space. We can’t do that in the real world, obviously, where we perceive things in three dimensions; let alone on a screen, where we perceive things in two. But while multidimensional information is lost, it’s possible to expose some of this information through interaction. So in this prototype, by adjusting the position of each topic, you can move through a variety of spatializations. Taken together, these alternate views allow the user to develop an understanding of the overall topical distribution.

This mode also nicely lends itself to our goal of helping users to “drill down” to a key subset of topics and documents: if the user determines a particular topic to be irrelevant to the question at hand, she can simply remove its magnet from the visualization, and the dust-trails will adjust.

This visualization also has some substantial disadvantages, as we came to see after exploring additional usage scenarios. For one, the topical distributions computed for each newspaper are not guaranteed to vary with any consistency. For instance, some topics appear and disappear; others increase and decrease repeatedly. In these cases, the resultant “trails” are not spatially coherent unless smoothing is applied after the fact. This diminishes the accuracy of the representation, and raises the question of how much smoothing is enough.

Another disadvantage is that while the visualization facilitates the comparison of the overall thematic trajectories of two newspapers, it is not easy to align these trajectories– for instance, to determine the thematic composition of two newspapers at the same point in time. We considered interactive solutions to this problem, like adding a clickable timeline that would highlight the relevant point on each dust trail. However, these interactive solutions moved us further from a visualization that was immediately intuitive.

At this point, we took a step back, returning to the initial goal of our project: facilitating humanities research through technically-sophisticated means. This required more complex thinking about the research process. There is a difference, we came to realize, between a scholar who is new to a dataset, and therefore primarily interested in understanding the overall landscape of ideas; and someone who already has a general sense of the data, and instead, has a specific research question in mind. This is a difference between the kind of exploration theorized by Tukey, and a different process we might call investigation. More specifically, while exploration is guided by popularity—what topics are most prominent at any given time—investigation is guided by relevance: what topics are most germane to a particular interest. We wanted to facilitate both forms of research in a single interface.

Slide16With this design, at left, it’s time that provides the structure for the interface, anchoring each research mode– exploration and investigation– in a single view. Here, you see the topics represented in “timeline” form. (The timeline-based visualization also includes smooth zooming and panning, using D3’s built-in zoom functionality). The user begins by entering a search term, as in a traditional keyword search. So here you see the results for a search on “rights,” with each topic that contains the word “rights” listed in order of relevance. This is like the output of a standard search engine, like Google, so each topic is clickable– like a link.

Rather than take you to a web page, however, clicking on a topic gets you more information about that topic: its keywords, its overall distribution in the dataset, its geographical distribution, and, eventually, the documents in the dataset that best encapsulate its use. (There will also be a standalone keyword-in-context view).

Another feature under development, in view of our interest in balancing exploration and investigation, is that the height–or thickness- of any individual block will indicates its overall popularity. (We actually have this implemented, although it hasn’t yet been integrated into the interface you see). For example, given the query “rights,” topic 59, centered on women’s rights, represented in blue at the top right, may be most relevant– with “rights” as the most statistically significant keyword. But it is also relatively rare in the entire dataset. Topic 40, on the other hand, which deals with more general civil and political issues, has “rights” as a much less meaningful keyword, yet is extremely common in the dataset. Each of these topics holds significance for the scholar, but in different ways. Our aim is to showcase both.

Slide17Another feature to demonstrate is a spatial layout of topic keywords. In the course of the project’s development, we came to realize that while the range of connotations of individual words in a topic presents one kind of interpretive challenge, the topics themselves can at times present another– more specifically, when a topic includes words associated with seemingly divergent themes. So for instance, in T56, the scholar might observe a (seemingly) obvious connection, for the nineteenth-century, between words that describe Native Americans and those that describe nature. However, unlike the words “antelope” or “hawk,” the words “tiger” and “hyena,” also included in the topic, do not describe animals that are native to North America. Just looking at the word list, it’s impossible to tell whether the explanation lies in a new figurative vocabulary for describing native Americans, or whether this set of words is merely an accident of statistical analysis.

Slide18So here, on the left, you see a spatial visualization of the topic’s keywords using multidimensional scaling, in which each keyword is positioned according to its contextual similarity. Here, the terms “indian”, “indians”, and “tribes” are located apart from “hyena”, “tiger”, and “tigers”, which are themselves closely associated. The spatial layout suggests a relatively weak connection between these groups of terms. For comparison, at right is a spatial visualization for a topic relating to the Mexican-American War, in which terms related to the conduct of the war are spatially distinguished from those related to its outcome.

Slide19But returning, for a minute, to the overall view, I’ll note just that there are limitations to this interface as well, owing to the fact of translating textual and temporal data into a spatial view. Through our design process, though, we came to realize that the goal should not be to produce an accurate spatial representation of what is, after all, a fundamentally non-spatial data. Rather, our challenge was to create a spatial transformation, one that conveyed a high density of information while at the same time allowed the scholar to quickly and easily reverse course, moving from space back to the original, textual representation.

Our project is far from concluded, and we have several specific steps we plan to accomplish. In addition to implementing the information about specific topics, our most pressing concern, given our interest in moving from text to space and back to text again, is to implement the KWIC view. We also plan to write up our findings about the newspapers themselves, since we believe this tool can yield new insights into the story of slavery’s abolition.

But I want to end with a more theoretical question that I think our visualization can help to address– in fact, one that our interface has helped to illuminate without our even trying.

Slide20I began this presentation by showing you some images from Henry Gannett’s Statistical Atlas of the United States. You’ll notice that one of these images bears a striking similarity to the interface we designed. Believe it or not, this was unintentional! We passed through several intermediary designs before arriving at the one you see, and several of its visual features: the hexagon shape of each blog, and the grey lines that connect them, were the result of working within the constraints of D3. But the similarities between these two designs can also tell us something, if we think harder about the shared context in which both were made.

Slide21So, what do we have in common with Henry Gannett, the nineteenth century government statistician? Well, we’re both coming at our data from a methodological perspective. Gannett, if you recall, wanted to elevate statistics in the public view. By integrating EDA into our topic model exploration scheme, our team also aims to promote a statistical mode of encountering data. But that I refer to our abolitionist newspaper data as “data” is, I think, quite significant, because it helps to expose our relation to it. For antislavery advocates at the time– and even more so for the individuals whose liberty was discussed in their pages– this was not data, it was life. So when we are called upon, not just as visualization designers, but as digital humanities visualization designers, to “expose the constructedness of data”—that’s Johanna Drucker again, who I mentioned at the outset. Or, to put it slightly differently, to illuminate the subjective position of the viewer with respect to the data’s display, we might think of these different sets of data, and their similar representations—which owe as much to technical issues as to theoretical concerns–and ask what about the data is exposed, and what remains obscured from view. That is to say, what questions and what stories still remain for computer scientists, and for humanities scholars, working together, to begin to tell?

 

]]>
https://dhlab.lmc.gatech.edu/talks/talk-at-digital-humanities-2014/feed/ 2 109