NLP for
News Narrative Analysis

An explanatory sequential research design

Introduction

One of the challenges of this project is that not only is it meant to be a digital experience in public scholarship, but it also attempts to bring a relatively new method to news discourse analysis that is rarely used in the discipline.

NLP (or natural language processing) is a computational method that is often used in computation linguistics, social media analysis, and other big data research disciplines. However, journalism research often conducts analysis via handcoding. Additionally, there are few guides for conducting narrative analysis in media (Fulton, 2005; Graef, 2019). Additionally, few journalism researchers are trained in computational methods, even when conducting quantitative analyses like content analyses. But when looking at news narratives, hand coding makes analysis of large data trends extremely labor intensive.

StoryAtlas aims to use computational analysis of news articles as a "way in" to deeper discourse analysis of news narratives that arise after a breaking news event. In the first iteration of StoryAtlas, the project examined differences between a national newspaper (The New York Times) and a local newspaper (The StarTribune) in a year's worth of coverage of the murder of George Floyd on May 25, 2020. Some of the guiding research questions of this project asked: What are the narratives that these papers are presenting to their audiences about the same event? How do they change over time? What meaning are they trying to make and imprint in our collective memory?

The sample size of this first iteration was 1,214 articles. In order to analyze this large dataset, it was important to the public scholarship ethos of this project to conduct and publish the analysis using as many open source tools as possible.

Open Source Tools

Choosing the correct tools for this process was not just an ethical imperative, but also a project directive. It's important to note that all Natural Language Processing (NLP) applications are unique. Some are open source, and some are not. And each has a different philosophy of how to approach language processing. The NLP application chosen for StoryAtlas was Natural Language Toolkit (NLTK) and Gensim. NLTK is is a Python library that uses multiple NLP databases to anaylze text data by parsing, tagging, and word frequency counting, among other functions. Gensim is a Python library that uses Google's Word2Vec deep learning algorithm to conduct topic modelling, semantic analysis, and other more intricate textual analysis. The aim of using these open source, free tools is to use NLP to identify words that were most frequently used in coverage of George Floyd's murder, as well as conduct embedding analysis of words most associated with George Floyd's name.

In coordination with this project's partner organizations, one of the guidelines established early in the project was to use open sources tools to build the digital product in addtion to the data analysis. Open source tools that were used to build the digital experience for StoryAtlas included GitHub, HTML5UP, Unsplash, The Library of Congress, Visual Studio Code, and Mapbox.

It's also important to note that ChatGPT was used for coding assistance on this project. Even though it is not an open source tool, it is an extremely helpful resource for novice coders.

Method

An explanatory sequential research design uses quantitative data and analysis methods as an initial step to inform qualitative data collection and analysis. This project uses this design as a "way in" for narrative analysis of news.

Step 1. Using Proquest Newsstream, I collected all articles from The New York Times and Minneapolis Star Tribune published from May 25, 2020–June 1, 2021, using the search phrase “George Floyd” (N= 1,214).

Step 2. The data was collated by publication and publishing date and formatted into a .txt file to prepare for NLP analysis.

Step 3. The 10 most common words for the articles published within a given month were assessed using NLTK. NLTK comes with a dictionary of extremely common English words, or"stop words," such as articles and numerals. Additional stop words, such as “said” and “including” were incorporated into the analysis process. This resulted in 47 most commonly used words associated with George Floyd, also called embeddings, for the New York Times sample and 35 embeddings for the StarTribune sample. (See collected data).

Step 4. These embeddings will be used as a “way in” to analyze the data qualitatively. For example, “police” was consistently the most-used word in relation to stories about George Floyd, regardless of publication or publication date.

Initial Findings

One initial takeaway from the data at this point in the analysis is that The New York Times frequently used words "Trump," "president," and "law" in stories about George Floyd. Contrastingly, the Star Tribune used words such as "Chauvin," "trial," and "justice." This seems to indicate that as a national paper, the New York Times was more concerned with public sentiment around the murder of George Floyd and its effects on the upcoming 2020 election and its candidates, as well as potential implications for federal law. The Star Tribune, on the other hand, was more concerned about justice for the community by way of Chauvin's trial.

Another important initial finding is the difference between use of the word "death" versus "murder" in relation to George Floyd. While "death" was within the top three words for both publications, only the Star Tribune begins using "murder" within its most frequent embeddings in relation to articles about Floyd. This has important implications for journalism norms. Murder is only a term used in professional journalism once someone has been convicted. It's an important shielding practice in order to avoid potential litigation for slander. However, video evidence and public outcry for justice from the Minneapolis and Black communities seem to have influenced the Star Tribune to begin using "murder" nearly as frequently as "death" even before Chauvin was convicted in April 2021.

The next step will be to more deeply examine the qualitative data to understand how these words play into the larger narrative conveyed about the “news event” we refer to as the murder of George Floyd.

Future Research

The next step in this process would be to examine the word embeddings (or words most associated with Floyd's name) using Gensim. Because this part of the process takes a little time to train the Word2Vec algorithm, this step in the process will be complete at a later date.

Now with the quantitative portion of the data collected, further research can be conducted with the qualitative narratives collected in the sample. Some points of interest based on the collected data include: the use of "death" versus "murder", and the Star Tribune's focus on Chauvin's trial versus The New York Time's focus on the implications for the 2020 presidential election. Additionally, larger samples can be collected in order to examine larger national trends. And finally, future research will aim to look at other news events such as natural disasters, mass shootings, or pandemics.

References

Fulton, H. (2005). Narrative and Media. University of New York Press.

Graef, J. (2019). Using narrative analysis to explore print news media stories of violent crime. In Sage Research Methods Cases Part 2.SAGE Publications, Ltd. https://doi.org/10.4135/9781526473233

Souto-Manning, M. (2014). Critical narrative analysis: The interplay of critical discourse and narrative analysis. International Journal of Qualitative Studies in Education, 27(2), 159–180. https://doi.org/10.1080/09518398.2012.737046

Suroweicki, J. (2004). The Wisdom of Crowds: Why the Many Are Smarter than the Few and How Collective Wisdom Shapes Business, Economies, Societies and Nations. Anchor Books.