Text as Data: Applications of Automated Text Analysis in the Middle East and North Africa

Alexandra Blackman, New York University – Abu Dhabi

This is part of the MENA Politics Newsletter, Volume 2, Issue 2, Fall 2019. Download the PDF of this piece here.

How do political parties in Tunisia present their economic platforms? How do Saudi political activists and their online followers change their social media behavior after arrest? How do Syrian state-owned media promote the political agenda of the state? These are just some of the types of questions that researchers are answering today using new text-as-data approaches.^[i] Text-as-data applications have experienced a notable increase over the last decade as digitization of documents and the Internet make large corpora of texts more accessible and as greater computing power makes the processing of such texts more feasible. The study of Middle East politics is no exception.

We organized this symposium to highlight the new work being done using Arabic language and text-as-data methods, to address some of the risks and rewards of adopting these methods, and to familiarize the Arabic-language research community with what remains a relatively new methodological approach in comparative political science.^[ii]

The contributions included in this newsletter illustrate the wide variety of texts that can be analyzed using new computational methods: (1) religious texts^[iii] (2) party platforms^[iv] (3) social media^[v] and (4) traditional news media.^[vi] These articles highlight how bodies of Arabic text can be analyzed to uncover new puzzles, measure key patterns of language usage, or make inferences about political behavior by key actors in the region. For example, Richard Nielsen discusses how male and female Salafi preachers appeal to different types of authority in their religious discourse.^[vii]

In addition to exploring various kinds of corpora for text analysis, each of the contributions details their methodological approaches and the challenges they faced. Broadly, these challenges include decisions about preprocessing the Arabic text data (e.g. dealing with distinguishing proper nouns from other words in the absence of capitalization and what types of stopwords to remove^[viii]) and decisions about how to analyze the data (e.g. using supervised or unsupervised methods).^[ix] Nathan Grubman’s project on party ideology is an example of an unsupervised approach to ideological scaling, while Alexandra Siegel and Jennifer Pan’s article on social media in Saudi Arabia uses supervised classification methods.^[x]

Importantly, the authors each illustrate the way that a deep understanding of the region and the texts is pivotal to using text as data methods. Text analysis points to the importance of language learning, deep thinking about the meaning of words, and recognizing the limits of automated methods. Once a researcher has developed her own understanding of the purposeful and delicate ways language is used, she can make decisions about how to process the text, what type of approach to use, or how to work with research assistants to classify texts. The authors included in this symposium reflect on engaging in that process in their own research.

Close knowledge of the region also helps researchers to think carefully about the ethical issues associated with using text-as-data methods. Ala’ Alrababa’h discusses intellectual property issues and the ethical concerns around ensuring that online access to the newspapers is not disrupted for other readers when researchers are scraping the websites. And Alex Siegel addresses the debates around highlighting an online individual’s social media activity in repressive settings. Looking ahead these ethical issues should remain at the forefront of researchers’ discussions.

Looking Ahead

If the recent uptick in social science research using text-as-data methods with Arabic language is any indication, these approaches will continue to develop and grow in political science. Work using computational text analysis methods in other languages offers some ideas about possible new avenues for research using these tools with Arabic, including: (1) literature, political theory texts, and textbooks^[xi] (2) candidate platforms and manifestos^[xii] (3) open-ended survey questions, interview transcripts, or personal narratives^[xiii] (4) political speeches and press releases^[xiv] and (5) diplomatic records, judicial decisions, and other government documents.^[xv]

Furthermore, from a methodological perspective, it would be interesting to see more work exploring technical aspects of Arabic text analysis. For instance, should analysis of Arabic traditional and social media incorporate French texts, particularly in former French colonies, and how consequential is that decision?^[xvi] Do the results of automated text analysis of Arabic differ after stemming versus lemmatization? Because Arabic relies on a strong root system, lemmatization could also be a powerful way to analyze the language.

Despite advances in Google Translate, automated language translation from Arabic performs more poorly than other major languages (and does not translate colloquial Arabic), and Arabic language remains underrepresented in artificial intelligence applications more broadly.^[xvii] Based on my own experience working in Arabic, for these reasons, automated translation from Arabic often fails to capture the meaning of a phrase or the correct translation of specific words within a given phrase. Thus, in analyzing computer-translated Arabic texts, the bag-of-words assumption could be violated.^[xviii] The works highlighted in this newsletter do not use automated language translation, and, in the short to medium term, that is likely to remain the gold standard for Arabic automated text analysis.

Some Arabic Text Analysis Resources

Rich Nielsen, one of the contributors to this newsletter, has developed stemmers for both Arabic (Arabic Stemmer) and Persian (Persian Stemmer). Additionally, there are many resources that have been developed outside of political science with applications to Arabic text-as-data analysis. For instance, the Computational Approaches to Modeling Language (CAMeL) Lab is a research lab at New York University Abu Dhabi focused on Arabic language analysis, and the Stanford University Natural Language Processing Group has developed software that can also process Arabic language texts (Stanford CoreNLP).

As highlighted by this collection of essays on recent research that employs Arabic text as data methods, there are many research questions – both new and old – to which these methods can contribute. We hope that by featuring this work, we provoke further discussion around the promise and pitfalls of these methods, particularly as it relates to Arabic language texts, and encourage scholars to familiarize themselves with these methods, if not add these tools to their repertoire.

Notes:

^[i] Ala’ Alrababa’h and Lisa Blaydes. “Authoritarian Media and Diversionary Threats: Lessons from Thirty Years of Syrian State Discourse.” Working Paper (2019); Nathan Grubman, “Ideological Scaling in a Neoliberal, Post-Islamist Age,” APSA Middle East Politics Newsletter (2019); Jennifer Pan and Alexandra Siegel, “How Saudi Crackdowns Fail to Silence Online Dissent,” Working Paper (2019).

^[ii] Christopher Lucas, Richard A. Nielsen, Margaret E. Roberts, Brandon M. Stewart, Alex Storer, and Dustin Tingley, “Computer-Assisted Text Analysis for Comparative Politics,” Political Analysis 23, no. 2 (2015): 254-277.

^[iii] Richard Nielsen, Deadly Clerics: Blocked Ambition and the Paths to Jihad (Cambridge: Cambridge University Press, 2017).

^[iv] Grubman 2019.

^[v] Pan and Siegel 2019; Alexandra Siegel, “Using Social Media Data to Study Arab Politics,” APSA Middle East Politics Newsletter (2019); Amaney A. Jamal, Robert O. Keohane, David Romney, and Dustin Tingley, “Anti-Americanism and Anti-Interventionism in Arabic Twitter Discourses,” Perspectives on Politics 13, no. 1 (2015): 55-73.

^[vi] Ala’ Alrababa’h, “Quantitative text analysis of Arabic news media,” APSA Middle East Politics Newsletter (2019); Alrababa’h and Blaydes 2019. For a relevant application using U.S. news media, see: Rochelle Terman, “Islamophobia and Media Portrayals of Muslim Women: A Computational Text Analysis of US News Coverage,” International Studies Quarterly 61, no. 3 (2017): 489-502.

^[vii] Richard Nielsen, “What Counting Words Can Teach Us About Middle East Politics,” APSA Middle East Politics Newsletter (2019); Richard Nielsen, “Women’s Authority in Patriarchal Social Movements: The Case of Female Salafi Preachers,” American Journal of Political Science (2019).

^[viii] For more details on preprocessing steps, see: Matthew J. Denny and Arthur Spirling. “Text Preprocessing For Unsupervised Learning: Why It Matters, When It Misleads, And What To Do About It.” Political Analysis 26, no. 2 (2018): 168-89. Arabic texts can present a challenge because the same sequence of characters can contain different parts of speech and have an entirely different meaning as a result of how words are connected in Arabic. For example, the sequence وجهته can mean “and his/its side” or “I/She sent him/it [in the direction of]” and can be segmented into entirely different parts of speech.

^[ix] For an overview of text as data methods, see: Justin Grimmer and Brandon M. Stewart, “Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts,” Political Analysis 21, no. 3 (2013): 267-97. Supervised methods include the use of human coders and thus rely on the researcher to know ex-ante what to code for in the data. Unsupervised methods generate the topics from the data but require researcher decisions about issues such as the number of topics to generate.

[x] Siegel and Pan’s research is discussed in Alexandra Siegel’s contribution to this newsletter.

^[xi] Lisa Blaydes, Justin Grimmer, and Alison McQueen. “Mirrors for Princes and Sultans: Advice on the Art of Governance in the Medieval Christian and Islamic Worlds,” Journal of Politics 80, no. 4 (2018): 1150-1167; Jennifer A. London, “Re-imagining the Cambridge School in the Age of Digital Humanities,” Annual Review of Political Science 19, no. 1 (2016): 351-373; Tamar Mitts, “Terrorism and the Rise of Right-Wing Content in Israeli Books,” International Organization 73, no. 1 (2019): 203-24.

^[xii] Amy Catalinac, “From Pork to Policy: The Rise of Programmatic Campaigning in Japanese Elections,” The Journal of Politics 78, no. 1 (2016): 1-18.

^[xiii] Mathilde Emeriau, “Learning to be Unbiased: Evidence from the French Asylum Office,” Working Paper (2019); Margaret E. Roberts, Brandon M. Stewart, Dustin Tingley, Christopher Lucas, Jetson Leder-Luis, Shana Kushner Gadarian, Bethany Albertson, David G. Rand, “Structural Topic Models for Open-Ended Survey Responses,” American Journal of Political Science 58, no. 4 (2014): 1064-1082.

^[xiv] Justin Grimmer, Representational Style in Congress: What Legislators Say and Why It Matters (Cambridge: Cambridge University Press, 2013).

^[xv] Azusa Katagiri and Eric Min, “The Credibility of Public and Private Signals: A Document-Based Approach,” American Political Science Review 113, no. 1 (2019): 156-72; Benjamin Liebman, Margaret Roberts, Rachel Stern, and Alice Wang, “Mass Digitization of Chinese Court Decisions: How to Use Text as Data in the Field of Chinese Law,” 21st Century China Center Research Paper no. 2017-01 (2017).

^[xvi] Another interesting question is how the choice of language in social media corresponds to social class and other individual characteristics? Fred Schaffer’s ethnographic study of conceptions of democracy in Senegal suggests that these language choices are closely related to class and have important implications for how people understand and engage with important political concepts like democracy. See: Frederic C. Schaffer, Democracy in Translation: Understanding Politics in an Unfamiliar Culture (Ithaca: Cornell University Press, 1998).

^[xvii] Freya Pratty, “Arabic and AI: Why voice-activated tech struggles in the Middle East,” Middle East Eye, September 10, 2019. Examining Danish, German, Spanish, French and Polish, de Vries, Schoonvelde, and Schumacher make the case for Google Translate. However, I have not found a comparable analysis of the performance of Google Translate for Arabic. See: Erik de Vries, Martijn Schoonvelde, and Gijs Schumacher, “No Longer Lost in Translation: Evidence That Google Translate Works for Comparative Bag-of-Words Text Applications,” Political Analysis 26, no. 4 (2018): 417-30.

^[xviii] It would be interesting to see applications and evaluations of recent advances in language models, such as word embeddings, on Arabic texts. For an application in political science, see: Yaoyao Dai, “Measuring Populism in Context: A Supervised Approach with Word Embedding Models,” Working Paper (2019).

Text as Data: Applications of Automated Text Analysis in the Middle East and North Africa

You May Also Like

It’s Us or Them: Partisan Polarization in Israel and Beyond

Prisons, Emotions and Ideology: Reflections on Egypt’s Cruel and Overcrowded Prisons

Note from APSA