Project Name: The Spanish-English Bilingual Youth Texting Corpus and Part-of-Speech Tagger: Spanish/English BYT Corpus
Grantees: Joanna Birnbaum and Michelle Johnson McSweeney
Funding Cycle: 2015-2016
Project Status: Cycle Complete
White Paper: BYT_PDIG White Paper – J. Birnbaum & M. McSweeney
This project seeks to create a publicly available, part-of-speech (PoS) tagged corpus of bilingual Spanish-English text messages. This corpus would be freely available for download to the broader academic community. While this may initially seem like a very small contribution, it is actually incredibly significant as the first step in creating language-detection software, the first step in a bilingual spell check, and the first step in being able to do computational linguistic research is the availability of a PoS tagged corpus. The availability of these datasets is the fundamental reason that computational research, language detection, and spell check are so good for popular languages like English and virtually non-existent for languages like Lusoga. It is also the fundamental reason that it is so difficult to do computational language analysis when languages are combined. Maybe in the case of Lusoga, we don’t worry too much because there just aren’t that many speakers. But, there ARE that many speakers of both Spanish and English. In NYC alone, 25% of the population is bilingual between those two languages!
The primary goal of this project is to make a corpus of bilingual (Spanish/English) text messages freely available to Digital Communication researchers. A secondary goal that will result from achieving the first is to develop an automated Part-of-Speech (PoS) tagger for bilingual corpora. Digital communication technologies are increasingly becoming integrated into everything that people do, they way we socialize, the way we work, and the way we learn. With the exception of programmers, these interactions are mediated through natural human language. Human beings use language in complicated ways, and often mix languages together. It is therefore incomplete and inaccurate to focus on languages in isolation of the contact they have with other languages. This project is a first step in working with languages that are messy, complicated, and in contact. One major way that people communicate with other people on digital devices is through text messaging. As of today, the corpora available for researching text messaging are all monolingual. Furthermore, available PoS taggers are designed for monolingual corpora. This project would change that by making an anonymized, tagged, freely available corpus of bilingual text messages and a publicly available PoS tagger for use with bilingual corpora. The development and public availability of this corpus will contribute to developing the identity of the Digital-GC by making the Graduate Center a central resource for gaining access to a naturalistic computer-mediated communication data set, thereby situating the Graduate Center at the forefront of research on language change in response to digital communication affordances.