Saturday 10:50 a.m.–11:10 a.m. in Terrace

NLTK vs Twitter: A Voyage into Linguistic Frontiers

Max Thayer

Audience level:


Language is complicated. We'll show you how to use statistics and geography to do linguistic research without the hassle of semantics using Flask for data collection, the NLTK for data parsing, and d3 for pretty graphs.


Given a linguistic frontier like Twitter, we are tempted to dive down the rabbit hole of semantic analysis, but for online communities whose evolution outpaces the traditional development of corpuses, we need a to look at every tool we have to uncover their secrets.

We'll demonstrate how to answer questions like "How do idioms grow and spread?", "What regions are evolving fastest linguistically?", and "What regions' tendencies are spreading, or shrinking?" using a Flask app to collect data from Twitter, the NLTK to slice it before storage and dice it upon retrieval, and d3.js to generate share-able graphs of your research.

We'll focus on analyzing n-grams, but we'll discuss many techniques both statistical and semantic for studying language, teaching machines to parse it, and what the NLTK provides to accomplish all that with no assembly required.

