12.4. Natural language processing#

Turning human language into data that computers can “read” has long been a subject of fascination in data science and the many fields from which it emerged and from which it continues to evolve today. The history, of course, is a long one, and the field today is vast – but we’ll keep our summary of both brief here, and then jump into building some intuition, both conceptually and in terms of programming, for how to start to think about turning language into data.

Overview#

Natural language processing (NLP) has been around in some form since at least the early 1900s, though not explicitly as NLP. Two big areas of work within NLP are language “learning” and text-as-data.

  1. Language “learning”: We’ve long been exploring ways to turn language into data that we can then train computers on (much like with our ML algorithms from a few chapters ago) so that they can replicate and generate language, perhaps to respond to human queries or have “conversations” with us. “Learning” is in quotation marks because it’s an open, and deeply philosophical, debate as to what the computers “know” about the language they’re replicating or generating, and whether they “understand” what humans are saying to them. A big part of the answer generally hinges on how you define “learn” and “understand”, which is not at all obvious, but we’ll leave the debate here for now.

If you’ve ever texted with a chatbot, encountered a bot on social media, or spoken to Siri, Alexa, Google Home, or any other “smart” device, you’ve interacted directly with NLP algorithms. As of the time of the writing of this book, the newest development in interactive applications of NLP is Open AI’s ChatGPT, which is currently dazzling (and possibly terrifying) everyone with its ability to reply to queries, including unusual ones, in ways that sound impressively (and even alarmingly) like a human. (Alas, in this chapter we won’t quite get as far as building our own ChatGPT, but we trust you can do that on your own time, as it’s likely exceedingly simple.)

  1. Text-as-data: Another reason we might want to turn human language into data that computers can work with is so that we can understand things about language itself, as well as about other things language might represent or predict. For example, before “data science” was a household (ish) term, linguists conducted “content analysis”, which meant studying the use of certain words, types of grammatical structures, or other linguistic elements across time, or locations, dialects, or other groups. More recently, social scientists are turning text into data to conduct research on, for example, how social media influences news content, and vice versa, or to understand the prevalence of, say, mental illness among certain populations on social media, or even to predict real world events, such as protests or stock market performance based on conversations online.

Much work is at the intersection of both of these areas, and developments in one often spur development in the other. Generally, one way to think about the two big areas of research are that in language learning we want to teach a computer to “speak” a human language, whereas in text-as-data, we tend to want to use text (including spoken or written) as an input or output variable(s), perhaps in a dataset alongside other variables, such as the number of people who showed up at a protest in a particular location, or the performance of a particular stock over some time period.

Many companies use both. For example, Amazon is certainly working on language learning when it develops Alexa, and it likely uses text-as-data techniques as it conducts, say, sentiment analysis on product reviews and/or tweets about or directed at the company in order to quickly understand what customers are liking or disliking, and/or predict performance of various products.

Overall, NLP has applications across all kinds of exciting areas, including the following, along with a brief example of some questions that might be asked in each:

  • Artificial intelligence: Can computers learn (reproduce, generate) language? Can they understand it?

  • Machine learning: How can we apply ML, deep learning, and neural nets that have been developed for non-text data to text data, and vice versa?

  • Linguistics: How has the structure of language changed over time? What is the relationship between language and the brain? (You could imagine some interesting crossover between NLP and RL here!)

  • Social science: Language can be thought of as an observable representation of a variable we might be interested but cannot see, such as beliefs, values, or culture; could help us predict and explain events

  • Humanities: How has language, culture, and symbolism changed over time?

Our hero Alan Turing is credited as one of the pioneers in thinking about computers and artificial intelligence in the middle of the 20th century. You may have come across his famous Turing test, in which he posits that a computer that could convince a human that it is also a human could be deemed to be “intelligent”. Depending on your views of how convincing this needs to be, we may well have already long exceeded this threshold in terms of the current capabilities of computers. For example, I’ve definitely texted with, e.g., customer service accounts for companies and not been sure if I was interacting with a computer or a human (though let’s be honest, sometimes it’s really clear it’s not a human). ChatGPT has also set a new bar of convincingness, and people are actively now working on figuring out just how convincing it is compared to human writing.

That said, there are limitations – the fact that we still have exercises we need to complete all over the internet to prove we are humans/not robots at least can give us some comfort (or disappointment, depending on our views) that there are still things humans can do that computers cannot. That said, specifically with language, that threshold above which we can tell we’re working with a human rather than a computer does seem to be getting thinner and higher all the time.