Language was a recurrent theme of my childhood. I was born in Senegal, where French was the primary language. My family was Dutch, so much of the next few years were spent in the Netherlands. When I was four we moved to America, and my parents encouraged us to speak English as often as possible.
I had a gaming console that came with a special speech synthesiser. I would type in a string of characters and the synthesiser would try to sound them out. It was my very first coding project and I was so excited I asked my parents to sign me up for programming classes. It all seemed very high-tech at the time.
I had an epiphany while studying linguistics and computing in my college days: thinking and how the brain works strangely corresponds to computing. If we could teach a computer to process verbs and nouns, what new insights could we gain into human language?
Today my research has evolved to bridge the gap between human and machine. I give computers the tools to understand the human world and design processes to help people gain the greatest benefit from their machines. I am inspired by the potential of text mining, a technique that gathers new information from separate written sources, because I believe this branch of computing holds the key to fighting disease.
By 2025, experts suggest we will have sequenced between 100 million and two billion genes. Genes are made of DNA, tiny instruction manuals that provide a blueprint for life. Although humans around the world share 99.9 per cent of the same DNA, small differences occur between each person. Sometimes these differences are inconsequential, other times they expose the root of disease.
Data gathered from gene sequencing will hold the key to disease diagnosis, prevention and treatment. It will help us connect illnesses with certain genes and tailor therapies to individual patients. To some extent, this new type of medicine is already here. Researchers have uncovered the genetic script behind pancreatic tumours and gene sequencing quickly identified the cause of a young boy’s life-threatening disease.
Other forms of big data hold similar promise. Sifting through social media conversations could identify disease outbreak such as Ebola or pinpoint the site of natural disasters. But as valuable as this data is, it can be near impossible to analyse in a meaningful way.
Accessing data involves more than giving computers grammar and a dictionary. If you a read an entire Dutch-English dictionary, it is unlikely you would be a fluent speaker of Dutch. This is because language is complex, ambiguous and highly dependent on situations. We learn language by immersion, not by reading a dictionary.
Our lives are embedded in context and we are hardwired as infants to learn how to communicate with others. And here the lies the problem: computers have no context. They might be capable of storing big data and hosting conversations on Twitter, but they are hopelessly literal objects.
To help computers learn, we summarise the human world for them. We provide a set of patterns that computers can make predictions from. Computers aren’t learning language in the same way a human does, but they are learning to apply rules from the human world to novel situations. This helps computers extract the precise data we need – the data that makes connections between genes and disease.
My research has given me a lot of respect for the human brain and how we do things. It is truly amazing that a child hears the word “pea” and comes to associate it with little green things their mother insists on feeding them.
Even a computer scientist has the potential to change the course of medicine. I will never have a patient sit in my office and I won’t administer the drug that saves their life. But I can still help. It’s inspiring to think that we could use human language processing to understand something as big as biology.
- As told to Kristen Goodgame
Banner image: Andrew Lenards, via Flickr.