Kaggle founder talks Big Data
The world’s most valuable resource is no longer oil; it’s data
G'day, I'm Glyn Davis and welcome to The Policy Shop, a place where we think about policy choices.
All these devices and machines and everything we're building these days, whether it's phones or computers or cars or refrigerators, are throwing off data.
Information is being extracted out of toolbooths, out of parking spaces, out of Internet searches, out of Facebook, out of your phone.
Big data and algorithms are going to challenge white collar professional knowledge work in the 21st century in the same way that factory automation and the assembly line challenged blue collar labour in the 20th century.
I hope you follow multiplication and quantity and data as a [inaudible] multiplication in the number of patterns that we can see in that data. This just in the last ten or 11 years. This, I would submit, is a sea-change, a profound change, in the economics of the world that we live in.
The great news for us as well is that it's the way that we can transform the American criminal justice system. It's how we can make a streets safer, we can reduce our prison costs and we can make our system much fairer and more just. Some people call it data science - I call it moneyballing criminal justice.
Today, the world's most valuable resource is no longer oil but data. Amazon, Apple, Facebook, Microsoft and Alphabet, Google's parent company, are the five most valuable listed firms in the world, collectively taking in over US$25 billion in the first quarter of 2017 alone. Whether you're going for a run, watching television, browsing online, or even just sitting in traffic, virtually every activity creates a digital trace, and you are being watched. Our connected, always-on world has generated huge volumes of data, but also heightened concerns.
At the cost of our privacy, this data economy brings new opportunities, particularly in public policy. Big data is now being used to fight obesity, to predict crime hotspots, to help NASA map dark matter.
In today's episode, we explore the potential and the challengers of big data for public policy. Joining us is an alumnus of the University of Melbourne who is working on the frontier of data science. He was twice placed on the Forbes top 30 under-30 list and his data analytics company has been backed by some of the biggest investors in Silicon Valley. Anthony Goldbloom, founder and CEO of Kaggle joins us on the line from San Francisco. Anthony, welcome to this episode.
Thanks for having me.
Anthony, you had the foresight some years ago, back in 2010, famously to leave a comfortable job at the Treasury and instead code Kaggle from a small apartment in Sydney. What does Kaggle do and why are you excited by its potential?
Sure. I'll tell you how Kaggle started out and we've opened our aperture from where we started, but we started out running machine learning competitions and so a company or a government would put their data up on our website and we had a community, or have a community, of data scientists, statisticians, machine learners who compete to build the best algorithm on a given problem.
We actually, our community just, I think last week, ticked over a million. So, there are a million people who spend part of their spare time browsing our website, taking our data sets and possibly competing in our competitions.
Since then, we've expanded, so we not only run machine learning competitions but if you're a data scientist and you're looking for somewhere to run your algorithm, we provide a tool called Kaggle Kernels which is a cloud-based workbench for doing your data science, so instead of doing it on your local laptop, like Google Docs but for data science. You have your analysis wherever you go and we also have an open data platform where it could be researchers, governance companies, share data sets and so it's like a vibrant library of open data sets that anyone can share and collaborate on.
So, we've expanded from that initial starting point of running machine learning competitions.
What was your experience setting up a company in Australia?
I would say I was extremely naïve when Kaggle started. My first job out of college was working at the Australian Treasury. Then after that, the Reserve Bank of Australia. So, I really knew very little. I think one thing that was an issue for me at the time, there certainly wasn’t that much of a start-up scene or a tech scene in Australia, so there weren't a lot of places to get great mentoring.
So, I think one of the big advantages I had in moving to Silicon Valley was that you sit in a café in the city and you just listen to a conversation next to you and it's about customer acquisition costs and growth strategies and Haskell versus Rust, different programming languages. It’s just like this city pulses in technology and start-ups.
So, as a first-time entrepreneur, I found it much easier as somebody who didn’t really have much of a background in running a company, easier to get mentoring. I think that a couple of things have changed since. I think the start-up scene is much more developed now than it was. It had huge success in Atlassian, which is a world class company that a lot of people in Australia who grew up with Atlassian now have the experience of working in a company that's ultimately turned out to be very successful. Google has a very big office in Sydney.
So, you have a lot more people trained, and having had the experience of going on a wild tech adventure, we have more capital in Australia, there's Square Peg Capital…
…among others that are funding companies. Also, Kaggle was acquired by Google three months ago, so I now work at Google. I'm very excited to see Kaggle at Google.
If at some point in the future I do another company, I actually think there are advantages in doing it from Australia. One advantage is their war for talent is probably not quite as acute, the cost of talent is probably not quite as high and so I think coming back to Australia as an experienced entrepreneur potentially down the track, I actually think there are some advantages to operating out of Australia.
Indeed. So, given your emphasis on how important the ecosystem is to have experienced entrepreneurs like you back in the system is going to be very important for the future. How far were you in your journey when you made the decision to move to the United States?
I started working on Kaggle initially actually nights while I was working at the Reserve Bank. That was in, let's say, early 2009. I think I calculated that at the rate I was going, I’d probably finish Kaggle in about 2040, which has taken too long. Initially, all I was doing was coding, so I was building the website. I left the job in RBA in July. I initially did a really nice routes trip with my grandparents after that. There was a month of not really focusing on Kaggle.
I came back and I proposed to my wife just so I could lock in an income source. Then, coded up Kaggle, got it launched in April of 2014, actually just after our wedding. Then started spending time in America, probably a year after launching Kaggle. So Kaggle was going well enough that there were some signs of life, it had some potential. I started by going to the US as a tourist to speak at conferences, quote-unquote, tourist, close business deals. We recruited somebody in America and then moved to the US full-time about January 2012.
I would also say that moving to the US was the time when Kaggle switched from being a side-project that I thought could be a nice lifestyle business to something that I really thought had the chance to change the way that data science machine learning gets done.
You were there on the cusp of an important developing industry. How big is business analytics and predictive modelling today as an industry?
The market for predictive analytics, business modelling and machine learning is $41 billion market. But what's hidden inside that number is - and it's growing at I think around eight per cent per year. It's a big market, it's growing fast. But the really exciting thing that's happening in the market at the moment is there are some major shifts which leave open a big business opportunity.
So, it used to be that compute was mostly done on premise and you used companies like Oracle and Teradata and companies like that to do a lot of your business analytics. There's a big shift towards the cloud. It used to be that you used predictive modelling statistical techniques for a lot of this work, so things like literacy regression, linear regression, even just basic pivot tables and things like that.
Now, with a set of new techniques, mission learning techniques, particularly gradient boosting machines and more recently deep neural networks, there's a fabulous set of things you can do now that you couldn’t do five, 10 years ago.
So, the power of these techniques is much greater than it was. So, there's also a business opportunity because a lot of the legacy players that I mentioned earlier are not really the world's best at these new techniques. Companies like Kaggle and our parent company Google and others like Amazon and Microsoft are far, far, far, far stronger in these new techniques. So, we're aggressively grabbing market share. Not just that the market is growing but the major players in this market are also changing.
You certainly have some remarkable talent. You mentioned the data competitions. One of the famous ones for NASA, your data scientists solved a problem in a week and a half that NASA had been looking at for a decade. Does this give us some sense of the power of open source data?
Our competitions are a very efficient way to get people who are very strong with machine learning and data in contact with problems that they might not have otherwise known about but they have the ability to solve. NASA was taking a dataset of galactical images and they were trying to measure the ellipticity of galaxies very precisely. If you can measure the ellipticity of galaxies very precisely, you can use that to basically infer the dark matter distribution of the universe. It was quite a challenging problem, a problem they cared a lot about.
Actually, the first person to make a major breakthrough on that problem was of all things, his profession was a glaciologist which I'd never heard of a glaciologist, but he was doing a lot of image recognition which was the branch of machine learning that was required for the NASA problem in his glaciology research. Because what he was doing was taking satellite images and looking for the edges of glaciers algorithmically. It just so happens that the techniques that he was using to find the edges of glaciers algorithmically were a good fit for measuring the ellipticity of a galaxy.
I like that story because I always say he was looking from space down at Earth, from satellite images down at glaciers and the competition required him to look from the ground up. So, he just had to turn the problem on its head. It’s a nice story that one.
Can we talk for a minute about big data and government? I know you’re coming at it from a business perspective but one of the key users of big data is government and there's a lot of excitement about the opportunities to think again about some public policy interventions. Can you give us some sense of how your company and others have been working with government?
We do a bit with government, actually working with NOAA at the moment. NOAA, the acronym is National Oceanographic and Atmospheric Administration. This is one of the big government departments in the US. They're really responsible for two things, the national weather service, which is the US's equivalent of the Bureau of Meteorology and they're also responsible for policing fisheries.
One of the key tasks they have to do in policing fisheries is keeping track of how many of different types of - they care about conservation - so how many different types of species exist? That can be a very tedious manual count. We're working with them at the moment to take aerial photographs - I think they were taken in Alaska, I forget exactly where -.and building algorithms that automatically count the number of sealions based on those aerial photographs. So, just taking something that some people used to do manually and automating it.
We recently did a fascinating project with the Nature Conservancy. Fishing is not well monitored at all. It's extremely expensive to send out somebody on a fishing boat to make sure that there isn't illegal fishing happening. Actually, it's quite a dangerous job because fishing boats are often not the - they're quite rough places. There are stories of people getting kidnapped or bribed.
So, the idea was with the Nature Conservancy we tackled this problem where you put a camera on a fishing boat and you automatically count and classify the fish that are being caught. Are they going above their quota, or are they catching by-catch of species that they shouldn’t be catching? I guess I'm thinking about oceans at the moment which is why I came up with these two examples. But they’re examples of things that public policy had a hard time dealing with historically that these new machine learning techniques are able to help us address.
A lot of policy work by government relies on sampling techniques such as household surveys to gain insights into the population. You're describing quite a different and automated approach to that. Are governments perhaps using old tools in a new setting?
Yeah, I wish I could remember where this research was. But there was research done out of one of the US universities replacing the census with other data sources, as you can imagine, doing things like taking satellite images and counting people or cars or houses.
So, there are a lot of amazing data sources that didn’t exist in 1783 or whenever the US Constitution said that we have to have a census every 10 years. So, there are definitely ways to reboot the census for the big data world. Just more generally, I think it's a great opportunity for the Australian Bureau of Statistics, it's a great opportunity for the various statistical bureaus in the US to look at having a less people-intensive approach to gathering statistics and having those statistics be released on more of a real-time basis.
There is a huge opportunity that is probably not really being taken at the moment to reboot the way that national statistics are done.
People are talking very excitedly about perhaps addressing wicked problems using big data, so some of the most challenging issues around the planet, obesity and climate change for example. Is there a role for data in working through these sorts of choices and helping inform policy solutions?
I would say in some of these cases, not so much climate change, but perhaps health related issues like obesity. Some of the more classical techniques may be more useful so randomised control trials are an incredibly effective tool for public policy studies. I'm not sure that machine learning techniques have much to add above and beyond randomised control trials.
So, there may be ways that machine learning could help at the edges, but I think that making good use of randomised control trials is going to be more powerful for a lot of public health issues.
Moving from policy to politics, data source from voters is clearly becoming an important player in campaigns. Sometimes, in interesting policy ways, the Pirate Party in Iceland for example, has been a leader in this and Lab Hacker and E-Democracy are two programs that encourage people to make proposals to representatives, work through them to improve bills and policies. But there is also an interesting engagement by data in election campaigns. What's been the discussion between the big data industry and politics?
Yeah, it's interesting. I think the Obama campaign in particular did a good job of using data science machine learning techniques to do things like micro-targeting voters. You have a certain number of people who are willing to knock on doors for you and you don’t want to send them to doors where they're definitely not going to vote for your party or they're definitely going to vote for your party but rather, so in the Obama case, I don’t want to vote for diehard Republicans nor diehard Democrats. What they want to do is they want to find the people who are swing voters and spend the energy knocking on those doors.
Actually, the second Obama campaign in particular had a huge number of people from Silicon Valley volunteer and actually quite a few companies came out of that campaign that have gone on to be very successful. So, some of the technology they developed for that campaign they've gone on to make companies out of. There's a company called Optimizely that does A/B testing, there's a company called Civis Analytics, the founder was the Chief Analytics Officer for the Obama campaign.
Actually, interestingly, I would say this current campaign, turn that on its head, a lot of the same people who contributed to the Obama campaign were also contributing to the Clinton campaign. But it's interesting, because I think the Trump campaign was quite a lot less sophisticated. As I understand it, they did have some efforts in this regard but it was not nearly as sophisticated an effort. I think Trump ran a very different campaign and won that with all the sophistication of the Clinton statistical machine. The Trump campaign showed that there are other approaches that can be effective in winning.
Certainly, winning the electoral college vote. It's very interesting you say that because of course on the Trump side, there are companies involved, people involved who are very proud of, as they perceive it, the role of data in helping Donald Trump become President. Cambridge Analytica is the one mentioned quite often and we've got a short clip from Alexander Nix who is the CEO of Cambridge Analytica speaking at a presentation in September 2016, so just weeks before the Presidential election.
Communication is fundamentally changing. Back in the days of mad men, communication was essentially top down, that is it's creative-led. Brilliant minds get together and come up with slogans like Beanz Means Heinz and Coca Cola Is It. They push these messages onto the audience in the hope that they resonate.
Today, we don’t need to guess at what creative solution may or may not work. We can use hundreds or thousands of individual data points on our targeted audiences to understand exactly which messages are going to appeal to which audiences way before the creative process starts.
Can you say something about psychographic profiling which is where Cambridge Analytica made its name?
The ultimate goal for a lot of these efforts is, as I said, micro-targeting. What they will often do is they will buy data sets and they'll buy data sets that really range from really whatever they can get hold of. So, they might be able to buy data on your magazine subscriptions, they may be able to buy data on what kind of car you drive, they might be able to buy data on all sorts of things. So, once they have assembled a large data set they can find some interesting relationships that might help them determine your political leaning.
So, for instance, if you have a big heavy four-wheel drive SUV as it's called in America, that might point in the direction of you being a Republican voter. But then if you also have four kids, that might steer you a little bit towards you being a Democrat voter, and the more of these what you call signals that can be assembled, the better the microtargeting is at pinning down your likely political persuasions,
As I said, the ultimate is to pick people who are on the fence. A well landed message can swing them in one direction or the other.
So, data modelling helps you to connect people and then move them to action. Donald Trump's chief strategist, Steve Bannon, sat on the board of Cambridge Analytica and one of its key backers Robert Mercer was a major donor to the US President. Obviously, this is a very sophisticated use of data, but is there something sinister we should fear in this development?
Yeah, I would say the area where big data gets a little bit murky is when data is being sold without your knowledge. So, companies like Google I think are very transparent. You can see what data they have about you and have some visibility into how it's used. There are a lot of data brokers out there who will - are collecting data about you and will sell it without your knowledge. So, your data sets about you can end up in the hands of whether it be political campaigns or places where you might be surprised how much they know about you, to the extent that there are sinister elements of the quote-unquote big data revolutionists. It's some of what is happening with your data that you're not aware of.
Indeed, the interaction between data and security services is very patent. One of the revelations from Edward Snowden is just the level of information that was held by security services. So, isn't privacy - you've just touched on the fundamental issue, when you considered how connected the world is and how we as citizens can't know what the full range of data that's being collected on us.
I think some companies handle it better than others. As I said, I think Google is really an exemplar here. Yes, they collect a lot of data, but you have the ability to interrogate the data that is collected on you. So, I think the areas where there really is a breach of trust or breach of privacy is when things are being done with your data that you have no knowledge of.
How good is government at understanding the implications of data and protecting its citizens when technology is moving so extraordinarily quickly?
It's actually a big challenge. So, the way I've heard it described which I quite like is tech companies, Silicon Valley, iterates on a three-month cycle, traditional businesses iterate on a five-year cycle and government iterates on 100-year cycle. It's very difficult. Government is trying to regulate in a world where they're not able to attract the best talent, who really understands how data science machine learning, a lot of these new technologies work.
So, in a lot of cases, you're finding regulation is slow to update. When it does update, it ends up updating in ways that are not necessarily in anybody's best interests. We're in a world that is probably changing faster, arguably changing faster than it's ever changed before. A lot of the most talented people are not finding government a particularly attractive place to go and it makes it hard for governments to keep up.
I'd just like to go to a slight variation on that but an interesting question around privacy as well, big data is being used by police forces to anticipate trends. In California, where you are, predictive policing has seen robberies decline by a quarter following use of the PredPol policing software.
Anthony, this certainly show the power of big data, but is there a risk of targeting certain social-economic groups?
Yeah, one big hot topic in data science, especially in learning circles at the moment, is bias in algorithms and how do you build algorithms that are not biased? This actually also happens to be a hot area of academic research. It's a really tough one. If you feed in biased data into an algorithm, the algorithm will continue to have a bias. So, the only real solution to this is having balanced unbiased data sets. Data science machine learning is not magic. It is only as good as the data you feed it.
Revenues in big data business analytics are expected to grow from the $41 billion you've indicated to perhaps $200 billion or more by 2020. I know it's hard to predict the future. Where do you see the big data industry heading?
Kaggle has a million users today. I'd like to see us be 10 times that, 100 times that in the coming years. So, actually universities like the University of Melbourne have a huge role to play here. There is a massive shortage of data scientists and machine learners and it's a very rewarding profession, at the moment a very lucrative profession. So, I think for data science machine learning to reach its potential to drive a large number of the decisions that each of us make every day or that companies make every day, we need to solve this talent shortage.
If we do solve this talent shortage, we will start to see algorithms starting to help make business decisions more frequently and more precisely than we can do as humans, which does obviously open up a scary question. If algorithms are capable of doing things like analysing us as we go through security, counting sealions from an aerial survey, there's a question around jobs and what would people do when a lot of the jobs that we currently do are taken over by algorithms.
That is genuinely something that I think a lot about and worry a lot about and I don't know, given the pace of innovation, historically when we face big waves of innovation, humans have been able to retrain and we're very flexible and do things that technology can't do.
The one potential difference here is that the change is coming so quickly, do we have enough time to retrain? This is not like the industrial revolution which took decades and decades and decades to unfold but rather it's happening. The progress, you can measure in years, not in decades. There is certainly a scary dimension to the rise in machine learning and artificial intelligence.
Anthony, as we're talking about industry, one of the arguments that's beginning to emerge in the academic and other literature is around monopolies, around whether we're going to see a handful of tech companies dominating an industry, and others finding it very difficult to come in, and that has raised questions about whether the anti-trust responses of the early twentieth century will need to be thought about in terms of technology. How do you respond to that emerging discussion?
Yeah, I think that tech industry is quite different to the oil and gas industry for instance. Technology tends to create natural monopolies but those natural monopolies don't tend to last for very long. The last generation of great tech companies were the Oracles and the SAPs. This generation it's Google, Amazon, Facebook, Microsoft. The next generation may be a company that hasn't even been founded yet.
So, there is so much change in the technology industry that you're a king for a while but you’re only a king until the next big wave of technology comes.
So, I think that policy and antitrust has much less concern today that in the Standard Oil days because it's just the pace of change in the industry will mean that the current giants end up being disrupted without antitrust having to intervene.
You mentioned the challenges for the public service and public sector in understanding what data makes possible. What sort of training and upskill is going to be necessary in order to make sure both sides of the equation know what they're doing?
One problem that the public sector has is it really operates as an environment that is not super-conducive to the kinds of people who - the kinds of people who do well in the tech industry are not necessarily the same kinds of people who are going to be attracted to a role in the public service. People who tend to do well in the tech industry are restless and they want to be constantly changing things. Government, as I said, tends to change quite slowly.
There are some bright spots. There's the US digital service which is a very fast moving part of the US government that helps implement the latest in technology in the areas of the US government. So, looking at things like the US digital service it's a nice example for other governments to follow.
Finally, you mentioned the possibility, very welcome, that you and others might use Australia in the future as a base for start-ups. But what would need to change here to make Australia a very attractive place for companies such as yours?
Yeah, it's a good question. I think a lot has changed in Australia anyway. For somebody like me, it's attractive to come back to Australia because we have great universities, we turn out great computer science students and they don’t have the job opportunities necessarily that they would have if they were at a Stanford or a Berkeley.
For me personally, it's attractive to work on a company out of Australia because you have access to that talent without having to compete with the Googles and the Facebooks. I think for somebody who was like me six or seven years ago and just getting started, really the ability to work in tech companies and get a bit of a - get the mentoring you need, it's really important. The only way to get that is to have other successful companies grow up in Australia and I think that is starting to happen. You have the Atlassians of the world, as I said. So, that ecosystem is already naturally building.
Anthony, it's been great talking with an Australian graduate who's providing leadership in such a fascinating and demanding industry, and thank you for encouraging today's students to think about maths and computer science and machine learning as really important areas.
Anthony Goldbloom, founder and CEO of Kaggle, thank you for taking the time to talk to us.
Nice to make the connection. I've got many emails signed by you and maybe my degree is signed by you.
The series producer of The Policy Shop is Eoin Hahessy with co-producer's Ruby Schwartz and Paul Gray. Audio engineering is by Gavin Nebauer This podcast is licensed under Creative Commons. Copyright, the University of Melbourne 2017. If you want to find out more about this subject, check out the documentary The Human Face of Big Data.
Amazon, Apple, Facebook, Microsoft and Alphabet - Google’s parent company - are the five most valuable listed firms in the world, collectively taking in over $25 billion US dollars in the first quarter of 2017.
Whether you are going for a run, watching TV, browsing online or even just sitting in traffic, virtually every activity creates a digital trace. Our connected, always ‘on’ world has generated huge volumes of data but also heightened concerns.
At the cost of our privacy this data economy brings opportunities, particularly in the area of public policy. Big Data is now being used to fight obesity, predict crime hot spots and to even help NASA map Dark Matter.
In this episode of The Policy Shop, alumnus of the University of Melbourne and Founder and CEO of the data analytics company, Kaggle, Anthony Goldbloom joins Professor Glyn Davis, Vice-Chancellor of the University of Melbourne, to discuss the potential and challenges of big data for public policy.
Episode recorded: 15 June 2017
The Policy Shop producer: Eoin Hahessy
Audio engineer: Gavin Nebauer
Banner image: Pixabay