Small Data Is the Next Big Thing in AI; It Will Make AI Smarter

W2R2bWVtYmVyIGlkPSIyMjE0IiBncmlkc3R5bGU9ImZ1bGwiIG9mZnNldD0iMjAiIGl0ZW13aWR0aD0iMjUwIiBzaWRlPSJyaWdodCIgcm91bmRlZD0iIl0=

Big data worries me. And this idea that Artificial Intelligence (AI) cannot do without big data worries me most. Just this week, a very quick and by no means significant mini-survey during the World Summit AI in Amsterdam revealed that around 20% of the participants expect that more and bigger data is key for the further development of AI.

Big data demands a lot of effort, is very expensive and brings with it many problems. One needs to collect that data, maintain it, manage it, develop all kinds of governance structures to deal with it. It brings with it problems with privacy and security and it vulnerable to attacks, misuse and misinterpretation. Data is context and time dependent, which means that if we want to keep it up to date we need to continuously collect more data, about more situations, on more context. Thus big data leads to more data leads to bigger data. Of course, there are many reasons and many domains where one needs big data. But I claim that AI is is not one of those domains.

In AI, big data is mostly used for machine learning (ML). Current ML techniques are mostly applications of probabilistic theories, in different flavours. Basically, ML is the search for correlations within data: the more data the more certain one is that the correlations are correct. Hence the need for big, bigger, biggest data. Typically, an image recognition algorithm needs to be trained with several millions of images of an elephant before it correctly can identify elephants. Once it does, however, it will do it quicker and more correct than people but it will still remain brittle and any slight change of a pixel can cause unexpected results. The same holds for speech recognition algorithms or automatic translation, for example.

In fact, the current achievements in machine learning are more due to the increasing availability of computational power to maintain and compute data, than with any real breakthrough in AI theory.

No child in the world, however, will need to look at a million elephant pictures before it can recognise one! As Ralf Herbrich (Amazon’s Director of Machine Learning Science) very accurately said “Machine Learning needs better accuracy per calorie” The effort/accuracy ratio is huge. Ralf’s plea at World Summit AI was about energy consumption of AI systems. I would like to add a plea for intelligent AI.

Why are people able to recognise elephants after just a few images? Among other things, because people don’t use only correlation as reasoning mechanism. We use causality, we use abstraction. Computational theories of causation and abstraction exist for many years, and are part of what used to be called ‘symbolic AI’. Symbolic approaches may be closer to the way people reason and do things. Just consider that the fact that humans have developed highly complex means to handle symbols (language) is what brought us here, as a species. These techniques only require a fraction of the data that correlation approaches require, but need much more effort to design and set up.

Without demeaning the huge and important developments that have brought ML to where we are now, I believe that further developments cannot be to merely use increasing amounts of data and increasing computer power. Even if Moore’s law will stand forever, such approach is at least from an environmental perspective not sustainable given the power it needs.

The same survey I mention above, indicated that over 40% see the main role on the further development of AI to be for societal responsibility. Responsibility in AI is not just about ethics, bias, and trolley problems. It is mostly about the role and position of AI in its societal context. It is about ensuring the best tools for the job, minimize side effects (maintain of huge amounts of data, energy usage). We need to be responsible about this dependency on big data. Like junk food, big data is making us dependent on consuming more and more data. Let’s go on a diet! Rethinking the use of correlation techniques, readdressing causality and abstraction theories, and a combination of them, leads not only to sustainable solutions, but to better results overall.

Small data is not just easier to use, it is The responsible way for AI!

Related posts

Design for Values in Education at TU Delft

SIG Talks: Digital Technologies for Mental Well-being

First 2026 Seminar of the Special Interest Group Design for Justice: A Pluralist Account