Machine learning companies love data, and most of today’s data scientists, unfortunately, still work on the outdated premise that more data is always better. Unsurprisingly, this creates an environment where the solution to any machine learning problem is to throw more data at it. That comes at the expense of more compute, storage, and processing. While this can be a viable approach for global organizations with big budgets, it’s not always the best one. Gathering, cleaning, and annotating data is time-consuming and expensive. Sometimes, you simply won’t have enough data.
The key is that there’s no guarantee that analyzing mass amounts of data will translate meaningful insights. The result of “more data is always better”? Underwhelming implementations and catastrophic failures that waste millions of dollars on data preparation and the man-hours spent figuring out whether or not it’s useful. Instead, companies should make conscious choices on what and how much data is needed.
Big data? Big whoop
Most machine learning models today are trained with mountains of data to ensure the highest probability of success. That is a valid approach, but this process results from the misconception that the more data you have, the easier a task becomes. Of course, no one wants to start building a model only to realize you don’t have enough data. If you can’t accurately estimate how much data you need, then the belief of having more data is justified as there is a higher chance of success with more information.
In reality, the process of collecting data can be extensive, and in the end, data scientists are left with a substantial amount of data that they know nothing about. With most machine learning tools, you go in blind after putting in your data. There still aren’t answers to what needs to be measured or what attributes are on the data points you need. Data engineers are tasked with analyzing the data to determine if certain features are more relevant than others based on their intuition. Then, they build a model and hope that it can predict the question at hand. The result is an ad hoc approach where we just hope that the person looking at the data comes up with an intelligent way to make sense of the numbers in front of them.
The overarching pain point is that today’s data scientists operate under vast amounts of uncertainty. There isn’t a clear answer as to which model is the right model, how much data it will take, or how long it will take, so there isn’t a definite budget. Under this mentality, the more data available to experiment with, the more models can be created to determine the best solution.
Debunking the myth
The more data approach is driven by the uncertainty of not having enough data to answer the question at hand. However, we can disprove this theory with a straightforward example. Let’s say you’re given a set of numbers [2, 4, 6, 8] and asked to guess the next number in the sequence; what would your guess be? Most people would say 10 because you would understand that the rule, “+2”, is immediately apparent.
Interestingly, adding more instances to this data set does not make it more learnable. If you were given [2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, etc.], you would have still figured out the rule after analyzing the just first four numbers, so there is no value or benefit in going all the way to 100. Machine learning works in the same way. The end goal is to infer a model, so once you have a rule that you can derive from your data, there isn’t a need to continue adding more information into the dataset. By operating with this mentality, the more time and compute is used, resulting in additional money spent.
Often, there’s an argument about noise, but the real problem is that there’s a tradeoff between generalization and accuracy. It would be ideal to generalize the noise to concentrate on the actual content, but an irreducible error cannot be reduced by creating good models. There can be a measure of accuracy that is achievable. Still, the lack of awareness in the industry today pushes data scientists to solve accuracy issues with more data, compute, and parameters. What’s worse is that this process creates a significant problem for machine learning surrounding verification, validation, and trust.
The solution? It’s time to take the guesswork out of machine learning by sizing machine learning models and measuring the learnability of any labeled data set against the model types. These dimensions would prevent overfitting and underfitting, allowing data scientists to better understand their data with accurate and unique models that they can trust.
The industry practice of machine learning has been a relatively unbridled rush to use all available computational resources to build the biggest and most data-rich possible models. Why? Because this is what everyone has believed leads to improved accuracy. Unlike the many professional engineers of various disciplines and scientists of most who say, “measure twice, cut once,” the machine learning experts today say, “more data is better data.” However, the notion of measuring models before they are built transforms the curation of your dataset into a science, optimizing the size and diversity of the dataset for the best possible model results.