How do you prepare a dataset for machine learning?

So, let’s have a look at the most common dataset problems and the ways to solve them. 1. Articulate the problem early 2. Establish data collection mechanisms 3. Check your data quality 4. Format data to make it consistent 5. Reduce data 6. Complete data cleaning 7. Create new features out of existing ones 8. Join transactional and attribute data 9. Rescale data 10. Discretize data

Preparing Your Dataset for Machine Learning: 10 Steps

There’s a good story about bad data from Columbia University. A healthcare project was aimed to cut costs in the treatment of patients with pneumonia. It employed machine learning (ML) to automatically sort through patient records to decide who has the lowest death risk and should take antibiotics at home and who’s at a high risk of death from pneumonia and should be in the hospital. The team used historic data from clinics, and the algorithm was accurate.

But there was with an important exception. One of the most dangerous conditions that may accompany pneumonia is asthma, and doctors always send asthmatics to intensive care resulting in minimal death rates for these patients. So, the absence of asthmatic death cases in the data made the algorithm assume that asthma isn’t that dangerous during pneumonia, and in all cases the machine recommended sending asthmatics home, while they had the highest risk of pneumonia complications.

ML depends heavily on data. It’s the most crucial aspect that makes algorithm training possible and explains why machine learning became so popular in recent years. But regardless of your actual terabytes of information and data science expertise, if you can’t make sense of data records, a machine will be nearly useless or perhaps even harmful.

The thing is, all datasets are flawed. That’s why data preparation is such an important step in the machine learning process. In a nutshell, data preparation is a set of procedures that helps make your dataset more suitable for machine learning. In broader terms, the data prep also includes establishing the right data collection mechanism. And these procedures consume most of the time spent on machine learning. Sometimes it takes months before the first algorithm is built! Dataset preparation can also be called data wrangling or data munging, so check our article on that as well.

Or watch our 14-minute explainer on data preparation:

How is data prepared for machine learning?

How data is prepared for machine learning, explained

Dataset preparation is sometimes a DIY project

If you were to consider a spherical machine-learning cow, all data preparation should be done by a dedicated data scientist. And that’s about right. If you don’t have a data scientist on board to do all the cleaning, well… you don’t have machine learning. But as we discussed in our story on data science team structures, life is hard for companies that can’t afford data science talent and try to transition existing IT engineers into the field. Besides, dataset preparation isn’t narrowed down to a data scientist’s competencies only. Problems with machine learning datasets can stem from the way an organization is built, workflows that are established, and whether instructions are adhered to or not among those charged with recordkeeping.

How data science teams work

Yes, you can rely completely on a data scientist in dataset preparation, but by knowing some techniques in advance there’s a way to meaningfully lighten the load of the person who’s going to face this Herculean task.

So, let’s have a look at the most common dataset problems and the ways to solve them.

0. How to collect data for machine learning if you don’t have any

The line dividing those who can play with ML and those who can’t is drawn by years of collecting information. Some organizations have been hoarding records for decades with such great success that now they need trucks to move it to the cloud as conventional broadband is just not broad enough.

For those who’ve just come on the scene, lack of data is expected, but fortunately, there are ways to turn that minus into a plus.

First, rely on open source datasets to initiate ML execution. There are mountains of data for machine learning around and some companies (like Google) are ready to give it away. We’ll talk about public dataset opportunities a bit later. While those opportunities exist, usually the real value comes from internally collected golden data nuggets mined from the business decisions and activities of your own company.

Second – and not surprisingly – now you have a chance to do data collection the right way. The companies that started data collection with paper ledgers and ended with .xlsx and .csv files will likely have a harder time with data preparation than those who have a small but proud ML-friendly dataset. If you know the tasks that machine learning should solve, you can tailor a data-gathering mechanism in advance.

What about big data? It’s so buzzed, it seems like the thing everyone should be doing. Aiming at big data from the start is a good mindset, but big data isn’t about petabytes. It’s all about the ability to process them the right way. The larger your dataset, the harder it gets to make the right use of it and yield insights. Having tons of lumber doesn’t necessarily mean you can convert it to a warehouse full of chairs and tables. So, the general recommendation for beginners is to start small and reduce the complexity of their data.

1. Articulate the problem early

Knowing what you want to predict will help you decide which data may be more valuable to collect. When formulating the problem, conduct data exploration and try to think in the categories of classification, clustering, regression, and ranking that we talked about in our whitepaper on business application of machine learning. In layman’s terms, these tasks are differentiated in the following way:

Classification. You want an algorithm to answer binary yes-or-no questions (cats or dogs, good or bad, sheep or goats, you get the idea) or you want to make a multiclass classification (grass, trees, or bushes; cats, dogs, or birds etc.) You also need the right answers labeled, so an algorithm can learn from them. Check our guide on how to tackle data labeling in an organization.

Clustering. You want an algorithm to find the rules of classification and the number of classes. The main difference from classification tasks is that you don’t actually know what the groups and the principles of their division are. For instance, this usually happens when you need to segment your customers and tailor a specific approach to each segment depending on its qualities.

Regression. You want an algorithm to yield some numeric value. For example, if you spend too much time coming up with the right price for your product since it depends on many factors, regression algorithms can aid in estimating this value.

Ranking. Some machine learning algorithms just rank objects by a number of features. Ranking is actively used to recommend movies in video streaming services or show the products that a customer might purchase with a high probability based on his or her previous search and purchase activities.

It’s likely that your business problem can be solved within this simple segmentation and you may start adapting a dataset accordingly. The rule of thumb on this stage is to avoid over-complicated problems.

2. Establish data collection mechanisms

Creating a data-driven culture in an organization is perhaps the hardest part of the entire initiative. We briefly covered this point in our story on machine learning strategy. If you aim to use ML for predictive analytics, the first thing to do is combat data fragmentation.

For instance, if you look at travel tech – one of AltexSoft’s key areas of expertise – data fragmentation is one of the top analytics problems here. In hotel businesses, the departments that are in charge of physical property get into pretty intimate details about their guests. Hotels know guests’ credit card numbers, types of amenities they choose, sometimes home addresses, room service use, and even drinks and meals ordered during a stay. The website where people book these rooms, however, may treat them as complete strangers.

This data gets siloed in different departments and even different tracking points within a department. Marketers may have access to a CRM but the customers there aren’t associated with web analytics. It’s not always possible to converge all data streams into a centralized storage if you have many channels of engagement, acquisition, and retention, but in most cases it’s manageable.

Usually, collecting data is the work of a data engineer, a specialist responsible for creating data infrastructures. But in the early stages, you can engage a software engineer who has some database experience.

Data engineering, explained

There are two major types of data collection mechanisms.

Data Warehouses and ETL

The first one is depositing data in warehouses. These storages are usually created for structured (or SQL) records, meaning they fit into standard table formats. It’s safe to say that all your sales records, payrolls, and CRM data fall into this category. Another traditional attribute of dealing with warehouses is transforming data before loading it there. We’ll talk more about data transformation techniques in this article. But generally it means that you know which data you need and how it must look, so you do all the processing before storing. This approach is called Extract, Transform, and Load (ETL).

The problem with this approach is that you don’t always know in advance which data will be useful and which won’t. So, warehouses are normally used to access data via business intelligence interfaces to visualize the metrics we know we need to track. And there’s another way.

Data Lakes and ELT

Data lakes are storages capable of keeping both structured and unstructured data, including images, videos, sounds records, PDF files... you get the idea. But even if data is structured, it’s not transformed before storing. You would load data there as is and decide how to use and process it later, on demand. This approach is called Extract, Load, and -- then when you need -- Transform.

More on the difference between ETL and ELT you can find in our article. So, what should you choose? Generally, both. Data lakes are considered a better fit for machine learning. But if you’re confident in at least some data, it’s worth keeping it prepared as you can use it for analytics before you even start any data science initiative.

And keep in mind that modern cloud data warehouse providers support both approaches.

Handling human factor

Another point here is the human factor. Data collection may be a tedious task that burdens your employees and overwhelms them with instructions. If people must constantly and manually make records, the chances are they will consider these tasks as yet another bureaucratic whim and let the job slide. For instance, Salesforce provides a decent toolset to track and analyze salespeople activities but manual data entry and activity logging alienates salespeople.

This can be solved using robotic process automation systems. RPA algorithms are simple, rule-based bots that can do tedious and repetitive tasks.

Check our dedicated article on data collection to learn more.

3. Check your data quality

The first question you should ask -- do you trust your data? Even the most sophisticated machine learning algorithms can’t work with poor data. We’ve talked in detail about data quality in a separate article, but generally you should look at several key things.

How tangible is human error? If your data is collected or labeled by humans, check a subset of data and estimate how often mistakes happen.

Were there any technical problems when transferring data? For instance, the same records can be duplicated because of server error, or you had a storage crash, or maybe you experienced a cyberattack. Evaluate how these events impacted your data.

How many omitted values does your data have? While there are ways to handle omitted records, which we discuss below, estimate whether their number is critical.

Is your data adequate to your task? If you’ve been selling home appliances in the US and now plan on branching into Europe, can you use the same data to predict stock and demand?

Is your data imbalanced? Imagine that you’re trying to mitigate supply chain risks and filter out those suppliers that you consider unreliable and you use a number of metadata attributes (e.g., location, size, rating, etc.). If your labeled dataset has 1,500 entries labeled as reliable and only 30 that you consider unreliable, the model won’t have enough samples to learn about the unreliable ones.

4. Format data to make it consistent

Data formatting is sometimes referred to as the file format you’re using. And this isn’t much of a problem to convert a dataset into a file format that fits your machine learning system best.

We’re talking about format consistency of records themselves. If you’re aggregating data from different sources or your dataset has been manually updated by different people, it’s worth making sure that all variables within a given attribute are consistently written. These may be date formats, sums of money (4.03 or $4.03, or even 4 dollars 3 cents), addresses, etc. The input format should be the same across the entire dataset.

And there are other aspects of data consistency. For instance, if you have a set numeric range in an attribute from 0.0 to 5.0, ensure that there are no 5.5s in your set.

5. Reduce data

It’s tempting to include as much data as possible, because of… well, big data! That’s wrong-headed. Yes, you definitely want to collect all data possible. But if you’re preparing a dataset with particular tasks in mind, it’s better to reduce data.

Since you know what the target attribute (what value you want to predict) is, common sense will guide you further. You can assume which values are critical and which are going to add more dimensions and complexity to your dataset without any forecasting contribution.

This approach is called attribute sampling.

For example, you want to predict which customers are prone to make large purchases in your online store. The age of your customers, their location, and gender can be better predictors than their credit card numbers. But this also works another way. Consider which other values you may need to collect to uncover more dependencies. For instance, adding bounce rates may increase accuracy in predicting conversion.

That’s the point where domain expertise plays a big role. Returning to our beginning story, not all data scientists know that asthma can cause pneumonia complications. The same works with reducing large datasets. If you haven’t employed a unicorn who has one foot in healthcare basics and the other in data science, it’s likely that a data scientist might have a hard time understanding which values are of real significance to a dataset.

Another approach is called record sampling. This implies that you simply remove records (objects) with missing, erroneous, or less representative values to make prediction more accurate. The technique can also be used in the later stages when you need a model prototype to understand whether a chosen machine learning method yields expected results and estimate ROI of your ML initiative.

You can also reduce data by aggregating it into broader records by dividing the entire attribute data into multiple groups and drawing the number for each group. Instead of exploring the most purchased products of a given day through five years of online store existence, aggregate them to weekly or monthly scores. This will help reduce data size and computing time without tangible prediction losses.

6. Complete data cleaning

Since missing values can tangibly reduce prediction accuracy, make this issue a priority. In terms of machine learning, assumed or approximated values are “more right” for an algorithm than just missing ones. Even if you don’t know the exact value, methods exist to better “assume” which value is missing or bypass the issue. How to сlean data? Choosing the right approach also heavily depends on data and the domain you have:

Substitute missing values with dummy values, e.g., n/a for categorical or 0 for numerical values
Substitute the missing numerical values with mean figures
For categorical values, you can also use the most frequent items to fill in.

If you use some ML as a service platform, data cleaning can be automated. For instance, Azure Machine Learning allows you to choose among available techniques, while Amazon ML will do it without your involvement at all. Have a look at our MLaaS systems comparison to get a better idea about systems available on the market.

7. Create new features out of existing ones

Some values in your data set can be complex and decomposing them into multiple parts will help in capturing more specific relationships. This process is actually the opposite to reducing data as you have to add new attributes based on the existing ones.

For example, if your sales performance varies depending on the day of a week, segregating the day as a separate categorical value from the date (Mon; 06.19.2017) may provide the algorithm with more relevant information.

8. Join transactional and attribute data

Transactional data consists of events that snapshot specific moments, e.g. what was the price of the boots and the time when a user with this IP clicked on the Buy now button?

Attribute data is more static, like user demographics or age and doesn’t directly relate to specific events.

You may have several data sources or logs where these types of data reside. Both types can enhance each other to achieve greater predictive power. For instance, if you’re tracking machinery sensor readings to enable predictive maintenance, most likely you’re generating logs of transactional data, but you can add such qualities as the equipment model, the batch, or its location to look for dependencies between equipment behavior and its attributes.

Also you can aggregate transactional data into attributes. Say, you gather website session logs to assign different attributes to different users, e.g., researcher (visits 30 pages on average, rarely buys something), reviews reader (explores the reviews page from top to bottom), instant buyer, etc., then you can use this data to, for example, optimize your retargeting campaigns or predict customer lifetime value.

9. Rescale data

Data rescaling belongs to a group of data normalization procedures that aim at improving the quality of a dataset by reducing dimensions and avoiding the situation when some of the values overweight others. What does this mean?

Imagine that you run a chain of car dealerships and most of the attributes in your dataset are either categorical to depict models and body styles (sedan, hatchback, van, etc.) or have 1-2 digit numbers, for instance, for years of use. But the prices are 4-5 digit numbers ($10000 or $8000) and you want to predict the average time for the car to be sold based on its characteristics (model, years of previous use, body style, price, condition, etc.) While the price is an important criterion, you don’t want it to overweight the other ones with a larger number.

In this case, min-max normalization can be used. It entails transforming numerical values to ranges, e.g., from 0.0 to 1.0 where 0.0 represents the minimal and 1.0 the maximum values to even out the weight of the price attribute with other attributes in a dataset.

A bit simpler approach is decimal scaling. It entails scaling data by moving a decimal point in either direction for the same purposes.

10. Discretize data

Sometimes you can be more effective in your predictions if you turn numerical values into categorical values. This can be achieved, for example, by dividing the entire range of values into a number of groups.

If you track customer age figures, there isn’t a big difference between the age of 13 and 14 or 26 and 27. So these can be converted into relevant age groups. Making the values categorical, you simplify the work for an algorithm and essentially make prediction more relevant.

Public datasets

Your private datasets capture the specifics of your unique business and potentially have all relevant attributes that you might need for predictions. But when can you use public datasets?

Public datasets come from organizations and businesses that are open enough to share. The sets usually contain information about general processes in a wide range of life areas like healthcare records, historical weather records, transportation measurements, text and translation collections, records of hardware use, etc. Though these won’t help capture data dependencies in your own business, they can yield great insight into your industry and its niche, and, sometimes, your customer segments.

To learn more about open data sources, consider checking our article about the best public datasets and resources that store this data.

Another use case for public datasets comes from startups and businesses that use machine learning techniques to ship ML-based products to their customers. If you recommend city attractions and restaurants based on user-generated content, you don’t have to label thousands of pictures to train an image recognition algorithm that will sort through photos sent by users. There’s an Open Images dataset from Google. Similar datasets exist for speech and text recognition. You can also find a public datasets compilation on GitHub. Some of the public datasets are commercial and will cost you money.

So, even if you haven’t been collecting data for years, go ahead and search. There may be sets that you can use right away.

Final word: you still need a data scientist

The dataset preparation measures described here are basic and straightforward. So, you still must find data scientists and data engineers if you need to automate data collection mechanisms, set the infrastructure, and scale for complex machine learning tasks.

But the point is, deep domain and problem understanding will aid in relevant structuring values in your data. If you are only at the data collection stage, it may be reasonable to reconsider existing approaches to sourcing and formatting your records.

Oleksandr is a content strategist and editor. He leads (when possible) the team of independent-thinking writers and tech journalists at AltexSoft. With over 10 years of writing and editing tech-related pieces and scripts, he currently focuses on travel tech, data science, and AI. Outside of work, Oleksandr enjoys escapism in video games and game development.

Want to write an article for our blog? Read our requirements and guidelines to become a contributor.

Preparing Your Dataset for Machine Learning: 10 Basic Techniques That Make Your Data Better