Machine Learning and Tabular Data

Using Neural Networks to work on structured data can be difficult, but why?

Table of contents

No heading

No headings in the article.

Introduction

Machine Learning is quite simple but then it could be quite complicated at the same time. Neural networks are a fundamental aspect of machine learning. They are a series of algorithms that work together to identify underlying patterns in data, they usually mimic or behave in a way that the neurons of the human brain work and that is where the name is gotten from.

Data comes in various forms, there is structured data and unstructured data. Structured data can be found in spreadsheets etc while unstructured data exists in log files, images, audios etc. Tabular data is a form of structured data. It is data that is structured into rows, where each of those rows contains information about something.

The Problems

Using neural networks on tabular data has not always been ideal. Models that are usually built with neural networks typically have low performance compared to traditional machine learning models. Tabular data typically does not have the hyper non-linear relationships that image recognition, NLP datasets have and there isn’t enough information in tabular data for the models to capitalize on and increase their performance levels.

The quality of data found is another one of the major concerns in tabular data. There are oftentimes outliers in the data, missing values. It is also difficult to find spatial correlations between the variables found in tabular datasets, which means that methods like Convolutional Neural Networks are unable to create models based on tabular data. Another important problem is the conversion of categorical attributes in the data. This is usually done using one-hot encoding but that increases the problem of dimensionality. Data augmentation is a very important part of machine learning as it helps the model become more accurate. It is very challenging to apply that for tabular data and all of these combine to show the complexity of using Neural networks with Tabular data.

Models that perform very well on tabular data such as Gradient boosted trees, random forests etc. all do very well when mapping “shallow” non-linear relationships and the mapping is done in an efficient and simple way. So, neural networks are not bad for tabular data, the amount of data required for a neural network to have good performance is not typically found in tabular data and explains the underperformance.

The time and resources needed to tune neural networks and deep learning for tabular data are also not easily justifiable knowing how well gradient boosting algorithms work on the same type of data.

What ML algorithms work instead.

As alluded to earlier, gradient boating algorithms have been shown to be the best for working on problems including tabular data, the best bet you can get for accurate modelling of these problems are LightGBM, XGBoost, Catboost. These three can be considered as the holy grail of tabular data and should be the first point of call in Tabular Data problems.

If there is still a need for a deep learning model to be created for tabular data, there exists Tabnet. TabNet is a Deep Neural Network for working with Structured, Tabular Data. It has outperformed previously mentioned Decision Tree-based models on multiple benchmark datasets and can be used in practice. A simple guide for implementation in solving a problem can be found here.

Understanding what your problem needs and knowing what to prioritize will aid in choosing the right machine learning method to use but hopefully, this article helps you explain and understand the options available to you. Thank you.

Some content that was used to gain an understanding of this issue include: