Data imbalance is an important term in the fields of artificial intelligence, big data, smart data and automation. Data imbalance means that individual groups or categories occur much more frequently in a data set than others. For example, when analysing customer satisfaction, the number of satisfied customers can be significantly higher than the number of dissatisfied customers. This leads to artificial intelligence and other data models drawing incorrect conclusions or overlooking certain groups.
An illustrative example: Imagine you want to sort emails automatically. Out of 1,000 emails, 950 are labelled as "normal" and only 50 as "spam". A system that hardly learns any spam ends up categorising too many spam emails as normal. This is because the rare cases in the data set (here: spam) are given too little weight in the evaluation.
Data imbalance is particularly important when developing automation and AI solutions. A model can only work reliably and fairly if the data is as balanced as possible. It is therefore important to pay attention to data imbalance when collecting and analysing data and to compensate for it.