Distribution

A distribution describes how data values are spread across a range, showing which values occur most frequently, how concentrated or spread out the data is, and whether it skews in any direction.

Distribution matters in machine learning because many algorithms make assumptions about the shape of their training data.

These are the main types of distribution.

  • Normal (Gaussian): Most values cluster around the average, with extreme values becoming increasingly rare.
  • Skewed: Values are unevenly spread, with most falling on one side and a few outliers pulling in the other direction.
  • Uniform: All values appear with roughly the same frequency.
  • Binomial: This describes outcomes that can only be one of two things, such as yes or no, pass or fail.

When data doesn't follow the distribution an algorithm expects, data scientists either transform the data to better fit those assumptions or choose algorithms that don't rely on them, such as decision trees and neural networks.

Share