Data#

There are two aspects to make sure having the right data for algorithm training.

Data quantity#

Data scientists echo a common argument about their models, arguing that model performance is not good because the quantity of data they were given was not sufficient to produce good model performance.

In some cases, there may be constraints on the quantity of data available to tackle some problems.

In most cases, more data might not really help, as quality also is an important factor. For example, models can learn more insights and characteristics from the data when having more samples for each class. Focus should not be on the number of data samples but rather on the diversity of data samples.

Data quality#

The more comprehensive or higher the quality of the data, the better the ML model or application will work. The process before the training is important: cleaning, augmenting, and scaling the data. There are some important dimensions of data quality to consider, such as consistency, correctness, and completeness.

  • Data consistency refers to the correspondence and coherence of the data samples throughout the dataset.

  • Data correctness is the degree of accuracy and the degree to which you can rely on the data to truly reflect events. Data correctness is dependent on how the data was collected.

  • The sparsity of data for each characteristic reflects data completeness.