Outliers in numerical data#

Detecting and getting rid of outliers is a time-consuming and critical process in any data wrangling pipeline. They need deep domain knowledge, expertise in descriptive statistics, mastery over the programming language (and all the useful libraries), and a lot of caution. We recommend being very careful when performing this operation on a dataset.

A z-score is a measure on a set of data that gives a value for each data point regarding how much that data point is spread out with respect to the standard deviation and mean of the dataset. The z-score can be used to numerically detect outliers in a set of data. Any data point with a z-score greater than +3 or less than -3 is considered an outlier.

Levenshtein distance is an advanced concept. We can think of it as the minimum number of single-character edits that are needed to convert one string into another. When two strings are identical, the distance between them is 0 – the bigger the difference, the higher the number. We can consider a threshold of distance, under which two strings are considered the same.

Importing libraries and packages#

# Mathematical operations and data manipulation
import pandas as pd
from math import cos, pi
from scipy import stats
from Levenshtein import distance

# Plotting
import matplotlib.pyplot as plt

%matplotlib inline

Detecing outliers in numerical data#

ys = [cos(i * (pi / 4)) for i in range(50)]
plt.plot(ys)

[<matplotlib.lines.Line2D at 0x7fcb1fc58700>]

../../_images/05f00307bea0d25732b18aec7b39670a9d74b58f6c2da0914891b233a6f1c8ee.png

# Introducing outliers
ys[4] = ys[4] + 5.0
ys[20] = ys[20] + 8.0
plt.plot(ys)

[<matplotlib.lines.Line2D at 0x7fcb17b5cfa0>]

../../_images/307eb03ee7233054e3a047eff3d1682107695993ee0d8464a099dea3f731fcfb.png

# Boxplotting for visual cues on outliers
plt.boxplot(ys)

{'whiskers': [<matplotlib.lines.Line2D at 0x7fcb17ad9760>,
  <matplotlib.lines.Line2D at 0x7fcb17ad9a30>],
 'caps': [<matplotlib.lines.Line2D at 0x7fcb17ad9dc0>,
  <matplotlib.lines.Line2D at 0x7fcb17ad9fd0>],
 'boxes': [<matplotlib.lines.Line2D at 0x7fcb17ad9490>],
 'medians': [<matplotlib.lines.Line2D at 0x7fcb17af02e0>],
 'fliers': [<matplotlib.lines.Line2D at 0x7fcb17af05b0>],
 'means': []}

../../_images/a3ef2559f02561af35adda84bed8ed7009965801f754b9ffcf2291dd6c39baad.png

The Z-Score value#

df_original = pd.DataFrame(ys)
cos_arr_z_score = stats.zscore(ys)
cos_arr_without_outliers = df_original[(cos_arr_z_score < 3)]

print(cos_arr_without_outliers.shape)
print(df_original.shape)

(49, 1)
(50, 1)

Fuzzy string matching#

# A problem that may look like an outlier, but is not.
# Creating the load data of a ship on three different dates:
ship_data = {
    "Sea Princess": {"date": "12/08/20", "load": 40000},
    "Sea Pincess": {"date": "10/06/20", "load": 30000},
    "Sea Princes": {"date": "12/04/20", "load": 30000},
}

# Passing two strings to the distance function to calculate
# distance between them
name_of_ship = "Sea Princess"
for k, v in ship_data.items():
    print("{} {} {}".format(k, name_of_ship, distance(name_of_ship, k)))

Sea Princess Sea Princess 0
Sea Pincess Sea Princess 1
Sea Princes Sea Princess 1

Table of Contents

Books

Outliers in numerical data#

Importing libraries and packages#

Detecing outliers in numerical data#

The Z-Score value#

Fuzzy string matching#