Preprocessing#

Loading the data and performing some initial exploration on it to acquire some basic knowledge about the data, how the various features are distributed.

Importing libraries and packages#

# Mathematical operations and data manipulation
import pandas as pd

# Warnings
import warnings

warnings.filterwarnings("ignore")

%matplotlib inline

Set paths#

# Path to datasets directory
data_path = "./datasets"
# Path to assets directory (for saving results to)
assets_path = "./assets"

Loading dataset#

# load data
dataset = pd.read_csv(f"{data_path}/Absenteeism_at_work.csv", sep=";")
dataset.head()

	ID	Reason for absence	Month of absence	Day of the week	Seasons	Transportation expense	Distance from Residence to Work	Service time	Age	Work load Average/day	...	Disciplinary failure	Education	Son	Social drinker	Social smoker	Pet	Weight	Height	Body mass index	Absenteeism time in hours
0	11	26	7	3	1	289	36	13	33	239.554	...	0	1	2	1	0	1	90	172	30	4
1	36	0	7	3	1	118	13	18	50	239.554	...	1	1	1	1	0	0	98	178	31	0
2	3	23	7	4	1	179	51	18	38	239.554	...	0	1	0	1	0	0	89	170	31	2
3	7	7	7	5	1	279	5	14	39	239.554	...	0	1	2	1	1	0	68	168	24	4
4	11	23	7	5	1	289	36	13	33	239.554	...	0	1	2	1	0	1	90	172	30	2

5 rows × 21 columns

Exploring dataset#

# Printing dimensionality of the data, columns, types and missing values
print(f"Data dimension: {dataset.shape}")
for col in dataset.columns:
    print(
        f"Column: {col:35} | "
        f"type: {str(dataset[col].dtype):7} | "
        f"missing values: {dataset[col].isna().sum():3d}"
    )

Data dimension: (740, 21)
Column: ID                                  | type: int64   | missing values:   0
Column: Reason for absence                  | type: int64   | missing values:   0
Column: Month of absence                    | type: int64   | missing values:   0
Column: Day of the week                     | type: int64   | missing values:   0
Column: Seasons                             | type: int64   | missing values:   0
Column: Transportation expense              | type: int64   | missing values:   0
Column: Distance from Residence to Work     | type: int64   | missing values:   0
Column: Service time                        | type: int64   | missing values:   0
Column: Age                                 | type: int64   | missing values:   0
Column: Work load Average/day               | type: float64 | missing values:   0
Column: Hit target                          | type: int64   | missing values:   0
Column: Disciplinary failure                | type: int64   | missing values:   0
Column: Education                           | type: int64   | missing values:   0
Column: Son                                 | type: int64   | missing values:   0
Column: Social drinker                      | type: int64   | missing values:   0
Column: Social smoker                       | type: int64   | missing values:   0
Column: Pet                                 | type: int64   | missing values:   0
Column: Weight                              | type: int64   | missing values:   0
Column: Height                              | type: int64   | missing values:   0
Column: Body mass index                     | type: int64   | missing values:   0
Column: Absenteeism time in hours           | type: int64   | missing values:   0

# Computing statistics on numerical features
dataset.describe().T

	count	mean	std	min	25%	50%	75%	max
ID	740.0	18.017568	11.021247	1.000	9.000	18.000	28.000	36.000
Reason for absence	740.0	19.216216	8.433406	0.000	13.000	23.000	26.000	28.000
Month of absence	740.0	6.324324	3.436287	0.000	3.000	6.000	9.000	12.000
Day of the week	740.0	3.914865	1.421675	2.000	3.000	4.000	5.000	6.000
Seasons	740.0	2.544595	1.111831	1.000	2.000	3.000	4.000	4.000
Transportation expense	740.0	221.329730	66.952223	118.000	179.000	225.000	260.000	388.000
Distance from Residence to Work	740.0	29.631081	14.836788	5.000	16.000	26.000	50.000	52.000
Service time	740.0	12.554054	4.384873	1.000	9.000	13.000	16.000	29.000
Age	740.0	36.450000	6.478772	27.000	31.000	37.000	40.000	58.000
Work load Average/day	740.0	271.490235	39.058116	205.917	244.387	264.249	294.217	378.884
Hit target	740.0	94.587838	3.779313	81.000	93.000	95.000	97.000	100.000
Disciplinary failure	740.0	0.054054	0.226277	0.000	0.000	0.000	0.000	1.000
Education	740.0	1.291892	0.673238	1.000	1.000	1.000	1.000	4.000
Son	740.0	1.018919	1.098489	0.000	0.000	1.000	2.000	4.000
Social drinker	740.0	0.567568	0.495749	0.000	0.000	1.000	1.000	1.000
Social smoker	740.0	0.072973	0.260268	0.000	0.000	0.000	0.000	1.000
Pet	740.0	0.745946	1.318258	0.000	0.000	0.000	1.000	8.000
Weight	740.0	79.035135	12.883211	56.000	69.000	83.000	89.000	108.000
Height	740.0	172.114865	6.034995	163.000	169.000	170.000	172.000	196.000
Body mass index	740.0	26.677027	4.285452	19.000	24.000	25.000	31.000	38.000
Absenteeism time in hours	740.0	6.924324	13.330998	0.000	2.000	3.000	8.000	120.000

Individual identification (ID)
Reason for absence (ICD). Absences attested by the International Code of Diseases (ICD) stratified into 21 categories (I to XXI) as follows:

I Certain infectious and parasitic diseases II Neoplasms III Diseases of the blood and blood-forming organs and certain disorders involving the immune mechanism IV Endocrine, nutritional and metabolic diseases V Mental and behavioural disorders VI Diseases of the nervous system VII Diseases of the eye and adnexa VIII Diseases of the ear and mastoid process IX Diseases of the circulatory system X Diseases of the respiratory system XI Diseases of the digestive system XII Diseases of the skin and subcutaneous tissue XIII Diseases of the musculoskeletal system and connective tissue XIV Diseases of the genitourinary system XV Pregnancy, childbirth and the puerperium XVI Certain conditions originating in the perinatal period XVII Congenital malformations, deformations and chromosomal abnormalities XVIII Symptoms, signs and abnormal clinical and laboratory findings, not elsewhere classified XIX Injury, poisoning and certain other consequences of external causes XX External causes of morbidity and mortality XXI Factors influencing health status and contact with health services.

And 7 categories without (CID) 2. patient follow-up (22), 3. medical consultation (23), 4. blood donation (24), 5. laboratory examination (25), 6. unjustified absence (26), 7. physiotherapy (27), 8. dental consultation (28).
Month of absence
Day of the week (Monday (2), Tuesday (3), Wednesday (4), Thursday (5), Friday (6))
Seasons (summer (1), autumn (2), winter (3), spring (4))
Transportation expense
Distance from Residence to Work (kilometers)
Service time
Age
Work load Average/day
Hit target
Disciplinary failure (yes=1; no=0)
Education (high school (1), graduate (2), postgraduate (3), master and doctor (4))
Son (number of children)
Social drinker (yes=1; no=0)
Social smoker (yes=1; no=0)
Pet (number of pet)
Weight
Height
Body mass index
Absenteeism time in hours (target)

Preprocessing#

# Encoding dictionaries
month_encoding = {
    1: "January",
    2: "February",
    3: "March",
    4: "April",
    5: "May",
    6: "June",
    7: "July",
    8: "August",
    9: "September",
    10: "October",
    11: "November",
    12: "December",
    0: "Unknown",
}
dow_encoding = {
    2: "Monday",
    3: "Tuesday",
    4: "Wednesday",
    5: "Thursday",
    6: "Friday",
}
season_encoding = {1: "Spring", 2: "Summer", 3: "Fall", 4: "Winter"}
education_encoding = {
    1: "high_school",
    2: "graduate",
    3: "postgraduate",
    4: "master_phd",
}
yes_no_encoding = {0: "No", 1: "Yes"}

# Creating a copy of the original data
preprocessed_data = dataset.copy()

# Tranforming numerical variables to categorical
preprocessed_data["Month of absence"] = preprocessed_data[
    "Month of absence"
].apply(lambda x: month_encoding[x])
preprocessed_data["Day of the week"] = preprocessed_data[
    "Day of the week"
].apply(lambda x: dow_encoding[x])
preprocessed_data["Seasons"] = preprocessed_data["Seasons"].apply(
    lambda x: season_encoding[x]
)
preprocessed_data["Education"] = preprocessed_data["Education"].apply(
    lambda x: education_encoding[x]
)
preprocessed_data["Disciplinary failure"] = preprocessed_data[
    "Disciplinary failure"
].apply(lambda x: yes_no_encoding[x])
preprocessed_data["Social drinker"] = preprocessed_data[
    "Social drinker"
].apply(lambda x: yes_no_encoding[x])
preprocessed_data["Social smoker"] = preprocessed_data["Social smoker"].apply(
    lambda x: yes_no_encoding[x]
)

preprocessed_data.head().T

	0	1	2	3	4
ID	11	36	3	7	11
Reason for absence	26	0	23	7	23
Month of absence	July	July	July	July	July
Day of the week	Tuesday	Tuesday	Wednesday	Thursday	Thursday
Seasons	Spring	Spring	Spring	Spring	Spring
Transportation expense	289	118	179	279	289
Distance from Residence to Work	36	13	51	5	36
Service time	13	18	18	14	13
Age	33	50	38	39	33
Work load Average/day	239.554	239.554	239.554	239.554	239.554
Hit target	97	97	97	97	97
Disciplinary failure	No	Yes	No	No	No
Education	high_school	high_school	high_school	high_school	high_school
Son	2	1	0	2	2
Social drinker	Yes	Yes	Yes	Yes	Yes
Social smoker	No	No	No	Yes	No
Pet	1	0	0	0	1
Weight	90	98	89	68	90
Height	172	178	170	168	172
Body mass index	30	31	31	24	30
Absenteeism time in hours	4	0	2	4	2

preprocessed_data.to_csv(
    f"{data_path}/preprocessed_absenteism.csv", index=False
)

Table of Contents

Books

Preprocessing#

Importing libraries and packages#

Set paths#

Loading dataset#

Exploring dataset#

Preprocessing#