Preprocessing#

Loading the data and performing some initial exploration on it to acquire some basic knowledge about the data, how the various features are distributed.

Importing libraries and packages#

1# Mathematical operations and data manipulation
2import pandas as pd
3
4# Warnings
5import warnings
6
7warnings.filterwarnings("ignore")
8
9%matplotlib inline

Set paths#

1# Path to datasets directory
2data_path = "./datasets"
3# Path to assets directory (for saving results to)
4assets_path = "./assets"

Loading dataset#

1# load data
2dataset = pd.read_csv(f"{data_path}/Absenteeism_at_work.csv", sep=";")
3dataset.head()
ID Reason for absence Month of absence Day of the week Seasons Transportation expense Distance from Residence to Work Service time Age Work load Average/day ... Disciplinary failure Education Son Social drinker Social smoker Pet Weight Height Body mass index Absenteeism time in hours
0 11 26 7 3 1 289 36 13 33 239.554 ... 0 1 2 1 0 1 90 172 30 4
1 36 0 7 3 1 118 13 18 50 239.554 ... 1 1 1 1 0 0 98 178 31 0
2 3 23 7 4 1 179 51 18 38 239.554 ... 0 1 0 1 0 0 89 170 31 2
3 7 7 7 5 1 279 5 14 39 239.554 ... 0 1 2 1 1 0 68 168 24 4
4 11 23 7 5 1 289 36 13 33 239.554 ... 0 1 2 1 0 1 90 172 30 2

5 rows × 21 columns

Exploring dataset#

1# Printing dimensionality of the data, columns, types and missing values
2print(f"Data dimension: {dataset.shape}")
3for col in dataset.columns:
4    print(
5        f"Column: {col:35} | "
6        f"type: {str(dataset[col].dtype):7} | "
7        f"missing values: {dataset[col].isna().sum():3d}"
8    )
Data dimension: (740, 21)
Column: ID                                  | type: int64   | missing values:   0
Column: Reason for absence                  | type: int64   | missing values:   0
Column: Month of absence                    | type: int64   | missing values:   0
Column: Day of the week                     | type: int64   | missing values:   0
Column: Seasons                             | type: int64   | missing values:   0
Column: Transportation expense              | type: int64   | missing values:   0
Column: Distance from Residence to Work     | type: int64   | missing values:   0
Column: Service time                        | type: int64   | missing values:   0
Column: Age                                 | type: int64   | missing values:   0
Column: Work load Average/day               | type: float64 | missing values:   0
Column: Hit target                          | type: int64   | missing values:   0
Column: Disciplinary failure                | type: int64   | missing values:   0
Column: Education                           | type: int64   | missing values:   0
Column: Son                                 | type: int64   | missing values:   0
Column: Social drinker                      | type: int64   | missing values:   0
Column: Social smoker                       | type: int64   | missing values:   0
Column: Pet                                 | type: int64   | missing values:   0
Column: Weight                              | type: int64   | missing values:   0
Column: Height                              | type: int64   | missing values:   0
Column: Body mass index                     | type: int64   | missing values:   0
Column: Absenteeism time in hours           | type: int64   | missing values:   0
1# Computing statistics on numerical features
2dataset.describe().T
count mean std min 25% 50% 75% max
ID 740.0 18.017568 11.021247 1.000 9.000 18.000 28.000 36.000
Reason for absence 740.0 19.216216 8.433406 0.000 13.000 23.000 26.000 28.000
Month of absence 740.0 6.324324 3.436287 0.000 3.000 6.000 9.000 12.000
Day of the week 740.0 3.914865 1.421675 2.000 3.000 4.000 5.000 6.000
Seasons 740.0 2.544595 1.111831 1.000 2.000 3.000 4.000 4.000
Transportation expense 740.0 221.329730 66.952223 118.000 179.000 225.000 260.000 388.000
Distance from Residence to Work 740.0 29.631081 14.836788 5.000 16.000 26.000 50.000 52.000
Service time 740.0 12.554054 4.384873 1.000 9.000 13.000 16.000 29.000
Age 740.0 36.450000 6.478772 27.000 31.000 37.000 40.000 58.000
Work load Average/day 740.0 271.490235 39.058116 205.917 244.387 264.249 294.217 378.884
Hit target 740.0 94.587838 3.779313 81.000 93.000 95.000 97.000 100.000
Disciplinary failure 740.0 0.054054 0.226277 0.000 0.000 0.000 0.000 1.000
Education 740.0 1.291892 0.673238 1.000 1.000 1.000 1.000 4.000
Son 740.0 1.018919 1.098489 0.000 0.000 1.000 2.000 4.000
Social drinker 740.0 0.567568 0.495749 0.000 0.000 1.000 1.000 1.000
Social smoker 740.0 0.072973 0.260268 0.000 0.000 0.000 0.000 1.000
Pet 740.0 0.745946 1.318258 0.000 0.000 0.000 1.000 8.000
Weight 740.0 79.035135 12.883211 56.000 69.000 83.000 89.000 108.000
Height 740.0 172.114865 6.034995 163.000 169.000 170.000 172.000 196.000
Body mass index 740.0 26.677027 4.285452 19.000 24.000 25.000 31.000 38.000
Absenteeism time in hours 740.0 6.924324 13.330998 0.000 2.000 3.000 8.000 120.000
  1. Individual identification (ID)

  2. Reason for absence (ICD). Absences attested by the International Code of Diseases (ICD) stratified into 21 categories (I to XXI) as follows:

    I Certain infectious and parasitic diseases II Neoplasms III Diseases of the blood and blood-forming organs and certain disorders involving the immune mechanism IV Endocrine, nutritional and metabolic diseases V Mental and behavioural disorders VI Diseases of the nervous system VII Diseases of the eye and adnexa VIII Diseases of the ear and mastoid process IX Diseases of the circulatory system X Diseases of the respiratory system XI Diseases of the digestive system XII Diseases of the skin and subcutaneous tissue XIII Diseases of the musculoskeletal system and connective tissue XIV Diseases of the genitourinary system XV Pregnancy, childbirth and the puerperium XVI Certain conditions originating in the perinatal period XVII Congenital malformations, deformations and chromosomal abnormalities XVIII Symptoms, signs and abnormal clinical and laboratory findings, not elsewhere classified XIX Injury, poisoning and certain other consequences of external causes XX External causes of morbidity and mortality XXI Factors influencing health status and contact with health services.

    And 7 categories without (CID) 2. patient follow-up (22), 3. medical consultation (23), 4. blood donation (24), 5. laboratory examination (25), 6. unjustified absence (26), 7. physiotherapy (27), 8. dental consultation (28).

  3. Month of absence

  4. Day of the week (Monday (2), Tuesday (3), Wednesday (4), Thursday (5), Friday (6))

  5. Seasons (summer (1), autumn (2), winter (3), spring (4))

  6. Transportation expense

  7. Distance from Residence to Work (kilometers)

  8. Service time

  9. Age

  10. Work load Average/day

  11. Hit target

  12. Disciplinary failure (yes=1; no=0)

  13. Education (high school (1), graduate (2), postgraduate (3), master and doctor (4))

  14. Son (number of children)

  15. Social drinker (yes=1; no=0)

  16. Social smoker (yes=1; no=0)

  17. Pet (number of pet)

  18. Weight

  19. Height

  20. Body mass index

  21. Absenteeism time in hours (target)

Preprocessing#

 1# Encoding dictionaries
 2month_encoding = {
 3    1: "January",
 4    2: "February",
 5    3: "March",
 6    4: "April",
 7    5: "May",
 8    6: "June",
 9    7: "July",
10    8: "August",
11    9: "September",
12    10: "October",
13    11: "November",
14    12: "December",
15    0: "Unknown",
16}
17dow_encoding = {
18    2: "Monday",
19    3: "Tuesday",
20    4: "Wednesday",
21    5: "Thursday",
22    6: "Friday",
23}
24season_encoding = {1: "Spring", 2: "Summer", 3: "Fall", 4: "Winter"}
25education_encoding = {
26    1: "high_school",
27    2: "graduate",
28    3: "postgraduate",
29    4: "master_phd",
30}
31yes_no_encoding = {0: "No", 1: "Yes"}
32
33# Creating a copy of the original data
34preprocessed_data = dataset.copy()
35
36# Tranforming numerical variables to categorical
37preprocessed_data["Month of absence"] = preprocessed_data[
38    "Month of absence"
39].apply(lambda x: month_encoding[x])
40preprocessed_data["Day of the week"] = preprocessed_data[
41    "Day of the week"
42].apply(lambda x: dow_encoding[x])
43preprocessed_data["Seasons"] = preprocessed_data["Seasons"].apply(
44    lambda x: season_encoding[x]
45)
46preprocessed_data["Education"] = preprocessed_data["Education"].apply(
47    lambda x: education_encoding[x]
48)
49preprocessed_data["Disciplinary failure"] = preprocessed_data[
50    "Disciplinary failure"
51].apply(lambda x: yes_no_encoding[x])
52preprocessed_data["Social drinker"] = preprocessed_data[
53    "Social drinker"
54].apply(lambda x: yes_no_encoding[x])
55preprocessed_data["Social smoker"] = preprocessed_data["Social smoker"].apply(
56    lambda x: yes_no_encoding[x]
57)
58
59preprocessed_data.head().T
0 1 2 3 4
ID 11 36 3 7 11
Reason for absence 26 0 23 7 23
Month of absence July July July July July
Day of the week Tuesday Tuesday Wednesday Thursday Thursday
Seasons Spring Spring Spring Spring Spring
Transportation expense 289 118 179 279 289
Distance from Residence to Work 36 13 51 5 36
Service time 13 18 18 14 13
Age 33 50 38 39 33
Work load Average/day 239.554 239.554 239.554 239.554 239.554
Hit target 97 97 97 97 97
Disciplinary failure No Yes No No No
Education high_school high_school high_school high_school high_school
Son 2 1 0 2 2
Social drinker Yes Yes Yes Yes Yes
Social smoker No No No Yes No
Pet 1 0 0 0 1
Weight 90 98 89 68 90
Height 172 178 170 168 172
Body mass index 30 31 31 24 30
Absenteeism time in hours 4 0 2 4 2
1preprocessed_data.to_csv(
2    f"{data_path}/preprocessed_absenteism.csv", index=False
3)