Impact of age on reason for absence#

Older employees might need more frequent medical treatment.

Importing libraries and packages#

 1# Mathematical operations and data manipulation
 2import pandas as pd
 3
 4# Statistics\
 5from scipy.stats import ttest_ind
 6from scipy.stats import ks_2samp
 7from scipy.stats import pearsonr
 8
 9# Plotting
10import seaborn as sns
11import matplotlib.pyplot as plt
12
13# Warnings
14import warnings
15
16warnings.filterwarnings("ignore")
17
18%matplotlib inline

Set paths#

1# Path to datasets directory
2data_path = "./datasets"
3# Path to assets directory (for saving results to)
4assets_path = "./assets"

Loading dataset#

1# load data
2dataset = pd.read_csv(f"{data_path}//preprocessed_absenteism.csv")
3dataset.head()
ID Reason for absence Month of absence Day of the week Seasons Transportation expense Distance from Residence to Work Service time Age Work load Average/day ... Disciplinary failure Education Son Social drinker Social smoker Pet Weight Height Body mass index Absenteeism time in hours
0 11 26 July Tuesday Spring 289 36 13 33 239.554 ... No high_school 2 Yes No 1 90 172 30 4
1 36 0 July Tuesday Spring 118 13 18 50 239.554 ... Yes high_school 1 Yes No 0 98 178 31 0
2 3 23 July Wednesday Spring 179 51 18 38 239.554 ... No high_school 0 Yes No 0 89 170 31 2
3 7 7 July Thursday Spring 279 5 14 39 239.554 ... No high_school 2 Yes Yes 0 68 168 24 4
4 11 23 July Thursday Spring 289 36 13 33 239.554 ... No high_school 2 Yes No 1 90 172 30 2

5 rows × 21 columns

Exploring dataset#

1# Printing dimensionality of the data, columns, types and missing values
2print(f"Data dimension: {dataset.shape}")
3for col in dataset.columns:
4    print(
5        f"Column: {col:35} | "
6        f"type: {str(dataset[col].dtype):7} | "
7        f"missing values: {dataset[col].isna().sum():3d}"
8    )
Data dimension: (740, 21)
Column: ID                                  | type: int64   | missing values:   0
Column: Reason for absence                  | type: int64   | missing values:   0
Column: Month of absence                    | type: object  | missing values:   0
Column: Day of the week                     | type: object  | missing values:   0
Column: Seasons                             | type: object  | missing values:   0
Column: Transportation expense              | type: int64   | missing values:   0
Column: Distance from Residence to Work     | type: int64   | missing values:   0
Column: Service time                        | type: int64   | missing values:   0
Column: Age                                 | type: int64   | missing values:   0
Column: Work load Average/day               | type: float64 | missing values:   0
Column: Hit target                          | type: int64   | missing values:   0
Column: Disciplinary failure                | type: object  | missing values:   0
Column: Education                           | type: object  | missing values:   0
Column: Son                                 | type: int64   | missing values:   0
Column: Social drinker                      | type: object  | missing values:   0
Column: Social smoker                       | type: object  | missing values:   0
Column: Pet                                 | type: int64   | missing values:   0
Column: Weight                              | type: int64   | missing values:   0
Column: Height                              | type: int64   | missing values:   0
Column: Body mass index                     | type: int64   | missing values:   0
Column: Absenteeism time in hours           | type: int64   | missing values:   0
1# Computing statistics on numerical features
2dataset.describe().T
count mean std min 25% 50% 75% max
ID 740.0 18.017568 11.021247 1.000 9.000 18.000 28.000 36.000
Reason for absence 740.0 19.216216 8.433406 0.000 13.000 23.000 26.000 28.000
Transportation expense 740.0 221.329730 66.952223 118.000 179.000 225.000 260.000 388.000
Distance from Residence to Work 740.0 29.631081 14.836788 5.000 16.000 26.000 50.000 52.000
Service time 740.0 12.554054 4.384873 1.000 9.000 13.000 16.000 29.000
Age 740.0 36.450000 6.478772 27.000 31.000 37.000 40.000 58.000
Work load Average/day 740.0 271.490235 39.058116 205.917 244.387 264.249 294.217 378.884
Hit target 740.0 94.587838 3.779313 81.000 93.000 95.000 97.000 100.000
Son 740.0 1.018919 1.098489 0.000 0.000 1.000 2.000 4.000
Pet 740.0 0.745946 1.318258 0.000 0.000 0.000 1.000 8.000
Weight 740.0 79.035135 12.883211 56.000 69.000 83.000 89.000 108.000
Height 740.0 172.114865 6.034995 163.000 169.000 170.000 172.000 196.000
Body mass index 740.0 26.677027 4.285452 19.000 24.000 25.000 31.000 38.000
Absenteeism time in hours 740.0 6.924324 13.330998 0.000 2.000 3.000 8.000 120.000
  1. Individual identification (ID)

  2. Reason for absence (ICD). Absences attested by the International Code of Diseases (ICD) stratified into 21 categories (I to XXI) as follows:

    I Certain infectious and parasitic diseases II Neoplasms III Diseases of the blood and blood-forming organs and certain disorders involving the immune mechanism IV Endocrine, nutritional and metabolic diseases V Mental and behavioural disorders VI Diseases of the nervous system VII Diseases of the eye and adnexa VIII Diseases of the ear and mastoid process IX Diseases of the circulatory system X Diseases of the respiratory system XI Diseases of the digestive system XII Diseases of the skin and subcutaneous tissue XIII Diseases of the musculoskeletal system and connective tissue XIV Diseases of the genitourinary system XV Pregnancy, childbirth and the puerperium XVI Certain conditions originating in the perinatal period XVII Congenital malformations, deformations and chromosomal abnormalities XVIII Symptoms, signs and abnormal clinical and laboratory findings, not elsewhere classified XIX Injury, poisoning and certain other consequences of external causes XX External causes of morbidity and mortality XXI Factors influencing health status and contact with health services.

    And 7 categories without (CID) 2. patient follow-up (22), 3. medical consultation (23), 4. blood donation (24), 5. laboratory examination (25), 6. unjustified absence (26), 7. physiotherapy (27), 8. dental consultation (28).

  3. Month of absence

  4. Day of the week (Monday (2), Tuesday (3), Wednesday (4), Thursday (5), Friday (6))

  5. Seasons (summer (1), autumn (2), winter (3), spring (4))

  6. Transportation expense

  7. Distance from Residence to Work (kilometers)

  8. Service time

  9. Age

  10. Work load Average/day

  11. Hit target

  12. Disciplinary failure (yes=1; no=0)

  13. Education (high school (1), graduate (2), postgraduate (3), master and doctor (4))

  14. Son (number of children)

  15. Social drinker (yes=1; no=0)

  16. Social smoker (yes=1; no=0)

  17. Pet (number of pet)

  18. Weight

  19. Height

  20. Body mass index

  21. Absenteeism time in hours (target)

1# Function to check if the provided integer value is contained
2# in the ICD or not
3def in_icd(val):
4    r = range(1, 22)
5    return "Yes" if val in r else "No"
6
7
8# Adding Disease column
9dataset["Disease"] = dataset["Reason for absence"].apply(in_icd)

Age plots#

 1# Computing Pearson's correlation coefficient and p-value
 2pearson_test = pearsonr(dataset["Age"], dataset["Absenteeism time in hours"])
 3
 4# Regression plot
 5plt.figure(figsize=(10, 6))
 6ax = sns.regplot(
 7    x="Age",
 8    y="Absenteeism time in hours",
 9    data=dataset,
10    scatter_kws={"alpha": 0.5},
11)
12ax.set_title(
13    f"Correlation={pearson_test[0]:.03f} | p-value={pearson_test[1]:.03f}"
14)
15plt.savefig(f"{assets_path}/correlation_age_hours.png", format="png", dpi=300)
../../_images/a3de48cc6b14af74e645f1b4681f0788558e195a0ed3fb53cd86ddc42a1b0b39.png
1# Creating violin plot between the Age and Disease columns
2plt.figure(figsize=(8, 6))
3sns.violinplot(x="Disease", y="Age", data=dataset)
4plt.savefig(f"{assets_path}/violin_age_disease.png", format="png", dpi=300)
../../_images/b2c2d77f634ad212421f0fe9d486cdd5b9c7e45cfb3b76ce0dd2a4b6954caa82.png
 1# Age entries for employees with Disease == Yes and Disease == No
 2disease_mask = dataset["Disease"] == "Yes"
 3disease_ages = dataset["Age"][disease_mask]
 4no_disease_ages = dataset["Age"][~disease_mask]
 5
 6# Hypothesis test for equality of means
 7test_res = ttest_ind(disease_ages, no_disease_ages)
 8print(
 9    f"Test for equality of means: statistic={test_res[0]:0.3f}, "
10    f"pvalue={test_res[1]:0.3f}"
11)
12
13# Testing equality of distributions via Kolmogorov-Smirnov test
14ks_res = ks_2samp(disease_ages, no_disease_ages)
15print(
16    f"KS test for equality of distributions: statistic={ks_res[0]:0.3f}, "
17    f"pvalue={ks_res[1]:0.3f}"
18)
Test for equality of means: statistic=0.630, pvalue=0.529
KS test for equality of distributions: statistic=0.057, pvalue=0.619
1# Violin plots of reason for absence vs age
2plt.figure(figsize=(20, 8))
3sns.violinplot(x="Reason for absence", y="Age", data=dataset)
4plt.savefig(f"{assets_path}/violin_age_reason.png", format="png")
../../_images/9325a78acd13681a4dea314a4cf568025c7087f44ac24eb062a46bcbdce08e0c.png