COMPAS¤

Running in Google Colab¤

You can run this experiment in Google Colab by clicking the button below:

Dataset¤

COMPAS [1] is a dataset containing the criminal records of 6,172 individuals arrested in Florida. The task is to predict whether the individual will commit a crime again in 2 years. The probability predicted by the system will be used as a risk score. As mentioned in [2] 13 attributes for prediction. The risk score should be monotonically increasing w.r.t. four attributes, number of prior adult convictions, number of juvenile felony, number of juvenile misdemeanor, and number of other convictions. The monotonicity_indicator corrsponding to these features are set to 1.

References:

S. Mattu J. Angwin, J. Larson and L. Kirchner. Machine bias: There’s software used across the country to predict future criminals. and it’s biased against blacks. ProPublica, 2016.
Xingchao Liu, Xing Han, Na Zhang, and Qiang Liu. Certified monotonic neural networks. Advances in Neural Information Processing Systems, 33:15427–15438, 2020

monotonicity_indicator = {
    "priors_count": 1,
    "juv_fel_count": 1,
    "juv_misd_count": 1,
    "juv_other_count": 1,
    "age": 0,
    "race_0": 0,
    "race_1": 0,
    "race_2": 0,
    "race_3": 0,
    "race_4": 0,
    "race_5": 0,
    "sex_0": 0,
    "sex_1": 0,
}

These are a few examples of the dataset:

	0	1	2	3	4
priors_count	0.368421	0.000000	0.026316	0.394737	0.052632
juv_fel_count	0.000000	0.000000	0.000000	0.000000	0.000000
juv_misd_count	0.000000	0.000000	0.000000	0.000000	0.000000
juv_other_count	0.000000	0.000000	0.000000	0.000000	0.000000
age	0.230769	0.051282	0.179487	0.230769	0.102564
race_0	1.000000	1.000000	0.000000	1.000000	1.000000
race_1	0.000000	0.000000	1.000000	0.000000	0.000000
race_2	0.000000	0.000000	0.000000	0.000000	0.000000
race_3	0.000000	0.000000	0.000000	0.000000	0.000000
race_4	0.000000	0.000000	0.000000	0.000000	0.000000
race_5	0.000000	0.000000	0.000000	0.000000	0.000000
sex_0	1.000000	1.000000	1.000000	1.000000	1.000000
sex_1	0.000000	0.000000	0.000000	0.000000	0.000000
ground_truth	1.000000	0.000000	0.000000	0.000000	1.000000

Hyperparameter search¤

The choice of the batch size and the maximum number of epochs depends on the dataset size. For this dataset, we use the following values:

batch_size = 8
max_epochs = 50

We use the Type-2 architecture built using MonoDense layer with the following set of hyperparameters ranges:

def hp_params_f(hp):
    return dict(
        units=hp.Int("units", min_value=16, max_value=32, step=1),
        n_layers=hp.Int("n_layers", min_value=2, max_value=2),
        activation=hp.Choice("activation", values=["elu"]),
        learning_rate=hp.Float(
            "learning_rate", min_value=1e-4, max_value=1e-2, sampling="log"
        ),
        weight_decay=hp.Float(
            "weight_decay", min_value=3e-2, max_value=0.3, sampling="log"
        ),
        dropout=hp.Float("dropout", min_value=0.0, max_value=0.5, sampling="linear"),
        decay_rate=hp.Float(
            "decay_rate", min_value=0.8, max_value=1.0, sampling="reverse_log"
        ),
    )

The following fixed parameters are used to build the Type-2 architecture for this dataset:

final_activation is used to build the final layer for regression problem (set to None) or for the classification problem ("sigmoid"),
loss is used for training regression ("mse") or classification ("binary_crossentropy") problem, and
metrics denotes metrics used to compare with previosly published results: "accuracy" for classification and “mse” or “rmse” for regression.

Parameters objective and direction are used by the tuner such that objective=f"val_{metrics}" and direction is either "min or "max".

Parameters max_trials denotes the number of trial performed buy the tuner, patience is the number of epochs allowed to perform worst than the best one before stopping the current trial. The parameter execution_per_trial denotes the number of runs before calculating the results of a trial, it should be set to value greater than 1 for small datasets that have high variance in results.

final_activation = "sigmoid"
loss = "binary_crossentropy"
metrics = "accuracy"
objective = "val_accuracy"
direction = "max"
max_trials = 50
executions_per_trial = 1
patience = 5

The following table describes the best models and their hyperparameters found by the tuner:

	0	1	2	3	4
units	27	28	26	31	25
n_layers	2	3	2	3	2
activation	elu	elu	elu	elu	elu
learning_rate	0.084685	0.105227	0.086301	0.018339	0.069011
weight_decay	0.137518	0.120702	0.147297	0.105921	0.153525
dropout	0.175917	0.160270	0.162063	0.480390	0.180772
decay_rate	0.899399	0.872222	0.927282	0.964135	0.874505
val_accuracy_mean	0.694413	0.693603	0.692955	0.692308	0.692146
val_accuracy_std	0.003464	0.000923	0.002710	0.002217	0.002649
val_accuracy_min	0.689879	0.692308	0.689069	0.689069	0.689879
val_accuracy_max	0.698785	0.694737	0.695547	0.694737	0.696356
params	2317	3599	2237	4058	2157

The optimal model¤

These are the best hyperparameters found by previous runs of the tuner:

def final_hp_params_f(hp):
    return dict(
        units=hp.Fixed("units", value=27),
        n_layers=hp.Fixed("n_layers", 2),
        activation=hp.Fixed("activation", value="elu"),
        learning_rate=hp.Fixed("learning_rate", value=0.084685),
        weight_decay=hp.Fixed("weight_decay", value=0.137518),
        dropout=hp.Fixed("dropout", value=0.175917),
        decay_rate=hp.Fixed("decay_rate", value=0.899399),
    )

The final evaluation of the optimal model:

	0
units	27
n_layers	2
activation	elu
learning_rate	0.084685
weight_decay	0.137518
dropout	0.175917
decay_rate	0.899399
val_accuracy_mean	0.691660
val_accuracy_std	0.001056
val_accuracy_min	0.690688
val_accuracy_max	0.693117
params	2317