COMPAS¤
Running in Google Colab¤
You can run this experiment in Google Colab by clicking the button below:
Dataset¤
COMPAS [1] is a dataset containing the criminal records of 6,172
individuals arrested in Florida. The task is to predict whether the
individual will commit a crime again in 2 years. The probability
predicted by the system will be used as a risk score. As mentioned in
[2] 13 attributes for prediction. The risk score should be
monotonically increasing w.r.t. four attributes, number of prior adult
convictions, number of juvenile felony, number of juvenile misdemeanor,
and number of other convictions. The monotonicity_indicator
corrsponding to these features are set to 1.
References:
-
S. Mattu J. Angwin, J. Larson and L. Kirchner. Machine bias: There’s software used across the country to predict future criminals. and it’s biased against blacks. ProPublica, 2016.
-
Xingchao Liu, Xing Han, Na Zhang, and Qiang Liu. Certified monotonic neural networks. Advances in Neural Information Processing Systems, 33:15427–15438, 2020
monotonicity_indicator = {
"priors_count": 1,
"juv_fel_count": 1,
"juv_misd_count": 1,
"juv_other_count": 1,
"age": 0,
"race_0": 0,
"race_1": 0,
"race_2": 0,
"race_3": 0,
"race_4": 0,
"race_5": 0,
"sex_0": 0,
"sex_1": 0,
}
These are a few examples of the dataset:
0 | 1 | 2 | 3 | 4 | |
---|---|---|---|---|---|
priors_count | 0.368421 | 0.000000 | 0.026316 | 0.394737 | 0.052632 |
juv_fel_count | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
juv_misd_count | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
juv_other_count | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
age | 0.230769 | 0.051282 | 0.179487 | 0.230769 | 0.102564 |
race_0 | 1.000000 | 1.000000 | 0.000000 | 1.000000 | 1.000000 |
race_1 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 |
race_2 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
race_3 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
race_4 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
race_5 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
sex_0 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 |
sex_1 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
ground_truth | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 |
Hyperparameter search¤
The choice of the batch size and the maximum number of epochs depends on the dataset size. For this dataset, we use the following values:
batch_size = 8
max_epochs = 50
We use the Type-2 architecture built using
MonoDense
layer with the following set of hyperparameters ranges:
def hp_params_f(hp):
return dict(
units=hp.Int("units", min_value=16, max_value=32, step=1),
n_layers=hp.Int("n_layers", min_value=2, max_value=2),
activation=hp.Choice("activation", values=["elu"]),
learning_rate=hp.Float(
"learning_rate", min_value=1e-4, max_value=1e-2, sampling="log"
),
weight_decay=hp.Float(
"weight_decay", min_value=3e-2, max_value=0.3, sampling="log"
),
dropout=hp.Float("dropout", min_value=0.0, max_value=0.5, sampling="linear"),
decay_rate=hp.Float(
"decay_rate", min_value=0.8, max_value=1.0, sampling="reverse_log"
),
)
The following fixed parameters are used to build the Type-2 architecture for this dataset:
-
final_activation
is used to build the final layer for regression problem (set toNone
) or for the classification problem ("sigmoid"
), -
loss
is used for training regression ("mse"
) or classification ("binary_crossentropy"
) problem, and -
metrics
denotes metrics used to compare with previosly published results:"accuracy"
for classification and “mse
” or “rmse
” for regression.
Parameters objective
and direction
are used by the tuner such that
objective=f"val_{metrics}"
and direction is either "min
or "max"
.
Parameters max_trials
denotes the number of trial performed buy the
tuner, patience
is the number of epochs allowed to perform worst than
the best one before stopping the current trial. The parameter
execution_per_trial
denotes the number of runs before calculating the
results of a trial, it should be set to value greater than 1 for small
datasets that have high variance in results.
final_activation = "sigmoid"
loss = "binary_crossentropy"
metrics = "accuracy"
objective = "val_accuracy"
direction = "max"
max_trials = 50
executions_per_trial = 1
patience = 5
The following table describes the best models and their hyperparameters found by the tuner:
0 | 1 | 2 | 3 | 4 | |
---|---|---|---|---|---|
units | 27 | 28 | 26 | 31 | 25 |
n_layers | 2 | 3 | 2 | 3 | 2 |
activation | elu | elu | elu | elu | elu |
learning_rate | 0.084685 | 0.105227 | 0.086301 | 0.018339 | 0.069011 |
weight_decay | 0.137518 | 0.120702 | 0.147297 | 0.105921 | 0.153525 |
dropout | 0.175917 | 0.160270 | 0.162063 | 0.480390 | 0.180772 |
decay_rate | 0.899399 | 0.872222 | 0.927282 | 0.964135 | 0.874505 |
val_accuracy_mean | 0.694413 | 0.693603 | 0.692955 | 0.692308 | 0.692146 |
val_accuracy_std | 0.003464 | 0.000923 | 0.002710 | 0.002217 | 0.002649 |
val_accuracy_min | 0.689879 | 0.692308 | 0.689069 | 0.689069 | 0.689879 |
val_accuracy_max | 0.698785 | 0.694737 | 0.695547 | 0.694737 | 0.696356 |
params | 2317 | 3599 | 2237 | 4058 | 2157 |
The optimal model¤
These are the best hyperparameters found by previous runs of the tuner:
def final_hp_params_f(hp):
return dict(
units=hp.Fixed("units", value=27),
n_layers=hp.Fixed("n_layers", 2),
activation=hp.Fixed("activation", value="elu"),
learning_rate=hp.Fixed("learning_rate", value=0.084685),
weight_decay=hp.Fixed("weight_decay", value=0.137518),
dropout=hp.Fixed("dropout", value=0.175917),
decay_rate=hp.Fixed("decay_rate", value=0.899399),
)
The final evaluation of the optimal model:
0 | |
---|---|
units | 27 |
n_layers | 2 |
activation | elu |
learning_rate | 0.084685 |
weight_decay | 0.137518 |
dropout | 0.175917 |
decay_rate | 0.899399 |
val_accuracy_mean | 0.691660 |
val_accuracy_std | 0.001056 |
val_accuracy_min | 0.690688 |
val_accuracy_max | 0.693117 |
params | 2317 |