LivingDataLab - Patient Selection for Diabetes Drug Testing

1 Introduction

EHR data is becoming a key source of real-world evidence (RWE) for the pharmaceutical industry and regulators to make decisions on clinical trials.

For this project, we have a groundbreaking diabetes drug that is ready for clinical trial testing. It is a very unique and sensitive drug that requires administering the drug over at least 5-7 days of time in the hospital with frequent monitoring/testing and patient medication adherence training with a mobile application. We have been provided a patient dataset from a client partner and are tasked with building a predictive model that can identify which type of patients the company should focus their efforts testing this drug on. Target patients are people that are likely to be in the hospital for this duration of time and will not incur significant additional costs for administering this drug to the patient and monitoring.

In order to achieve our goal we must build a regression model that can predict the estimated hospitalization time for a patient and use this to select/filter patients for this study.

2 Approach

Utilizing a synthetic dataset (denormalized at the line level augmentation) built off of the UCI Diabetes readmission dataset, we will build a regression model that predicts the expected days of hospitalization time and then convert this to a binary prediction of whether to include or exclude that patient from the clinical trial.

This project will demonstrate the importance of building the right data representation at the encounter level, with appropriate filtering and preprocessing/feature engineering of key medical code sets. We will also analyze and interpret the model for biases across key demographic groups.

3 Dataset

Due to healthcare PHI regulations (HIPAA, HITECH), there are limited number of publicly available datasets and some datasets require training and approval. So, for the purpose of this study, we are using a dataset from UC Irvine that has been modified.

4 Dataset Loading and Schema Review

dataset_path = "./data/final_project_dataset.csv"
df = pd.read_csv(dataset_path)

# Show first few rows
df.head()

	encounter_id	patient_nbr	race	gender	age	weight	admission_type_id	discharge_disposition_id	admission_source_id	time_in_hospital	payer_code	medical_specialty	primary_diagnosis_code	other_diagnosis_codes	number_outpatient	number_inpatient	num_lab_procedures	number_diagnoses	num_medications	num_procedures	ndc_code	max_glu_serum	A1Cresult	change	readmitted
0	2278392	8222157	Caucasian	Female	[0-10)	?	6	25	1	1	?	Pediatrics-Endocrinology	250.83	?\|?	0	0	41	1	1	0	NaN	None	None	No	NO
1	149190	55629189	Caucasian	Female	[10-20)	?	1	1	7	3	?	?	276	250.01\|255	0	0	59	9	18	0	68071-1701	None	None	Ch	>30
2	64410	86047875	AfricanAmerican	Female	[20-30)	?	1	1	7	2	?	?	648	250\|V27	2	1	11	6	13	5	0378-1110	None	None	No	NO
3	500364	82442376	Caucasian	Male	[30-40)	?	1	1	7	2	?	?	8	250.43\|403	0	0	44	7	16	1	68071-1701	None	None	Ch	NO
4	16680	42519267	Caucasian	Male	[40-50)	?	1	1	7	1	?	?	197	157\|250	0	0	51	5	8	0	0049-4110	None	None	Ch	NO

4.1 Determine Level of Dataset (Line or Encounter)

Given there are only 101766 unique encounter_id’s yet there are 143424 rows that are not nulls, this looks like the dataset is at the line level.

We would also want to aggregate on the primary_diagnosis_code as there is also only one of these per encounter. By aggregating on these 3 columns, we can create a encounter level dataset.

5 Analyze Dataset

# Look at range of values & key stats for numerical columns
numerical_feature_list = ['time_in_hospital',  'number_outpatient', 'number_inpatient', 'number_emergency', 'num_lab_procedures', 'number_diagnoses', 'num_medications', 'num_procedures' ]
df[numerical_feature_list].describe()

	time_in_hospital	number_outpatient	number_inpatient	number_emergency	num_lab_procedures	number_diagnoses	num_medications	num_procedures
count	143424.000000	143424.000000	143424.000000	143424.000000	143424.000000	143424.000000	143424.000000	143424.000000
mean	4.490190	0.362429	0.600855	0.195086	43.255745	7.424434	16.776035	1.349021
std	2.999667	1.249295	1.207934	0.920410	19.657319	1.924872	8.397130	1.719104
min	1.000000	0.000000	0.000000	0.000000	1.000000	1.000000	1.000000	0.000000
25%	2.000000	0.000000	0.000000	0.000000	32.000000	6.000000	11.000000	0.000000
50%	4.000000	0.000000	0.000000	0.000000	44.000000	8.000000	15.000000	1.000000
75%	6.000000	0.000000	1.000000	0.000000	57.000000	9.000000	21.000000	2.000000
max	14.000000	42.000000	21.000000	76.000000	132.000000	16.000000	81.000000	6.000000

# Define utility functions
def create_cardinality_feature(df):
    num_rows = len(df)
    random_code_list = np.arange(100, 1000, 1)
    return np.random.choice(random_code_list, num_rows)

def count_unique_values(df, cat_col_list):
    cat_df = df[cat_col_list]
    cat_df['principal_diagnosis_code'] = create_cardinality_feature(cat_df)
    #add feature with high cardinality
    val_df = pd.DataFrame({'columns': cat_df.columns,
                       'cardinality': cat_df.nunique() } )
    return val_df

categorical_feature_list = [ 'race', 'gender', 'age', 'weight', 'payer_code', 'medical_specialty', 'primary_diagnosis_code', 'other_diagnosis_codes','ndc_code', 'max_glu_serum', 'A1Cresult', 'change', 'readmitted']

categorical_df = count_unique_values(df, categorical_feature_list)
categorical_df

	columns	cardinality
race	race	6
gender	gender	3
age	age	10
weight	weight	10
payer_code	payer_code	18
medical_specialty	medical_specialty	73
primary_diagnosis_code	primary_diagnosis_code	717
other_diagnosis_codes	other_diagnosis_codes	19374
ndc_code	ndc_code	251
max_glu_serum	max_glu_serum	4
A1Cresult	A1Cresult	4
change	change	2
readmitted	readmitted	3
principal_diagnosis_code	principal_diagnosis_code	900

5.1 Analysis key findings

The ndc_code field has a high amount of missing values (23460)
num_lab_procedures and num_medications seem to have a roughly normal distribution
Fields that have a high cardinality are - medical_specialty, primary_diagnosis_code, other_diagnosis_codes, ndc_code, and principal_diagnosis_code. This is because there are many thousands of these codes that correspond to the many disease and diagnosis sub-classes that exist in the medical field.
The distribution for the age field is approximately normal, which we would expect. The distribution for the gender field is roughly uniform & equal. In this case we discount the very small number of Unknown/valid cases. Again this is not surprising, as the distribution of genders in the general population is also roughly equal so this seems to be a representitive sample from the general population.

6 Reduce Dimensionality of the NDC Code Feature

NDC codes are a common format to represent the wide variety of drugs that are prescribed for patient care in the United States. The challenge is that there are many codes that map to the same or similar drug. We are provided with the ndc drug lookup file https://github.com/udacity/nd320-c1-emr-data-starter/blob/master/project/data_schema_references/ndc_lookup_table.csv derived from the National Drug Codes List site(https://ndclist.com/).

We can use this file to come up with a way to reduce the dimensionality of this field and create a new field in the dataset called “generic_drug_name” in the output dataframe.

#NDC code lookup file
ndc_code_path = "./medication_lookup_tables/final_ndc_lookup_table"
ndc_code_df = pd.read_csv(ndc_code_path)

# Check first new rows
ndc_code_df.head()

	NDC_Code	Proprietary Name	Non-proprietary Name	Dosage Form	Route Name	Company Name	Product Type
0	0087-6060	Glucophage	Metformin Hydrochloride	Tablet, Film Coated	Oral	Bristol-myers Squibb Company	Human Prescription Drug
1	0087-6063	Glucophage XR	Metformin Hydrochloride	Tablet, Extended Release	Oral	Bristol-myers Squibb Company	Human Prescription Drug
2	0087-6064	Glucophage XR	Metformin Hydrochloride	Tablet, Extended Release	Oral	Bristol-myers Squibb Company	Human Prescription Drug
3	0087-6070	Glucophage	Metformin Hydrochloride	Tablet, Film Coated	Oral	Bristol-myers Squibb Company	Human Prescription Drug
4	0087-6071	Glucophage	Metformin Hydrochloride	Tablet, Film Coated	Oral	Bristol-myers Squibb Company	Human Prescription Drug

# Check for duplicate NDC_Code's
ndc_code_df[ndc_code_df.duplicated(subset=['NDC_Code'])]

	NDC_Code	Proprietary Name	Non-proprietary Name	Dosage Form	Route Name	Company Name	Product Type
263	0781-5634	Pioglitazone Hydrochloride And Glimepiride	Pioglitazone Hydrochloride And Glimepiride	Tablet	Oral	Sandoz Inc	Human Prescription Drug
264	0781-5635	Pioglitazone Hydrochloride And Glimepiride	Pioglitazone Hydrochloride And Glimepiride	Tablet	Oral	Sandoz Inc	Human Prescription Drug

# Remove duplicates
ndc_code_df = ndc_code_df.drop(ndc_code_df.index[[263,264]])
ndc_code_df[ndc_code_df.duplicated(subset=['NDC_Code'])]

	NDC_Code	Proprietary Name	Non-proprietary Name	Dosage Form	Route Name	Company Name	Product Type

7 Select First Encounter for each Patient

In order to simplify the aggregation of data for the model, we will only select the first encounter for each patient in the dataset. This is to reduce the risk of data leakage of future patient encounters and to reduce complexity of the data transformation and modeling steps. We will assume that sorting in numerical order on the encounter_id provides the time horizon for determining which encounters come before and after another.

from student_utils import select_first_encounter
first_encounter_df = select_first_encounter(reduce_dim_df)

# unique patients in transformed dataset
unique_patients = first_encounter_df['patient_nbr'].nunique()
print("Number of unique patients:{}".format(unique_patients))

# unique encounters in transformed dataset
unique_encounters = first_encounter_df['encounter_id'].nunique()
print("Number of unique encounters:{}".format(unique_encounters))

original_unique_patient_number = reduce_dim_df['patient_nbr'].nunique()
# number of unique patients should be equal to the number of unique encounters and patients in the final dataset
assert original_unique_patient_number == unique_patients
assert original_unique_patient_number == unique_encounters

Number of unique patients:71518
Number of unique encounters:71518

8 Aggregate Dataset to Right Level for Modelling

To make it simpler, we are creating dummy columns for each unique generic drug name and adding those are input features to the model.

exclusion_list = ['generic_drug_name']
grouping_field_list = [c for c in first_encounter_df.columns if c not in exclusion_list]
agg_drug_df, ndc_col_list = aggregate_dataset(first_encounter_df, grouping_field_list, 'generic_drug_name')

assert len(agg_drug_df) == agg_drug_df['patient_nbr'].nunique() == agg_drug_df['encounter_id'].nunique()

9 Prepare Fields and Cast Dataset

9.1 Feature Selection

# Look at counts for payer_code categories
ax = sns.countplot(x="payer_code", data=agg_drug_df)

# Look at counts for weight categories
ax = sns.countplot(x="weight", data=agg_drug_df)

From the category counts above, we can see that for payer_code while there are many unknown values i.e. ‘?’, there are still many values for other payer codes, these may prove useful predictors for our target variable. For weight, there are so few unknown ‘?’ codes, that this feature is likely to be not very helpful for predicting our target variable.

# Selected features
required_demo_col_list = ['race', 'gender', 'age']
student_categorical_col_list = [ "change", "readmitted", "payer_code", "medical_specialty", "primary_diagnosis_code", "other_diagnosis_codes", "max_glu_serum", "A1Cresult",  "admission_type_id", "discharge_disposition_id", "admission_source_id"] + required_demo_col_list + ndc_col_list
student_numerical_col_list = ["number_outpatient", "number_inpatient", "number_emergency", "num_lab_procedures", "number_diagnoses", "num_medications", "num_procedures"]
PREDICTOR_FIELD = 'time_in_hospital'

def select_model_features(df, categorical_col_list, numerical_col_list, PREDICTOR_FIELD, grouping_key='patient_nbr'):
    selected_col_list = [grouping_key] + [PREDICTOR_FIELD] + categorical_col_list + numerical_col_list   
    return agg_drug_df[selected_col_list]

selected_features_df = select_model_features(agg_drug_df, student_categorical_col_list, student_numerical_col_list,
                                            PREDICTOR_FIELD)

9.2 Preprocess Dataset - Casting and Imputing

We will cast and impute the dataset before splitting so that we do not have to repeat these steps across the splits in the next step. For imputing, there can be deeper analysis into which features to impute and how to impute but for the sake of time, we are taking a general strategy of imputing zero for only numerical features.

processed_df = preprocess_df(selected_features_df, student_categorical_col_list,
        student_numerical_col_list, PREDICTOR_FIELD, categorical_impute_value='nan', numerical_impute_value=0)

10 Split Dataset into Train, Validation, and Test Partitions

In order to prepare the data for being trained and evaluated by a deep learning model, we will split the dataset into three partitions, with the validation partition used for optimizing the model hyperparameters during training. One of the key parts is that we need to be sure that the data does not accidently leak across partitions.

We will split the input dataset into three partitions(train, validation, test) with the following requirements:

Approximately 60%/20%/20% train/validation/test split
Randomly sample different patients into each data partition
We need to take care that a patient’s data is not in more than one partition, so that we can avoid possible data leakage.
We need to take care the total number of unique patients across the splits is equal to the total number of unique patients in the original dataset
Total number of rows in original dataset = sum of rows across all three dataset partitions

from student_utils import patient_dataset_splitter
d_train, d_val, d_test = patient_dataset_splitter(processed_df, 'patient_nbr')

Total number of unique patients in train = 32563
Total number of unique patients in validation = 10854
Total number of unique patients in test = 10854
Training partition has a shape = (32563, 43)
Validation partition has a shape = (10854, 43)
Test partition has a shape = (10854, 43)

11 Demographic Representation Analysis of Split

After the split, we should check to see the distribution of key features/groups and make sure that there is representative samples across the partitions.

11.1 Label Distribution Across Partitions

Are the histogram distribution shapes similar across partitions?

show_group_stats_viz(processed_df, PREDICTOR_FIELD)

show_group_stats_viz(d_train, PREDICTOR_FIELD)

show_group_stats_viz(d_test, PREDICTOR_FIELD)

11.2 Demographic Group Analysis

We should check that our partitions/splits of the dataset are similar in terms of their demographic profiles.

# Full dataset before splitting
patient_demo_features = ['race', 'gender', 'age', 'patient_nbr']
patient_group_analysis_df = processed_df[patient_demo_features].groupby('patient_nbr').head(1).reset_index(drop=True)
show_group_stats_viz(patient_group_analysis_df, 'gender')

# Training partition
show_group_stats_viz(d_train, 'gender')

# Test partition
show_group_stats_viz(d_test, 'gender')

12 Convert Dataset Splits to TF Dataset

# Convert dataset from Pandas dataframes to TF dataset
batch_size = 128
diabetes_train_ds = df_to_dataset(d_train, PREDICTOR_FIELD, batch_size=batch_size)
diabetes_val_ds = df_to_dataset(d_val, PREDICTOR_FIELD, batch_size=batch_size)
diabetes_test_ds = df_to_dataset(d_test, PREDICTOR_FIELD, batch_size=batch_size)

# We use this sample of the dataset to show transformations later
diabetes_batch = next(iter(diabetes_train_ds))[0]
def demo(feature_column, example_batch):
    feature_layer = layers.DenseFeatures(feature_column)
    print(feature_layer(example_batch))

13 Create Features

13.1 Create Categorical Features with TF Feature Columns

Before we can create the TF categorical features, we must first create the vocab files with the unique values for a given field that are from the training dataset.

# Build Vocabulary for Categorical Features
vocab_file_list = build_vocab_files(d_train, student_categorical_col_list)

13.2 Create Categorical Features with Tensorflow Feature Column API

from student_utils import create_tf_categorical_feature_cols
tf_cat_col_list = create_tf_categorical_feature_cols(student_categorical_col_list)

test_cat_var1 = tf_cat_col_list[0]
print("Example categorical field:\n{}".format(test_cat_var1))
demo(test_cat_var1, diabetes_batch)

13.3 Create Numerical Features with TF Feature Columns

from student_utils import create_tf_numeric_feature

def calculate_stats_from_train_data(df, col):
    mean = df[col].describe()['mean']
    std = df[col].describe()['std']
    return mean, std

def create_tf_numerical_feature_cols(numerical_col_list, train_df):
    tf_numeric_col_list = []
    for c in numerical_col_list:
        mean, std = calculate_stats_from_train_data(train_df, c)
        tf_numeric_feature = create_tf_numeric_feature(c, mean, std)
        tf_numeric_col_list.append(tf_numeric_feature)
    return tf_numeric_col_list

tf_cont_col_list = create_tf_numerical_feature_cols(student_numerical_col_list, d_train)

test_cont_var1 = tf_cont_col_list[0]
print("Example continuous field:\n{}\n".format(test_cont_var1))
demo(test_cont_var1, diabetes_batch)

14 Build Deep Learning Regression Model with Sequential API and TF Probability Layers

14.1 Use DenseFeatures to combine features for model

Now that we have prepared categorical and numerical features using Tensorflow’s Feature Column API, we can combine them into a dense vector representation for the model. Below we will create this new input layer, which we will call ‘claim_feature_layer’.

claim_feature_columns = tf_cat_col_list + tf_cont_col_list
claim_feature_layer = tf.keras.layers.DenseFeatures(claim_feature_columns)

14.2 Build Sequential API Model from DenseFeatures and TF Probability Layers

def build_sequential_model(feature_layer):
    model = tf.keras.Sequential([
        feature_layer,
        tf.keras.layers.Dense(150, activation='relu'),
        tf.keras.layers.Dense(200, activation='relu'),# New
        tf.keras.layers.Dense(75, activation='relu'),
        tfp.layers.DenseVariational(1+1, posterior_mean_field, prior_trainable),
        tfp.layers.DistributionLambda(
            lambda t:tfp.distributions.Normal(loc=t[..., :1],
                                             scale=1e-3 + tf.math.softplus(0.01 * t[...,1:])
                                             )
        ),
    ])
    return model

def build_diabetes_model(train_ds, val_ds,  feature_layer,  epochs=5, loss_metric='mse'):
    model = build_sequential_model(feature_layer)
    opt = tf.keras.optimizers.Adam(learning_rate=0.01)
    model.compile(optimizer=opt, loss=loss_metric, metrics=[loss_metric])
    #model.compile(optimizer='rmsprop', loss=loss_metric, metrics=[loss_metric])
    #early_stop = tf.keras.callbacks.EarlyStopping(monitor=loss_metric, patience=3)     
    history = model.fit(train_ds, validation_data=val_ds,
                        #callbacks=[early_stop],
                        epochs=epochs)
    return model, history

diabetes_model, history = build_diabetes_model(diabetes_train_ds, diabetes_val_ds,  claim_feature_layer,  epochs=10)

14.3 Show Model Uncertainty Range with TF Probability

Now that we have trained a model with TF Probability layers, we can extract the mean and standard deviation for each prediction.

feature_list = student_categorical_col_list + student_numerical_col_list
diabetes_x_tst = dict(d_test[feature_list])
diabetes_yhat = diabetes_model(diabetes_x_tst)
preds = diabetes_model.predict(diabetes_test_ds)

from student_utils import get_mean_std_from_preds
m, s = get_mean_std_from_preds(diabetes_yhat)

14.4 Show Prediction Output

prob_outputs = {
    "pred": preds.flatten(),
    "actual_value": d_test['time_in_hospital'].values,
    "pred_mean": m.numpy().flatten(),
    "pred_std": s.numpy().flatten()
}
prob_output_df = pd.DataFrame(prob_outputs)

prob_output_df.head()

	pred	actual_value	pred_mean	pred_std
0	3.587955	3.0	4.673843	0.693749
1	5.007016	2.0	4.673843	0.693749
2	4.809363	9.0	4.673843	0.693749
3	5.003417	2.0	4.673843	0.693749
4	5.346958	8.0	4.673843	0.693749

prob_output_df.describe()

	pred	actual_value	pred_mean	pred_std
count	10854.000000	10854.000000	10854.000000	10854.000000
mean	4.376980	4.429888	4.673843	0.693749
std	0.908507	3.002044	0.000000	0.000000
min	0.976290	1.000000	4.673843	0.693749
25%	3.755292	2.000000	4.673843	0.693749
50%	4.382993	4.000000	4.673843	0.693749
75%	5.002859	6.000000	4.673843	0.693749
max	7.529900	14.000000	4.673843	0.693749

14.5 Convert Regression Output to Classification Output for Patient Selection

from student_utils import get_student_binary_prediction
student_binary_prediction = get_student_binary_prediction(prob_output_df, 'pred')

student_binary_prediction.value_counts()

0:8137
1:2717

14.6 Add Binary Prediction to Test Dataframe

Using the student_binary_prediction output that is a numpy array with binary labels, we can use this to add to a dataframe to better visualize and also to prepare the data for the Aequitas toolkit. The Aequitas toolkit requires that the predictions be mapped to a binary label for the predictions (called ‘score’ field) and the actual value (called ‘label_value’).

def add_pred_to_test(test_df, pred_np, demo_col_list):
    for c in demo_col_list:
        test_df[c] = test_df[c].astype(str)
    test_df['score'] = pred_np
    test_df['label_value'] = test_df['time_in_hospital'].apply(lambda x: 1 if x >=5 else 0)
    return test_df

pred_test_df = add_pred_to_test(d_test, student_binary_prediction, ['race', 'gender'])

pred_test_df[['patient_nbr', 'gender', 'race', 'time_in_hospital', 'score', 'label_value']].head()

	patient_nbr	gender	race	time_in_hospital	score	label_value
0	122896787	Male	Caucasian	3.0	0	0
1	102598929	Male	Caucasian	2.0	1	0
2	80367957	Male	Caucasian	9.0	0	1
3	6721533	Male	Caucasian	2.0	1	0
4	104346288	Female	Caucasian	8.0	1	1

15 Model Evaluation Metrics

Now it is time to use the newly created binary labels in the ‘pred_test_df’ dataframe to evaluate the model with some common classification metrics. We will create a report summary of the performance of the model and give the ROC AUC, F1 score(weighted), class precision and recall scores.

# AUC, F1, precision and recall
# Summary
y_true = pred_test_df['label_value'].values
y_pred = pred_test_df['score'].values

accuracy_score(y_true, y_pred)

0.5627418463239359

roc_auc_score(y_true, y_pred)

0.5032089104088319

Precision-recall tradeoff - The model has been optimised to identify those patients correct for the trial with the fewest mistakes, while also trying to ensure we identify as many of them as possible.

Areas of imporovement - we could look to engineer new features that might help us better predict our target patients.

16 Evaluating Potential Model Biases with Aequitas Toolkit

16.1 Prepare Data For Aequitas Bias Toolkit

Using the gender and race fields, we will prepare the data for the Aequitas Toolkit.

# Aequitas
from aequitas.preprocessing import preprocess_input_df
from aequitas.group import Group
from aequitas.plotting import Plot
from aequitas.bias import Bias
from aequitas.fairness import Fairness

ae_subset_df = pred_test_df[['race', 'gender', 'score', 'label_value']]
ae_df, _ = preprocess_input_df(ae_subset_df)
g = Group()
xtab, _ = g.get_crosstabs(ae_df)
absolute_metrics = g.list_absolute_metrics(xtab)
clean_xtab = xtab.fillna(-1)
aqp = Plot()
b = Bias()

model_id, score_thresholds 1 {‘rank_abs’: [2717]}

absolute_metrics = g.list_absolute_metrics(xtab)
xtab[[col for col in xtab.columns if col not in absolute_metrics]]

	model_id	score_threshold	k	attribute_name	attribute_value	pp	pn	fp	fn	tn	tp	group_label_pos	group_label_neg	group_size	total_entities
0	1	binary 0/1	2717	race	?	86	240	56	85	155	30	115	211	326	10854
1	1	binary 0/1	2717	race	AfricanAmerican	491	1530	291	592	938	200	792	1229	2021	10854
2	1	binary 0/1	2717	race	Asian	15	60	10	16	44	5	21	54	75	10854
3	1	binary 0/1	2717	race	Caucasian	2030	6038	1249	2298	3740	781	3079	4989	8068	10854
4	1	binary 0/1	2717	race	Hispanic	52	141	35	48	93	17	65	128	193	10854
5	1	binary 0/1	2717	race	Other	43	128	26	40	88	17	57	114	171	10854
6	1	binary 0/1	2717	gender	Female	1413	4306	820	1675	2631	593	2268	3451	5719	10854
7	1	binary 0/1	2717	gender	Male	1304	3831	847	1404	2427	457	1861	3274	5135	10854

16.2 Reference Group Selection

# Test reference group with Caucasian Male
bdf = b.get_disparity_predefined_groups(clean_xtab,
                    original_df=ae_df,
                    ref_groups_dict={'race':'Caucasian', 'gender':'Male'
                                     },
                    alpha=0.05,
                    check_significance=False)


f = Fairness()
fdf = f.get_group_value_fairness(bdf)

16.3 Race and Gender Bias Analysis for Patient Selection

# Plot two metrics
# Is there significant bias in your model for either race or gender?
fpr_disparity1 = aqp.plot_disparity(bdf, group_metric='fpr_disparity', attribute_name='race')

We notice that while with most races, there is no significant indication of bias, there is an indication that Asians are less likely to be itentified by the model, based on the 0.4 disparity in relation to the Caucasian reference group.

fpr_disparity2 = aqp.plot_disparity(bdf, group_metric='fpr_disparity', attribute_name='gender')

With gender, there does not seem to be any significant indication of bias.

16.4 Fairness Analysis Example - Relative to a Reference Group

# Reference group fairness plot
fpr_fairness = aqp.plot_fairness_group(fdf, group_metric='fpr', title=True)

Here again we can see that there appears to be signficant disparity with the Asian race being under-represented with a magnitude of 0.19.

Patient Selection for Diabetes Drug Testing

Subscribe