1 Introduction
EHR data is becoming a key source of real-world evidence (RWE) for the pharmaceutical industry and regulators to make decisions on clinical trials.
For this project, we have a groundbreaking diabetes drug that is ready for clinical trial testing. It is a very unique and sensitive drug that requires administering the drug over at least 5-7 days of time in the hospital with frequent monitoring/testing and patient medication adherence training with a mobile application. We have been provided a patient dataset from a client partner and are tasked with building a predictive model that can identify which type of patients the company should focus their efforts testing this drug on. Target patients are people that are likely to be in the hospital for this duration of time and will not incur significant additional costs for administering this drug to the patient and monitoring.
In order to achieve our goal we must build a regression model that can predict the estimated hospitalization time for a patient and use this to select/filter patients for this study.
2 Approach
Utilizing a synthetic dataset (denormalized at the line level augmentation) built off of the UCI Diabetes readmission dataset, we will build a regression model that predicts the expected days of hospitalization time and then convert this to a binary prediction of whether to include or exclude that patient from the clinical trial.
This project will demonstrate the importance of building the right data representation at the encounter level, with appropriate filtering and preprocessing/feature engineering of key medical code sets. We will also analyze and interpret the model for biases across key demographic groups.
3 Dataset
Due to healthcare PHI regulations (HIPAA, HITECH), there are limited number of publicly available datasets and some datasets require training and approval. So, for the purpose of this study, we are using a dataset from UC Irvine that has been modified.
4 Dataset Loading and Schema Review
= "./data/final_project_dataset.csv"
dataset_path = pd.read_csv(dataset_path) df
# Show first few rows
df.head()
encounter_id | patient_nbr | race | gender | age | weight | admission_type_id | discharge_disposition_id | admission_source_id | time_in_hospital | payer_code | medical_specialty | primary_diagnosis_code | other_diagnosis_codes | number_outpatient | number_inpatient | number_emergency | num_lab_procedures | number_diagnoses | num_medications | num_procedures | ndc_code | max_glu_serum | A1Cresult | change | readmitted | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2278392 | 8222157 | Caucasian | Female | [0-10) | ? | 6 | 25 | 1 | 1 | ? | Pediatrics-Endocrinology | 250.83 | ?|? | 0 | 0 | 0 | 41 | 1 | 1 | 0 | NaN | None | None | No | NO |
1 | 149190 | 55629189 | Caucasian | Female | [10-20) | ? | 1 | 1 | 7 | 3 | ? | ? | 276 | 250.01|255 | 0 | 0 | 0 | 59 | 9 | 18 | 0 | 68071-1701 | None | None | Ch | >30 |
2 | 64410 | 86047875 | AfricanAmerican | Female | [20-30) | ? | 1 | 1 | 7 | 2 | ? | ? | 648 | 250|V27 | 2 | 1 | 0 | 11 | 6 | 13 | 5 | 0378-1110 | None | None | No | NO |
3 | 500364 | 82442376 | Caucasian | Male | [30-40) | ? | 1 | 1 | 7 | 2 | ? | ? | 8 | 250.43|403 | 0 | 0 | 0 | 44 | 7 | 16 | 1 | 68071-1701 | None | None | Ch | NO |
4 | 16680 | 42519267 | Caucasian | Male | [40-50) | ? | 1 | 1 | 7 | 1 | ? | ? | 197 | 157|250 | 0 | 0 | 0 | 51 | 5 | 8 | 0 | 0049-4110 | None | None | Ch | NO |
4.1 Determine Level of Dataset (Line or Encounter)
Given there are only 101766 unique encounter_id’s yet there are 143424 rows that are not nulls, this looks like the dataset is at the line level.
We would also want to aggregate on the primary_diagnosis_code as there is also only one of these per encounter. By aggregating on these 3 columns, we can create a encounter level dataset.
5 Analyze Dataset
# Look at range of values & key stats for numerical columns
= ['time_in_hospital', 'number_outpatient', 'number_inpatient', 'number_emergency', 'num_lab_procedures', 'number_diagnoses', 'num_medications', 'num_procedures' ]
numerical_feature_list df[numerical_feature_list].describe()
time_in_hospital | number_outpatient | number_inpatient | number_emergency | num_lab_procedures | number_diagnoses | num_medications | num_procedures | |
---|---|---|---|---|---|---|---|---|
count | 143424.000000 | 143424.000000 | 143424.000000 | 143424.000000 | 143424.000000 | 143424.000000 | 143424.000000 | 143424.000000 |
mean | 4.490190 | 0.362429 | 0.600855 | 0.195086 | 43.255745 | 7.424434 | 16.776035 | 1.349021 |
std | 2.999667 | 1.249295 | 1.207934 | 0.920410 | 19.657319 | 1.924872 | 8.397130 | 1.719104 |
min | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 1.000000 | 1.000000 | 0.000000 |
25% | 2.000000 | 0.000000 | 0.000000 | 0.000000 | 32.000000 | 6.000000 | 11.000000 | 0.000000 |
50% | 4.000000 | 0.000000 | 0.000000 | 0.000000 | 44.000000 | 8.000000 | 15.000000 | 1.000000 |
75% | 6.000000 | 0.000000 | 1.000000 | 0.000000 | 57.000000 | 9.000000 | 21.000000 | 2.000000 |
max | 14.000000 | 42.000000 | 21.000000 | 76.000000 | 132.000000 | 16.000000 | 81.000000 | 6.000000 |
# Define utility functions
def create_cardinality_feature(df):
= len(df)
num_rows = np.arange(100, 1000, 1)
random_code_list return np.random.choice(random_code_list, num_rows)
def count_unique_values(df, cat_col_list):
= df[cat_col_list]
cat_df 'principal_diagnosis_code'] = create_cardinality_feature(cat_df)
cat_df[#add feature with high cardinality
= pd.DataFrame({'columns': cat_df.columns,
val_df 'cardinality': cat_df.nunique() } )
return val_df
= [ 'race', 'gender', 'age', 'weight', 'payer_code', 'medical_specialty', 'primary_diagnosis_code', 'other_diagnosis_codes','ndc_code', 'max_glu_serum', 'A1Cresult', 'change', 'readmitted']
categorical_feature_list
= count_unique_values(df, categorical_feature_list)
categorical_df categorical_df
columns | cardinality | |
---|---|---|
race | race | 6 |
gender | gender | 3 |
age | age | 10 |
weight | weight | 10 |
payer_code | payer_code | 18 |
medical_specialty | medical_specialty | 73 |
primary_diagnosis_code | primary_diagnosis_code | 717 |
other_diagnosis_codes | other_diagnosis_codes | 19374 |
ndc_code | ndc_code | 251 |
max_glu_serum | max_glu_serum | 4 |
A1Cresult | A1Cresult | 4 |
change | change | 2 |
readmitted | readmitted | 3 |
principal_diagnosis_code | principal_diagnosis_code | 900 |
5.1 Analysis key findings
- The ndc_code field has a high amount of missing values (23460)
- num_lab_procedures and num_medications seem to have a roughly normal distribution
- Fields that have a high cardinality are - medical_specialty, primary_diagnosis_code, other_diagnosis_codes, ndc_code, and principal_diagnosis_code. This is because there are many thousands of these codes that correspond to the many disease and diagnosis sub-classes that exist in the medical field.
- The distribution for the age field is approximately normal, which we would expect. The distribution for the gender field is roughly uniform & equal. In this case we discount the very small number of Unknown/valid cases. Again this is not surprising, as the distribution of genders in the general population is also roughly equal so this seems to be a representitive sample from the general population.
6 Reduce Dimensionality of the NDC Code Feature
NDC codes are a common format to represent the wide variety of drugs that are prescribed for patient care in the United States. The challenge is that there are many codes that map to the same or similar drug. We are provided with the ndc drug lookup file https://github.com/udacity/nd320-c1-emr-data-starter/blob/master/project/data_schema_references/ndc_lookup_table.csv derived from the National Drug Codes List site(https://ndclist.com/).
We can use this file to come up with a way to reduce the dimensionality of this field and create a new field in the dataset called “generic_drug_name” in the output dataframe.
#NDC code lookup file
= "./medication_lookup_tables/final_ndc_lookup_table"
ndc_code_path = pd.read_csv(ndc_code_path) ndc_code_df
# Check first new rows
ndc_code_df.head()
NDC_Code | Proprietary Name | Non-proprietary Name | Dosage Form | Route Name | Company Name | Product Type | |
---|---|---|---|---|---|---|---|
0 | 0087-6060 | Glucophage | Metformin Hydrochloride | Tablet, Film Coated | Oral | Bristol-myers Squibb Company | Human Prescription Drug |
1 | 0087-6063 | Glucophage XR | Metformin Hydrochloride | Tablet, Extended Release | Oral | Bristol-myers Squibb Company | Human Prescription Drug |
2 | 0087-6064 | Glucophage XR | Metformin Hydrochloride | Tablet, Extended Release | Oral | Bristol-myers Squibb Company | Human Prescription Drug |
3 | 0087-6070 | Glucophage | Metformin Hydrochloride | Tablet, Film Coated | Oral | Bristol-myers Squibb Company | Human Prescription Drug |
4 | 0087-6071 | Glucophage | Metformin Hydrochloride | Tablet, Film Coated | Oral | Bristol-myers Squibb Company | Human Prescription Drug |
# Check for duplicate NDC_Code's
=['NDC_Code'])] ndc_code_df[ndc_code_df.duplicated(subset
NDC_Code | Proprietary Name | Non-proprietary Name | Dosage Form | Route Name | Company Name | Product Type | |
---|---|---|---|---|---|---|---|
263 | 0781-5634 | Pioglitazone Hydrochloride And Glimepiride | Pioglitazone Hydrochloride And Glimepiride | Tablet | Oral | Sandoz Inc | Human Prescription Drug |
264 | 0781-5635 | Pioglitazone Hydrochloride And Glimepiride | Pioglitazone Hydrochloride And Glimepiride | Tablet | Oral | Sandoz Inc | Human Prescription Drug |
# Remove duplicates
= ndc_code_df.drop(ndc_code_df.index[[263,264]])
ndc_code_df =['NDC_Code'])] ndc_code_df[ndc_code_df.duplicated(subset
NDC_Code | Proprietary Name | Non-proprietary Name | Dosage Form | Route Name | Company Name | Product Type |
---|
7 Select First Encounter for each Patient
In order to simplify the aggregation of data for the model, we will only select the first encounter for each patient in the dataset. This is to reduce the risk of data leakage of future patient encounters and to reduce complexity of the data transformation and modeling steps. We will assume that sorting in numerical order on the encounter_id provides the time horizon for determining which encounters come before and after another.
from student_utils import select_first_encounter
= select_first_encounter(reduce_dim_df) first_encounter_df
# unique patients in transformed dataset
= first_encounter_df['patient_nbr'].nunique()
unique_patients print("Number of unique patients:{}".format(unique_patients))
# unique encounters in transformed dataset
= first_encounter_df['encounter_id'].nunique()
unique_encounters print("Number of unique encounters:{}".format(unique_encounters))
= reduce_dim_df['patient_nbr'].nunique()
original_unique_patient_number # number of unique patients should be equal to the number of unique encounters and patients in the final dataset
assert original_unique_patient_number == unique_patients
assert original_unique_patient_number == unique_encounters
- Number of unique patients:71518
- Number of unique encounters:71518
8 Aggregate Dataset to Right Level for Modelling
To make it simpler, we are creating dummy columns for each unique generic drug name and adding those are input features to the model.
= ['generic_drug_name']
exclusion_list = [c for c in first_encounter_df.columns if c not in exclusion_list]
grouping_field_list = aggregate_dataset(first_encounter_df, grouping_field_list, 'generic_drug_name') agg_drug_df, ndc_col_list
assert len(agg_drug_df) == agg_drug_df['patient_nbr'].nunique() == agg_drug_df['encounter_id'].nunique()
9 Prepare Fields and Cast Dataset
9.1 Feature Selection
# Look at counts for payer_code categories
= sns.countplot(x="payer_code", data=agg_drug_df) ax
# Look at counts for weight categories
= sns.countplot(x="weight", data=agg_drug_df) ax
From the category counts above, we can see that for payer_code while there are many unknown values i.e. ‘?’, there are still many values for other payer codes, these may prove useful predictors for our target variable. For weight, there are so few unknown ‘?’ codes, that this feature is likely to be not very helpful for predicting our target variable.
# Selected features
= ['race', 'gender', 'age']
required_demo_col_list = [ "change", "readmitted", "payer_code", "medical_specialty", "primary_diagnosis_code", "other_diagnosis_codes", "max_glu_serum", "A1Cresult", "admission_type_id", "discharge_disposition_id", "admission_source_id"] + required_demo_col_list + ndc_col_list
student_categorical_col_list = ["number_outpatient", "number_inpatient", "number_emergency", "num_lab_procedures", "number_diagnoses", "num_medications", "num_procedures"]
student_numerical_col_list = 'time_in_hospital' PREDICTOR_FIELD
def select_model_features(df, categorical_col_list, numerical_col_list, PREDICTOR_FIELD, grouping_key='patient_nbr'):
= [grouping_key] + [PREDICTOR_FIELD] + categorical_col_list + numerical_col_list
selected_col_list return agg_drug_df[selected_col_list]
= select_model_features(agg_drug_df, student_categorical_col_list, student_numerical_col_list,
selected_features_df PREDICTOR_FIELD)
9.2 Preprocess Dataset - Casting and Imputing
We will cast and impute the dataset before splitting so that we do not have to repeat these steps across the splits in the next step. For imputing, there can be deeper analysis into which features to impute and how to impute but for the sake of time, we are taking a general strategy of imputing zero for only numerical features.
= preprocess_df(selected_features_df, student_categorical_col_list,
processed_df ='nan', numerical_impute_value=0) student_numerical_col_list, PREDICTOR_FIELD, categorical_impute_value
10 Split Dataset into Train, Validation, and Test Partitions
In order to prepare the data for being trained and evaluated by a deep learning model, we will split the dataset into three partitions, with the validation partition used for optimizing the model hyperparameters during training. One of the key parts is that we need to be sure that the data does not accidently leak across partitions.
We will split the input dataset into three partitions(train, validation, test) with the following requirements:
- Approximately 60%/20%/20% train/validation/test split
- Randomly sample different patients into each data partition
- We need to take care that a patient’s data is not in more than one partition, so that we can avoid possible data leakage.
- We need to take care the total number of unique patients across the splits is equal to the total number of unique patients in the original dataset
- Total number of rows in original dataset = sum of rows across all three dataset partitions
from student_utils import patient_dataset_splitter
= patient_dataset_splitter(processed_df, 'patient_nbr') d_train, d_val, d_test
- Total number of unique patients in train = 32563
- Total number of unique patients in validation = 10854
- Total number of unique patients in test = 10854
- Training partition has a shape = (32563, 43)
- Validation partition has a shape = (10854, 43)
- Test partition has a shape = (10854, 43)
11 Demographic Representation Analysis of Split
After the split, we should check to see the distribution of key features/groups and make sure that there is representative samples across the partitions.
11.1 Label Distribution Across Partitions
Are the histogram distribution shapes similar across partitions?
show_group_stats_viz(processed_df, PREDICTOR_FIELD)
show_group_stats_viz(d_train, PREDICTOR_FIELD)
show_group_stats_viz(d_test, PREDICTOR_FIELD)
11.2 Demographic Group Analysis
We should check that our partitions/splits of the dataset are similar in terms of their demographic profiles.
# Full dataset before splitting
= ['race', 'gender', 'age', 'patient_nbr']
patient_demo_features = processed_df[patient_demo_features].groupby('patient_nbr').head(1).reset_index(drop=True)
patient_group_analysis_df 'gender') show_group_stats_viz(patient_group_analysis_df,
# Training partition
'gender') show_group_stats_viz(d_train,
# Test partition
'gender') show_group_stats_viz(d_test,
12 Convert Dataset Splits to TF Dataset
# Convert dataset from Pandas dataframes to TF dataset
= 128
batch_size = df_to_dataset(d_train, PREDICTOR_FIELD, batch_size=batch_size)
diabetes_train_ds = df_to_dataset(d_val, PREDICTOR_FIELD, batch_size=batch_size)
diabetes_val_ds = df_to_dataset(d_test, PREDICTOR_FIELD, batch_size=batch_size) diabetes_test_ds
# We use this sample of the dataset to show transformations later
= next(iter(diabetes_train_ds))[0]
diabetes_batch def demo(feature_column, example_batch):
= layers.DenseFeatures(feature_column)
feature_layer print(feature_layer(example_batch))
13 Create Features
13.1 Create Categorical Features with TF Feature Columns
Before we can create the TF categorical features, we must first create the vocab files with the unique values for a given field that are from the training dataset.
# Build Vocabulary for Categorical Features
= build_vocab_files(d_train, student_categorical_col_list) vocab_file_list
13.2 Create Categorical Features with Tensorflow Feature Column API
from student_utils import create_tf_categorical_feature_cols
= create_tf_categorical_feature_cols(student_categorical_col_list) tf_cat_col_list
= tf_cat_col_list[0]
test_cat_var1 print("Example categorical field:\n{}".format(test_cat_var1))
demo(test_cat_var1, diabetes_batch)
13.3 Create Numerical Features with TF Feature Columns
from student_utils import create_tf_numeric_feature
def calculate_stats_from_train_data(df, col):
= df[col].describe()['mean']
mean = df[col].describe()['std']
std return mean, std
def create_tf_numerical_feature_cols(numerical_col_list, train_df):
= []
tf_numeric_col_list for c in numerical_col_list:
= calculate_stats_from_train_data(train_df, c)
mean, std = create_tf_numeric_feature(c, mean, std)
tf_numeric_feature
tf_numeric_col_list.append(tf_numeric_feature)return tf_numeric_col_list
= create_tf_numerical_feature_cols(student_numerical_col_list, d_train) tf_cont_col_list
= tf_cont_col_list[0]
test_cont_var1 print("Example continuous field:\n{}\n".format(test_cont_var1))
demo(test_cont_var1, diabetes_batch)
14 Build Deep Learning Regression Model with Sequential API and TF Probability Layers
14.1 Use DenseFeatures to combine features for model
Now that we have prepared categorical and numerical features using Tensorflow’s Feature Column API, we can combine them into a dense vector representation for the model. Below we will create this new input layer, which we will call ‘claim_feature_layer’.
= tf_cat_col_list + tf_cont_col_list
claim_feature_columns = tf.keras.layers.DenseFeatures(claim_feature_columns) claim_feature_layer
14.2 Build Sequential API Model from DenseFeatures and TF Probability Layers
def build_sequential_model(feature_layer):
= tf.keras.Sequential([
model
feature_layer,150, activation='relu'),
tf.keras.layers.Dense(200, activation='relu'),# New
tf.keras.layers.Dense(75, activation='relu'),
tf.keras.layers.Dense(1+1, posterior_mean_field, prior_trainable),
tfp.layers.DenseVariational(
tfp.layers.DistributionLambda(lambda t:tfp.distributions.Normal(loc=t[..., :1],
=1e-3 + tf.math.softplus(0.01 * t[...,1:])
scale
)
),
])return model
def build_diabetes_model(train_ds, val_ds, feature_layer, epochs=5, loss_metric='mse'):
= build_sequential_model(feature_layer)
model = tf.keras.optimizers.Adam(learning_rate=0.01)
opt compile(optimizer=opt, loss=loss_metric, metrics=[loss_metric])
model.#model.compile(optimizer='rmsprop', loss=loss_metric, metrics=[loss_metric])
#early_stop = tf.keras.callbacks.EarlyStopping(monitor=loss_metric, patience=3)
= model.fit(train_ds, validation_data=val_ds,
history #callbacks=[early_stop],
=epochs)
epochsreturn model, history
= build_diabetes_model(diabetes_train_ds, diabetes_val_ds, claim_feature_layer, epochs=10) diabetes_model, history
14.3 Show Model Uncertainty Range with TF Probability
Now that we have trained a model with TF Probability layers, we can extract the mean and standard deviation for each prediction.
= student_categorical_col_list + student_numerical_col_list
feature_list = dict(d_test[feature_list])
diabetes_x_tst = diabetes_model(diabetes_x_tst)
diabetes_yhat = diabetes_model.predict(diabetes_test_ds) preds
from student_utils import get_mean_std_from_preds
= get_mean_std_from_preds(diabetes_yhat) m, s
14.4 Show Prediction Output
= {
prob_outputs "pred": preds.flatten(),
"actual_value": d_test['time_in_hospital'].values,
"pred_mean": m.numpy().flatten(),
"pred_std": s.numpy().flatten()
}= pd.DataFrame(prob_outputs) prob_output_df
prob_output_df.head()
pred | actual_value | pred_mean | pred_std | |
---|---|---|---|---|
0 | 3.587955 | 3.0 | 4.673843 | 0.693749 |
1 | 5.007016 | 2.0 | 4.673843 | 0.693749 |
2 | 4.809363 | 9.0 | 4.673843 | 0.693749 |
3 | 5.003417 | 2.0 | 4.673843 | 0.693749 |
4 | 5.346958 | 8.0 | 4.673843 | 0.693749 |
prob_output_df.describe()
pred | actual_value | pred_mean | pred_std | |
---|---|---|---|---|
count | 10854.000000 | 10854.000000 | 10854.000000 | 10854.000000 |
mean | 4.376980 | 4.429888 | 4.673843 | 0.693749 |
std | 0.908507 | 3.002044 | 0.000000 | 0.000000 |
min | 0.976290 | 1.000000 | 4.673843 | 0.693749 |
25% | 3.755292 | 2.000000 | 4.673843 | 0.693749 |
50% | 4.382993 | 4.000000 | 4.673843 | 0.693749 |
75% | 5.002859 | 6.000000 | 4.673843 | 0.693749 |
max | 7.529900 | 14.000000 | 4.673843 | 0.693749 |
14.5 Convert Regression Output to Classification Output for Patient Selection
from student_utils import get_student_binary_prediction
= get_student_binary_prediction(prob_output_df, 'pred') student_binary_prediction
student_binary_prediction.value_counts()
- 0:8137
- 1:2717
14.6 Add Binary Prediction to Test Dataframe
Using the student_binary_prediction output that is a numpy array with binary labels, we can use this to add to a dataframe to better visualize and also to prepare the data for the Aequitas toolkit. The Aequitas toolkit requires that the predictions be mapped to a binary label for the predictions (called ‘score’ field) and the actual value (called ‘label_value’).
def add_pred_to_test(test_df, pred_np, demo_col_list):
for c in demo_col_list:
= test_df[c].astype(str)
test_df[c] 'score'] = pred_np
test_df['label_value'] = test_df['time_in_hospital'].apply(lambda x: 1 if x >=5 else 0)
test_df[return test_df
= add_pred_to_test(d_test, student_binary_prediction, ['race', 'gender']) pred_test_df
'patient_nbr', 'gender', 'race', 'time_in_hospital', 'score', 'label_value']].head() pred_test_df[[
patient_nbr | gender | race | time_in_hospital | score | label_value | |
---|---|---|---|---|---|---|
0 | 122896787 | Male | Caucasian | 3.0 | 0 | 0 |
1 | 102598929 | Male | Caucasian | 2.0 | 1 | 0 |
2 | 80367957 | Male | Caucasian | 9.0 | 0 | 1 |
3 | 6721533 | Male | Caucasian | 2.0 | 1 | 0 |
4 | 104346288 | Female | Caucasian | 8.0 | 1 | 1 |
15 Model Evaluation Metrics
Now it is time to use the newly created binary labels in the ‘pred_test_df’ dataframe to evaluate the model with some common classification metrics. We will create a report summary of the performance of the model and give the ROC AUC, F1 score(weighted), class precision and recall scores.
# AUC, F1, precision and recall
# Summary
= pred_test_df['label_value'].values
y_true = pred_test_df['score'].values y_pred
accuracy_score(y_true, y_pred)
- 0.5627418463239359
roc_auc_score(y_true, y_pred)
- 0.5032089104088319
Precision-recall tradeoff - The model has been optimised to identify those patients correct for the trial with the fewest mistakes, while also trying to ensure we identify as many of them as possible.
Areas of imporovement - we could look to engineer new features that might help us better predict our target patients.
16 Evaluating Potential Model Biases with Aequitas Toolkit
16.1 Prepare Data For Aequitas Bias Toolkit
Using the gender and race fields, we will prepare the data for the Aequitas Toolkit.
# Aequitas
from aequitas.preprocessing import preprocess_input_df
from aequitas.group import Group
from aequitas.plotting import Plot
from aequitas.bias import Bias
from aequitas.fairness import Fairness
= pred_test_df[['race', 'gender', 'score', 'label_value']]
ae_subset_df = preprocess_input_df(ae_subset_df)
ae_df, _ = Group()
g = g.get_crosstabs(ae_df)
xtab, _ = g.list_absolute_metrics(xtab)
absolute_metrics = xtab.fillna(-1)
clean_xtab = Plot()
aqp = Bias() b
- model_id, score_thresholds 1 {‘rank_abs’: [2717]}
= g.list_absolute_metrics(xtab)
absolute_metrics for col in xtab.columns if col not in absolute_metrics]] xtab[[col
model_id | score_threshold | k | attribute_name | attribute_value | pp | pn | fp | fn | tn | tp | group_label_pos | group_label_neg | group_size | total_entities | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | binary 0/1 | 2717 | race | ? | 86 | 240 | 56 | 85 | 155 | 30 | 115 | 211 | 326 | 10854 |
1 | 1 | binary 0/1 | 2717 | race | AfricanAmerican | 491 | 1530 | 291 | 592 | 938 | 200 | 792 | 1229 | 2021 | 10854 |
2 | 1 | binary 0/1 | 2717 | race | Asian | 15 | 60 | 10 | 16 | 44 | 5 | 21 | 54 | 75 | 10854 |
3 | 1 | binary 0/1 | 2717 | race | Caucasian | 2030 | 6038 | 1249 | 2298 | 3740 | 781 | 3079 | 4989 | 8068 | 10854 |
4 | 1 | binary 0/1 | 2717 | race | Hispanic | 52 | 141 | 35 | 48 | 93 | 17 | 65 | 128 | 193 | 10854 |
5 | 1 | binary 0/1 | 2717 | race | Other | 43 | 128 | 26 | 40 | 88 | 17 | 57 | 114 | 171 | 10854 |
6 | 1 | binary 0/1 | 2717 | gender | Female | 1413 | 4306 | 820 | 1675 | 2631 | 593 | 2268 | 3451 | 5719 | 10854 |
7 | 1 | binary 0/1 | 2717 | gender | Male | 1304 | 3831 | 847 | 1404 | 2427 | 457 | 1861 | 3274 | 5135 | 10854 |
16.2 Reference Group Selection
# Test reference group with Caucasian Male
= b.get_disparity_predefined_groups(clean_xtab,
bdf =ae_df,
original_df={'race':'Caucasian', 'gender':'Male'
ref_groups_dict
},=0.05,
alpha=False)
check_significance
= Fairness()
f = f.get_group_value_fairness(bdf) fdf
16.3 Race and Gender Bias Analysis for Patient Selection
# Plot two metrics
# Is there significant bias in your model for either race or gender?
= aqp.plot_disparity(bdf, group_metric='fpr_disparity', attribute_name='race') fpr_disparity1
We notice that while with most races, there is no significant indication of bias, there is an indication that Asians are less likely to be itentified by the model, based on the 0.4 disparity in relation to the Caucasian reference group.
= aqp.plot_disparity(bdf, group_metric='fpr_disparity', attribute_name='gender') fpr_disparity2
With gender, there does not seem to be any significant indication of bias.
16.4 Fairness Analysis Example - Relative to a Reference Group
# Reference group fairness plot
= aqp.plot_fairness_group(fdf, group_metric='fpr', title=True) fpr_fairness
Here again we can see that there appears to be signficant disparity with the Asian race being under-represented with a magnitude of 0.19.