Of Genomes and Genetics

Portfolio of Alejandro O. Valdés, a mathematician specialized in data science.


Of Genomes And Genetics

In this Hackerearth challenge [link], you are given a dataset that contains medical information of children who have genetic disorders.

You are set to predict the following:

  • Disorder subclass.
  • Genetic disorder.

With the evaluations metrics

Genetic Disorder


score1 = max(0, 100*metrics.f1_score(actual["Genetic Disorder"], predicted["Genetic Disorder"], average="macro"))

Disorder Subclass


score2 = max(0, 100*metrics.f1_score(actual["Disorder Subclass"], predicted["Disorder Subclass"], average="macro"))

Final score


score = (score1/2)+(score2/2)

I used two main notebooks

1_GeneticDisorder_preprocessing.ipynb

Although a much more extensive EDA was performed on the notebook named GeneticDisorderEDAv1 the most important transformations to the data for modeling occurred on this one. Of the 44 columns, 13 that would not provide relevant information were discarded at first, from the first moment it can be observed in the data that they present many missing data to solve this problem, MICE (Multiple Imputations by Chained Equations) imputation will be applied. This approach may be generally referred to as fully conditional specification (FCS) or multivariate imputation by chained equations (MICE).

This methodology is attractive if the multivariate distribution is a reasonable description of the data. FCS specifies the multivariate imputation model on a variable-by-variable basis by a set of conditional densities, one for each incomplete variable. Starting from an initial imputation, FCS draws imputations by iterating over the conditional densities. A low number of iterations (say 10–20) is often sufficient.

— mice: Multivariate Imputation by Chained Equations in R, 2009.

Non-numeric values can be observed in the data whose transformation is important for our modelling. For this, the variables were separated between categorical and non-categorical to later apply the ordinal encoding method to encode the categorical variables. The variant of using ordinal encoding came from experimentation and obviously these operations were performed by joining the test and training data.

2_GeneticDisorder_finalmodelling.ipynb

In this notebook the main objective is to perform the runs using different models. For this, the data is also transformed, starting with our target variables: Genetic Disorder and Disorder Subclass, which are categorical variables whose characteristics and the data itself allow us to create a single variable without the risk of overlapping. Then, when analyzing the new target variable, it can be seen that the data is unbalanced. During the experimentation, an improvement in the results was observed when the the target variable is balanced, so the data was balanced by Over Sampling using SMOTE (Synthetic Minority Oversampling Technique).

This technique was described by Nitesh Chawla, et al. in their 2002 paper named for the technique titled “SMOTE: Synthetic Minority Over-sampling Technique.”

SMOTE works by selecting examples that are close in the feature space, drawing a line between the examples in the feature space and drawing a new sample at a point along that line.

Then we proceed to run the models whose results can be seen in the following table:

Model Score
GradientBoostingClassifier 33.60688
XGBoostClassifier 33.52182
LGBMClassifier 32.99259
CatboostClassifier 35.20506
Best Score 35.20506

The best ranked result would be the CatboostClassifier model in 18th place out of 2138 participants.