Of Genomes And Genetics
In this Hackerearth challenge [link], you are given a dataset that contains medical information of children who have genetic disorders.
You are set to predict the following:
- Disorder subclass.
- Genetic disorder.
With the evaluations metrics
Genetic Disorder
score1 = max(0, 100*metrics.f1_score(actual["Genetic Disorder"], predicted["Genetic Disorder"], average="macro"))
Disorder Subclass
score2 = max(0, 100*metrics.f1_score(actual["Disorder Subclass"], predicted["Disorder Subclass"], average="macro"))
Final score
score = (score1/2)+(score2/2)
I used two main notebooks
1_GeneticDisorder_preprocessing.ipynbAlthough a much more extensive EDA was performed on the notebook named GeneticDisorderEDAv1 the most important transformations to the data for modeling occurred on this one. Of the 44 columns, 13 that would not provide relevant information were discarded at first, from the first moment it can be observed in the data that they present many missing data to solve this problem, MICE (Multiple Imputations by Chained Equations) imputation will be applied. This approach may be generally referred to as fully conditional specification (FCS) or multivariate imputation by chained equations (MICE).
This methodology is attractive if the multivariate distribution is a reasonable description of the data. FCS specifies the multivariate imputation model on a variable-by-variable basis by a set of conditional densities, one for each incomplete variable. Starting from an initial imputation, FCS draws imputations by iterating over the conditional densities. A low number of iterations (say 10–20) is often sufficient.
— mice: Multivariate Imputation by Chained Equations in R, 2009.
Non-numeric values can be observed in the data whose transformation is important for our modelling. For this, the variables were separated between categorical and non-categorical to later apply the ordinal encoding method to encode the categorical variables. The variant of using ordinal encoding came from experimentation and obviously these operations were performed by joining the test and training data.
2_GeneticDisorder_finalmodelling.ipynbIn this notebook the main objective is to perform the runs using different models. For this, the data is also transformed, starting with our target variables: Genetic Disorder and Disorder Subclass, which are categorical variables whose characteristics and the data itself allow us to create a single variable without the risk of overlapping. Then, when analyzing the new target variable, it can be seen that the data is unbalanced. During the experimentation, an improvement in the results was observed when the the target variable is balanced, so the data was balanced by Over Sampling using SMOTE (Synthetic Minority Oversampling Technique).
This technique was described by Nitesh Chawla, et al. in their 2002 paper named for the technique titled “SMOTE: Synthetic Minority Over-sampling Technique.”
SMOTE works by selecting examples that are close in the feature space, drawing a line between the examples in the feature space and drawing a new sample at a point along that line.
Then we proceed to run the models whose results can be seen in the following table:
Model | Score |
---|---|
GradientBoostingClassifier | 33.60688 |
XGBoostClassifier | 33.52182 |
LGBMClassifier | 32.99259 |
CatboostClassifier | 35.20506 |
Best Score 35.20506 |
The best ranked result would be the CatboostClassifier model in 18th place out of 2138 participants.