Oversampling based on generative adversarial networks to overcome imbalance data in predicting fraud insurance claim



  • Ranu A. Nugraha Faculty of Information Technology, Graduate School of Computer Science, Nusa Mandiri University, Jakarta, Indonesia
  • Hilman F. Pardede Faculty of Information Technology, Graduate School of Computer Science, Nusa Mandiri University, Jakarta, Indonesia https://orcid.org/0000-0001-8078-7592
  • Agus Subekti National Research and Innovation Agency, Indonesia https://orcid.org/0000-0002-4525-4747




Fraud on health insurance has an impact not only on cost overruns, but also a decline in the quality of health services in long term. The use of machine learning to predict fraud on health insurance is increasingly popular. However, one of remaining problems for predicting health insurance frauds is the data imbalance. The problem of data imbalance would affect machine learning capabilities which tend to be biased towards the majority class. Recently, many efforts have been employed to use deep learning for data augmentation. One of them is Generative Adversarial Networks (GAN). Studies show that GAN has the capability to generate artificial data very similar to real data. Unlike other deep learning structures, GAN trains two networks called generator and discriminator in adversarial training. By doing so, generator never sees the distribution of the real data, making it possible to learn better generative model to produce the artificial data. In this paper, we propose to use GAN as an oversampling method to generate data for minority class. Since data for detecting health insurance fraud are tabular, we adopt Conditional Tabular GAN (CTGAN) architecture where generator is conditioned to be able to adjust the tabular data input and receive additional information in order to produce samples according to the specified class conditions. The new balanced data are used to train 17 classification algorithms. Our experiments show that our proposed method achieves better performance on several evaluation metrics: accuracy, precision score, F1-score, and also ROC than other referenced methods to deal imbalance data random over sampling (ROS), random under sampling (RUS), Synthetic Minority Oversampling Technique (SMOTE), Borderline SMOTE (B-SMO), and adaptive synthetic (ADASYN) methods.





Special Issue on Machin Learning (CS)