Journal Article


John Dunnion
Michael L Doherty
Luke O'Grady
Caroline Fenlon



support vector machines decision trees logistic regression simulation modelling decision support sensitivity and specificity data mining predictive models

A discussion of calibration techniques for evaluating binary and categorical predictive models. (2017)

Abstract Modelling of binary and categorical events is a commonly used tool to simulate epidemiological processes in veterinary research. Logistic and multinomial regression, naïve Bayes, decision trees and support vector machines are popular data mining techniques used to predict the probabilities of events with two or more outcomes. Thorough evaluation of a predictive model is important to validate its ability for use in decision-support or broader simulation modelling. Measures of discrimination, such as sensitivity, specificity and receiver operating characteristics, are commonly used to evaluate how well the model can distinguish between the possible outcomes. However, these discrimination tests cannot confirm that the predicted probabilities are accurate and without bias. This paper describes a range of calibration tests, which typically measure the accuracy of predicted probabilities by comparing them to mean event occurrence rates within groups of similar test records. These include overall goodness-of-fit statistics in the form of the Hosmer-Lemeshow and Brier tests. Visual assessment of prediction accuracy is carried out using plots of calibration and deviance (the difference between the outcome and its predicted probability). The slope and intercept of the calibration plot are compared to the perfect diagonal using the unreliability test. Mean absolute calibration error provides an estimate of the level of predictive error. This paper uses sample predictions from a binary logistic regression model to illustrate the use of calibration techniques. Code is provided to perform the tests in the R statistical programming language. The benefits and disadvantages of each test are described. Discrimination tests are useful for establishing a model's diagnostic abilities, but may not suitably assess the model's usefulness for other predictive applications, such as stochastic simulation. Calibration tests may be more informative than discrimination tests for evaluating models with a narrow range of predicted probabilities or overall prevalence close to 50%, which are common in epidemiological applications. Using a suite of calibration tests alongside discrimination tests allows model builders to thoroughly measure their model's predictive capabilities.
Collections Ireland -> University College Dublin -> PubMed

Full list of authors on original publication

John Dunnion, Michael L Doherty, Luke O'Grady, Caroline Fenlon

Experts in our system

Michael L. Doherty
Total Publications: 119
Luke O'Grady
University College Dublin
Total Publications: 54
Caroline Fenlon
Total Publications: 7