Prediction / Machine Learning
Our group has a genuine and explicit interest in prediction, prediction competitions, and prediction approaches. Some of our approaches are model based while others focus on complex model stacking (a.k.a. super learning), boosting, random forests, sampling and melding, support vector machines, etc. In our group the clash of Statistical cultures described by Leo Breiman in his seminal paper does not exist. Instead, we combine modeling and prediction approaches to answer important scientific problems. In fact, a SMART team won the ADHD 200 competition, making it the first team of Statisticians to win a major brain imaging prediction competition. We are also involved in the Heritage Health competion, where our team ranked as high as 16 out of more than 1500 teams from around the world.
Prediction methods and approaches tend to be highly dependent on the particular problem to be solved. For example, the tools used for predicting the length of stay in the hospital for 300 million people will be different from those used for predicting which locations in the brain are affected by a certain disease (multiple sclerosis, cancer, Alzheimer) or from used to predict who will develop breast cancer in the next 10 years. After evaluating multiple approaches to prediction the following lessons about prediction approaches stand out.
- 1. Simple models, such as linear or logistic regression, often perform extremely well and are hard to beat in practice.
- 2. Fancy names used for various machine learning methods are no substitue for careful thinking about the problem.
- 3. Interpretability of results and transparency of methods remains extremely important to the scientific community, to the public, and to industry. We believe that this is a good thing.
- 4. Model parsimony combined with multiple layers of protection against over-fitting is crucial.
- 5. The most interesting scientific problems are not typically revolving around which algorithm provides the best prediction, as they all perform about the same. Instead, understanding the feature (predictor) space, carefully building strong predictors, normalizing data, dealing with uncertainty and batch effects, and avoiding black box over-fitting are the keys to making progress.
Finally, to provide a counterpoint to Leo Breiman's discussion we would like to note the excellent paper by David J. Hand in Statistical Science, Classifier Technology and the Illusion of Progress. We believe that this a good read and we believe that in practice one should be very careful about over-buying into the hype (some would say drinking the kool-aid) of off-the-shelf machine learning approaches. In our experience, irrespective to one's favorite classifier technology or philosophy, in most applications the basic formula "garbage in garbage out" remains more valid than ever.