Week 11
Currently I am working on different kids of models and statistical techniques in order to obtain one that is satisfactory. So far, the best model gives 52% winning rate on ALL games from 2000-2007. This model would make a bettor break-even, unsatisfactory. We worked on finding circumstances where the model provides more accuracy and we had found that indeed there are situations where the model predicts with more than 65% accuracy, but these situations were hardly the norm.
I have come to the conclusion that a simple linear regression model will not work with the data I have. Other similar techniques I have tried include robust linear regression, where outliers are down-weighted weighted in order to obtain better estimates and logistic regression which allows the responses to be binary in order to predict the probability (odds) that a team would cover the spread.
A few of the problems I see are that 1) the data is not linear, 2) the observations are not independent, and 3) there are other significant factors that are not contained in the data. All hope is not lost, there are statistical techniques around this. There are non-linear techniques like Bayesian statistics that might be more accurate and take the severe randomness into account. Repeated measures or also called mix models which are more commonly used in drug trials and account for the correlation of observations done repeatedly to a single entity.
Finally, there is a vast number of data mining techniques that might be a good fit for the NFL data. These include: neural networks that are able to capture complex system of behavior between inter-correlated/connected nodes, regression/decision trees which accommodate predictive modeling and classification (in our case, will a team cover the spread?), clustering procedures that could group games in terms of their predictability, Vegas spread, and other factors that can help in building estimates for each cluster type.
Stay tuned for next week when I deploy my next best model. I will give you the NFL picks and their level of accuracy from 2000-2007.
I have come to the conclusion that a simple linear regression model will not work with the data I have. Other similar techniques I have tried include robust linear regression, where outliers are down-weighted weighted in order to obtain better estimates and logistic regression which allows the responses to be binary in order to predict the probability (odds) that a team would cover the spread.
A few of the problems I see are that 1) the data is not linear, 2) the observations are not independent, and 3) there are other significant factors that are not contained in the data. All hope is not lost, there are statistical techniques around this. There are non-linear techniques like Bayesian statistics that might be more accurate and take the severe randomness into account. Repeated measures or also called mix models which are more commonly used in drug trials and account for the correlation of observations done repeatedly to a single entity.
Finally, there is a vast number of data mining techniques that might be a good fit for the NFL data. These include: neural networks that are able to capture complex system of behavior between inter-correlated/connected nodes, regression/decision trees which accommodate predictive modeling and classification (in our case, will a team cover the spread?), clustering procedures that could group games in terms of their predictability, Vegas spread, and other factors that can help in building estimates for each cluster type.
Stay tuned for next week when I deploy my next best model. I will give you the NFL picks and their level of accuracy from 2000-2007.
Comments