This is part III of project one focusing on the applications of
cross-validation methods in predictive modeling.
Cross-validation for
Predictive Modeling
The idea is to use data-driven approaches to data
splitting and then apply cross-validation methods to select the final
model from a pool of candidate models based on predictive
performance metric such as MSE for linear
regression models and accuracy,
sensitivity, or specificity for
logistic regression models.
Suggested Components in the Predictive Analysis
random splitting - using random splitting for all data
partitions.
Two-way data splitting - data split into 75% for
training and validation and 25% for testing.
5-fold cross-validation - using a 5-fold
cross-validation algorithm on the training data
Prediction Linear
Regression
The primary predictive performance metric for linear regression
modeling is the mean square error (the average
squared error between predicted and the observed values of the response
variable in its original scale).
Other predictive performance metrics that can also be used are \(R^2\) or \(R^2_{adj}\).
Likelihood-based metrics such as AIC and SBC can be used if the
likelihood functions of all candidate models are at the same scale.
These measures are not as intuitive as the MSE since MSE is a squared
‘distance’ in the Euclidean space.
If the response variables in all candidate
models are at the same scale, the MSE is expected to be used in the
cross-validation for model selection.
Logistic Predictive
Modeling
The primary tool for assessing the global predictive performance of
logistic models is ROC curve analysis (this includes the area under the
ROC curve - AUC). ROC curve suggested for this assignment.
Other predictive performance measures that can be considered are
accuracy, sensitivity, and
specificity.
Reporting ROC and AUC is required when
comparing candidate models.
After the final model is identified, you need to use the 25% testing
data set to report the actual performance of the
corresponding models. The performance measure is similar the actual
performance when the model is implemented new real data.
LS0tDQp0aXRsZTogJ1Byb2plY3QgT25lOiAgUGFydCBJSUkgLSBQcmVkaWN0aXZlIE1vZGVsaW5nIGFuZCBDcm9zcyBWYWxpZGF0b24nDQphdXRob3I6ICIgKFlvdSBhcmUgZXhwZWN0ZWQgdG8gZ2l2ZSBhIGRlc2NyaXB0aXZlIHRpdGxlKSINCmRhdGU6ICIgIg0Kb3V0cHV0Og0KICBodG1sX2RvY3VtZW50OiANCiAgICB0b2M6IHllcw0KICAgIHRvY19kZXB0aDogNA0KICAgIHRvY19mbG9hdDogeWVzDQogICAgbnVtYmVyX3NlY3Rpb25zOiB5ZXMNCiAgICB0b2NfY29sbGFwc2VkOiB5ZXMNCiAgICBjb2RlX2ZvbGRpbmc6IGhpZGUNCiAgICBjb2RlX2Rvd25sb2FkOiB5ZXMNCiAgICBzbW9vdGhfc2Nyb2xsOiB5ZXMNCiAgICB0aGVtZTogbHVtZW4NCiAgd29yZF9kb2N1bWVudDogDQogICAgdG9jOiB5ZXMNCiAgICB0b2NfZGVwdGg6IDQNCiAgICBmaWdfY2FwdGlvbjogeWVzDQogICAga2VlcF9tZDogeWVzDQogIHBkZl9kb2N1bWVudDogDQogICAgdG9jOiB5ZXMNCiAgICB0b2NfZGVwdGg6IDQNCiAgICBmaWdfY2FwdGlvbjogeWVzDQogICAgbnVtYmVyX3NlY3Rpb25zOiB5ZXMNCiAgICBmaWdfd2lkdGg6IDMNCiAgICBmaWdfaGVpZ2h0OiAzDQplZGl0b3Jfb3B0aW9uczogDQogIGNodW5rX291dHB1dF90eXBlOiBpbmxpbmUNCi0tLQ0KDQpgYGB7Y3NzLCBlY2hvID0gRkFMU0V9DQpkaXYjVE9DIGxpIHsgICAgIC8qIHRhYmxlIG9mIGNvbnRlbnQgICovDQogICAgbGlzdC1zdHlsZTp1cHBlci1yb21hbjsNCiAgICBiYWNrZ3JvdW5kLWltYWdlOm5vbmU7DQogICAgYmFja2dyb3VuZC1yZXBlYXQ6bm9uZTsNCiAgICBiYWNrZ3JvdW5kLXBvc2l0aW9uOjA7DQp9DQoNCmgxLnRpdGxlIHsgICAgLyogbGV2ZWwgMSBoZWFkZXIgb2YgdGl0bGUgICovDQogIGZvbnQtc2l6ZTogMjRweDsNCiAgZm9udC13ZWlnaHQ6IGJvbGQ7DQogIGNvbG9yOiBEYXJrUmVkOw0KICB0ZXh0LWFsaWduOiBjZW50ZXI7DQp9DQoNCmg0LmF1dGhvciB7IC8qIEhlYWRlciA0IC0gYW5kIHRoZSBhdXRob3IgYW5kIGRhdGEgaGVhZGVycyB1c2UgdGhpcyB0b28gICovDQogIGZvbnQtc2l6ZTogMThweDsNCiAgZm9udC13ZWlnaHQ6IGJvbGQ7DQogIGZvbnQtZmFtaWx5OiAiVGltZXMgTmV3IFJvbWFuIiwgVGltZXMsIHNlcmlmOw0KICBjb2xvcjogRGFya1JlZDsNCiAgdGV4dC1hbGlnbjogY2VudGVyOw0KfQ0KDQpoNC5kYXRlIHsgLyogSGVhZGVyIDQgLSBhbmQgdGhlIGF1dGhvciBhbmQgZGF0YSBoZWFkZXJzIHVzZSB0aGlzIHRvbyAgKi8NCiAgZm9udC1zaXplOiAxOHB4Ow0KICBmb250LXdlaWdodDogYm9sZDsNCiAgZm9udC1mYW1pbHk6ICJUaW1lcyBOZXcgUm9tYW4iLCBUaW1lcywgc2VyaWY7DQogIGNvbG9yOiBEYXJrQmx1ZTsNCiAgdGV4dC1hbGlnbjogY2VudGVyOw0KfQ0KDQpoMSB7IC8qIEhlYWRlciAxIC0gYW5kIHRoZSBhdXRob3IgYW5kIGRhdGEgaGVhZGVycyB1c2UgdGhpcyB0b28gICovDQogICAgZm9udC1zaXplOiAyMHB4Ow0KICAgIGZvbnQtd2VpZ2h0OiBib2xkOw0KICAgIGZvbnQtZmFtaWx5OiAiVGltZXMgTmV3IFJvbWFuIiwgVGltZXMsIHNlcmlmOw0KICAgIGNvbG9yOiBkYXJrcmVkOw0KICAgIHRleHQtYWxpZ246IGNlbnRlcjsNCn0NCg0KaDIgeyAvKiBIZWFkZXIgMiAtIGFuZCB0aGUgYXV0aG9yIGFuZCBkYXRhIGhlYWRlcnMgdXNlIHRoaXMgdG9vICAqLw0KICAgIGZvbnQtc2l6ZTogMThweDsNCiAgICBmb250LXdlaWdodDogYm9sZDsNCiAgICBmb250LWZhbWlseTogIlRpbWVzIE5ldyBSb21hbiIsIFRpbWVzLCBzZXJpZjsNCiAgICBjb2xvcjogbmF2eTsNCiAgICB0ZXh0LWFsaWduOiBsZWZ0Ow0KfQ0KDQpoMyB7IC8qIEhlYWRlciAzIC0gYW5kIHRoZSBhdXRob3IgYW5kIGRhdGEgaGVhZGVycyB1c2UgdGhpcyB0b28gICovDQogICAgZm9udC1zaXplOiAxNnB4Ow0KICAgIGZvbnQtd2VpZ2h0OiBib2xkOw0KICAgIGZvbnQtZmFtaWx5OiAiVGltZXMgTmV3IFJvbWFuIiwgVGltZXMsIHNlcmlmOw0KICAgIGNvbG9yOiBuYXZ5Ow0KICAgIHRleHQtYWxpZ246IGxlZnQ7DQp9DQoNCmg0IHsgLyogSGVhZGVyIDQgLSBhbmQgdGhlIGF1dGhvciBhbmQgZGF0YSBoZWFkZXJzIHVzZSB0aGlzIHRvbyAgKi8NCiAgICBmb250LXNpemU6IDE0cHg7DQogIGZvbnQtd2VpZ2h0OiBib2xkOw0KICAgIGZvbnQtZmFtaWx5OiAiVGltZXMgTmV3IFJvbWFuIiwgVGltZXMsIHNlcmlmOw0KICAgIGNvbG9yOiBkYXJrcmVkOw0KICAgIHRleHQtYWxpZ246IGxlZnQ7DQp9DQoNCi8qIEFkZCBkb3RzIGFmdGVyIG51bWJlcmVkIGhlYWRlcnMgKi8NCi5oZWFkZXItc2VjdGlvbi1udW1iZXI6OmFmdGVyIHsNCiAgY29udGVudDogIi4iOw0KDQpib2R5IHsgYmFja2dyb3VuZC1jb2xvcjp3aGl0ZTsgfQ0KDQouaGlnaGxpZ2h0bWUgeyBiYWNrZ3JvdW5kLWNvbG9yOnllbGxvdzsgfQ0KDQpwIHsgYmFja2dyb3VuZC1jb2xvcjp3aGl0ZTsgfQ0KDQp9DQpgYGANCg0KYGBge3Igc2V0dXAsIGluY2x1ZGU9RkFMU0V9DQojIGNvZGUgY2h1bmsgc3BlY2lmaWVzIHdoZXRoZXIgdGhlIFIgY29kZSwgd2FybmluZ3MsIGFuZCBvdXRwdXQgDQojIHdpbGwgYmUgaW5jbHVkZWQgaW4gdGhlIG91dHB1dCBmaWxlcy4NCmlmICghcmVxdWlyZSgia25pdHIiKSkgew0KICAgaW5zdGFsbC5wYWNrYWdlcygia25pdHIiKQ0KICAgbGlicmFyeShrbml0cikNCn0NCmlmICghcmVxdWlyZSgidGlkeXZlcnNlIikpIHsNCiAgIGluc3RhbGwucGFja2FnZXMoInRpZHl2ZXJzZSIpDQpsaWJyYXJ5KHRpZHl2ZXJzZSkNCn0NCmlmICghcmVxdWlyZSgiR0dhbGx5IikpIHsNCiAgIGluc3RhbGwucGFja2FnZXMoIkdHYWxseSIpDQpsaWJyYXJ5KEdHYWxseSkNCn0NCmtuaXRyOjpvcHRzX2NodW5rJHNldChlY2hvID0gVFJVRSwgICAgICAgIyBpbmNsdWRlIGNvZGUgY2h1bmsgaW4gdGhlIG91dHB1dCBmaWxlDQogICAgICAgICAgICAgICAgICAgICAgd2FybmluZyA9IEZBTFNFLCAgICMgc29tZXRpbWVzLCB5b3UgY29kZSBtYXkgcHJvZHVjZSB3YXJuaW5nIG1lc3NhZ2VzLA0KICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAjIHlvdSBjYW4gY2hvb3NlIHRvIGluY2x1ZGUgdGhlIHdhcm5pbmcgbWVzc2FnZXMgaW4NCiAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgIyB0aGUgb3V0cHV0IGZpbGUuIA0KICAgICAgICAgICAgICAgICAgICAgIHJlc3VsdHMgPSBUUlVFLCAgICAjIHlvdSBjYW4gYWxzbyBkZWNpZGUgd2hldGhlciB0byBpbmNsdWRlIHRoZSBvdXRwdXQNCiAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgIyBpbiB0aGUgb3V0cHV0IGZpbGUuDQogICAgICAgICAgICAgICAgICAgICAgbWVzc2FnZSA9IEZBTFNFLA0KICAgICAgICAgICAgICAgICAgICAgIGNvbW1lbnQgPSBOQQ0KICAgICAgICAgICAgICAgICAgICAgICkgIA0KYGBgDQoNClwNCg0KVGhpcyBpcyBwYXJ0IElJSSBvZiBwcm9qZWN0IG9uZSBmb2N1c2luZyBvbiB0aGUgYXBwbGljYXRpb25zIG9mIGNyb3NzLXZhbGlkYXRpb24gbWV0aG9kcyBpbiBwcmVkaWN0aXZlIG1vZGVsaW5nLg0KDQpcDQoNClwNCg0KIyBDcm9zcy12YWxpZGF0aW9uIGZvciBQcmVkaWN0aXZlIE1vZGVsaW5nIA0KDQpUaGUgaWRlYSBpcyB0byB1c2UgKipkYXRhLWRyaXZlbiBhcHByb2FjaGVzKiogdG8gZGF0YSBzcGxpdHRpbmcgYW5kIHRoZW4gYXBwbHkgY3Jvc3MtdmFsaWRhdGlvbiBtZXRob2RzIHRvIHNlbGVjdCB0aGUgZmluYWwgbW9kZWwgZnJvbSBhIHBvb2wgb2YgY2FuZGlkYXRlIG1vZGVscyBiYXNlZCBvbiAqKnByZWRpY3RpdmUgcGVyZm9ybWFuY2UgbWV0cmljKiogc3VjaCBhcyAqKk1TRSoqIGZvciBsaW5lYXIgcmVncmVzc2lvbiBtb2RlbHMgYW5kICoqYWNjdXJhY3kqKiwgKipzZW5zaXRpdml0eSoqLCBvciAqKnNwZWNpZmljaXR5KiogZm9yIGxvZ2lzdGljIHJlZ3Jlc3Npb24gbW9kZWxzLg0KDQoqKlN1Z2dlc3RlZCBDb21wb25lbnRzIGluIHRoZSBQcmVkaWN0aXZlIEFuYWx5c2lzKioNCg0KKiAqcmFuZG9tIHNwbGl0dGluZyogLSB1c2luZyByYW5kb20gc3BsaXR0aW5nIGZvciBhbGwgZGF0YSBwYXJ0aXRpb25zLg0KDQoqICpUd28td2F5IGRhdGEgc3BsaXR0aW5nKiAtIGRhdGEgc3BsaXQgaW50byA3NSUgZm9yIHRyYWluaW5nIGFuZCB2YWxpZGF0aW9uIGFuZCAyNSUgZm9yIHRlc3RpbmcuIA0KDQoqICo1LWZvbGQgY3Jvc3MtdmFsaWRhdGlvbiogLSB1c2luZyBhIDUtZm9sZCBjcm9zcy12YWxpZGF0aW9uIGFsZ29yaXRobSBvbiB0aGUgdHJhaW5pbmcgZGF0YQ0KDQpcDQoNCiMgUHJlZGljdGlvbiBMaW5lYXIgUmVncmVzc2lvbg0KDQpUaGUgcHJpbWFyeSBwcmVkaWN0aXZlIHBlcmZvcm1hbmNlIG1ldHJpYyBmb3IgbGluZWFyIHJlZ3Jlc3Npb24gbW9kZWxpbmcgaXMgdGhlIG1lYW4gc3F1YXJlIGVycm9yIDxmb250IGNvbG9yID0gInJlZCI+ICh0aGUgYXZlcmFnZSBzcXVhcmVkIGVycm9yIGJldHdlZW4gcHJlZGljdGVkIGFuZCB0aGUgb2JzZXJ2ZWQgdmFsdWVzIG9mIHRoZSByZXNwb25zZSB2YXJpYWJsZSA8Yj5pbiBpdHMgb3JpZ2luYWwgc2NhbGUpPC9iPjwvZm9udD4uIA0KDQpPdGhlciBwcmVkaWN0aXZlIHBlcmZvcm1hbmNlIG1ldHJpY3MgdGhhdCBjYW4gYWxzbyBiZSB1c2VkIGFyZSAkUl4yJCBvciAkUl4yX3thZGp9JC4gDQoNCkxpa2VsaWhvb2QtYmFzZWQgbWV0cmljcyBzdWNoIGFzIEFJQyBhbmQgU0JDIGNhbiBiZSB1c2VkIGlmIHRoZSBsaWtlbGlob29kIGZ1bmN0aW9ucyBvZiBhbGwgY2FuZGlkYXRlIG1vZGVscyBhcmUgYXQgdGhlIHNhbWUgc2NhbGUuIFRoZXNlIG1lYXN1cmVzIGFyZSBub3QgYXMgaW50dWl0aXZlIGFzIHRoZSBNU0Ugc2luY2UgTVNFIGlzIGEgc3F1YXJlZCAnKipkaXN0YW5jZSoqJyBpbiB0aGUgRXVjbGlkZWFuIHNwYWNlLg0KDQo8Zm9udCBjb2xvciA9ICJyZWQiPipcY29sb3J7cmVkfUlmIHRoZSByZXNwb25zZSB2YXJpYWJsZXMgaW4gYWxsIGNhbmRpZGF0ZSBtb2RlbHMgYXJlIGF0IHRoZSBzYW1lIHNjYWxlLCB0aGUgTVNFIGlzIGV4cGVjdGVkIHRvIGJlIHVzZWQgaW4gdGhlIGNyb3NzLXZhbGlkYXRpb24gZm9yIG1vZGVsIHNlbGVjdGlvbi4qPC9mb250Pg0KDQoNCiMgTG9naXN0aWMgUHJlZGljdGl2ZSBNb2RlbGluZw0KDQpUaGUgcHJpbWFyeSB0b29sIGZvciBhc3Nlc3NpbmcgdGhlIGdsb2JhbCBwcmVkaWN0aXZlIHBlcmZvcm1hbmNlIG9mIGxvZ2lzdGljIG1vZGVscyBpcyBST0MgY3VydmUgYW5hbHlzaXMgKHRoaXMgaW5jbHVkZXMgdGhlIGFyZWEgdW5kZXIgdGhlIFJPQyBjdXJ2ZSAtIEFVQykuIFJPQyBjdXJ2ZSBzdWdnZXN0ZWQgZm9yIHRoaXMgYXNzaWdubWVudC4NCg0KT3RoZXIgcHJlZGljdGl2ZSBwZXJmb3JtYW5jZSBtZWFzdXJlcyB0aGF0IGNhbiBiZSBjb25zaWRlcmVkIGFyZSAqKmFjY3VyYWN5KiosICoqc2Vuc2l0aXZpdHkqKiwgYW5kICoqc3BlY2lmaWNpdHkqKi4NCg0KPGZvbnQgY29sb3IgPSAicmVkIj4qXGNvbG9ye3JlZH1SZXBvcnRpbmcgUk9DIGFuZCBBVUMgaXMgcmVxdWlyZWQgd2hlbiBjb21wYXJpbmcgY2FuZGlkYXRlIG1vZGVscy4qPC9mb250Pg0KDQpBZnRlciB0aGUgZmluYWwgbW9kZWwgaXMgaWRlbnRpZmllZCwgeW91IG5lZWQgdG8gdXNlIHRoZSAyNSUgdGVzdGluZyBkYXRhIHNldCB0byByZXBvcnQgdGhlICoqYWN0dWFsKiogcGVyZm9ybWFuY2Ugb2YgdGhlIGNvcnJlc3BvbmRpbmcgbW9kZWxzLiBUaGUgcGVyZm9ybWFuY2UgbWVhc3VyZSBpcyBzaW1pbGFyIHRoZSBhY3R1YWwgcGVyZm9ybWFuY2Ugd2hlbiB0aGUgbW9kZWwgaXMgaW1wbGVtZW50ZWQgbmV3IHJlYWwgZGF0YS4gDQoNCg==