This is part III of project one focusing on the applications of cross-validation methods in predictive modeling.



1 Cross-validation for Predictive Modeling

The idea is to use data-driven approaches to data splitting and then apply cross-validation methods to select the final model from a pool of candidate models based on predictive performance metric such as MSE for linear regression models and accuracy, sensitivity, or specificity for logistic regression models.

Suggested Components in the Predictive Analysis

  • random splitting - using random splitting for all data partitions.

  • Two-way data splitting - data split into 75% for training and validation and 25% for testing.

  • 5-fold cross-validation - using a 5-fold cross-validation algorithm on the training data


2 Prediction Linear Regression

The primary predictive performance metric for linear regression modeling is the mean square error (the average squared error between predicted and the observed values of the response variable in its original scale).

Other predictive performance metrics that can also be used are \(R^2\) or \(R^2_{adj}\).

Likelihood-based metrics such as AIC and SBC can be used if the likelihood functions of all candidate models are at the same scale. These measures are not as intuitive as the MSE since MSE is a squared ‘distance’ in the Euclidean space.

If the response variables in all candidate models are at the same scale, the MSE is expected to be used in the cross-validation for model selection.

3 Logistic Predictive Modeling

The primary tool for assessing the global predictive performance of logistic models is ROC curve analysis (this includes the area under the ROC curve - AUC). ROC curve suggested for this assignment.

Other predictive performance measures that can be considered are accuracy, sensitivity, and specificity.

Reporting ROC and AUC is required when comparing candidate models.

After the final model is identified, you need to use the 25% testing data set to report the actual performance of the corresponding models. The performance measure is similar the actual performance when the model is implemented new real data.

LS0tDQp0aXRsZTogJ1Byb2plY3QgT25lOiAgUGFydCBJSUkgLSBQcmVkaWN0aXZlIE1vZGVsaW5nIGFuZCBDcm9zcyBWYWxpZGF0b24nDQphdXRob3I6ICIgKFlvdSBhcmUgZXhwZWN0ZWQgdG8gZ2l2ZSBhIGRlc2NyaXB0aXZlIHRpdGxlKSINCmRhdGU6ICIgIg0Kb3V0cHV0Og0KICBodG1sX2RvY3VtZW50OiANCiAgICB0b2M6IHllcw0KICAgIHRvY19kZXB0aDogNA0KICAgIHRvY19mbG9hdDogeWVzDQogICAgbnVtYmVyX3NlY3Rpb25zOiB5ZXMNCiAgICB0b2NfY29sbGFwc2VkOiB5ZXMNCiAgICBjb2RlX2ZvbGRpbmc6IGhpZGUNCiAgICBjb2RlX2Rvd25sb2FkOiB5ZXMNCiAgICBzbW9vdGhfc2Nyb2xsOiB5ZXMNCiAgICB0aGVtZTogbHVtZW4NCiAgd29yZF9kb2N1bWVudDogDQogICAgdG9jOiB5ZXMNCiAgICB0b2NfZGVwdGg6IDQNCiAgICBmaWdfY2FwdGlvbjogeWVzDQogICAga2VlcF9tZDogeWVzDQogIHBkZl9kb2N1bWVudDogDQogICAgdG9jOiB5ZXMNCiAgICB0b2NfZGVwdGg6IDQNCiAgICBmaWdfY2FwdGlvbjogeWVzDQogICAgbnVtYmVyX3NlY3Rpb25zOiB5ZXMNCiAgICBmaWdfd2lkdGg6IDMNCiAgICBmaWdfaGVpZ2h0OiAzDQplZGl0b3Jfb3B0aW9uczogDQogIGNodW5rX291dHB1dF90eXBlOiBpbmxpbmUNCi0tLQ0KDQpgYGB7Y3NzLCBlY2hvID0gRkFMU0V9DQpkaXYjVE9DIGxpIHsgICAgIC8qIHRhYmxlIG9mIGNvbnRlbnQgICovDQogICAgbGlzdC1zdHlsZTp1cHBlci1yb21hbjsNCiAgICBiYWNrZ3JvdW5kLWltYWdlOm5vbmU7DQogICAgYmFja2dyb3VuZC1yZXBlYXQ6bm9uZTsNCiAgICBiYWNrZ3JvdW5kLXBvc2l0aW9uOjA7DQp9DQoNCmgxLnRpdGxlIHsgICAgLyogbGV2ZWwgMSBoZWFkZXIgb2YgdGl0bGUgICovDQogIGZvbnQtc2l6ZTogMjRweDsNCiAgZm9udC13ZWlnaHQ6IGJvbGQ7DQogIGNvbG9yOiBEYXJrUmVkOw0KICB0ZXh0LWFsaWduOiBjZW50ZXI7DQp9DQoNCmg0LmF1dGhvciB7IC8qIEhlYWRlciA0IC0gYW5kIHRoZSBhdXRob3IgYW5kIGRhdGEgaGVhZGVycyB1c2UgdGhpcyB0b28gICovDQogIGZvbnQtc2l6ZTogMThweDsNCiAgZm9udC13ZWlnaHQ6IGJvbGQ7DQogIGZvbnQtZmFtaWx5OiAiVGltZXMgTmV3IFJvbWFuIiwgVGltZXMsIHNlcmlmOw0KICBjb2xvcjogRGFya1JlZDsNCiAgdGV4dC1hbGlnbjogY2VudGVyOw0KfQ0KDQpoNC5kYXRlIHsgLyogSGVhZGVyIDQgLSBhbmQgdGhlIGF1dGhvciBhbmQgZGF0YSBoZWFkZXJzIHVzZSB0aGlzIHRvbyAgKi8NCiAgZm9udC1zaXplOiAxOHB4Ow0KICBmb250LXdlaWdodDogYm9sZDsNCiAgZm9udC1mYW1pbHk6ICJUaW1lcyBOZXcgUm9tYW4iLCBUaW1lcywgc2VyaWY7DQogIGNvbG9yOiBEYXJrQmx1ZTsNCiAgdGV4dC1hbGlnbjogY2VudGVyOw0KfQ0KDQpoMSB7IC8qIEhlYWRlciAxIC0gYW5kIHRoZSBhdXRob3IgYW5kIGRhdGEgaGVhZGVycyB1c2UgdGhpcyB0b28gICovDQogICAgZm9udC1zaXplOiAyMHB4Ow0KICAgIGZvbnQtd2VpZ2h0OiBib2xkOw0KICAgIGZvbnQtZmFtaWx5OiAiVGltZXMgTmV3IFJvbWFuIiwgVGltZXMsIHNlcmlmOw0KICAgIGNvbG9yOiBkYXJrcmVkOw0KICAgIHRleHQtYWxpZ246IGNlbnRlcjsNCn0NCg0KaDIgeyAvKiBIZWFkZXIgMiAtIGFuZCB0aGUgYXV0aG9yIGFuZCBkYXRhIGhlYWRlcnMgdXNlIHRoaXMgdG9vICAqLw0KICAgIGZvbnQtc2l6ZTogMThweDsNCiAgICBmb250LXdlaWdodDogYm9sZDsNCiAgICBmb250LWZhbWlseTogIlRpbWVzIE5ldyBSb21hbiIsIFRpbWVzLCBzZXJpZjsNCiAgICBjb2xvcjogbmF2eTsNCiAgICB0ZXh0LWFsaWduOiBsZWZ0Ow0KfQ0KDQpoMyB7IC8qIEhlYWRlciAzIC0gYW5kIHRoZSBhdXRob3IgYW5kIGRhdGEgaGVhZGVycyB1c2UgdGhpcyB0b28gICovDQogICAgZm9udC1zaXplOiAxNnB4Ow0KICAgIGZvbnQtd2VpZ2h0OiBib2xkOw0KICAgIGZvbnQtZmFtaWx5OiAiVGltZXMgTmV3IFJvbWFuIiwgVGltZXMsIHNlcmlmOw0KICAgIGNvbG9yOiBuYXZ5Ow0KICAgIHRleHQtYWxpZ246IGxlZnQ7DQp9DQoNCmg0IHsgLyogSGVhZGVyIDQgLSBhbmQgdGhlIGF1dGhvciBhbmQgZGF0YSBoZWFkZXJzIHVzZSB0aGlzIHRvbyAgKi8NCiAgICBmb250LXNpemU6IDE0cHg7DQogIGZvbnQtd2VpZ2h0OiBib2xkOw0KICAgIGZvbnQtZmFtaWx5OiAiVGltZXMgTmV3IFJvbWFuIiwgVGltZXMsIHNlcmlmOw0KICAgIGNvbG9yOiBkYXJrcmVkOw0KICAgIHRleHQtYWxpZ246IGxlZnQ7DQp9DQoNCi8qIEFkZCBkb3RzIGFmdGVyIG51bWJlcmVkIGhlYWRlcnMgKi8NCi5oZWFkZXItc2VjdGlvbi1udW1iZXI6OmFmdGVyIHsNCiAgY29udGVudDogIi4iOw0KDQpib2R5IHsgYmFja2dyb3VuZC1jb2xvcjp3aGl0ZTsgfQ0KDQouaGlnaGxpZ2h0bWUgeyBiYWNrZ3JvdW5kLWNvbG9yOnllbGxvdzsgfQ0KDQpwIHsgYmFja2dyb3VuZC1jb2xvcjp3aGl0ZTsgfQ0KDQp9DQpgYGANCg0KYGBge3Igc2V0dXAsIGluY2x1ZGU9RkFMU0V9DQojIGNvZGUgY2h1bmsgc3BlY2lmaWVzIHdoZXRoZXIgdGhlIFIgY29kZSwgd2FybmluZ3MsIGFuZCBvdXRwdXQgDQojIHdpbGwgYmUgaW5jbHVkZWQgaW4gdGhlIG91dHB1dCBmaWxlcy4NCmlmICghcmVxdWlyZSgia25pdHIiKSkgew0KICAgaW5zdGFsbC5wYWNrYWdlcygia25pdHIiKQ0KICAgbGlicmFyeShrbml0cikNCn0NCmlmICghcmVxdWlyZSgidGlkeXZlcnNlIikpIHsNCiAgIGluc3RhbGwucGFja2FnZXMoInRpZHl2ZXJzZSIpDQpsaWJyYXJ5KHRpZHl2ZXJzZSkNCn0NCmlmICghcmVxdWlyZSgiR0dhbGx5IikpIHsNCiAgIGluc3RhbGwucGFja2FnZXMoIkdHYWxseSIpDQpsaWJyYXJ5KEdHYWxseSkNCn0NCmtuaXRyOjpvcHRzX2NodW5rJHNldChlY2hvID0gVFJVRSwgICAgICAgIyBpbmNsdWRlIGNvZGUgY2h1bmsgaW4gdGhlIG91dHB1dCBmaWxlDQogICAgICAgICAgICAgICAgICAgICAgd2FybmluZyA9IEZBTFNFLCAgICMgc29tZXRpbWVzLCB5b3UgY29kZSBtYXkgcHJvZHVjZSB3YXJuaW5nIG1lc3NhZ2VzLA0KICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAjIHlvdSBjYW4gY2hvb3NlIHRvIGluY2x1ZGUgdGhlIHdhcm5pbmcgbWVzc2FnZXMgaW4NCiAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgIyB0aGUgb3V0cHV0IGZpbGUuIA0KICAgICAgICAgICAgICAgICAgICAgIHJlc3VsdHMgPSBUUlVFLCAgICAjIHlvdSBjYW4gYWxzbyBkZWNpZGUgd2hldGhlciB0byBpbmNsdWRlIHRoZSBvdXRwdXQNCiAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgIyBpbiB0aGUgb3V0cHV0IGZpbGUuDQogICAgICAgICAgICAgICAgICAgICAgbWVzc2FnZSA9IEZBTFNFLA0KICAgICAgICAgICAgICAgICAgICAgIGNvbW1lbnQgPSBOQQ0KICAgICAgICAgICAgICAgICAgICAgICkgIA0KYGBgDQoNClwNCg0KVGhpcyBpcyBwYXJ0IElJSSBvZiBwcm9qZWN0IG9uZSBmb2N1c2luZyBvbiB0aGUgYXBwbGljYXRpb25zIG9mIGNyb3NzLXZhbGlkYXRpb24gbWV0aG9kcyBpbiBwcmVkaWN0aXZlIG1vZGVsaW5nLg0KDQpcDQoNClwNCg0KIyBDcm9zcy12YWxpZGF0aW9uIGZvciBQcmVkaWN0aXZlIE1vZGVsaW5nIA0KDQpUaGUgaWRlYSBpcyB0byB1c2UgKipkYXRhLWRyaXZlbiBhcHByb2FjaGVzKiogdG8gZGF0YSBzcGxpdHRpbmcgYW5kIHRoZW4gYXBwbHkgY3Jvc3MtdmFsaWRhdGlvbiBtZXRob2RzIHRvIHNlbGVjdCB0aGUgZmluYWwgbW9kZWwgZnJvbSBhIHBvb2wgb2YgY2FuZGlkYXRlIG1vZGVscyBiYXNlZCBvbiAqKnByZWRpY3RpdmUgcGVyZm9ybWFuY2UgbWV0cmljKiogc3VjaCBhcyAqKk1TRSoqIGZvciBsaW5lYXIgcmVncmVzc2lvbiBtb2RlbHMgYW5kICoqYWNjdXJhY3kqKiwgKipzZW5zaXRpdml0eSoqLCBvciAqKnNwZWNpZmljaXR5KiogZm9yIGxvZ2lzdGljIHJlZ3Jlc3Npb24gbW9kZWxzLg0KDQoqKlN1Z2dlc3RlZCBDb21wb25lbnRzIGluIHRoZSBQcmVkaWN0aXZlIEFuYWx5c2lzKioNCg0KKiAqcmFuZG9tIHNwbGl0dGluZyogLSB1c2luZyByYW5kb20gc3BsaXR0aW5nIGZvciBhbGwgZGF0YSBwYXJ0aXRpb25zLg0KDQoqICpUd28td2F5IGRhdGEgc3BsaXR0aW5nKiAtIGRhdGEgc3BsaXQgaW50byA3NSUgZm9yIHRyYWluaW5nIGFuZCB2YWxpZGF0aW9uIGFuZCAyNSUgZm9yIHRlc3RpbmcuIA0KDQoqICo1LWZvbGQgY3Jvc3MtdmFsaWRhdGlvbiogLSB1c2luZyBhIDUtZm9sZCBjcm9zcy12YWxpZGF0aW9uIGFsZ29yaXRobSBvbiB0aGUgdHJhaW5pbmcgZGF0YQ0KDQpcDQoNCiMgUHJlZGljdGlvbiBMaW5lYXIgUmVncmVzc2lvbg0KDQpUaGUgcHJpbWFyeSBwcmVkaWN0aXZlIHBlcmZvcm1hbmNlIG1ldHJpYyBmb3IgbGluZWFyIHJlZ3Jlc3Npb24gbW9kZWxpbmcgaXMgdGhlIG1lYW4gc3F1YXJlIGVycm9yIDxmb250IGNvbG9yID0gInJlZCI+ICh0aGUgYXZlcmFnZSBzcXVhcmVkIGVycm9yIGJldHdlZW4gcHJlZGljdGVkIGFuZCB0aGUgb2JzZXJ2ZWQgdmFsdWVzIG9mIHRoZSByZXNwb25zZSB2YXJpYWJsZSA8Yj5pbiBpdHMgb3JpZ2luYWwgc2NhbGUpPC9iPjwvZm9udD4uIA0KDQpPdGhlciBwcmVkaWN0aXZlIHBlcmZvcm1hbmNlIG1ldHJpY3MgdGhhdCBjYW4gYWxzbyBiZSB1c2VkIGFyZSAkUl4yJCBvciAkUl4yX3thZGp9JC4gDQoNCkxpa2VsaWhvb2QtYmFzZWQgbWV0cmljcyBzdWNoIGFzIEFJQyBhbmQgU0JDIGNhbiBiZSB1c2VkIGlmIHRoZSBsaWtlbGlob29kIGZ1bmN0aW9ucyBvZiBhbGwgY2FuZGlkYXRlIG1vZGVscyBhcmUgYXQgdGhlIHNhbWUgc2NhbGUuIFRoZXNlIG1lYXN1cmVzIGFyZSBub3QgYXMgaW50dWl0aXZlIGFzIHRoZSBNU0Ugc2luY2UgTVNFIGlzIGEgc3F1YXJlZCAnKipkaXN0YW5jZSoqJyBpbiB0aGUgRXVjbGlkZWFuIHNwYWNlLg0KDQo8Zm9udCBjb2xvciA9ICJyZWQiPipcY29sb3J7cmVkfUlmIHRoZSByZXNwb25zZSB2YXJpYWJsZXMgaW4gYWxsIGNhbmRpZGF0ZSBtb2RlbHMgYXJlIGF0IHRoZSBzYW1lIHNjYWxlLCB0aGUgTVNFIGlzIGV4cGVjdGVkIHRvIGJlIHVzZWQgaW4gdGhlIGNyb3NzLXZhbGlkYXRpb24gZm9yIG1vZGVsIHNlbGVjdGlvbi4qPC9mb250Pg0KDQoNCiMgTG9naXN0aWMgUHJlZGljdGl2ZSBNb2RlbGluZw0KDQpUaGUgcHJpbWFyeSB0b29sIGZvciBhc3Nlc3NpbmcgdGhlIGdsb2JhbCBwcmVkaWN0aXZlIHBlcmZvcm1hbmNlIG9mIGxvZ2lzdGljIG1vZGVscyBpcyBST0MgY3VydmUgYW5hbHlzaXMgKHRoaXMgaW5jbHVkZXMgdGhlIGFyZWEgdW5kZXIgdGhlIFJPQyBjdXJ2ZSAtIEFVQykuIFJPQyBjdXJ2ZSBzdWdnZXN0ZWQgZm9yIHRoaXMgYXNzaWdubWVudC4NCg0KT3RoZXIgcHJlZGljdGl2ZSBwZXJmb3JtYW5jZSBtZWFzdXJlcyB0aGF0IGNhbiBiZSBjb25zaWRlcmVkIGFyZSAqKmFjY3VyYWN5KiosICoqc2Vuc2l0aXZpdHkqKiwgYW5kICoqc3BlY2lmaWNpdHkqKi4NCg0KPGZvbnQgY29sb3IgPSAicmVkIj4qXGNvbG9ye3JlZH1SZXBvcnRpbmcgUk9DIGFuZCBBVUMgaXMgcmVxdWlyZWQgd2hlbiBjb21wYXJpbmcgY2FuZGlkYXRlIG1vZGVscy4qPC9mb250Pg0KDQpBZnRlciB0aGUgZmluYWwgbW9kZWwgaXMgaWRlbnRpZmllZCwgeW91IG5lZWQgdG8gdXNlIHRoZSAyNSUgdGVzdGluZyBkYXRhIHNldCB0byByZXBvcnQgdGhlICoqYWN0dWFsKiogcGVyZm9ybWFuY2Ugb2YgdGhlIGNvcnJlc3BvbmRpbmcgbW9kZWxzLiBUaGUgcGVyZm9ybWFuY2UgbWVhc3VyZSBpcyBzaW1pbGFyIHRoZSBhY3R1YWwgcGVyZm9ybWFuY2Ugd2hlbiB0aGUgbW9kZWwgaXMgaW1wbGVtZW50ZWQgbmV3IHJlYWwgZGF0YS4gDQoNCg==