Data Set
Choose a data set that has at least four categorical variables and
four numerical variables. The sample size should be at least 200. You
can find a data set either from my teaching data repository or other
data sources. The data set should be cross-sectional (i.e., each of the
data points must be observed/collected/generated at the same time).
Description of
Data
The following information of the data should be provided in the
report:
A brief description of the data source.
How the data set is generated or collected.
Number of variables and their type (categorical or numerical) and
size of the data set.
List the variable names and their
description/definitions.
Exploratory Data
Analysis and Feature
Perform the standard EDA such as distribution for categorical and
numerical variables respectively, the relationship between two variables
(combinations of categorical and numerical variables), and pairwise
relationship. Keep in mind that the pairwise scatter plot is only
meaning for numerical variables.
For each EDA and associated representation, you should
interpret what you observed and the implication of potential
feature engineering
perform feature engineering based on EDA by writing an R/Python
function.
Write a main function to wrap individual feature engineering
functions.
Test the main function with different patterns in the components
and sure it produces the expected result.
LS0tDQp0aXRsZTogJ0VEQSBhbmQgRmVhdHVyZSBFbmdpbmVlcmluZyBBc3NpZ25tZW50Jw0KYXV0aG9yOiAiICINCmRhdGU6ICIgU1RBIDUxMSAtIEZvdWRhdGlvbnMgb2YgRGF0YSBTY2llbmNlIg0Kb3V0cHV0Og0KICBodG1sX2RvY3VtZW50OiANCiAgICB0b2M6IHllcw0KICAgIHRvY19kZXB0aDogNA0KICAgIHRvY19mbG9hdDogeWVzDQogICAgbnVtYmVyX3NlY3Rpb25zOiB5ZXMNCiAgICB0b2NfY29sbGFwc2VkOiB5ZXMNCiAgICBjb2RlX2ZvbGRpbmc6IGhpZGUNCiAgICBjb2RlX2Rvd25sb2FkOiB5ZXMNCiAgICBzbW9vdGhfc2Nyb2xsOiB5ZXMNCiAgICB0aGVtZTogbHVtZW4NCiAgd29yZF9kb2N1bWVudDogDQogICAgdG9jOiB5ZXMNCiAgICB0b2NfZGVwdGg6IDQNCiAgICBmaWdfY2FwdGlvbjogeWVzDQogICAga2VlcF9tZDogeWVzDQogIHBkZl9kb2N1bWVudDogDQogICAgdG9jOiB5ZXMNCiAgICB0b2NfZGVwdGg6IDQNCiAgICBmaWdfY2FwdGlvbjogeWVzDQogICAgbnVtYmVyX3NlY3Rpb25zOiB5ZXMNCiAgICBmaWdfd2lkdGg6IDMNCiAgICBmaWdfaGVpZ2h0OiAzDQplZGl0b3Jfb3B0aW9uczogDQogIGNodW5rX291dHB1dF90eXBlOiBpbmxpbmUNCi0tLQ0KDQpgYGB7PWh0bWx9DQoNCjxzdHlsZSB0eXBlPSJ0ZXh0L2NzcyI+DQoNCi8qIENhc2NhZGluZyBTdHlsZSBTaGVldHMgKENTUykgaXMgYSBzdHlsZXNoZWV0IGxhbmd1YWdlIHVzZWQgdG8gZGVzY3JpYmUgdGhlIHByZXNlbnRhdGlvbiBvZiBhIGRvY3VtZW50IHdyaXR0ZW4gaW4gSFRNTCBvciBYTUwuIGl0IGlzIGEgc2ltcGxlIG1lY2hhbmlzbSBmb3IgYWRkaW5nIHN0eWxlIChlLmcuLCBmb250cywgY29sb3JzLCBzcGFjaW5nKSB0byBXZWIgZG9jdW1lbnRzLiAqLw0KDQpoMS50aXRsZSB7ICAvKiBUaXRsZSAtIGZvbnQgc3BlY2lmaWNhdGlvbnMgb2YgdGhlIHJlcG9ydCB0aXRsZSAqLw0KICBmb250LXNpemU6IDI0cHg7DQogIGZvbnQtd2VpZ2h0OiBib2xkOw0KICBjb2xvcjogRGFya1JlZDsNCiAgdGV4dC1hbGlnbjogY2VudGVyOw0KICBmb250LWZhbWlseTogIkdpbGwgU2FucyIsIHNhbnMtc2VyaWY7DQp9DQpoNC5hdXRob3IgeyAvKiBIZWFkZXIgNCAtIGZvbnQgc3BlY2lmaWNhdGlvbnMgZm9yIGF1dGhvcnMgICovDQogIGZvbnQtc2l6ZTogMjBweDsNCiAgZm9udC1mYW1pbHk6IHN5c3RlbS11aTsNCiAgY29sb3I6IERhcmtSZWQ7DQogIHRleHQtYWxpZ246IGNlbnRlcjsNCiAgZm9udC13ZWlnaHQ6IGJvbGQ7DQp9DQpoNC5kYXRlIHsgLyogSGVhZGVyIDQgLSBmb250IHNwZWNpZmljYXRpb25zIGZvciB0aGUgZGF0ZSAgKi8NCiAgZm9udC1zaXplOiAxOHB4Ow0KICBmb250LWZhbWlseTogc3lzdGVtLXVpOw0KICBjb2xvcjogRGFya0JsdWU7DQogIHRleHQtYWxpZ246IGNlbnRlcjsNCiAgZm9udC13ZWlnaHQ6IGJvbGQ7DQp9DQpoMSB7IC8qIEhlYWRlciAxIC0gZm9udCBzcGVjaWZpY2F0aW9ucyBmb3IgbGV2ZWwgMSBzZWN0aW9uIHRpdGxlICAqLw0KICAgIGZvbnQtc2l6ZTogMjJweDsNCiAgICBmb250LWZhbWlseTogIlRpbWVzIE5ldyBSb21hbiIsIFRpbWVzLCBzZXJpZjsNCiAgICBjb2xvcjogbmF2eTsNCiAgICB0ZXh0LWFsaWduOiBjZW50ZXI7DQogICAgZm9udC13ZWlnaHQ6IGJvbGQ7DQp9DQpoMiB7IC8qIEhlYWRlciAyIC0gZm9udCBzcGVjaWZpY2F0aW9ucyBmb3IgbGV2ZWwgMiBzZWN0aW9uIHRpdGxlICovDQogICAgZm9udC1zaXplOiAyMHB4Ow0KICAgIGZvbnQtZmFtaWx5OiAiVGltZXMgTmV3IFJvbWFuIiwgVGltZXMsIHNlcmlmOw0KICAgIGNvbG9yOiBuYXZ5Ow0KICAgIHRleHQtYWxpZ246IGxlZnQ7DQogICAgZm9udC13ZWlnaHQ6IGJvbGQ7DQp9DQoNCmgzIHsgLyogSGVhZGVyIDMgLSBmb250IHNwZWNpZmljYXRpb25zIG9mIGxldmVsIDMgc2VjdGlvbiB0aXRsZSAgKi8NCiAgICBmb250LXNpemU6IDE4cHg7DQogICAgZm9udC1mYW1pbHk6ICJUaW1lcyBOZXcgUm9tYW4iLCBUaW1lcywgc2VyaWY7DQogICAgY29sb3I6IG5hdnk7DQogICAgdGV4dC1hbGlnbjogbGVmdDsNCn0NCg0KaDQgeyAvKiBIZWFkZXIgNCAtIGZvbnQgc3BlY2lmaWNhdGlvbnMgb2YgbGV2ZWwgNCBzZWN0aW9uIHRpdGxlICAqLw0KICAgIGZvbnQtc2l6ZTogMThweDsNCiAgICBmb250LWZhbWlseTogIlRpbWVzIE5ldyBSb21hbiIsIFRpbWVzLCBzZXJpZjsNCiAgICBjb2xvcjogZGFya3JlZDsNCiAgICB0ZXh0LWFsaWduOiBsZWZ0Ow0KfQ0KDQpib2R5IHsgYmFja2dyb3VuZC1jb2xvcjp3aGl0ZTsgfQ0KDQouaGlnaGxpZ2h0bWUgeyBiYWNrZ3JvdW5kLWNvbG9yOnllbGxvdzsgfQ0KDQpwIHsgYmFja2dyb3VuZC1jb2xvcjp3aGl0ZTsgfQ0KDQo8L3N0eWxlPg0KYGBgDQoNCmBgYHtyIHNldHVwLCBpbmNsdWRlPUZBTFNFfQ0KIyBjb2RlIGNodW5rIHNwZWNpZmllcyB3aGV0aGVyIHRoZSBSIGNvZGUsIHdhcm5pbmdzLCBhbmQgb3V0cHV0IA0KIyB3aWxsIGJlIGluY2x1ZGVkIGluIHRoZSBvdXRwdXQgZmlsZXMuDQppZiAoIXJlcXVpcmUoImtuaXRyIikpIHsNCiAgIGluc3RhbGwucGFja2FnZXMoImtuaXRyIikNCiAgIGxpYnJhcnkoa25pdHIpDQp9DQppZiAoIXJlcXVpcmUoInRpZHl2ZXJzZSIpKSB7DQogICBpbnN0YWxsLnBhY2thZ2VzKCJ0aWR5dmVyc2UiKQ0KbGlicmFyeSh0aWR5dmVyc2UpDQp9DQppZiAoIXJlcXVpcmUoIkdHYWxseSIpKSB7DQogICBpbnN0YWxsLnBhY2thZ2VzKCJHR2FsbHkiKQ0KbGlicmFyeShHR2FsbHkpDQp9DQprbml0cjo6b3B0c19jaHVuayRzZXQoZWNobyA9IFRSVUUsICAgICAgICMgaW5jbHVkZSBjb2RlIGNodW5rIGluIHRoZSBvdXRwdXQgZmlsZQ0KICAgICAgICAgICAgICAgICAgICAgIHdhcm5pbmdzID0gRkFMU0UsICAjIHNvbWV0aW1lcywgeW91IGNvZGUgbWF5IHByb2R1Y2Ugd2FybmluZyBtZXNzYWdlcywNCiAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgIyB5b3UgY2FuIGNob29zZSB0byBpbmNsdWRlIHRoZSB3YXJuaW5nIG1lc3NhZ2VzIGluDQogICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICMgdGhlIG91dHB1dCBmaWxlLiANCiAgICAgICAgICAgICAgICAgICAgICByZXN1bHRzID0gVFJVRSwgICAgIyB5b3UgY2FuIGFsc28gZGVjaWRlIHdoZXRoZXIgdG8gaW5jbHVkZSB0aGUgb3V0cHV0DQogICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICMgaW4gdGhlIG91dHB1dCBmaWxlLg0KICAgICAgICAgICAgICAgICAgICAgIG1lc3NhZ2UgPSBGQUxTRSwNCiAgICAgICAgICAgICAgICAgICAgICBjb21tZW50ID0gTkENCiAgICAgICAgICAgICAgICAgICAgICApICANCmBgYA0KDQoNCg0KXA0KDQojIERhdGEgU2V0DQoNCg0KQ2hvb3NlIGEgZGF0YSBzZXQgdGhhdCBoYXMgYXQgbGVhc3QgZm91ciBjYXRlZ29yaWNhbCB2YXJpYWJsZXMgYW5kIGZvdXIgbnVtZXJpY2FsIHZhcmlhYmxlcy4gVGhlIHNhbXBsZSBzaXplIHNob3VsZCBiZSBhdCBsZWFzdCAyMDAuIFlvdSBjYW4gZmluZCBhIGRhdGEgc2V0IGVpdGhlciBmcm9tIG15IHRlYWNoaW5nIGRhdGEgcmVwb3NpdG9yeSBvciBvdGhlciBkYXRhIHNvdXJjZXMuIFRoZSBkYXRhIHNldCBzaG91bGQgYmUgY3Jvc3Mtc2VjdGlvbmFsIChpLmUuLCBlYWNoIG9mIHRoZSBkYXRhIHBvaW50cyBtdXN0IGJlIG9ic2VydmVkL2NvbGxlY3RlZC9nZW5lcmF0ZWQgYXQgdGhlIHNhbWUgdGltZSkuDQoNCiMgRGVzY3JpcHRpb24gb2YgRGF0YQ0KDQpUaGUgZm9sbG93aW5nIGluZm9ybWF0aW9uIG9mIHRoZSBkYXRhIHNob3VsZCBiZSBwcm92aWRlZCBpbiB0aGUgcmVwb3J0Og0KDQoqIEEgYnJpZWYgZGVzY3JpcHRpb24gb2YgdGhlIGRhdGEgc291cmNlLg0KDQoqIEhvdyB0aGUgZGF0YSBzZXQgaXMgZ2VuZXJhdGVkIG9yIGNvbGxlY3RlZC4NCg0KKiBOdW1iZXIgb2YgdmFyaWFibGVzIGFuZCB0aGVpciB0eXBlIChjYXRlZ29yaWNhbCBvciBudW1lcmljYWwpIGFuZCBzaXplIG9mIHRoZSBkYXRhIHNldC4NCg0KKiBMaXN0IHRoZSB2YXJpYWJsZSBuYW1lcyBhbmQgdGhlaXIgZGVzY3JpcHRpb24vZGVmaW5pdGlvbnMuDQoNCiMgRXhwbG9yYXRvcnkgRGF0YSBBbmFseXNpcyBhbmQgRmVhdHVyZQ0KDQpQZXJmb3JtIHRoZSBzdGFuZGFyZCBFREEgc3VjaCBhcyBkaXN0cmlidXRpb24gZm9yIGNhdGVnb3JpY2FsIGFuZCBudW1lcmljYWwgdmFyaWFibGVzIHJlc3BlY3RpdmVseSwgdGhlIHJlbGF0aW9uc2hpcCBiZXR3ZWVuIHR3byB2YXJpYWJsZXMgKGNvbWJpbmF0aW9ucyBvZiBjYXRlZ29yaWNhbCBhbmQgbnVtZXJpY2FsIHZhcmlhYmxlcyksIGFuZCBwYWlyd2lzZSByZWxhdGlvbnNoaXAuIEtlZXAgaW4gbWluZCB0aGF0IHRoZSBwYWlyd2lzZSBzY2F0dGVyIHBsb3QgaXMgb25seSBtZWFuaW5nIGZvciBudW1lcmljYWwgdmFyaWFibGVzLg0KDQpGb3IgZWFjaCBFREEgYW5kIGFzc29jaWF0ZWQgcmVwcmVzZW50YXRpb24sIHlvdSBzaG91bGQgDQoNCiogaW50ZXJwcmV0IHdoYXQgeW91IG9ic2VydmVkIGFuZCB0aGUgaW1wbGljYXRpb24gb2YgcG90ZW50aWFsIGZlYXR1cmUgZW5naW5lZXJpbmcNCg0KKiBwZXJmb3JtIGZlYXR1cmUgZW5naW5lZXJpbmcgYmFzZWQgb24gRURBIGJ5IHdyaXRpbmcgYW4gUi9QeXRob24gZnVuY3Rpb24uDQoNCiogV3JpdGUgYSBtYWluIGZ1bmN0aW9uIHRvIHdyYXAgaW5kaXZpZHVhbCBmZWF0dXJlIGVuZ2luZWVyaW5nIGZ1bmN0aW9ucy4NCg0KKiBUZXN0IHRoZSBtYWluIGZ1bmN0aW9uIHdpdGggZGlmZmVyZW50IHBhdHRlcm5zIGluIHRoZSBjb21wb25lbnRzIGFuZCBzdXJlIGl0IHByb2R1Y2VzIHRoZSBleHBlY3RlZCByZXN1bHQuDQoNCg0KDQojIFJlcG9ydGluZyBhbmQgZm9ybWF0DQoNClRoZSBmb3JtYXQgb2YgdGhlIHJlcG9ydCBzaG91bGQgYmUgc2ltaWxhciB0byB0aGF0IG9mIHRoZSByZXBvcnQgb2YgdGhlIGNhc2Ugc3R1ZHkuIFlvdSBjYW4gdXNlIHRoZSByZXBvcnQgdGVtcGxhdGUgdGhhdCBpcyB1c2VkIGZvciB0aGUgY2FzZSBzdHVkeS4NCg0KDQoNCg0KDQoNCg0KDQoNCg0KDQoNCg0KDQoNCg0KDQoNCg0KDQoNCg0KDQoNCg0K