1 Data Set

Choose a data set that has at least four categorical variables and four numerical variables. The sample size should be at least 200. You can find a data set either from my teaching data repository or other data sources. The data set should be cross-sectional (i.e., each of the data points must be observed/collected/generated at the same time).

2 Description of Data

The following information of the data should be provided in the report:

  • A brief description of the data source.

  • How the data set is generated or collected.

  • Number of variables and their type (categorical or numerical) and size of the data set.

  • List the variable names and their description/definitions.

3 Exploratory Data Analysis and Feature

Perform the standard EDA such as distribution for categorical and numerical variables respectively, the relationship between two variables (combinations of categorical and numerical variables), and pairwise relationship. Keep in mind that the pairwise scatter plot is only meaning for numerical variables.

For each EDA and associated representation, you should

  • interpret what you observed and the implication of potential feature engineering

  • perform feature engineering based on EDA by writing an R/Python function.

  • Write a main function to wrap individual feature engineering functions.

  • Test the main function with different patterns in the components and sure it produces the expected result.

4 Reporting and format

The format of the report should be similar to that of the report of the case study. You can use the report template that is used for the case study.

LS0tDQp0aXRsZTogJ0VEQSBhbmQgRmVhdHVyZSBFbmdpbmVlcmluZyBBc3NpZ25tZW50Jw0KYXV0aG9yOiAiICINCmRhdGU6ICIgU1RBIDUxMSAtIEZvdWRhdGlvbnMgb2YgRGF0YSBTY2llbmNlIg0Kb3V0cHV0Og0KICBodG1sX2RvY3VtZW50OiANCiAgICB0b2M6IHllcw0KICAgIHRvY19kZXB0aDogNA0KICAgIHRvY19mbG9hdDogeWVzDQogICAgbnVtYmVyX3NlY3Rpb25zOiB5ZXMNCiAgICB0b2NfY29sbGFwc2VkOiB5ZXMNCiAgICBjb2RlX2ZvbGRpbmc6IGhpZGUNCiAgICBjb2RlX2Rvd25sb2FkOiB5ZXMNCiAgICBzbW9vdGhfc2Nyb2xsOiB5ZXMNCiAgICB0aGVtZTogbHVtZW4NCiAgd29yZF9kb2N1bWVudDogDQogICAgdG9jOiB5ZXMNCiAgICB0b2NfZGVwdGg6IDQNCiAgICBmaWdfY2FwdGlvbjogeWVzDQogICAga2VlcF9tZDogeWVzDQogIHBkZl9kb2N1bWVudDogDQogICAgdG9jOiB5ZXMNCiAgICB0b2NfZGVwdGg6IDQNCiAgICBmaWdfY2FwdGlvbjogeWVzDQogICAgbnVtYmVyX3NlY3Rpb25zOiB5ZXMNCiAgICBmaWdfd2lkdGg6IDMNCiAgICBmaWdfaGVpZ2h0OiAzDQplZGl0b3Jfb3B0aW9uczogDQogIGNodW5rX291dHB1dF90eXBlOiBpbmxpbmUNCi0tLQ0KDQpgYGB7PWh0bWx9DQoNCjxzdHlsZSB0eXBlPSJ0ZXh0L2NzcyI+DQoNCi8qIENhc2NhZGluZyBTdHlsZSBTaGVldHMgKENTUykgaXMgYSBzdHlsZXNoZWV0IGxhbmd1YWdlIHVzZWQgdG8gZGVzY3JpYmUgdGhlIHByZXNlbnRhdGlvbiBvZiBhIGRvY3VtZW50IHdyaXR0ZW4gaW4gSFRNTCBvciBYTUwuIGl0IGlzIGEgc2ltcGxlIG1lY2hhbmlzbSBmb3IgYWRkaW5nIHN0eWxlIChlLmcuLCBmb250cywgY29sb3JzLCBzcGFjaW5nKSB0byBXZWIgZG9jdW1lbnRzLiAqLw0KDQpoMS50aXRsZSB7ICAvKiBUaXRsZSAtIGZvbnQgc3BlY2lmaWNhdGlvbnMgb2YgdGhlIHJlcG9ydCB0aXRsZSAqLw0KICBmb250LXNpemU6IDI0cHg7DQogIGZvbnQtd2VpZ2h0OiBib2xkOw0KICBjb2xvcjogRGFya1JlZDsNCiAgdGV4dC1hbGlnbjogY2VudGVyOw0KICBmb250LWZhbWlseTogIkdpbGwgU2FucyIsIHNhbnMtc2VyaWY7DQp9DQpoNC5hdXRob3IgeyAvKiBIZWFkZXIgNCAtIGZvbnQgc3BlY2lmaWNhdGlvbnMgZm9yIGF1dGhvcnMgICovDQogIGZvbnQtc2l6ZTogMjBweDsNCiAgZm9udC1mYW1pbHk6IHN5c3RlbS11aTsNCiAgY29sb3I6IERhcmtSZWQ7DQogIHRleHQtYWxpZ246IGNlbnRlcjsNCiAgZm9udC13ZWlnaHQ6IGJvbGQ7DQp9DQpoNC5kYXRlIHsgLyogSGVhZGVyIDQgLSBmb250IHNwZWNpZmljYXRpb25zIGZvciB0aGUgZGF0ZSAgKi8NCiAgZm9udC1zaXplOiAxOHB4Ow0KICBmb250LWZhbWlseTogc3lzdGVtLXVpOw0KICBjb2xvcjogRGFya0JsdWU7DQogIHRleHQtYWxpZ246IGNlbnRlcjsNCiAgZm9udC13ZWlnaHQ6IGJvbGQ7DQp9DQpoMSB7IC8qIEhlYWRlciAxIC0gZm9udCBzcGVjaWZpY2F0aW9ucyBmb3IgbGV2ZWwgMSBzZWN0aW9uIHRpdGxlICAqLw0KICAgIGZvbnQtc2l6ZTogMjJweDsNCiAgICBmb250LWZhbWlseTogIlRpbWVzIE5ldyBSb21hbiIsIFRpbWVzLCBzZXJpZjsNCiAgICBjb2xvcjogbmF2eTsNCiAgICB0ZXh0LWFsaWduOiBjZW50ZXI7DQogICAgZm9udC13ZWlnaHQ6IGJvbGQ7DQp9DQpoMiB7IC8qIEhlYWRlciAyIC0gZm9udCBzcGVjaWZpY2F0aW9ucyBmb3IgbGV2ZWwgMiBzZWN0aW9uIHRpdGxlICovDQogICAgZm9udC1zaXplOiAyMHB4Ow0KICAgIGZvbnQtZmFtaWx5OiAiVGltZXMgTmV3IFJvbWFuIiwgVGltZXMsIHNlcmlmOw0KICAgIGNvbG9yOiBuYXZ5Ow0KICAgIHRleHQtYWxpZ246IGxlZnQ7DQogICAgZm9udC13ZWlnaHQ6IGJvbGQ7DQp9DQoNCmgzIHsgLyogSGVhZGVyIDMgLSBmb250IHNwZWNpZmljYXRpb25zIG9mIGxldmVsIDMgc2VjdGlvbiB0aXRsZSAgKi8NCiAgICBmb250LXNpemU6IDE4cHg7DQogICAgZm9udC1mYW1pbHk6ICJUaW1lcyBOZXcgUm9tYW4iLCBUaW1lcywgc2VyaWY7DQogICAgY29sb3I6IG5hdnk7DQogICAgdGV4dC1hbGlnbjogbGVmdDsNCn0NCg0KaDQgeyAvKiBIZWFkZXIgNCAtIGZvbnQgc3BlY2lmaWNhdGlvbnMgb2YgbGV2ZWwgNCBzZWN0aW9uIHRpdGxlICAqLw0KICAgIGZvbnQtc2l6ZTogMThweDsNCiAgICBmb250LWZhbWlseTogIlRpbWVzIE5ldyBSb21hbiIsIFRpbWVzLCBzZXJpZjsNCiAgICBjb2xvcjogZGFya3JlZDsNCiAgICB0ZXh0LWFsaWduOiBsZWZ0Ow0KfQ0KDQpib2R5IHsgYmFja2dyb3VuZC1jb2xvcjp3aGl0ZTsgfQ0KDQouaGlnaGxpZ2h0bWUgeyBiYWNrZ3JvdW5kLWNvbG9yOnllbGxvdzsgfQ0KDQpwIHsgYmFja2dyb3VuZC1jb2xvcjp3aGl0ZTsgfQ0KDQo8L3N0eWxlPg0KYGBgDQoNCmBgYHtyIHNldHVwLCBpbmNsdWRlPUZBTFNFfQ0KIyBjb2RlIGNodW5rIHNwZWNpZmllcyB3aGV0aGVyIHRoZSBSIGNvZGUsIHdhcm5pbmdzLCBhbmQgb3V0cHV0IA0KIyB3aWxsIGJlIGluY2x1ZGVkIGluIHRoZSBvdXRwdXQgZmlsZXMuDQppZiAoIXJlcXVpcmUoImtuaXRyIikpIHsNCiAgIGluc3RhbGwucGFja2FnZXMoImtuaXRyIikNCiAgIGxpYnJhcnkoa25pdHIpDQp9DQppZiAoIXJlcXVpcmUoInRpZHl2ZXJzZSIpKSB7DQogICBpbnN0YWxsLnBhY2thZ2VzKCJ0aWR5dmVyc2UiKQ0KbGlicmFyeSh0aWR5dmVyc2UpDQp9DQppZiAoIXJlcXVpcmUoIkdHYWxseSIpKSB7DQogICBpbnN0YWxsLnBhY2thZ2VzKCJHR2FsbHkiKQ0KbGlicmFyeShHR2FsbHkpDQp9DQprbml0cjo6b3B0c19jaHVuayRzZXQoZWNobyA9IFRSVUUsICAgICAgICMgaW5jbHVkZSBjb2RlIGNodW5rIGluIHRoZSBvdXRwdXQgZmlsZQ0KICAgICAgICAgICAgICAgICAgICAgIHdhcm5pbmdzID0gRkFMU0UsICAjIHNvbWV0aW1lcywgeW91IGNvZGUgbWF5IHByb2R1Y2Ugd2FybmluZyBtZXNzYWdlcywNCiAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgIyB5b3UgY2FuIGNob29zZSB0byBpbmNsdWRlIHRoZSB3YXJuaW5nIG1lc3NhZ2VzIGluDQogICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICMgdGhlIG91dHB1dCBmaWxlLiANCiAgICAgICAgICAgICAgICAgICAgICByZXN1bHRzID0gVFJVRSwgICAgIyB5b3UgY2FuIGFsc28gZGVjaWRlIHdoZXRoZXIgdG8gaW5jbHVkZSB0aGUgb3V0cHV0DQogICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICMgaW4gdGhlIG91dHB1dCBmaWxlLg0KICAgICAgICAgICAgICAgICAgICAgIG1lc3NhZ2UgPSBGQUxTRSwNCiAgICAgICAgICAgICAgICAgICAgICBjb21tZW50ID0gTkENCiAgICAgICAgICAgICAgICAgICAgICApICANCmBgYA0KDQoNCg0KXA0KDQojIERhdGEgU2V0DQoNCg0KQ2hvb3NlIGEgZGF0YSBzZXQgdGhhdCBoYXMgYXQgbGVhc3QgZm91ciBjYXRlZ29yaWNhbCB2YXJpYWJsZXMgYW5kIGZvdXIgbnVtZXJpY2FsIHZhcmlhYmxlcy4gVGhlIHNhbXBsZSBzaXplIHNob3VsZCBiZSBhdCBsZWFzdCAyMDAuIFlvdSBjYW4gZmluZCBhIGRhdGEgc2V0IGVpdGhlciBmcm9tIG15IHRlYWNoaW5nIGRhdGEgcmVwb3NpdG9yeSBvciBvdGhlciBkYXRhIHNvdXJjZXMuIFRoZSBkYXRhIHNldCBzaG91bGQgYmUgY3Jvc3Mtc2VjdGlvbmFsIChpLmUuLCBlYWNoIG9mIHRoZSBkYXRhIHBvaW50cyBtdXN0IGJlIG9ic2VydmVkL2NvbGxlY3RlZC9nZW5lcmF0ZWQgYXQgdGhlIHNhbWUgdGltZSkuDQoNCiMgRGVzY3JpcHRpb24gb2YgRGF0YQ0KDQpUaGUgZm9sbG93aW5nIGluZm9ybWF0aW9uIG9mIHRoZSBkYXRhIHNob3VsZCBiZSBwcm92aWRlZCBpbiB0aGUgcmVwb3J0Og0KDQoqIEEgYnJpZWYgZGVzY3JpcHRpb24gb2YgdGhlIGRhdGEgc291cmNlLg0KDQoqIEhvdyB0aGUgZGF0YSBzZXQgaXMgZ2VuZXJhdGVkIG9yIGNvbGxlY3RlZC4NCg0KKiBOdW1iZXIgb2YgdmFyaWFibGVzIGFuZCB0aGVpciB0eXBlIChjYXRlZ29yaWNhbCBvciBudW1lcmljYWwpIGFuZCBzaXplIG9mIHRoZSBkYXRhIHNldC4NCg0KKiBMaXN0IHRoZSB2YXJpYWJsZSBuYW1lcyBhbmQgdGhlaXIgZGVzY3JpcHRpb24vZGVmaW5pdGlvbnMuDQoNCiMgRXhwbG9yYXRvcnkgRGF0YSBBbmFseXNpcyBhbmQgRmVhdHVyZQ0KDQpQZXJmb3JtIHRoZSBzdGFuZGFyZCBFREEgc3VjaCBhcyBkaXN0cmlidXRpb24gZm9yIGNhdGVnb3JpY2FsIGFuZCBudW1lcmljYWwgdmFyaWFibGVzIHJlc3BlY3RpdmVseSwgdGhlIHJlbGF0aW9uc2hpcCBiZXR3ZWVuIHR3byB2YXJpYWJsZXMgKGNvbWJpbmF0aW9ucyBvZiBjYXRlZ29yaWNhbCBhbmQgbnVtZXJpY2FsIHZhcmlhYmxlcyksIGFuZCBwYWlyd2lzZSByZWxhdGlvbnNoaXAuIEtlZXAgaW4gbWluZCB0aGF0IHRoZSBwYWlyd2lzZSBzY2F0dGVyIHBsb3QgaXMgb25seSBtZWFuaW5nIGZvciBudW1lcmljYWwgdmFyaWFibGVzLg0KDQpGb3IgZWFjaCBFREEgYW5kIGFzc29jaWF0ZWQgcmVwcmVzZW50YXRpb24sIHlvdSBzaG91bGQgDQoNCiogaW50ZXJwcmV0IHdoYXQgeW91IG9ic2VydmVkIGFuZCB0aGUgaW1wbGljYXRpb24gb2YgcG90ZW50aWFsIGZlYXR1cmUgZW5naW5lZXJpbmcNCg0KKiBwZXJmb3JtIGZlYXR1cmUgZW5naW5lZXJpbmcgYmFzZWQgb24gRURBIGJ5IHdyaXRpbmcgYW4gUi9QeXRob24gZnVuY3Rpb24uDQoNCiogV3JpdGUgYSBtYWluIGZ1bmN0aW9uIHRvIHdyYXAgaW5kaXZpZHVhbCBmZWF0dXJlIGVuZ2luZWVyaW5nIGZ1bmN0aW9ucy4NCg0KKiBUZXN0IHRoZSBtYWluIGZ1bmN0aW9uIHdpdGggZGlmZmVyZW50IHBhdHRlcm5zIGluIHRoZSBjb21wb25lbnRzIGFuZCBzdXJlIGl0IHByb2R1Y2VzIHRoZSBleHBlY3RlZCByZXN1bHQuDQoNCg0KDQojIFJlcG9ydGluZyBhbmQgZm9ybWF0DQoNClRoZSBmb3JtYXQgb2YgdGhlIHJlcG9ydCBzaG91bGQgYmUgc2ltaWxhciB0byB0aGF0IG9mIHRoZSByZXBvcnQgb2YgdGhlIGNhc2Ugc3R1ZHkuIFlvdSBjYW4gdXNlIHRoZSByZXBvcnQgdGVtcGxhdGUgdGhhdCBpcyB1c2VkIGZvciB0aGUgY2FzZSBzdHVkeS4NCg0KDQoNCg0KDQoNCg0KDQoNCg0KDQoNCg0KDQoNCg0KDQoNCg0KDQoNCg0KDQoNCg0K