Topic 4 Effective Data Preparation

In this note, we use a health registry data set as an example to illustrate the use of some of the base R commands in creating an analytic data set analysis and modeling.

4.1 Data Set Description

The Current Population Survey (CPS, http://www.bls.census.gov/cps/overmain.htm) is a monthly survey of about 50,000 households conducted by the Bureau of the Census for the Bureau of Labor Statistics. The survey has been conducted for more than 50 years. The CPS is the primary source of information on the labor force characteristics of the U.S. population. The sample is scientifically selected to represent the civilian noninstitutional population. Respondents are interviewed to obtain information about the employment status of each member of the household 15 years of age and older. However, published data focus on those ages 16 and over. The sample provides estimates for the nation as a whole and serves as part of model-based estimates for individual states and other geographic areas.

Estimates obtained from the CPS include

  • employment,
  • unemployment,
  • earnings,
  • hours of work, and
  • other indicators.

They are available by a variety of demographic characteristics including * age, * sex, * race, * marital status, and * educational attainment.

They are also available by

  • occupation,
  • industry, and
  • class of worker.

Supplemental questions to produce estimates on a variety of topics including

  • School enrollment,
  • income,
  • previous work experience,
  • health,
  • employee benefits, and
  • work schedules

are also often added to the regular CPS questionnaire.

CPS data are used by government policymakers and legislators as important indicators of our nation’s economic situation and for planning and evaluating many government programs. They are also used by the press, students, academics, and the general public.

In this note, we use a very small portion of the sample (https://raw.githubusercontent.com/pengdsci/sta321/main/ww04/cps_00003.csv) for illustrative purposes. The definitions of some of the variables can be found at (https://www.bls.gov/cps/definitions.htm). We will not use this data to perform any meaningful analysis.

The first few columns are what could be called administrative. They’re unique identifiers for the different observations, the timing of the survey they’ve taken, and a bit of other information. So for now we don’t need to pay much attention to MONTH, HWTFINL, CPSID, PERNUM, WTFINL, or CPSIDP. We will drop these variables.

The next few columns are concerned with the different geographies we have for the observations. This data is for individuals, but we also know the individual region (REGION), state (STATEFIP and STATECENSUS), and metropolitan area (METRO and METAREA). We can define a separate data set to store this geoinformation.

4.2 Base R Commands for Data Management

This note introduces several most commonly used R functions in data management.

library(knitr)
dat = read.csv("https://raw.githubusercontent.com/pengdsci/sta321/main/ww04/cps_00003.csv")
kable(head(dat))
YEAR SERIAL MONTH HWTFINL CPSID REGION STATEFIP METRO METAREA STATECENSUS FAMINC PERNUM WTFINL CPSIDP AGE SEX RACE EMPSTAT LABFORCE EDUC VOTED VOREG
2018 1 11 1703.832 20170800000000 32 1 2 3440 63 830 1 1703.832 20170800000000 26 2 100 10 2 111 98 98
2018 1 11 1703.832 20170800000000 32 1 2 3440 63 830 2 1845.094 20170800000000 26 1 100 10 2 123 98 98
2018 3 11 1957.313 20180900000000 32 1 2 5240 63 100 1 1957.313 20180900000000 48 2 200 21 2 73 2 99
2018 4 11 1687.784 20171000000000 32 1 2 5240 63 820 1 1687.784 20171000000000 53 2 200 10 2 81 2 99
2018 4 11 1687.784 20171000000000 32 1 2 5240 63 820 2 2780.421 20171000000000 16 1 200 10 2 50 99 99
2018 4 11 1687.784 20171000000000 32 1 2 5240 63 820 3 2780.421 20171000000000 16 1 200 10 2 50 99 99

The unique identifiers CPSID and CPSIDP are in the form of scientific notation, we need to convert them to a normal string version of the ID.

4.3 Working with Scientific Notations

Two global options we can use to print out the actual ID. See the self-explained options in the following code.

options(digits = 15, scipen=999)
dat = read.csv("https://raw.githubusercontent.com/pengdsci/sta321/main/ww04/cps_00003.csv")
kable(head(dat))
YEAR SERIAL MONTH HWTFINL CPSID REGION STATEFIP METRO METAREA STATECENSUS FAMINC PERNUM WTFINL CPSIDP AGE SEX RACE EMPSTAT LABFORCE EDUC VOTED VOREG
2018 1 11 1703.8321 20170800000000 32 1 2 3440 63 830 1 1703.8321 20170800000000 26 2 100 10 2 111 98 98
2018 1 11 1703.8321 20170800000000 32 1 2 3440 63 830 2 1845.0939 20170800000000 26 1 100 10 2 123 98 98
2018 3 11 1957.3134 20180900000000 32 1 2 5240 63 100 1 1957.3134 20180900000000 48 2 200 21 2 73 2 99
2018 4 11 1687.7836 20171000000000 32 1 2 5240 63 820 1 1687.7836 20171000000000 53 2 200 10 2 81 2 99
2018 4 11 1687.7836 20171000000000 32 1 2 5240 63 820 2 2780.4215 20171000000000 16 1 200 10 2 50 99 99
2018 4 11 1687.7836 20171000000000 32 1 2 5240 63 820 3 2780.4215 20171000000000 16 1 200 10 2 50 99 99

A new R function pander() in the pander{} library was used in the above code to produce an R markdown table. We add more features to the output table (check the help document for more information and examples).

If the ID variable was truncated before saving to CSV format, then the truncated digits will not be recovered.

options(digits = 7)  # change the default number digits


4.4 The ifelse() Function

ifelse statement is also called a vectorized conditional statement. It is commonly used in defining new variables.

For example, we can define a categorical variable, denoted by groupAge, based on the AGE variable in the original data frame. If AGE > 50, then groupAge = "(50, 150)" otherwise groupAge = "[16, 50]" The following code defines this new variable.

dat$groupAge = ifelse(dat$AGE > 50, "(50, 150)", "[16, 50]")
kable(head(dat[, c("AGE", "groupAge")]))
AGE groupAge
26 [16, 50]
26 [16, 50]
48 [16, 50]
53 (50, 150)
16 [16, 50]
16 [16, 50]

If we define another groupAge with more than two categories, we can still call ifelse multiple times. For example, we define groupAge02 as: is AGE > 50, then groupAge02 = "(50, 150)", if 30 <= AGE < 50, groupAge02 = [30, 50),otherewise, groupAge02 = "[16, 30)"

dat$groupAge02 = ifelse(dat$AGE > 50, "(50, 150)", ifelse(dat$AGE < 30, "[16, 30)", "[30, 50)"))
kable(head(dat[, c("AGE", "groupAge", "groupAge02")]))
AGE groupAge groupAge02
26 [16, 50] [16, 30)
26 [16, 50] [16, 30)
48 [16, 50] [30, 50)
53 (50, 150) (50, 150)
16 [16, 50] [16, 30)
16 [16, 50] [16, 30)

Remark: ifelse() is particularly useful when you want to combine categories of existing categorical variables.


4.5 The cut() Function

The cut() function is more flexible than ifelse().

Syntax
cut(num_vector,              # Numeric input vector
    breaks,                  # Number or vector of breaks
    labels = NULL,           # Labels for each group
    include.lowest = FALSE,  # Whether to include the lowest 'break' or not
    right = TRUE,            # Whether the right interval is closed (and the left open) or vice versa
    dig.lab = 3,             # Number of digits of the groups if labels = NULL
    ordered_result = FALSE,  # Whether to order the factor result or not
    …)                       # Additional arguments

We still use the above example to discretize the age variable using cut() function.

dat$cutAge01 = cut(dat$AGE, breaks =c(16, 30, 50, 150), labels=c( "[16, 30)", "[30, 50)", "(50, 150)"), include.lowest = TRUE)
kable(head(dat[, c("AGE", "groupAge", "groupAge02", "cutAge01")]))
AGE groupAge groupAge02 cutAge01
26 [16, 50] [16, 30) [16, 30)
26 [16, 50] [16, 30) [16, 30)
48 [16, 50] [30, 50) [30, 50)
53 (50, 150) (50, 150) (50, 150)
16 [16, 50] [16, 30) [16, 30)
16 [16, 50] [16, 30) [16, 30)
dat$cutAge02 = cut(dat$AGE, breaks =c(16, 30, 50, 150), include.lowest = TRUE)
kable(head(dat[, c("AGE", "groupAge", "groupAge02", "cutAge01", "cutAge02")]))
AGE groupAge groupAge02 cutAge01 cutAge02
26 [16, 50] [16, 30) [16, 30) [16,30]
26 [16, 50] [16, 30) [16, 30) [16,30]
48 [16, 50] [30, 50) [30, 50) (30,50]
53 (50, 150) (50, 150) (50, 150) (50,150]
16 [16, 50] [16, 30) [16, 30) [16,30]
16 [16, 50] [16, 30) [16, 30) [16,30]

4.6 with() and within() Functions

with() and within() are two closely related yet different base R functions that are useful in data management.

4.6.1 The with() Function

with() function enables us to define a new variable based on the variables in a data frame using basic R expressions that include mathematical and logical operations. We can add the newly defined variables to the existing data frame as usual.

with() Syntax

with(data-frame, R-expression)

Example 1

Num <- c(1400,1200,1100,1700,1500)
Cost <- c(1200,1300,1400,1500,1600)
##
dataA <- data.frame(Num,Cost,stringsAsFactors = FALSE)
##
product <- with(dataA, Num*Cost)
quotient <- with(dataA, Cost/Num)
logical <- with(dataA, Num > Cost)
kable(cbind(product = product, quotient = quotient, logical = logical))
product quotient logical
1680000 0.8571429 1
1560000 1.0833333 0
1540000 1.2727273 0
2550000 0.8823529 1
2400000 1.0666667 0
## add the new variables to data frame dataA
dataA$product = product
dataA$quotient = quotient
dataA$logical = logical
##
kable(dataA)
Num Cost product quotient logical
1400 1200 1680000 0.8571429 TRUE
1200 1300 1560000 1.0833333 FALSE
1100 1400 1540000 1.2727273 FALSE
1700 1500 2550000 0.8823529 TRUE
1500 1600 2400000 1.0666667 FALSE


### The within() Function

within() function allows us to create a copy of the data frame and add a column that would eventually store the result of the R expression.

Num <- c(1400,1200,1100,1700,1500)
Cost <- c(1200,1300,1400,1500,1600)
##
dataA <- data.frame(Num,Cost,stringsAsFactors = FALSE)
##
dataB <- within(dataA, Product <- Num*Cost)   # defined Product and added to dataA simultaneously
dataC <- within(dataB, Quotient <- Cost/Num)
dataD <- within(dataC, Logical <- Num > Cost)
kable(dataD)
Num Cost Product Quotient Logical
1400 1200 1680000 0.8571429 TRUE
1200 1300 1560000 1.0833333 FALSE
1100 1400 1540000 1.2727273 FALSE
1700 1500 2550000 0.8823529 TRUE
1500 1600 2400000 1.0666667 FALSE

4.7 The merge() Function - Table Joins

The R merge() function allows merging two data frames by row names (common key). This function allows us to perform different database (SQL) joins, like left join, inner join, right join, or full join, among others. In this note, we only introduce four different ways of merging datasets in base R with examples. We will introduce the SQL clause in R later.


4.7.1 Inner Join

The following figure illustrates how A left joins B and the resulting merged data set.

The following code implements the above left-join.

A = data.frame(Date = c("1/1/2020", "1/2/2020", "1/3/2020", "1/4/2020"),
               CountryID = c(1,1,3,2),
               Units = c(40, 25, 30, 35))
B = data.frame(ID=c(3,4),
               Country=c( "Panama", "Spain"))
AinnerB = merge(A, B, by.x = "CountryID", by.y = "ID")
kable(AinnerB)
CountryID Date Units Country
3 1/3/2020 30 Panama


4.7.2 Left Join

The following figure illustrates how A left joins B and the resulting merged data set.

The following code implements the above left-join.

A = data.frame(Date = c("1/1/2020", "1/2/2020", "1/3/2020", "1/4/2020"),
               CountryID = c(1,1,3,4),
               Units = c(40, 25, 30, 35))
B = data.frame(ID=c(1,2,3),
               Country=c("USA", "Canada", "Panama"))
AleftB = merge(A, B, by.x = "CountryID", by.y = "ID", all.x = TRUE)
kable(AleftB)
CountryID Date Units Country
1 1/1/2020 40 USA
1 1/2/2020 25 USA
3 1/3/2020 30 Panama
4 1/4/2020 35 NA

Note that, left-join produces missing values of the record in A and does not have any information in B.


4.7.3 Right Join

The following figure illustrates how A right joins B and the resulting merged data set.

The following code implements the above left-join.

A = data.frame(Date = c("1/1/2020", "1/2/2020", "1/3/2020", "1/4/2020"),
               CountryID = c(1,1,3,4),
               Units = c(40, 25, 30, 35))
B = data.frame(ID=c(3),
               Country=c("Panama"))
ArightB = merge(A, B, by.x = "CountryID", by.y = "ID", all.y = TRUE)
kable(ArightB)
CountryID Date Units Country
3 1/3/2020 30 Panama

Note also that right-join could also produce missing values.


4.7.4 Full (outer) Join

The following figure illustrates how A Full outer joins B and the resulting merged data set.

The following code implements the above left-join.

A = data.frame(Date = c("1/1/2020", "1/2/2020", "1/3/2020", "1/4/2020"),
               CountryID = c(1,1,3,2),
               Units = c(40, 25, 30, 35))
B = data.frame(ID=c(1,2,3,4),
               Country=c("USA", "Canada", "Panama", "Spain"))
AfullB = merge(A, B, by.x = "CountryID", by.y = "ID", all = TRUE)
kable(AfullB)
CountryID Date Units Country
1 1/1/2020 40 USA
1 1/2/2020 25 USA
2 1/4/2020 35 Canada
3 1/3/2020 30 Panama
4 NA NA Spain


4.8 Subsetting Data Frame

There are two different ways for subsetting a data frame: subsetting by rows and by columns.

We first define the following working data set.

working.data <- data.frame(
  id = c(10,11,12,13,14,15,16,17),
  name = c('sai','ram','deepika','sahithi','kumar','scott','Don','Lin'),
  gender = c('M','M',NA,'F','M','M','M','F'),
  dob = as.Date(c('1990-10-02','1981-3-24','1987-6-14','1985-8-16',
                  '1995-03-02','1991-6-21','1986-3-24','1990-8-26')),
  state = c('CA','NY',NA,NA,'DC','DW','AZ','PH'),
  row.names=c('r1','r2','r3','r4','r5','r6','r7','r8')
)
kable(working.data)
id name gender dob state
r1 10 sai M 1990-10-02 CA
r2 11 ram M 1981-03-24 NY
r3 12 deepika NA 1987-06-14 NA
r4 13 sahithi F 1985-08-16 NA
r5 14 kumar M 1995-03-02 DC
r6 15 scott M 1991-06-21 DW
r7 16 Don M 1986-03-24 AZ
r8 17 Lin F 1990-08-26 PH

4.8.1 Subsetting by Columns

This is a relatively easy job - we can simply select or drop variables to make a subset. The following is just an example.

# only keep id, name, dob
subset01 = working.data[, c("id", "name", "dob")]
subset01
##    id    name        dob
## r1 10     sai 1990-10-02
## r2 11     ram 1981-03-24
## r3 12 deepika 1987-06-14
## r4 13 sahithi 1985-08-16
## r5 14   kumar 1995-03-02
## r6 15   scott 1991-06-21
## r7 16     Don 1986-03-24
## r8 17     Lin 1990-08-26

We could also create the above subset by dropping gender and state.

# drop gender, state
subset02 = working.data[, -c(3,5)]
subset02
##    id    name        dob
## r1 10     sai 1990-10-02
## r2 11     ram 1981-03-24
## r3 12 deepika 1987-06-14
## r4 13 sahithi 1985-08-16
## r5 14   kumar 1995-03-02
## r6 15   scott 1991-06-21
## r7 16     Don 1986-03-24
## r8 17     Lin 1990-08-26

4.8.2 Subsetting by Rows

# subset by row name
kable(subset(working.data, subset=rownames(df) == 'r1'))
id name gender dob state
# subset row by the vector of row names
kable(subset(working.data, rownames(df) %in% c('r1','r2','r3')))
id name gender dob state
# subset by condition
kable(subset(working.data, gender == 'M'))
id name gender dob state
r1 10 sai M 1990-10-02 CA
r2 11 ram M 1981-03-24 NY
r5 14 kumar M 1995-03-02 DC
r6 15 scott M 1991-06-21 DW
r7 16 Don M 1986-03-24 AZ
# subset by condition with %in%
kable(subset(working.data, state %in% c('CA','DC')))
id name gender dob state
r1 10 sai M 1990-10-02 CA
r5 14 kumar M 1995-03-02 DC
# subset by multiple conditions using |
kable(subset(working.data, gender == 'M' | state == 'PH'))
id name gender dob state
r1 10 sai M 1990-10-02 CA
r2 11 ram M 1981-03-24 NY
r5 14 kumar M 1995-03-02 DC
r6 15 scott M 1991-06-21 DW
r7 16 Don M 1986-03-24 AZ
r8 17 Lin F 1990-08-26 PH