1 Introduction

In this brief report, we will summarize the working data set collected from banks that received loan applications from small businesses with a partial warranty from the Small Business Association (SBA). The data is in a public domain and can be found at https://pengdsci.github.io/datasets/#sba-loan. We use the free open-source R (http://www.r-project.com) to perform descriptive analysis.

loan01 = read.csv("https://pengdsci.github.io/datasets/SBAloan/w06-SBAnational01.csv", header = TRUE)[, -1]
loan02 = read.csv("https://pengdsci.github.io/datasets/SBAloan/w06-SBAnational02.csv", header = TRUE)[, -1]
loan03 = read.csv("https://pengdsci.github.io/datasets/SBAloan/w06-SBAnational03.csv", header = TRUE)[, -1]
loan04 = read.csv("https://pengdsci.github.io/datasets/SBAloan/w06-SBAnational04.csv", header = TRUE)[, -1]
loan05 = read.csv("https://pengdsci.github.io/datasets/SBAloan/w06-SBAnational05.csv", header = TRUE)[, -1]
loan06 = read.csv("https://pengdsci.github.io/datasets/SBAloan/w06-SBAnational06.csv", header = TRUE)[, -1]
loan07 = read.csv("https://pengdsci.github.io/datasets/SBAloan/w06-SBAnational07.csv", header = TRUE)[, -1]
loan08 = read.csv("https://pengdsci.github.io/datasets/SBAloan/w06-SBAnational08.csv", header = TRUE)[, -1]
loan09 = read.csv("https://pengdsci.github.io/datasets/SBAloan/w06-SBAnational09.csv", header = TRUE)[, -1]
loan = rbind(loan01, loan02, loan03, loan04, loan05, loan06, loan07, loan08, loan09)
# dim(bankLoan)
#names(bankLoan)

You can use the information in the data description (available on the web page of the data site).

var.names = names(loan)
my.var = var.names[c(1,6,21,26)]
my.new.data = loan[1:15, my.var]
dim(my.new.data)
## [1] 15  4

Based on the information given in the data set description, we can create a basic markdown table.

Variable Name Type Description
LoanNr_ChkDgt Text Identifier – Primary Key
Name Text Borrower Name
City Text Borrower City
State Text Borrower State
Zip Text Borrower Zip Code
Bank Text Bank Name
BankState Text Bank State
NAICS Text North American Industry Classification System code
ApprovalDate Date Date SBA Commitment Issued
ApprovalFY Time Fiscal Year of Commitment
Term Numeric Loan term in months
NoEmp Numeric Number of Business Employees
NewExist Text 1 = Existing Business, 2 = New Business
CreateJob Numeric Number of jobs created
RetainedJob Numeric Number of jobs retained
FranchiseCode Text Franchise Code 00000 or 00001 = No Franchise
UrbanRural Text 1= Urban, 2= Rural, 0 = Undefined
RevLineCr Text Revolving Line of Credit : Y = Yes
LowDoc Text LowDoc Loan Program: Y = Yes, N = No
ChgOffDate Date The date when a loan is declared to be in default
DisbursementDate Date Disbursement Date
DisbursementGross Numeric Amount Disbursed
BalanceGross Numeric Gross amount outstanding
MIS_Status Text Loan Status
ChgOffPrinGr Numeric Charged-off Amount
GrAppv Numeric Gross Amount of Loan Approved by Bank
SBA_Appv Numeric SBA’s Guaranteed Amount of Approved Loan
New Text =1 if NewExist=2 (New Business), =0 if NewExist=1 (Existing Business)
Portion Numeric Proportion of Gross Amount Guaranteed by SBA
RealEstate Text =1 if loan is backed by real estate, =0 otherwise
Recession Text =1 if loan is active during Great Recession, =0 otherwise
Selected Text =1 if the data are selected as training data to build model for assignment, =0 if the data are selected as testing data to validate model
Default Text =1 if MIS_Status=CHGOFF, =0 if MIS_Status=PIF (pay in full)

One nice feature of this method is that you have parameters you can control to make a better good looking graphic. For more information on embedding images in RMD, please visit https://yihui.org/knitr/options/.

In the next sections, we will perform frequency analysis for character variables and descriptive numeric measures for numerical variables.

2 Variable Inspection

I use an informal term variable inspection in this note to have a glance at the distribution of the individual variables of interest to be used in the analysis. We only want to use this data for sampling purposes for now. A categorical variable is needed to do stratified sampling.

2.1 Frequency Analysis for Category Variables

There are AAA categorical variables in the data set, variables XXX, YYY, and ZZZ are related to the geographic regions, the type of industry, and the size of the business. These variables can be used to define independent sub-populations.

Some of the categories may have only a small number of loans, we should combine small sub-populations in a meaningful way so that we can use the “balanced” variable for stratification.

For example, we can look at the variable NAICS. Since R is case sensitive, before you write code to select a variable for analysis, you may want to see the exact form of the variable name in your data set to avoid unnecessary frustration.

name.vec=names(loan)   # this is a character vector
name.mtx = matrix(name.vec, ncol=3)  # define a matrix 
colnames(name.mtx)=c("variable name", "variable name", "variable name" ) # assign column names: This is required!
kable(name.mtx)   # make a nice looking table: 
variable name variable name variable name
LoanNr_ChkDgt ApprovalFY LowDoc
Name Term ChgOffDate
City NoEmp DisbursementDate
State NewExist DisbursementGross
Zip CreateJob BalanceGross
Bank RetainedJob MIS_Status
BankState FranchiseCode ChgOffPrinGr
NAICS UrbanRural GrAppv
ApprovalDate RevLineCr SBA_Appv
                  # kable() is a fun function in knitr()

Next, I use the North America Industry Classification System (NAICS) as an example to demonstrate how to obtain a frequency table from R and what information you need to know in designing a stratified sampling. The following code chunk produces a frequency distribution (table).

naics =as.character(loan$NAICS)  # make a character vector
N=length(naics)                  # find the size of the data. 
f.table = -sort(-table(naics))   # sort the vector in descending order
n = length(f.table)              # find the number of distinct industries
n.0 = sum(f.table < 900)         # industry with less than 0.1% of the population size
# A note of length of R variable: the latest version of R has upped 
# the maximum length of variable names from 256 characters to a whopping 10,000. 
# We should try our best to give meaningful names to R variables.
kable(cbind(Population.size = N, Number.of.Industries=n, Sub.Pop.less.1000 = n.0))
Population.size Number.of.Industries Sub.Pop.less.1000
899164 1312 1140

In the above frequency distribution table, we observe the following

  • About \(87\%\) of NAICS classified industries have less than 900 applications. If the sample size is not large enough, \(87\%\) of the industries in the population will be unlikely represented in an SRS sample.

  • It makes sense to combine these small sub-populations with a larger similar population.

  • It also makes sense to define a study population by including a set of sub-populations with a certain size.

2.2 Summary of Numerical Variables

In general, descriptive analyses of the numerical variable mainly focus on the distribution of the variable. For example, for a multi-modal or an extremely skewed variable, we may want to discretize it and to make a discrete variable for analysis.

We can also discretize a numerical variable based on the distribution to make a stratification variable.

3 Initial Data Management

We will do some basic data management to create a data set that will be used as a study population. We can add a variable to the data set and used a sampling list. Some R functions are convenient to use for setting data based on observations (not variable!).

3.1 Add New Variable to a Dataframe

Here is an example. I would add an ID variable to the existing data frame so that you can use it as the sampling frame in the SRS sampling scheme.

pop.N = length(loan$LoanNr_ChkDgt)        # population size
loan$obs.ID = 1:pop.N                     # adding an observation ID to the dataframe
# Next, we choose first 10 row of a subset of variables to make a table for inspection.
kable(as.matrix(loan[1:10, c("obs.ID", "LoanNr_ChkDgt", "Name", "City", "State", "Bank")]))
obs.ID LoanNr_ChkDgt Name City State Bank
1 1000014003 ABC HOBBYCRAFT EVANSVILLE IN FIFTH THIRD BANK
2 1000024006 LANDMARK BAR & GRILLE (THE) NEW PARIS IN 1ST SOURCE BANK
3 1000034009 WHITLOCK DDS, TODD M. BLOOMINGTON IN GRANT COUNTY STATE BANK
4 1000044001 BIG BUCKS PAWN & JEWELRY, LLC BROKEN ARROW OK 1ST NATL BK & TR CO OF BROKEN
5 1000054004 ANASTASIA CONFECTIONS, INC. ORLANDO FL FLORIDA BUS. DEVEL CORP
6 1000084002 B&T SCREW MACHINE COMPANY, INC PLAINVILLE CT TD BANK, NATIONAL ASSOCIATION
7 1000093009 MIDDLE ATLANTIC SPORTS CO INC UNION NJ WELLS FARGO BANK NATL ASSOC
8 1000094005 WEAVER PRODUCTS SUMMERFIELD FL REGIONS BANK
9 1000104006 TURTLE BEACH INN PORT SAINT JOE FL CENTENNIAL BANK
10 1000124001 INTEXT BUILDING SYS LLC GLASTONBURY CT WEBSTER BANK NATL ASSOC

3.2 Accessing Observations Meeting Certain Conditions

Setting a data set based on variables is straightforward. you need the following command to define a subset with only the listed 6 variables.

loan[ , c("obs.ID", "LoanNr_ChkDgt", "Name", "City", "State", "Bank")]

If we define a subset with conditions in one or more variables, we can use a powerful function \(which()\). The following example defines a subset

subset.id = which(loan$Bank=="FIFTH THIRD BANK" | loan$Bank=="CENTENNIAL BANK") # "|" means "or".
centennial.5th3rd = loan[subset.id, ]
sizes = dim(centennial.5th3rd)
mtx=cbind(obs.n = sizes[1], var.n = sizes[2])
kable(mtx)
obs.n var.n
5628 28

3.3 Random Sampling

As you have already seen that SRS is the foundation of probabilistic sampling plans. There are two types of SRS: SRS with and without replacement. The built-in R function \(sample()\) can do both with and without replacement plans. It can also do weighted sampling.

sample(x, size, replace = FALSE, prob = NULL)
# x = vector
# size = sample size 
# replace = FALSE, without replacement
# replace = TRUE, with replacement
# prob =  a vector of probabilities (weights) 
          of corresponding observations 
          to do weighted sampling.

4 Some Useful Tips

4.1 Embed External Images Into RMD

For example, you can make an image of the variable list and embed it into this document. To do this, create an image and save it in a folder where you save this R markdown document. You then can use the following R code chunk to embed the image into a file in the format of html, word, or pdf when you knit this RMD document.

List of all variables and the description of each variable

List of all variables and the description of each variable