Topic 4 Descriptive Statistics

The note outlines the basic descriptive statistics.

Data Types
Tabular and graphic summary of data
Numerical summary of data

4.1 Data Types

There are different classifications of data types. We use the following simple one

Categorical Variables - The values of these types of variables do not have numerical meaning in the sense that one can not perform arithmetic operations with the values of these types of variables.
- Ordinal categorical variables - the values have a natural order. For example, course letter grades: A, B, C, D, and F.
- Nominal categorical variables - the values do not have a natural order. For example, majors in a college: Mathematics, finance, music, biology, etc.
Numerical Variables - As indicated in the name, the values of numerical variables are numbers.
- Discrete variables - one can find two values of such variables such there are no meaningful values that fall between the two. For example, the number of children in a household: 1, 2, 3, 4, …. There is no such household that has 2.5 children.
- Continuous variables - For any two distinct values of such variables, any value between the two is meaningful. For example, consider two arbitrarily selected human body temperatures (in Fahrenheit) 97.4 and 97.5, any number between 97.4 and 97.5 could be the temperature of someone in the population (although the person may not be part of the sample).

4.2 Tabular and Graphic Summary

Both tabular and graphic summaries are powerful and effective tools to visualize the (shape of the) distribution of the data.

4.2.1 Categorical Data

Example 1: [Status of Endangered Species]: The data was extracted from the U.S. Fish & Wildlife Service ECOS Environmental Conservation Online System. The data can be found at the following https://raw.githubusercontent.com/pengdsci/STA501/main/Data/EndangeredSpecies.csv

We want to summarize the status of endangered species (one of the columns in the data set).

4.2.1.1 Frequency Table

Since the values of categorical data sets are labels, constructing frequency tables of categorical data is straightforward.

speURL = "https://raw.githubusercontent.com/pengdsci/STA501/main/Data/EndangeredSpecies.csv"
Species = read.csv(speURL, header = TRUE)   # read the csv data from the URL 
kable(t(head(Species[1:4,])))     # list first 4 rows of the data

	1	2	3	4
scientificName	Acanthorutilus handlirschi	Accipiter fasciatus natalis	Accipiter francesii pusillus	Accipiter gentilis laingi
commonName	Cicek (minnow)	Christmas Island goshawk	Anjouan Island sparrowhawk	Queen Charlotte goshawk
criticalHabitat	N/A	N/A	N/A	N/A
speciesGroup	Fishes	Birds	Birds	Birds
Status	Endangered	Endangered	Endangered	Threatened
specialRules	N/A	N/A	N/A	N/A
whereListed	Wherever found	Wherever found	Wherever found	British Columbia Canada

Next, we create a frequency table to include all four types of frequencies.

speciesGroup = Species$speciesGroup   # extract the column of endangered species
freq = table(speciesGroup)            # frequency count
rel.freq = freq/sum(freq)             # relative frequency
cum.freq = cumsum(freq)               # cumulative frequency
cum.rel.freq = cum.freq/sum(freq)     # cumulative relative frequency
freq.table = cbind(freq =freq, 
                   rel.freq = rel.freq,
                   cum.freq = cum.freq,
                   cum.rel.freq = cum.rel.freq)    
kable(freq.table)     # kable() makes a nice-looking table

	freq	rel.freq	cum.freq	cum.rel.freq
Amphibians	45	0.0307377	45	0.0307377
Arachnids	17	0.0116120	62	0.0423497
Birds	342	0.2336066	404	0.2759563
Clams	124	0.0846995	528	0.3606557
Corals	24	0.0163934	552	0.3770492
Crustaceans	28	0.0191257	580	0.3961749
Fishes	208	0.1420765	788	0.5382514
Insects	94	0.0642077	882	0.6024590
Mammals	381	0.2602459	1263	0.8627049
Reptiles	146	0.0997268	1409	0.9624317
Snails	55	0.0375683	1464	1.0000000

4.2.1.2 Bar Chart and Pie Chart

We use R to create both charts based on the frequency table created in the previous sub-section in the following.

We first draw a simple pie chart. You can add different colors and additional information to the chart. You can visit https://www.statmethods.net/graphs/pie.html for more examples.

freq = table(speciesGroup)
group = names(freq)
pie(freq, labels = group, main="Pie Chart of Species Group")

Since there are too many slices in the pie chart, it is not easy to add frequencies to the chart. This is not a good visualization. Next, we create a bar chart to represent the distribution of the same data set.

freq = table(speciesGroup)
group = names(freq)         # categories
barplot(freq,               # frequency table
        names.arg=group,    # tick marks
        las=3,              # 
        main="Distribution of Species Group" )

The above bar plot was created using the function in Base R. We can also use relevant functions in different R packages to make a bar chart that may contain additional information. For example, the R function BarChart() in library {lessR} generates bar charts with more information based on the original data values. This is different from barplot() which uses the frequency tables.

# library(lessR)        # placed at the beginning of the document
BarChart(speciesGroup,  rotate_x=45)

There are more examples to use BarChart() in a nice blog https://cran.r-project.org/web/packages/lessR/vignettes/BarChart.html.

4.2.2 Numerical Data

To summarize numerical data sets, we use frequency tables and histograms to visualize the underlying distributions.

We will use the following data set to illustrate the steps to construct frequency tables and histograms using R. The data set https://raw.githubusercontent.com/pengdsci/STA501/main/Data/diet.csv was used to study the effect of three different diets on weight loss.

4.2.2.1 Frequency Tables

Unlike categorical data in which the data values are category labels, in numerical data, we need to group data values to create data groups and then construct the corresponding frequency table.

Think about creating a data window by the maximum and minimum data values in the data set then cut the data window into several small data windows with equal width. The data values in each small data window form a data group. In the figure, we assume there is a data set with a minimum value of 21 and a maximum value of 74. We plan to split the data window [21, 74] into five small data windows with equal width. The cut-off points are [21.0, 31.6, 42.2, 52.8, 63.4, 74.0] (including minimum and maximum values). We can use the R command to find these cut-offs if we provide the minimum, maximum, and number of small windows to be used for creating the frequency tables and histogram.

We can use the R function seq(min, max, length = number-of-windows + 1) to find cut-off points. For example, in the above figure, the following code yields the cut-off.

# round off the cut-offs to 1 decimal point
cutoff = round(seq(21, 74, length =5+1),1)
# kable() produces a nice looking table in PDF 
kable(data.frame(cutoff), align = 'l')

cutoff
21.0
31.6
42.2
52.8
63.4
74.0

Example 2: [Effectiveness of Diets Data] We are interested in creating a histogram of the weights of all participants in the study before starting the three diets. We want to create a frequency table with 6 rows. That is, We will create 6 small data windows to define 6 groups. R function cut(x = data-set, breaks=cutoff-points) .

dietURL = "https://raw.githubusercontent.com/pengdsci/STA501/main/Data/diet.csv"
diet = read.csv(dietURL, header=TRUE)
pre.weight=diet$initial.weight   # extract pre.weight from the data set.
### calculate the cut-offs that yield 6 small data windows with equal widths
cutoff.pt = seq(min(pre.weight), max(pre.weight), length = 6+1) 
cutoff.pt = round(cutoff.pt, 1)       # rounding off to keep 1 decimal place
### use R function **cut()** to split the data window into 6 small data windows
data.group = cut(x = pre.weight, breaks=cutoff.pt, include.lowest = TRUE)
## use R function **table** to get the frequency table
freq.count=table(data.group)    # regular frequency counts
kable(freq.count, align = 'l')

data.group	Freq
[58,63]	12
(63,68]	14
(68,73]	17
(73,78]	15
(78,83]	11
(83,88]	7

We can also use the same steps to find relative and cumulative frequencies.

freq = table(data.group)              # frequency count
rel.freq = freq/sum(freq)             # relative frequency
cum.freq = cumsum(freq)               # cumulative frequency
cum.rel.freq = cum.freq/sum(freq)     # cumulative relative frequency
freq.table = cbind(freq =freq, 
                   rel.freq = round(rel.freq,3),   # keep 3 decimal places
                   cum.freq = cum.freq,
                   cum.rel.freq = round(cum.rel.freq ,3)) # keep 3 decimal places
kable(freq.table, align = 'l')

	freq	rel.freq	cum.freq	cum.rel.freq
[58,63]	12	0.158	12	0.158
(63,68]	14	0.184	26	0.342
(68,73]	17	0.224	43	0.566
(73,78]	15	0.197	58	0.763
(78,83]	11	0.145	69	0.908
(83,88]	7	0.092	76	1.000

4.2.2.2 Graphic Summary - Histogram

As mentioned earlier, we use a histogram to visualize the distribution of the numerical data set. R function hist(x=data-set, breaks = cutoff). We still use the same pre.weight and the same cut-off obtained in the previous subsections to construct the histogram.

hist(pre.weight, breaks = cutoff.pt,
     main = "Histogram of Pre-Weight")

We can see that the distribution of pre-weights is skewed to the right since the above histogram has a long right tail.

4.3 Numerical Summary of Numerical Data

Three family measures are outlined in this section: central tendency, variation, and location. We will still use pre-weight as an example to show how to basic R functions to calculate these numerical measures.

4.3.1 Central Tendency

We will not list all relevant measures of centers. Three three R functions mean() and median() are used to calculate the mean and median of a given data set.

Mean - the average of the values in the data set.

avg.pre.weight = mean(pre.weight)
kable(data.frame(avg.pre.weight), align = 'l')

avg.pre.weight
72.28947

Median - a cut-off value that splits the data values into two parts (the cut-off in both parts) such that at least 50% of data values are greater than or equal to and at least 50% of data values are less than or equal to the cut-off value.

middle.numer = median(pre.weight)
kable(data.frame(middle.numer), align = 'l')

middle.numer
72

The more general quantile function quantile() can also be used to find the median. In fact, quantile() can be any percentile. The 50th percentile is the median.

quantile.mid.num = quantile(pre.weight, # data set name
                            0.5,        # percentile, 0.5 = 50%
                            type=2      # there are different interpolations. 
                                        # We use type 2.
                            )
fifty.percentile= data.frame(quantile.mid.num)
kable(fifty.percentile, align = 'l')

	quantile.mid.num
50%	72

4.3.2 Variations

We use R functions and variable pre-weight to calculate variance, standard deviation, and inter-quartile range (IQR).

Variance - measure the spread of the data. R function var() calculates the sample variance.

sample.var = var(pre.weight)
kable(data.frame(sample.var), align = 'l')

sample.var
63.59509

Standard Deviation - measures the spread of the data and is equal to the square root of the variance.

stdev = sd(pre.weight)
kable(data.frame(stdev), align = 'l')

stdev
7.974653

Inter-quartile Range (IQR) - the range of the middle 50% data values. That is, we throw out the bottom and upper 25% of data values and use the difference between the maximum and the minimum values to define IQR. The idea is illustrated in the following figure [I exclude the code in the output file. You can find the RMD document].

where Q1 and Q3 are the first and third quartiles which can be found using quantile(). The inter-quartile range is defined to be IQR = Q3 - Q3. We still use the pre.weight to illustrate how to find the IQR with the following code.

IQR = quantile(pre.weight, 0.75, type = 2) - quantile(pre.weight, 0.25, type = 2)
IQR = as.vector(IQR)
kable(data.frame(IQR), align = 'l')

IQR
12

4.3.3 Location

4.3.3.1 Z-score Transforamtion

The z-score transformation converts any given numerical data set to a new standardized data set such the new data set has zero mean and unit standard deviation. Let’s denote the original data set to be \(X= \{x_1, x_2, \cdots, x_n\}\). let \(Z =\{z_1, z_2, \cdots, z_n \}\) be the standardized data set. The formula that transforms X to Z is given by

\[ z_i = \frac{x_i-\bar{x}}{s} \]

where \(\bar{x}\) is the sample mean and \(s\) is the standard deviation of \(X\).

I use the toy data \(X = \{1,3,5,7,9 \}\) as an example to perform the z-score transformation.

X= c(1, 3, 5, 7, 9)     # type in data values
xbar = mean(X)          # sample mean
s = sd(X)               # sample standard deviation
Z=(X-xbar)/s            # z-score transformation
kable(data.frame(Z), align = 'l', format = "pipe")   # make a nice looking

Z
-1.2649111
-0.6324555
0.0000000
0.6324555
1.2649111

4.3.3.2 Quantile

A k-th quantile of a data set (also called sample k-th quantile) is defined as a cut-off value that splits the data into two parts such that at least \(100k\%\) of data values are bigger than or equal to the cut-off and at least \(100(1-k)\%\) data values are less than or equal to the cut-off value, where \(0 < k < 100\). Special quantiles are the quartile (quarter) and percentiles (hundredth).

Please keep in mind that the calculation of quantile is based on the sorted data and involves interpolations. Several interpolations were implemented in R. There is a minor difference between these different interpolations. The simple interpolation that is commonly used is the so-called type 2 interpolation. The type 1 interpolation is the default type in quantile(dataset, k/100, type=2).

Example [Pre-weight data] - We want to find 25% and 68% percentiles of pre-weights.

q.25 = quantile(pre.weight, 0.25, type = 2)
kable(data.frame(q.25), align = 'l')

	q.25
25%	66

q.68 = quantile(pre.weight, 0.68, type = 2)
kable(data.frame(q.68), align = 'l')

	q.68
68%	77

We can call quantile() to find the two quantiles simultaneously.

q.25.68 = quantile(pre.weight, c(0.25, 0.68), type = 2)
kable(data.frame(q.25.68), align = 'l')

	q.25.68
25%	66
68%	77

4.3.3.3 Five-number Summary and Box-plot

The five-number summary consists of 5 numbers: minimum (0%), 1st quartile (25%), 2nd quartile(50%, median), 3rd quartile (75%), and maximum (100%). R function fivenum() is dedicated to finding the five-number summary.

fivenum(pre.weight)

## [1] 58 66 72 78 88

We can also use quantile() to find the five-number summary in the following.

quantile(pre.weight, c(0, 0.25, 0.5, 0.75, 1), type = 2)

##   0%  25%  50%  75% 100% 
##   58   66   72   78   88

A box plot is the graphic representation of the five-number summary.

R function boxplot() will make the box-plot. We only present a simple box plot in the following.

boxplot(pre.weight, horizontal = TRUE)

4.4 Assignment - Descriptive Statistics

The Diabetes data set to be used in this assignment is taken from Vanderbilt’s Biostatistics Datasets.

The following is the description from the web page:

These data are courtesy of Dr. John Schorling, Department of Medicine, University of Virginia School of Medicine. The data consists of 19 variables on 403 subjects from 1046 subjects who were interviewed in a study to understand the prevalence of obesity, diabetes, and other cardiovascular risk factors in central Virginia for African Americans. According to Dr. John Hong, Diabetes Mellitus Type II (adult-onset diabetes) is associated most strongly with obesity. The waist/hip ratio may be a predictor of diabetes and heart disease. DM II is also associated with hypertension - they may both be part of “Syndrome X”. The 403 subjects were the ones who were actually screened for diabetes. Glycosolated hemoglobin > 7.0 is usually taken as a positive diagnosis of diabetes. For more information about this study see

Willems JP, Saunders JT, DE Hunt, JB Schorling: Prevalence of coronary heart disease risk factors among rural blacks: A community-based study. Southern Medical Journal 90:814-820; 1997

Schorling JB, Roach J, Siegel M, Baturka N, Hunt DE, Guterbock TM, Stewart HL: A trial of church-based smoking cessation interventions for rural African Americans. Preventive Medicine 26:92-101; 1997.

diaURL = "https://raw.githubusercontent.com/pengdsci/STA501/main/Data/diabetes.csv"
diabetes = read.csv(diaURL, header = TRUE)
kable(t(head(diabetes)))

	1	2	3	4	5	6
id	1000	1001	1002	1003	1005	1008
chol	203	165	228	78	249	248
stab.glu	82	97	92	93	90	94
hdl	56	24	37	12	28	69
ratio	3.6	6.9	6.2	6.5	8.9	3.6
glyhb	4.31	4.44	4.64	4.63	7.72	4.81
location	Buckingham	Buckingham	Buckingham	Buckingham	Buckingham	Buckingham
age	46	29	58	67	64	34
gender	female	female	female	male	male	male
height	62	64	61	67	68	71
weight	121	218	256	119	183	190
frame	medium	large	large	large	medium	large
bp.1s	118	112	190	110	138	132
bp.1d	59	68	92	50	80	86
bp.2s	NA	NA	185	NA	NA	NA
bp.2d	NA	NA	92	NA	NA	NA
waist	29	46	49	33	44	36
hip	38	48	57	38	41	42
time.ppn	720	360	180	480	300	195

We can see from the first 6 observations that there are 15 numerical variables and 3 categorical variables. Variable bp.2s and bp.2d have missing values. To complete this week’s assignment, you need to choose one numerical variable and one categorical variable with NO missing values.

The following code shows how to extract variables from the data frame. I will use the two variables with missing values as an example. You can modify the code to extract your variables for the assignment.

bp.2s <- diabetes$bp.2s
bp.2d <- diabetes$bp.2d

4.4.1 Summarizing Categorical Data

Use the categorical variable you selected to perform the following analysis
- Construct a relative frequency table. Write a few sentences to describe the distribution of the variable. Note that you are encouraged to construct a frequency table with all four types of frequencies as I did in the class note.
- Construct a pie-chart to represent the distribution of the categorical variable.
Using the numerical variable you chose from the diabetes data to answer the following questions.
- Construct a relative frequency table of the numerical variable with 10 categories. In other words, the frequency table should have 10 rows. You are encouraged to include all 4 frequencies in the table. Please provide a brief description of the relative frequencies.
- Construct a histogram of the numerical variable with 10 vertical bars. In other words, the histogram is a geometric representation of the frequency table. Explain the distribution of the variable. Is it skewed to the left or the right?
- Construct a box-plot and explain it. That is, can you tell whether the distribution is skewed to the right or the left?