Topic 3 Running SQL in SAS and R

Although R and Python has different data wrangling libraries we can use to connect to databases for information extraction. Structural Query Language (SQL) is native to different DBMS. Some degree of proficiency in SQL crucial for every data professional.

This note use SAS PROX SQL and several R packages to run SQL in SAS and R.

3.1 Running SQL in SAS

The PROC SQL in SAS is powerful. We can run authentic SQL script with in SAS SQL. One can use free SAS Studio via SAS on-Demand. We will use three relational data tables in the following SAS code (URLs are also in the SAS code).

In order to make the code workable in SAS environment, we simply copy and paste the code to include it in this note. NOte that, SAS converted imported relational tables to SAS data sets. The resulting tables in PROC SAS are still SAS tables stored in the designated library. If a library reference is not given, the resulting SAS data set will be saved in the temporary library.

3.1.1 Loading Data

LIBNAME sql "/home/u50445699/STA551";

filename dat01 url 'https://pengdsci.github.io/datasets/AnimalSurvey/plots.csv';
filename dat02 url 'https://pengdsci.github.io/datasets/AnimalSurvey/species.csv';
filename dat03 url 'https://pengdsci.github.io/datasets/AnimalSurvey/surveys.csv';

*write a macro to load multiple data files;
%MACRO IMPORTCSV(in_csv, out_sas);
PROC IMPORT DATAFILE=&in_csv 
   DBMS=csv 
   OUT=sql.&out_sas 
   REPLACE;
RUN;
%MEND;

*call macros to load data files;
%IMPORTCSV(dat01, plots);
%IMPORTCSV(dat02, species);
%IMPORTCSV(dat03, survey);

3.1.2 Basic SQL Syntax and Clauses

####. First query with clause SELECT

(a). Create a table view

(b). Create a new table (SAS data set)

*Table view;   
PROC SQL;
/** The actual SQL code starts here **/
SELECT year
FROM sql.survey;
/***/
QUIT;


* create a new table;
PROC SQL;

CREATE TABLE sql.YMD AS
SELECT year, 
       month, 
       day
FROM sql.survey;

QUIT;

* create a new table: select all variables;
PROC SQL;

CREATE TABLE sql.survey_all AS
SELECT *
FROM sql.survey;

QUIT;

CAUTION: SQL uses “CREATE VIEW name_of_table AS” to create a view table that is saved in the database and can be used to look at, filter, and even update information.

3.1.2.1 Select unique values

* one variable;
PROC SQL;
SELECT DISTINCT year
FROM sql.survey;
QUIT;

* two variable;
PROC SQL;
SELECT DISTINCT year, 
                species_id
FROM sql.survey;
QUIT;

3.1.2.2 Calculated values

* Create a view;
PROC SQL;
SELECT year, 
       month, 
       day, 
       INPUT(weight, best.)/1000.0 AS wgt
FROM sql.survey;
QUIT;

* Create a table and define a new variable
  based on the calculated values;
PROC SQL;
CREATE TABLE sql.Add_new_var AS
SELECT year, 
       month, 
       day, 
       INPUT(weight, best.)/1000.0 AS wt_kilo
FROM sql.survey;
QUIT;


proc contents data = sql.surveys; run;

                         
PROC SQL;
SELECT plot_id, 
       species_id, 
       sex, 
       weight, 
       ROUND(INPUT(weight, best.)/ 1000.0, 0.01) /* ROUND() is a SAS function! */
FROM sql.survey;
QUIT;

3.1.2.3 Filtering

* subsetting by filtering - WHERE statement;
* Single condition;
PROC SQL;
SELECT *
FROM sql.survey
WHERE species_id='DM';
QUIT;


* Multiple conditions: AND;
PROC SQL;
SELECT *
FROM sql.survey
WHERE (year >= 2000) AND (species_id = 'DM');
QUIT;

* Multiple conditions: OR;
PROC SQL;
SELECT *
FROM sql.survey
WHERE (species_id = 'DM') OR (species_id = 'DO') OR (species_id = 'DS');
QUIT;

3.1.2.4 Special Keywords simplify WHERE statement

* use of keyword IN;
PROC SQL;
SELECT *
FROM sql.survey
WHERE (year >= 2000) AND (species_id IN ('DM', 'DO', 'DS'));
QUIT;

3.1.2.5 Sorting Variables

* Ascending ordering - default;
PROC SQL;
SELECT *
FROM sql.species
ORDER BY taxa ASC;
QUIT;

* descending ordering;
PROC SQL;
SELECT *
FROM sql.species
ORDER BY taxa DESC;
QUIT;

* sorting multiple variables- nest sorting;
PROC SQL;
SELECT *
FROM sql.species
ORDER BY genus ASC, species ASC;
QUIT;

3.1.2.6 Order of Execution

*Clauses are written in a fixed order: SELECT, FROM, WHERE, then ORDER BY. ;
PROC SQL;
SELECT genus, species
FROM sql.species
WHERE taxa = 'Bird'
ORDER BY species_id ASC;
QUIT;

3.1.2.7 Summary Statistics with groups

*no group - Using the wildcard simply counts the number of records (rows);
PROC SQL;
SELECT COUNT(*)
FROM sql.survey;
QUIT;

* calculate the sum of a numerical variable;
PROC SQL;
SELECT COUNT(*) ,
       SUM(INPUT(weight, best.))
FROM sql.survey;
QUIT;

* calculate the sum of a numerical variable - adding names of summary statistics;
PROC SQL;
SELECT COUNT(*) AS sample_size,
       SUM(weight) AS total_weight
FROM sql.survey;
QUIT;

*summary statistics within subgroups - GROUP BY clause;
* This is equalent to a univeriable frequency table;
PROC SQL;
SELECT species_id, 
       COUNT(*)
FROM sql.survey
GROUP BY species_id;
QUIT;

3.1.2.8 HAVING clause based on aggregated variables

* conditioning on species size;
PROC SQL;
SELECT species_id, 
       COUNT(species_id)
FROM sql.survey
GROUP BY species_id
HAVING COUNT(species_id) > 10;
QUIT;

* conditioning on species size - a better code;
PROC SQL;
SELECT species_id, 
       COUNT(species_id) AS species_size
FROM sql.survey
GROUP BY species_id
HAVING species_size > 10;
QUIT;

3.1.2.9 Ordering aggregated results.

* sorting aggregated variables -- This DOES NOT work!
* Summary functions are restricted to the SELECT and HAVING clauses only;
PROC SQL;
SELECT species_id, 
       COUNT(*) 
FROM sql.survey
GROUP BY species_id
ORDER BY COUNT(species_id);
QUIT;

* sorting aggregated variables;
* use the new name in the ORDER BY clause;
PROC SQL;
SELECT species_id AS subtotal, 
       COUNT(*) 
FROM sql.survey
GROUP BY species_id
ORDER BY subtotal;
QUIT;

3.1.2.10 NULL - Working with Missing values

* keyword IS;
PROC SQL;
SELECT *
FROM sql.survey
WHERE species_id IS NULL;
QUIT;

* keyword IS NOT;
PROC SQL;
SELECT *
FROM sql.survey
WHERE species_id IS NOT NULL;
QUIT;

* for non-missing values, we use IS/IS NOT or =/!=;
* CAUTION: "=" and "==" both work in SQL, However, "==" does NOT in SAS.;
PROC SQL;
SELECT SUM(INPUT(weight, best.)), 
       COUNT(*), 
       SUM(INPUT(weight, best.))/COUNT(*)
FROM sql.survey
WHERE species_id = 'PE';
QUIT;

* for non-missing values, we use IS/IS NOT or =/!=;
* CAUTION: "=" and "==" both work in SQL, However, "==" does NOT in SAS.;
* != works in SQL, but not in SAS. ^= works in SAS;
PROC SQL;
SELECT SUM(INPUT(weight, best.)), 
       COUNT(*), 
       SUM(INPUT(weight, best.))/COUNT(*)
FROM sql.survey
WHERE species_id ^= 'PE';
QUIT;

3.1.2.11 Working with multiple tables: JOIN

*INNER JOIN;
*Need to use ALIAS to rename/name the data set since
 the files are stored in the SAS permanent libaray;
PROC SQL;
SELECT *
FROM sql.survey AS surveys
JOIN sql.species AS species
ON surveys.species_id = species.species_id;
QUIT;


* we can simply rename the tables as A and B using alias;
PROC SQL;
CREATE TABLE sql.INNERJOIN AS
SELECT *
FROM sql.survey AS A
JOIN sql.species AS B
ON A.species_id = B.species_id;
QUIT;

*left join;
PROC SQL;
CREATE TABLE sql.LEFTJOIN AS
SELECT *
FROM sql.survey AS A
LEFT JOIN sql.species AS B
ON A.species_id = B.species_id;
QUIT;

*right join;
PROC SQL;
CREATE TABLE sql.RIGHTJOIN AS
SELECT *
FROM sql.survey AS A
RIGHT JOIN sql.species AS B
ON A.species_id = B.species_id;
QUIT;

*full join;
PROC SQL;
CREATE TABLE sql.FULLTJOIN AS
SELECT *
FROM sql.survey AS A
FULL JOIN sql.species AS B
ON A.species_id = B.species_id;
QUIT;

* We can select some variables from individual tables
* then join the two sub tables.;
* CAUTION: We will NOT select any variables in species table 
*  to include in the new table;
PROC SQL;
SELECT A.species_id, 
       A.sex, 
       AVG(INPUT(a.hindfoot_length, best.)) as mean_foot_length  
FROM sql.survey  AS A
JOIN sql.species AS B
ON A.species_id=B.species_id 
WHERE taxa = 'Rodent' AND A.sex IS NOT NULL 
GROUP BY A.species_id, A.sex;
QUIT;

3.1.2.12 Creatng New Variables

The following code defines new variables using string functions in SQL. This method is sometimes used to define composite keys

PROC SQL;
SELECT *, 
       species_id||'-'||sex AS newKey
FROM sql.surveys;
QUIT;

3.1.2.13 Nest Queries

* nest queries;
* The SELECT ... FROM in the denominator is self-contained
* It is NOT affected by GROUP BY statement. The COUNT() function
* returns the total size of the data;
PROC SQL;
SELECT B.taxa, 
       100.0*COUNT(*)/(SELECT COUNT(*) FROM sql.survey)  AS Percentage
FROM sql.survey AS A
JOIN sql.species AS B
ON A.species_id = B.species_id 
GROUP BY taxa;
QUIT;

3.2 Running SAS in R

To run SQL clauses in R, we need to use several R libraries (installed and loaded in the above R setup code chunk). There are different ways to run SQL query in R. We only introduce one methods that is close to the authentic SQL code that can be run a DBMS.

3.2.1 Connect R to Existing Database

If there is an existing database, the following code connects R to the database.

con <- DBI::dbConnect(drv = odbc::odbc(),
                      Driver = "driver_name",
                      Server = "server_url",
                      Database = "database_name",
                      user = "user", #optional
                      password = "password") #optional

This section shows the three basic steps to run SQL in R using R Markdown starting with a set of relational tables.

  1. Load relational data tables as usual to R.

  2. Create a SQLite (relational) database that contain these relational table.

  3. Create R code chunk and connect to the created database using Chunk options.

3.2.2 Create SQLite Database with R

If modeling requires a data set that contains information from multiple relational data tables, we need to perform data management to aggregate the required information from different data tables. We can load the different data sets in different formats using appropriate R functions.

As an example, We use three ecological survey data sets to create a database.

#Load the sample data
plots <- read.csv("https://pengdsci.github.io/datasets/AnimalSurvey/plots.csv")
species <- read.csv("https://pengdsci.github.io/datasets/AnimalSurvey/species.csv")
surveys <- read.csv("https://pengdsci.github.io/datasets/AnimalSurvey/surveys.csv")

Next, we create a SQLit database using several R libraries.

#Create database
con <- dbConnect(drv = SQLite(),
                 dbname = ":memory:")

#store sample data in database
dbWriteTable(conn = con, 
             name = "plots",
             value = plots)

dbWriteTable(conn = con, 
             name = "species",
             value = species)

dbWriteTable(conn = con, 
             name = "surveys",
             value = surveys)
 
#remove the local data from the environment
rm(plots, species, surveys)

We can use table view function tbl() to explore the information of relational data tables in the database. Note that, we

tbl(src = con, #the source if the database connection profile
    c("surveys")) #the name of the table to preview
## # Source:   table<surveys> [?? x 10]
## # Database: sqlite 3.41.2 [:memory:]
##        X record_id month   day  year plot_id species_id sex   hindfoot_length weight
##    <int>     <int> <int> <int> <int>   <int> <chr>      <chr>           <int>  <int>
##  1     1         1     7    16  1977       2 NL         M                  32     NA
##  2     2         2     7    16  1977       3 NL         M                  33     NA
##  3     3         3     7    16  1977       2 DM         F                  37     NA
##  4     4         4     7    16  1977       7 DM         M                  36     NA
##  5     5         5     7    16  1977       3 DM         M                  35     NA
##  6     6         6     7    16  1977       1 PF         M                  14     NA
##  7     7         7     7    16  1977       2 PE         F                  NA     NA
##  8     8         8     7    16  1977       1 DM         M                  37     NA
##  9     9         9     7    16  1977       1 DM         F                  34     NA
## 10    10        10     7    16  1977       6 PF         F                  20     NA
## # ℹ more rows

3.2.3 Running SQL Queries in R Code chunks

To use SQL in RMarkdown, we need the following chunk options:

  1. sql
  2. connection = “database-name”
  3. output.var = “output-dataset-name”

If we create a data view only, we simply ignore option output.var =

Following are few examples of SQL queries based on the animal survey data tales in the database.

3.2.3.1 Subsetting and Duplicating Data

  1. Extract year, month and day from survey table
SELECT 
  surveys.year, surveys.month, surveys.Day
FROM 
 surveys /* pointer is not needed since it is in the database */
WHERE
  surveys.species_id IN ('NL', 'DM') AND
  surveys.sex = 'M'
  1. Duplicate a data and rename it
SELECT 
  surveys.*
FROM 
 surveys
  1. Create a table view (i.e., no data set will be created and saved)
SELECT 
  surveys.year, surveys.month, surveys.Day
FROM 
 surveys
WHERE
  surveys.species_id = 'NL' AND
  surveys.sex = 'M'

3.2.3.2 Define A New Variable

  1. Define a new variable with simple arithmetic operations
SELECT 
    surveys.plot_id, 
    surveys.species_id, 
    surveys.sex, 
    surveys.weight, 
    surveys.weight/100 AS wt_kilo  /*should not the pointer in front of the name of the new variable*/ 
FROM 
   surveys
  1. Define new variables using string functions in SQL
SELECT surveys.*, 
       surveys.species_id||'-'||surveys.sex AS newKey
FROM surveys
  1. Define new variables with aggregated information
SELECT surveys.species_id, 
       COUNT(surveys.species_id) AS species_ctr
FROM surveys
GROUP BY surveys.species_id
HAVING species_ctr > 10

3.2.3.3 Sorting Variables

  1. Sort data based on the summarized statistics of a variable

Summary functions are restricted to the SELECT and HAVING clauses only;

SELECT surveys.species_id
FROM surveys
GROUP BY surveys.species_id
ORDER BY COUNT(surveys.species_id);
  1. Sort data based on a new variable defined using summarized statistics of a variable.
/* create a table view*/
SELECT surveys.species_id AS subtotal, 
       COUNT(*) 
FROM surveys
GROUP BY surveys.species_id
ORDER BY subtotal;

3.2.3.4 Join Tables

This section introduce commonly used join operation to merge tables using the common key(s).

  1. Inner Join
SELECT *
FROM surveys AS A
JOIN species AS B
ON A.species_id = B.species_id;
  1. Left Join
SELECT *
FROM surveys AS A
LEFT JOIN species AS B
ON A.species_id = B.species_id;
  1. Right Join
SELECT *
FROM surveys AS A
RIGHT JOIN species AS B
ON A.species_id = B.species_id;
  1. Full Join
SELECT *
FROM surveys AS A
FULL JOIN species AS B
ON A.species_id = B.species_id;
  1. Join sub-tables
SELECT A.species_id, 
       A.sex, 
       AVG(A.weight) as mean_wgt  
FROM surveys AS A
JOIN species AS B
ON A.species_id=B.species_id 
WHERE taxa = 'Rodent' AND A.sex IS NOT NULL 
GROUP BY A.species_id, A.sex; 

3.2.3.5 Subqueries

  1. Sample size
SELECT COUNT(*) 
FROM surveys
  1. Relative Frequency with sub-query
SELECT B.taxa, 
       100.0*COUNT(*)/(SELECT COUNT(*) FROM surveys)  AS Percentage
FROM surveys AS A
JOIN species AS B
ON A.species_id = B.species_id 
GROUP BY taxa;