🏠 Home
Author: Bhanoji Duppada | 2025-07-26 13:32:16

Dplyr: The R Equivalent of DATA Step + PROC SQL

Introduction

If you're transitioning from Clinical SAS to R, you’ve likely spent countless hours mastering the DATA step and PROC SQL for data manipulation. The good news? R’s dplyr package provides an intuitive, efficient, and powerful alternative to these SAS staples.

dplyr is part of the tidyverse, a collection of R packages designed for data science. It simplifies data wrangling with a consistent syntax that mirrors SQL and DATA step operations—filtering, sorting, joining, summarizing, and transforming data.

In this chapter, you’ll learn:
- How dplyr verbs replace traditional SAS procedures.
- Key functions to manipulate data frames like SAS datasets.
- Practical comparisons between SAS and R for common data tasks.

By the end, you’ll see why dplyr is a game-changer for SAS programmers embracing R.


Key dplyr Verbs and Their SAS Equivalents

dplyr organizes data manipulation around verbs—functions named for their actions. Below is a mapping between dplyr and SAS:

R (dplyr) SAS Equivalent Purpose
filter() WHERE (DATA Step/PROC SQL) Subset rows based on conditions.
select() KEEP/DROP (DATA Step) Select or remove columns.
mutate() DATA Step calculations Create or modify columns.
arrange() PROC SORT Sort data.
summarize() + group_by() PROC MEANS/PROC SUMMARY Aggregate data by groups.
left_join() PROC SQL joins Merge datasets.

Practical Examples: From SAS to R

1. Filtering Rows (filter() vs. WHERE)

SAS:

DATA subset;  
    SET original_data;  
    WHERE age > 30 AND sex = 'M';  
RUN;  

R (dplyr):

library(dplyr)  
subset <- original_data %>%  
    filter(age > 30, sex == "M")  

Key Notes:
- %>% (pipe operator) chains operations, similar to SAS RUN steps.
- Conditions in filter() use R’s logical operators (==, >, &).


2. Selecting Columns (select() vs. KEEP/DROP)

SAS:

DATA selected_cols;  
    SET original_data;  
    KEEP subject_id age treatment;  
RUN;  

R (dplyr):

selected_cols <- original_data %>%  
    select(subject_id, age, treatment)  

Pro Tip: Use - to exclude columns (like DROP in SAS):

dropped_cols <- original_data %>%  
    select(-address, -screening_date)  

3. Creating/Modifying Variables (mutate() vs. DATA Step)

SAS:

DATA modified;  
    SET original_data;  
    bmi = (weight / (height**2)) * 703;  
    IF bmi >= 30 THEN obese_flag = 1; ELSE obese_flag = 0;  
RUN;  

R (dplyr):

modified <- original_data %>%  
    mutate(
        bmi = (weight / (height^2)) * 703,  
        obese_flag = ifelse(bmi >= 30, 1, 0)  
    )  

Bonus: Use case_when() for multi-condition logic (like SAS SELECT-WHEN):

modified <- original_data %>%  
    mutate(
        bmi_category = case_when(
            bmi < 18.5 ~ "Underweight",  
            bmi < 25 ~ "Normal",  
            bmi < 30 ~ "Overweight",  
            TRUE ~ "Obese"  
        )  
    )  

4. Sorting Data (arrange() vs. PROC SORT)

SAS:

PROC SORT DATA=original_data;  
    BY descending age treatment;  
RUN;  

R (dplyr):

sorted_data <- original_data %>%  
    arrange(desc(age), treatment)  

5. Aggregating Data (summarize() + group_by() vs. PROC MEANS)

SAS:

PROC MEANS DATA=original_data N MEAN STD;  
    CLASS treatment;  
    VAR age weight;  
RUN;  

R (dplyr):

summary_stats <- original_data %>%  
    group_by(treatment) %>%  
    summarize(
        n = n(),  
        mean_age = mean(age, na.rm = TRUE),  
        sd_weight = sd(weight, na.rm = TRUE)  
    )  

Note: na.rm = TRUE excludes missing values (like SAS NMISS handling).


6. Merging Datasets (*_join() vs. PROC SQL)

SAS (Inner Join):

PROC SQL;  
    CREATE TABLE merged_data AS  
    SELECT a.*, b.lab_value  
    FROM demog_data a  
    INNER JOIN lab_data b ON a.subject_id = b.subject_id;  
QUIT;  

R (dplyr):

merged_data <- demog_data %>%  
    inner_join(lab_data, by = "subject_id")  

Join Types in dplyr:
- left_join(): Keep all rows from the left table (like SAS LEFT JOIN).
- right_join(): Keep all rows from the right table.
- full_join(): Keep all rows from both tables (FULL JOIN in SAS).


Advanced dplyr Features

1. Pivoting Data (tidyr Integration)

Reshape data like SAS PROC TRANSPOSE:

Long to Wide (SAS PROC TRANSPOSE):

wide_data <- long_data %>%  
    pivot_wider(names_from = visit, values_from = result)  

Wide to Long:

long_data <- wide_data %>%  
    pivot_longer(cols = starts_with("visit"), names_to = "visit", values_to = "result")  

2. Row-wise Operations (rowwise())

Avoid SAS macros with:

row_means <- data %>%  
    rowwise() %>%  
    mutate(avg_score = mean(c(score1, score2, score3), na.rm = TRUE))  

3. Efficient Workflow with %>%

Chain operations instead of intermediate datasets:

final_output <- raw_data %>%  
    filter(visit == "Baseline") %>%  
    select(-screening_date) %>%  
    mutate(bmi = weight / (height^2)) %>%  
    group_by(treatment) %>%  
    summarize(mean_bmi = mean(bmi))  

Key Takeaways

  1. dplyr mirrors SAS functionality but with cleaner, more readable syntax.
  2. Key verbs replace DATA steps and PROCs:
  3. filter() → WHERE
  4. mutate() → DATA step calculations
  5. summarize() + group_by() → PROC MEANS
  6. Joins are simpler with *_join() vs. PROC SQL.
  7. Pipes (%>%) streamline workflows, reducing temporary datasets.

Next Steps

Practice converting your existing SAS scripts to R using dplyr. Start with simple filters and merges, then experiment with grouped summaries. The more you use dplyr, the more you’ll appreciate its efficiency—and wonder how you ever lived without it!

Ready to dive deeper? Chapter 6 explores ggplot2, R’s answer to SAS PROC SGPLOT.


This chapter equips you with the foundational dplyr skills to transition seamlessly from SAS to R. Happy coding! 🚀