Dplyr_The_R_equivalent_of_DATA_Step_+_PROC

If you're transitioning from Clinical SAS to R, you’ve likely spent countless hours mastering the DATA step and PROC SQL for data manipulation. The good news? R’s dplyr package provides an intuitive, efficient, and powerful alternative to these SAS staples.

dplyr is part of the tidyverse, a collection of R packages designed for data science. It simplifies data wrangling with a consistent syntax that mirrors SQL and DATA step operations—filtering, sorting, joining, summarizing, and transforming data.

In this chapter, you’ll learn:
- How dplyr verbs replace traditional SAS procedures.
- Key functions to manipulate data frames like SAS datasets.
- Practical comparisons between SAS and R for common data tasks.

By the end, you’ll see why dplyr is a game-changer for SAS programmers embracing R.

Key dplyr Verbs and Their SAS Equivalents

dplyr organizes data manipulation around verbs—functions named for their actions. Below is a mapping between dplyr and SAS:

Practical Examples: From SAS to R

R (`dplyr`)	SAS Equivalent	Purpose
`filter()`	`WHERE` (DATA Step/PROC SQL)	Subset rows based on conditions.
`select()`	`KEEP`/`DROP` (DATA Step)	Select or remove columns.
`mutate()`	`DATA Step` calculations	Create or modify columns.
`arrange()`	`PROC SORT`	Sort data.
`summarize()` + `group_by()`	`PROC MEANS`/`PROC SUMMARY`	Aggregate data by groups.
`left_join()`	`PROC SQL` joins	Merge datasets.

1. Filtering Rows (filter() vs. WHERE)

DATA subset;  
    SET original_data;  
    WHERE age > 30 AND sex = 'M';  
RUN;

library(dplyr)  
subset <- original_data %>%  
    filter(age > 30, sex == "M")

Key Notes:
- %>% (pipe operator) chains operations, similar to SAS RUN steps.
- Conditions in filter() use R’s logical operators (==, >, &).

2. Selecting Columns (select() vs. KEEP/DROP)

DATA selected_cols;  
    SET original_data;  
    KEEP subject_id age treatment;  
RUN;

selected_cols <- original_data %>%  
    select(subject_id, age, treatment)

dropped_cols <- original_data %>%  
    select(-address, -screening_date)

3. Creating/Modifying Variables (mutate() vs. DATA Step)

DATA modified;  
    SET original_data;  
    bmi = (weight / (height**2)) * 703;  
    IF bmi >= 30 THEN obese_flag = 1; ELSE obese_flag = 0;  
RUN;

modified <- original_data %>%  
    mutate(
        bmi = (weight / (height^2)) * 703,  
        obese_flag = ifelse(bmi >= 30, 1, 0)  
    )

Bonus: Use case_when() for multi-condition logic (like SAS SELECT-WHEN):

modified <- original_data %>%  
    mutate(
        bmi_category = case_when(
            bmi < 18.5 ~ "Underweight",  
            bmi < 25 ~ "Normal",  
            bmi < 30 ~ "Overweight",  
            TRUE ~ "Obese"  
        )  
    )

4. Sorting Data (arrange() vs. PROC SORT)

PROC SORT DATA=original_data;  
    BY descending age treatment;  
RUN;

sorted_data <- original_data %>%  
    arrange(desc(age), treatment)

5. Aggregating Data (summarize() + group_by() vs. PROC MEANS)

PROC MEANS DATA=original_data N MEAN STD;  
    CLASS treatment;  
    VAR age weight;  
RUN;

summary_stats <- original_data %>%  
    group_by(treatment) %>%  
    summarize(
        n = n(),  
        mean_age = mean(age, na.rm = TRUE),  
        sd_weight = sd(weight, na.rm = TRUE)  
    )

6. Merging Datasets (*_join() vs. PROC SQL)

PROC SQL;  
    CREATE TABLE merged_data AS  
    SELECT a.*, b.lab_value  
    FROM demog_data a  
    INNER JOIN lab_data b ON a.subject_id = b.subject_id;  
QUIT;

merged_data <- demog_data %>%  
    inner_join(lab_data, by = "subject_id")

Join Types in dplyr:
- left_join(): Keep all rows from the left table (like SAS LEFT JOIN).
- right_join(): Keep all rows from the right table.
- full_join(): Keep all rows from both tables (FULL JOIN in SAS).

Advanced dplyr Features

1. Pivoting Data (tidyr Integration)

wide_data <- long_data %>%  
    pivot_wider(names_from = visit, values_from = result)

long_data <- wide_data %>%  
    pivot_longer(cols = starts_with("visit"), names_to = "visit", values_to = "result")

2. Row-wise Operations (rowwise())

row_means <- data %>%  
    rowwise() %>%  
    mutate(avg_score = mean(c(score1, score2, score3), na.rm = TRUE))

3. Efficient Workflow with %>%

final_output <- raw_data %>%  
    filter(visit == "Baseline") %>%  
    select(-screening_date) %>%  
    mutate(bmi = weight / (height^2)) %>%  
    group_by(treatment) %>%  
    summarize(mean_bmi = mean(bmi))

Key Takeaways

Next Steps

Practice converting your existing SAS scripts to R using dplyr. Start with simple filters and merges, then experiment with grouped summaries. The more you use dplyr, the more you’ll appreciate its efficiency—and wonder how you ever lived without it!

Ready to dive deeper? Chapter 6 explores ggplot2, R’s answer to SAS PROC SGPLOT.

This chapter equips you with the foundational dplyr skills to transition seamlessly from SAS to R. Happy coding! 🚀

Dplyr: The R Equivalent of DATA Step + PROC SQL

Introduction

Key `dplyr` Verbs and Their SAS Equivalents

Practical Examples: From SAS to R

1. Filtering Rows (`filter()` vs. `WHERE`)

2. Selecting Columns (`select()` vs. `KEEP/DROP`)

3. Creating/Modifying Variables (`mutate()` vs. DATA Step)

4. Sorting Data (`arrange()` vs. `PROC SORT`)

5. Aggregating Data (`summarize()` + `group_by()` vs. `PROC MEANS`)

6. Merging Datasets (`*_join()` vs. `PROC SQL`)

Advanced `dplyr` Features

1. Pivoting Data (`tidyr` Integration)

2. Row-wise Operations (`rowwise()`)

3. Efficient Workflow with `%>%`

Key Takeaways

Next Steps

Dplyr: The R Equivalent of DATA Step + PROC SQL

Introduction

Key dplyr Verbs and Their SAS Equivalents

Practical Examples: From SAS to R

1. Filtering Rows (filter() vs. WHERE)

2. Selecting Columns (select() vs. KEEP/DROP)

3. Creating/Modifying Variables (mutate() vs. DATA Step)

4. Sorting Data (arrange() vs. PROC SORT)

5. Aggregating Data (summarize() + group_by() vs. PROC MEANS)

6. Merging Datasets (*_join() vs. PROC SQL)

Advanced dplyr Features

1. Pivoting Data (tidyr Integration)

2. Row-wise Operations (rowwise())

3. Efficient Workflow with %>%

Key Takeaways

Next Steps

Key `dplyr` Verbs and Their SAS Equivalents

1. Filtering Rows (`filter()` vs. `WHERE`)

2. Selecting Columns (`select()` vs. `KEEP/DROP`)

3. Creating/Modifying Variables (`mutate()` vs. DATA Step)

4. Sorting Data (`arrange()` vs. `PROC SORT`)

5. Aggregating Data (`summarize()` + `group_by()` vs. `PROC MEANS`)

6. Merging Datasets (`*_join()` vs. `PROC SQL`)

Advanced `dplyr` Features

1. Pivoting Data (`tidyr` Integration)

2. Row-wise Operations (`rowwise()`)

3. Efficient Workflow with `%>%`