---
title: "Data Manipulation Functions"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Data Manipulation Functions}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

## Transpose

There are several functions in base R to pivot data.  These include the `t()`
and `reshape()` functions.  There are also well-designed functions in the
**tidyverse** for transposing data.  Nevertheless, some users will prefer
the syntax of `proc_transpose()`.  This function provides control over output
column naming, and an intuitive set of parameters.

To explore `proc_transpose`, let's first create some sample data:
```{r eval=FALSE, echo=TRUE} 
# Create input data
dat <- read.table(header = TRUE, text = '
                  Name    Subject   Semester1  Semester2
                  Samma	  Maths	    96             94
                  Sandy	  English	  76             51
                  Devesh	German	  76             95
                  Rakesh	Maths	    50             63
                  Priya	  English	  62             80
                  Kranti	Maths	    92             92
                  William	German	  87             75')

# View data
dat
#      Name Subject Semester1 Semester2
# 1   Samma   Maths        96        94
# 2   Sandy English        76        51
# 3  Devesh  German        76        95
# 4  Rakesh   Maths        50        63
# 5   Priya English        62        80
# 6  Kranti   Maths        92        92
# 7 William  German        87        75
```
### Default Transpose

The `proc_tranpose()` function may be executed without any parameters.  The 
default usage will tranpose all numeric variables and construct generic 
column names for the new columns:
```{r eval=FALSE, echo=TRUE} 
# No parameters
res <- proc_transpose(dat)

# View result
res
#        NAME COL1 COL2 COL3 COL4 COL5 COL6 COL7
# 1 Semester1   96   76   76   50   62   92   87
# 2 Semester2   94   51   95   63   80   92   75

```
If you didn't want all numeric variables transposed, you could specify which
variable(s) to transpose using the `var` parameter. Multiple variables can be
passed as a vector:
```{r eval=FALSE, echo=TRUE} 
# Var parameter
res <- proc_transpose(dat, var = "Semester1")

# View result
res
#        NAME COL1 COL2 COL3 COL4 COL5 COL6 COL7
# 1 Semester1   96   76   76   50   62   92   87

```

### Prefix and Suffix

You may control the output column names using the `prefix`, 
`suffix`, and `name` parameters.  The `prefix` and `suffix` will be used to
construct the new transposed column names.  The `name` parameter value will
be used as the name of the generic "NAME" column, which identify 
the `var` parameter values:
```{r eval=FALSE, echo=TRUE} 
# With prefix
res <- proc_transpose(dat, name = VarName, prefix = Student)

# View result
res
    VarName Student1 Student2 Student3 Student4 Student5 Student6 Student7
1 Semester1       96       76       76       50       62       92       87
2 Semester2       94       51       95       63       80       92       75
```
Here is the same function call with a `suffix`:
```{r eval=FALSE, echo=TRUE} 
# With suffix
res <- proc_transpose(dat, name = VarName, prefix = S, suffix = Score)

# View result
res
    VarName S1Score S2Score S3Score S4Score S5Score S6Score S7Score
1 Semester1      96      76      76      50      62      92      87
2 Semester2      94      51      95      63      80      92      75
```
Note that since a variable name in R cannot start with a number, some sort
of prefix is required.

### Name Columns From Data

If your data contains a column with appropriate labels for the transposed columns,
you can assign it to the `id` parameter.  The `id` values will then be used for
the new column names.
```{r eval=FALSE, echo=TRUE} 
# Assign column names from data
res <- proc_transpose(dat, name = VarName, id = Name)

# View result
res
    VarName Samma Sandy Devesh Rakesh Priya Kranti William
1 Semester1    96    76     76     50    62     92      87
2 Semester2    94    51     95     63    80     92      75
```

Using two `id` parameters tells the function that you want columns that are
combinations of the two variables:
```{r eval=FALSE, echo=TRUE}
# Two id variables
res <- proc_transpose(dat, id = v(Name, Subject))

res
#        NAME Samma.Maths Sandy.English Devesh.German Rakesh.Maths Priya.English Kranti.Maths William.German
# 1 Semester1          96            76            76           50            62           92             87
# 2 Semester2          94            51            95           63            80           92             75
```
The default delimiter shown above is a dot (".").  The delimiter may be changed
with the `delimiter` parameter:

```{r eval=FALSE, echo=TRUE}
# Underscore delimiter
res <- proc_transpose(dat, id = v(Name, Subject), delimiter = "_")

res
#        NAME Samma_Maths Sandy_English Devesh_German Rakesh_Maths Priya_English Kranti_Maths William_German
# 1 Semester1          96            76            76           50            62           92             87
# 2 Semester2          94            51            95           63            80           92             75
```

### By Groups

The `by` parameter tells the function to group by the `by` variable 
before transposing.  As you can see below, the `by` varible is then retained
on the output dataset so you can identify which rows belong to which group.
```{r eval=FALSE, echo=TRUE} 
# By variable
res <- proc_transpose(dat, by = Name, id = Subject, name = Semester)

# View result
res
#       Name  Semester German English Maths
# 1   Devesh Semester1     76      NA    NA
# 2   Devesh Semester2     95      NA    NA
# 3   Kranti Semester1     NA      NA    92
# 4   Kranti Semester2     NA      NA    92
# 5    Priya Semester1     NA      62    NA
# 6    Priya Semester2     NA      80    NA
# 7   Rakesh Semester1     NA      NA    50
# 8   Rakesh Semester2     NA      NA    63
# 9    Samma Semester1     NA      NA    96
# 10   Samma Semester2     NA      NA    94
# 11   Sandy Semester1     NA      76    NA
# 12   Sandy Semester2     NA      51    NA
# 13 William Semester1     87      NA    NA
# 14 William Semester2     75      NA    NA 
```
The `by` parameter is a valuable feature of `proc_transpose()`.  The by variable
can have a significant effect on the shape of the output data. Let's see
what happens when we use a different by variable.
```{r eval=FALSE, echo=TRUE} 
# By variable
res <- proc_transpose(dat, by = Subject, id = Name)

#   Subject      NAME Sandy Priya Devesh William Samma Rakesh Kranti
# 1 English Semester1    76    62     NA      NA    NA     NA     NA
# 2 English Semester2    51    80     NA      NA    NA     NA     NA
# 3  German Semester1    NA    NA     76      87    NA     NA     NA
# 4  German Semester2    NA    NA     95      75    NA     NA     NA
# 5   Maths Semester1    NA    NA     NA      NA    96     50     92
# 6   Maths Semester2    NA    NA     NA      NA    94     63     92

```

Now let's use two by variables:
```{r eval=FALSE, echo=TRUE} 
# Two by variables
res <- proc_transpose(dat, by = v(Name, Subject))

# View results
res
#       Name Subject      NAME COL1
# 1    Priya English Semester1   62
# 2    Priya English Semester2   80
# 3    Sandy English Semester1   76
# 4    Sandy English Semester2   51
# 5   Devesh  German Semester1   76
# 6   Devesh  German Semester2   95
# 7  William  German Semester1   87
# 8  William  German Semester2   75
# 9   Kranti   Maths Semester1   92
# 10  Kranti   Maths Semester2   92
# 11  Rakesh   Maths Semester1   50
# 12  Rakesh   Maths Semester2   63
# 13   Samma   Maths Semester1   96
# 14   Samma   Maths Semester2   94

```
By transposing one more time on the results of the previous example,
you can restore the original data frame.
```{r eval=FALSE, echo=TRUE} 
# Restore original data shape
res2 <- proc_transpose(res, by = v(Name, Subject), 
                       id = NAME, options = noname)

# View results
res2[ , c("Name", "Subject", "Semester1", "Semester2")]
#      Name Subject Semester1 Semester2
# 1   Priya English        62        80
# 2   Sandy English        76        51
# 3  Devesh  German        76        95
# 4 William  German        87        75
# 5  Kranti   Maths        92        92
# 6  Rakesh   Maths        50        63
# 7   Samma   Maths        96        94

```
## Sort

The `proc_sort()` function in the **procs** package attempts to recreate the most
basic functionality of PROC SORT.  The function accepts an input data frame
and returns the sorted result. Before examining the function, first let's create
some sample data:
```{r eval=FALSE, echo=TRUE} 
# Create sample data
dat <- read.table(header = TRUE, text = '
                    ID	Name	Score
                    1	David	 74
                    2	Sam	   45
                    3	Bane	 87
                    3	Mary	 92
                    4	Dane	 23
                    5	Jenny	 87
                    6	Simran 63
                    8	Priya	 72 
                    1	David	 45
                    2	Ram	   54
                    3	Bane	 87
                    5	Ken	   87')

# View data
dat
#    ID   Name Score
# 1   1  David    74
# 2   2    Sam    45
# 3   3   Bane    87
# 4   3   Mary    92
# 5   4   Dane    23
# 6   5  Jenny    87
# 7   6 Simran    63
# 8   8  Priya    72
# 9   1  David    45
# 10  2    Ram    54
# 11  3   Bane    87
# 12  5    Ken    87
```

Like `proc_transpose()`, the function has a default
usage, which is to sort by all variables:
```{r eval=FALSE, echo=TRUE} 
# Default sort
res <- proc_sort(dat)

# View results
res
#    ID   Name Score
# 9   1  David    45
# 1   1  David    74
# 10  2    Ram    54
# 2   2    Sam    45
# 3   3   Bane    87
# 11  3   Bane    87
# 4   3   Mary    92
# 5   4   Dane    23
# 6   5  Jenny    87
# 12  5    Ken    87
# 7   6 Simran    63
# 8   8  Priya    72

```
### Sort By
To sort by a specific variable, you can use the `by` parameter:
```{r eval=FALSE, echo=TRUE} 
# Sort By
res <- proc_sort(dat, by = Score)

# View results
res
#    ID   Name Score
# 5   4   Dane    23
# 2   2    Sam    45
# 9   1  David    45
# 10  2    Ram    54
# 7   6 Simran    63
# 8   8  Priya    72
# 1   1  David    74
# 3   3   Bane    87
# 6   5  Jenny    87
# 11  3   Bane    87
# 12  5    Ken    87
# 4   3   Mary    92
```
You can also sort by two variables:
```{r eval=FALSE, echo=TRUE} 
# Sort By
res <- proc_sort(dat, by = v(Score, Name))

# View results
res
#    ID   Name Score
# 5   4   Dane    23
# 9   1  David    45
# 2   2    Sam    45
# 10  2    Ram    54
# 7   6 Simran    63
# 8   8  Priya    72
# 1   1  David    74
# 3   3   Bane    87
# 11  3   Bane    87
# 6   5  Jenny    87
# 12  5    Ken    87
# 4   3   Mary    92
```
Notice that when the scores are tied, the data is now sorted alphabetically 
by name.

### Sort Order

To put the highest scores on top, you can sort descending using the `order`
parameter.
```{r eval=FALSE, echo=TRUE} 
# Sort By
res <- proc_sort(dat, by = Score, order = descending)

# View results
res
#    ID   Name Score
# 4   3   Mary    92
# 3   3   Bane    87
# 6   5  Jenny    87
# 11  3   Bane    87
# 12  5    Ken    87
# 1   1  David    74
# 8   8  Priya    72
# 7   6 Simran    63
# 10  2    Ram    54
# 2   2    Sam    45
# 9   1  David    45
# 5   4   Dane    23
```

The `order` instructions can be abbreviated "a" for ascending and "d" for 
descending.  These abbreviations are convenient when there is more than 
one `by` variable.
```{r eval=FALSE, echo=TRUE} 
# Sort By
res <- proc_sort(dat, by = v(Score, Name), order = v(d, a))

# View results
res
#    ID   Name Score
# 4   3   Mary    92
# 3   3   Bane    87
# 11  3   Bane    87
# 6   5  Jenny    87
# 12  5    Ken    87
# 1   1  David    74
# 8   8  Priya    72
# 7   6 Simran    63
# 10  2    Ram    54
# 9   1  David    45
# 2   2    Sam    45
# 5   4   Dane    23
```
### Keep 
The `keep` parameter allows you to select only some of the variables for the
output dataset:
```{r eval=FALSE, echo=TRUE} 
# Keep Name and Score only
res <- proc_sort(dat, by = Name, keep = v(Name, Score))

# View results
res
#      Name Score
# 3    Bane    87
# 11   Bane    87
# 5    Dane    23
# 1   David    74
# 9   David    45
# 6   Jenny    87
# 12    Ken    87
# 4    Mary    92
# 8   Priya    72
# 10    Ram    54
# 2     Sam    45
# 7  Simran    63

```

### Options
Another convenient feature of `proc_sort()` is the `nodupkey` option.  This 
option will eliminate duplicates based on the unique combination of values
of the `by` variables. For instance, to get a unique list of students, you
could use `keep` and `options = nodupkey`:
```{r eval=FALSE, echo=TRUE} 
# Keep and Nodupkey
res <- proc_sort(dat, by = Name, keep = Name, options = nodupkey)

# View results
res
#      Name
# 3    Bane
# 5    Dane
# 1   David
# 6   Jenny
# 12    Ken
# 4    Mary
# 8   Priya
# 10    Ram
# 2     Sam
# 7  Simran
```
There is also a `dupkey` option to retain only records with duplicate
key values. This feature
would allow you to easily find students who took the exam twice, and show
the scores for each attempt.
```{r eval=FALSE, echo=TRUE} 
# Keep and dupkey
res <- proc_sort(dat, by = Name, options = dupkey)

# View results
res
#   ID  Name Score
# 1  3  Bane    87
# 2  3  Bane    87
# 3  1 David    74
# 4  1 David    45
```

### Sorting Non-alphabetically
Most of the time, you will sort alphabetically or numerically.  But sometimes
you may have character data that you want to sort in a specific order that is
not alphabetic.  Let's return to the class data we used above:
```{r eval=FALSE, echo=TRUE} 
# Create input data
dat <- read.table(header = TRUE, text = '
                  Name    Subject   Semester1  Semester2
                  Samma	  Maths	    96             94
                  Sandy	  English	  76             51
                  Devesh	German	  76             95
                  Rakesh	Maths	    50             63
                  Priya	  English	  62             80
                  Kranti	Maths	    92             92
                  William	German	  87             75')

# View data
dat
#      Name Subject Semester1 Semester2
# 1   Samma   Maths        96        94
# 2   Sandy English        76        51
# 3  Devesh  German        76        95
# 4  Rakesh   Maths        50        63
# 5   Priya English        62        80
# 6  Kranti   Maths        92        92
# 7 William  German        87        75
```
What if we wanted to sort "Maths" to the top, followed by "English" and "German"?
How could that be accomplished?

To perform a sort in a specific, non-alphabetic order, first create a factor
on your desired sort column and order the levels by the intended sort.
For instance:
```{r eval=FALSE, echo=TRUE} 
# Create factor with ordered levels
dat$Subject <- factor(dat$Subject, levels = c("Maths", "English", "German"))

# View attributes
attributes(dat$Subject)
# $levels
# [1] "Maths"   "English" "German" 
# 
# $class
# [1] "factor"

```
Then use `proc_sort()` to sort by the factor variable:

```{r eval=FALSE, echo=TRUE} 
# Sort by factor variable
dat2 <- proc_sort(dat, by = Subject)

# View result
dat2
#      Name Subject Semester1 Semester2
# 1   Samma   Maths        96        94
# 4  Rakesh   Maths        50        63
# 6  Kranti   Maths        92        92
# 2   Sandy English        76        51
# 5   Priya English        62        80
# 3  Devesh  German        76        95
# 7 William  German        87        75

```

Note that after the sort operation, the "Subject" variable is still a factor:
```{r eval=FALSE, echo=TRUE} 
class(dat2$Subject)
# [1] "factor"

```

If desired, we can turn the variable back into a character using 
the "as.character" parameter.  Following the sort, this parameter will 
automatically cast any factors on the by parameter to character variables:
```{r eval=FALSE, echo=TRUE} 
# Sort by factor variable
dat2 <- proc_sort(dat, by = Subject, as.character = TRUE)

# Same results
dat2
#      Name Subject Semester1 Semester2
# 1   Samma   Maths        96        94
# 4  Rakesh   Maths        50        63
# 6  Kranti   Maths        92        92
# 2   Sandy English        76        51
# 5   Priya English        62        80
# 3  Devesh  German        76        95
# 7 William  German        87        75

# But now Subject is a character
class(dat2$Subject)
# [1] "character"
```

### Sorting Missing Values

Normally, R sorts missing values (NA) to the bottom or last. This behavior is
also the default with `proc_sort()`.  To illustrate, let's add an NA
value to the dataset created above, so that we can experiment.  Here is 
the modification:
```{r eval=FALSE, echo=TRUE} 
# Create input data with missing values
dat2 <- read.table(header = TRUE, text = '
                  Name    Subject   Semester1  Semester2
                  Samma	  Maths	    96             94
                  Sandy	  English	  76             NA
                  Devesh	German	  76             95
                  Rakesh	Maths	    50             63
                  Priya	  English	  62             80
                  Kranti	Maths	    92             92
                  William	German	  87             75')

# View dataset
dat2
#      Name Subject Semester1 Semester2
# 1   Samma   Maths        96        94
# 2   Sandy English        76        NA
# 3  Devesh  German        76        95
# 4  Rakesh   Maths        50        63
# 5   Priya English        NA        80
# 6  Kranti   Maths        92        92
# 7 William  German        87        75

```
Now let's try some different sorts. Here is the default sort:
```{r eval=FALSE, echo=TRUE} 
# Default sort on column with NA
proc_sort(dat2, by = Semester2)
#      Name Subject Semester1 Semester2
# 4  Rakesh   Maths        50        63
# 7 William  German        87        75
# 5   Priya English        NA        80
# 6  Kranti   Maths        92        92
# 1   Samma   Maths        96        94
# 3  Devesh  German        76        95
# 2   Sandy English        76        NA

```
As you can see, the function sorts the NA value last.  But what if you want to
sort it first?  

Fortunately, there is a parameter to control the NA sort behavior: `na.sort`.
This parameter allows you to sort the NAs first, last, or in a manner consistent
with SAS.  Here is how to sort the NAs first:
```{r eval=FALSE, echo=TRUE} 
# Force NA to top 
proc_sort(dat2, by = Semester2, na.sort = "first")
#      Name Subject Semester1 Semester2
# 2   Sandy English        76        NA
# 4  Rakesh   Maths        50        63
# 7 William  German        87        75
# 5   Priya English        NA        80
# 6  Kranti   Maths        92        92
# 1   Samma   Maths        96        94
# 3  Devesh  German        76        95

```
Note that "first" and "last" will always sort NAs in the requested direction,
no matter the direction of the sort overall.  Here is the same sort on 
"Semester2", but this time descending instead of ascending:
```{r eval=FALSE, echo=TRUE} 
# NA still forced to top 
proc_sort(dat2, by = Semester2, order = "d", na.sort = "first")
#      Name Subject Semester1 Semester2
# 2   Sandy English        76        NA
# 3  Devesh  German        76        95
# 1   Samma   Maths        96        94
# 6  Kranti   Maths        92        92
# 5   Priya English        NA        80
# 7 William  German        87        75
# 4  Rakesh   Maths        50        63

```
In SAS, however, the NAs can go to the top or bottom, depending on the direction 
of the sort. The `na.sort` parameter also has a "sas" option to mimic this 
behavior. Observe:
```{r eval=FALSE, echo=TRUE} 
# Ascending sort  
proc_sort(dat2, by = Semester2, order = "a", na.sort = "sas")
#      Name Subject Semester1 Semester2
# 2   Sandy English        76        NA
# 4  Rakesh   Maths        50        63
# 7 William  German        87        75
# 5   Priya English        NA        80
# 6  Kranti   Maths        92        92
# 1   Samma   Maths        96        94
# 3  Devesh  German        76        95

# Descending sort  
proc_sort(dat2, by = Semester2, order = "d", na.sort = "sas")
#      Name Subject Semester1 Semester2
# 3  Devesh  German        76        95
# 1   Samma   Maths        96        94
# 6  Kranti   Maths        92        92
# 5   Priya English        NA        80
# 7 William  German        87        75
# 4  Rakesh   Maths        50        63
# 2   Sandy English        76        NA

```
The "sas" option allows you to more easily match the sort produced by 
SAS PROC SORT.  You may set this option globally with 
`options("procs.na.sort" = "sas")`. See the `proc_sort()` documentation 
and `vignette("procs-globals")` for 
additional information.