---
title: <br><center>Re-encoding `people` in the `EDH` dataset</center>
date:  "August 2022" 
author: 
  - name: <center>Antonio Rivero Ostoic</center><div style="margin-bottom:-20px;"> </div>
    affiliation: <center>Aarhus University</center><div style="margin-bottom:-20px;"> </div>
    email: <center>jaro@cas.au.dk</center>
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Re-encoding `people` in the `EDH` dataset}
  %\VignetteEngine{knitr::rmarkdown}
  \usepackage[utf8]{inputenc}
---

<style type="text/css">
h1.title {
  text-align: center; font-size: 20pt; color: #400040; 
}
h2 {
  font-size: 18pt; color: #400040;
}
h3 {
  font-size: 14pt; color: #400080;
}
h4 {
  text-align: center; color: DarkRed;
}
</style>


```{r setup, echo=FALSE, message=FALSE}
knitr::opts_chunk$set(echo=TRUE,error=TRUE)
knitr::opts_chunk$set(comment = "")
library("sdam")
```

```{r set-options, echo=FALSE, cache=FALSE}
options(width = 96)
```



<div style="margin-bottom:20px;"> </div>


### Preliminaries
Install and load a version of `"sdam"` package.

```{r, echo=TRUE, eval=FALSE}
install.packages("sdam") # from CRAN
devtools::install_github("sdam-au/sdam") # development version
devtools::install_github("mplex/cedhar", subdir="pkg/sdam") # legacy version R 3.6.x
```
<div style="margin-bottom:10px;"> </div>
```{r}
# load and check versions
library(sdam)
packageVersion("sdam")
```

<div style="margin-bottom:40px;"> </div>


<div style="margin-bottom:60px;"> </div>


## EDH people
* `EDH` is a dataset in `"sdam"` that contains the texts of Latin and Latin-Greek inscriptions of the Roman Empire, which have been retrieved from the [Epigraphic Database Heidelberg API repository](https://edh-www.adw.uni-heidelberg.de/data/api) through routines `get.edh()` and `get.edhw()`. 

Since the year 2022 and still today, the API repository does not support people variables, and the `EDH` dataset serves as an alternative for the analysis of people-related inscriptions.

One challenge with people variables in `EDH` is that some records contain characters in Greek and Latin extended that need re-encoding for a proper rendering and display.


<div style="margin-bottom:60px;"> </div>


### Re-encoding `people` in `EDH`
Ancient inscriptions in some Roman provinces have Greek characters written and, due to encoding and decoding steps in the process of extraction, loading, and transformation of the data (perhaps Treating UTF-8 Bytes as Windows-1252?), Greek and other Latin characters are not displayed properly with the actual version of the `EDH` dataset. Most of the encoding issues are in variables related to people, and some examples with inscriptions in Roman provinces are next.


<div style="margin-bottom:40px;"> </div>


#### Achaia

The Roman province of **Achaia** in the `EDH` dataset has inscriptions related to people.

```{r, echo=FALSE, eval=TRUE, out.width="25%", fig.align="center", fig.cap="Roman province of Achaia (ca 117 AD)."}
plot.map("Ach", cap=TRUE, name=FALSE)
```


<div style="margin-bottom:40px;"> </div>

Function `edhw()` is to obtain the available inscriptions per province in the EDH dataset, which is a list that is the input for the same function to extract `people` variables *cognomen* and *nomen*. In this case, the `'province'` argument is `Ach` that stands for `Achaia`. 


```{r, echo=TRUE, eval=FALSE}
# select two people variables from Achaia
Ach <- edhw(province="Ach") |> 
  edhw(vars="people", select=c("cognomen","nomen"))
```
```{r, echo=FALSE, eval=TRUE}
Ach <- edhw(province="Ach") |> 
  suppressWarnings() |> 
  edhw(vars="people", select=c("cognomen","nomen")) |> 
  suppressWarnings()
```



<div style="margin-bottom:20px;"> </div>

There are 1539 records with people in `Ach` that corresponds to the number of rows in this data frame. 


```{r}
# number of people entries in Achaia
nrow(Ach)
```

<div style="margin-bottom:20px;"> </div>

However, some records have either missing data or are inscriptions where *cognomen* and *nomen* are not available. 


```{r, echo=TRUE, eval=FALSE}
# also remove NAs
Ach <- edhw(province="Ach") |> 
  edhw(vars="people", select=c("cognomen","nomen"), na.rm=TRUE)

nrow(Ach)
```
```{r, echo=FALSE, eval=TRUE}
Ach <- edhw(province="Ach") |> 
  suppressWarnings() |> 
  edhw(vars="people", select=c("cognomen","nomen"), na.rm=TRUE) |> 
  suppressWarnings()

nrow(Ach)
```

<div style="margin-bottom:40px;"> </div>

### Clean function for re-encoding 

Treating with `people` attribute variables requires many times re-encoding that is one option in function `cln()`. 
For instance, values in *cognomen* in the first entries of `Ach` are likely in Greek. 

```{r}
# some people entries in Achaia
head(Ach)
```

Function `cln()` serves to re-encode Greek and Latin characters to render Greek, Greek extended, and Latin extended glyphs. 


```{r, echo=TRUE, eval=FALSE}
# re-encode in Ach cognomen
Ach$cognomen |> 
  head() |> 
  cln()
```
```
cognomen
```
```{r, echo=FALSE, eval=TRUE}
knitr::asis_output(cln(head(Ach$cognomen,6))[1])
```
<br>
```{r, echo=FALSE, eval=TRUE}
knitr::asis_output(cln(head(Ach$cognomen,6))[2])
```
<br>
```{r, echo=FALSE, eval=TRUE}
knitr::asis_output(cln(head(Ach$cognomen,6))[3])
```
<br>
```{r, echo=FALSE, eval=TRUE}
knitr::asis_output(cln(head(Ach$cognomen,6))[4])
```
<br>
```{r, echo=FALSE, eval=TRUE}
knitr::asis_output(cln(head(Ach$cognomen,6))[5])
```
<br>
```{r, echo=FALSE, eval=TRUE}
knitr::asis_output(cln(head(Ach$cognomen,6))[6])
```

<div style="margin-bottom:30px;"> </div>



```{r, echo=FALSE, eval=FALSE}
detach("package:sdam", unload=TRUE)
sdam::cln(tail(Ach))
```

For *cognomen* in the last people entries in `Achaia`.

```{r}
# last entries
tail(Ach)
```

<div style="margin-bottom:60px;"> </div>


After re-encoding the last records in `Ach` with `cln()`, it is easier to see, for example, that some have identical *cognomen* where entries having `<NA>` in the input become `NA`. 

<div class="aegp">
<div class="row">
<div class="col-md-6">
```{r, echo=TRUE, eval=FALSE}
# clean last entries of cognomen
Ach$cognomen |> 
  tail() |> 
  cln()
```
```
cognomen
```
```{r, echo=FALSE, eval=TRUE}
knitr::asis_output(cln(tail(Ach$cognomen,6))[1])
```
<br>
```{r, echo=FALSE, eval=TRUE}
knitr::asis_output(cln(tail(Ach$cognomen,6))[2])
```
<br>
```{r, echo=FALSE, eval=TRUE}
knitr::asis_output(cln(tail(Ach$cognomen,6))[3])
```
<br>
```{r, echo=FALSE, eval=TRUE}
knitr::asis_output(cln(tail(Ach$cognomen,6))[4])
```
<br>
```{r, echo=FALSE, eval=TRUE}
knitr::asis_output(cln(tail(Ach$cognomen,6))[5])
```
<br>
```{r, echo=FALSE, eval=TRUE}
knitr::asis_output(cln(tail(Ach$cognomen,6))[6])
```
</div>
<div class="col-md-6">
```{r, echo=TRUE, eval=FALSE}
# clean last entries of nomen
Ach$nomen |> 
  tail() |> 
  cln()
```
```
nomen
```
```{r, echo=FALSE, eval=TRUE}
knitr::asis_output(cln(tail(Ach$nomen,6))[1])
```
<br>
```{r, echo=FALSE, eval=TRUE}
knitr::asis_output(cln(tail(Ach$nomen,6))[2])
```
<br>
```{r, echo=FALSE, eval=TRUE}
knitr::asis_output(cln(tail(Ach$nomen,6))[3])
```
<br>
```{r, echo=FALSE, eval=TRUE}
knitr::asis_output(cln(tail(Ach$nomen,6))[4])
```
<br>
```{r, echo=FALSE, eval=TRUE}
knitr::asis_output(cln(tail(Ach$nomen,6))[5])
```
<br>
```{r, echo=FALSE, eval=TRUE}
knitr::asis_output(cln(tail(Ach$nomen,6))[6])
```
</div>
</div>
</div>





<div style="margin-bottom:60px;"> </div>




### Re-encode Greek and Latin within data frames

<div style="margin-bottom:20px;"> </div>


#### Aegyptus

In the case of the province of **Aegyptus**, three people variables have a mixing og Greek and Latin characters 
scripted that need *re-codification* as well. 


```{r, echo=FALSE, eval=TRUE, out.width="25%", fig.align="center", fig.cap="Roman province of Aegyptus (ca 117 AD)."}
plot.map("Aeg", cap=TRUE, name=FALSE)
```


```{r, echo=TRUE, eval=FALSE}
# Aegyptus people
Aeg <- edhw(province="Aeg") |> 
  edhw(vars="people")
```
```{r, echo=FALSE, eval=TRUE}
Aeg <- edhw(province="Aeg") |> 
  suppressWarnings() |> 
  edhw(vars="people") |>
  suppressWarnings()
```

```{r}
# three variables of the last eight records
Aeg[ , c(3,5:6)] |> 
  tail(8)
```

<div style="margin-bottom:30px;"> </div>

For people in `Aegyptus`, columns three, and five to six correspond to *cognomen*, *name*, and *nomen*, where 
the output from `cln()` in the console is a dataframe.


```{r, echo=TRUE, eval=FALSE}
# re-encode three variables from last entries
Aeg[ ,c(3,5:6)] |> 
  tail() |> 
  cln()
```


<div class="aegp">
<div class="row">
<div class="col-md-4">
```
cognomen
```
```{r, echo=FALSE, eval=TRUE}
knitr::asis_output(cln(tail(Aeg$cognomen,8))[1])
```
<br>
```{r, echo=FALSE, eval=TRUE}
knitr::asis_output(cln(tail(Aeg$cognomen,8))[2])
```
<br>
```{r, echo=FALSE, eval=TRUE}
knitr::asis_output(cln(tail(Aeg$cognomen,8))[3])
```
<br>
```{r, echo=FALSE, eval=TRUE}
knitr::asis_output(cln(tail(Aeg$cognomen,8))[4])
```
<br>
```{r, echo=FALSE, eval=TRUE}
knitr::asis_output(cln(tail(Aeg$cognomen,8))[5])
```
<br>
```{r, echo=FALSE, eval=TRUE}
knitr::asis_output(cln(tail(Aeg$cognomen,8))[6])
```
<br>
```{r, echo=FALSE, eval=TRUE}
knitr::asis_output(cln(tail(Aeg$cognomen,8))[7])
```
<br>
```{r, echo=FALSE, eval=TRUE}
knitr::asis_output(cln(tail(Aeg$cognomen,8))[8])
```
<br>
</div>
<div class="col-md-4">
```
name
```
```{r, echo=FALSE, eval=TRUE}
knitr::asis_output(cln(tail(Aeg$name,8))[1])
```
<br>
```{r, echo=FALSE, eval=TRUE}
knitr::asis_output(cln(tail(Aeg$name,8))[2])
```
<br>
```{r, echo=FALSE, eval=TRUE}
knitr::asis_output(cln(tail(Aeg$name,8))[3])
```
<br>
```{r, echo=FALSE, eval=TRUE}
knitr::asis_output(cln(tail(Aeg$name,8))[4])
```
<br>
```{r, echo=FALSE, eval=TRUE}
knitr::asis_output(cln(tail(Aeg$name,8))[5])
```
<br>
```{r, echo=FALSE, eval=TRUE}
knitr::asis_output(cln(tail(Aeg$name,8))[6])
```
<br>
```{r, echo=FALSE, eval=TRUE}
knitr::asis_output(cln(tail(Aeg$name,8))[7])
```
<br>
```{r, echo=FALSE, eval=TRUE}
knitr::asis_output(cln(tail(Aeg$name,8))[8])
```
<br>
</div>
<div class="col-md-4">
```
nomen
```
```{r, echo=FALSE, eval=TRUE}
knitr::asis_output(cln(tail(Aeg$nomen,8))[1])
```
<br>
```{r, echo=FALSE, eval=TRUE}
knitr::asis_output(cln(tail(Aeg$nomen,8))[2])
```
<br>
```{r, echo=FALSE, eval=TRUE}
knitr::asis_output(cln(tail(Aeg$nomen,8))[3])
```
<br>
```{r, echo=FALSE, eval=TRUE}
knitr::asis_output(cln(tail(Aeg$nomen,8))[4])
```
<br>
```{r, echo=FALSE, eval=TRUE}
knitr::asis_output(cln(tail(Aeg$nomen,8))[5])
```
<br>
```{r, echo=FALSE, eval=TRUE}
knitr::asis_output(cln(tail(Aeg$nomen,8))[6])
```
<br>
```{r, echo=FALSE, eval=TRUE}
knitr::asis_output(cln(tail(Aeg$nomen,8))[7])
```
<br>
```{r, echo=FALSE, eval=TRUE}
knitr::asis_output(cln(tail(Aeg$nomen,8))[8])
```
<br>
</div>
</div>
</div>


<div style="margin-bottom:30px;"> </div>

Some entries in `Aeg` have Greek extended characters, and one entry in Latin has a special character at the end 
(`Sulpicius*`), which can be omitted for further computations by raising the cleaning level to `2`. 



<div style="margin-bottom:50px;"> </div>



#### *nomen* in Aegyptus

Benefits from re-encoding and cleaning text from the `EDH` dataset are evident like when counting occurrences 
in the different attribute variables as with ` nomen ` in `Aeg`. 


```{r, echo=TRUE, eval=FALSE}
# default cleaning level 1
Aeg$nomen |> 
  cln() |> 
  table() |> 
  sort(decreasing=TRUE)
```


```{r, echo=FALSE, eval=TRUE}
knitr::asis_output(cln(sort(unique(Aeg$nomen)),level=1)[32])
```
```{r, echo=FALSE, eval=TRUE}
as.vector(sort(table(Aeg$nomen),decreasing=TRUE))[1]
```
```{r, echo=FALSE, eval=TRUE}
knitr::asis_output(cln(sort(unique(Aeg$nomen)),level=1)[22])
```
```{r, echo=FALSE, eval=TRUE}
as.vector(sort(table(Aeg$nomen),decreasing=TRUE))[2]
```
<br>
```{r, echo=FALSE, eval=TRUE}
knitr::asis_output(cln(sort(unique(Aeg$nomen)),level=1)[23])
```
```{r, echo=FALSE, eval=TRUE}
as.vector(sort(table(Aeg$nomen),decreasing=TRUE))[3]
```
<br>
```{r, echo=FALSE, eval=TRUE}
knitr::asis_output(cln(sort(unique(Aeg$nomen)),level=1)[1])
```
```{r, echo=FALSE, eval=TRUE}
as.vector(sort(table(Aeg$nomen),decreasing=TRUE))[4]
```
<br>
*etc.* 
```
...
```



<div style="margin-bottom:30px;"> </div>

By raising the cleaning level to `2`, all special characters are removed from the end, 
and it is possible to see that, in the Roman province of Aegyptus, `Sempronius`, `Sentius`, `Valerius` 
are the three most common *nomen* in inscriptions with four occurrences each.

```{r, echo=TRUE, eval=FALSE}
# raise cleaning level and remove NAs
Aeg$nomen |> 
  cln(level=2, na.rm=TRUE) |> 
  table() |> 
  sort(decreasing=TRUE) 
```


```{r, echo=FALSE, eval=TRUE}
knitr::asis_output(names(sort(table(cln(x=Aeg$nomen, level=2, na.rm=TRUE)), decreasing=TRUE))[1])
```
```{r, echo=FALSE, eval=TRUE}
as.vector(sort(table(cln(x=Aeg$nomen, level=2, na.rm=TRUE)), decreasing=TRUE))[1]
```
<br>
```{r, echo=FALSE, eval=TRUE}
knitr::asis_output(names(sort(table(cln(x=Aeg$nomen, level=2, na.rm=TRUE)), decreasing=TRUE))[2])
```
```{r, echo=FALSE, eval=TRUE}
as.vector(sort(table(cln(x=Aeg$nomen, level=2, na.rm=TRUE)), decreasing=TRUE))[2]
```
<br>
```{r, echo=FALSE, eval=TRUE}
knitr::asis_output(names(sort(table(cln(x=Aeg$nomen, level=2, na.rm=TRUE)), decreasing=TRUE))[3])
```
```{r, echo=FALSE, eval=TRUE}
as.vector(sort(table(cln(x=Aeg$nomen, level=2, na.rm=TRUE)), decreasing=TRUE))[3]
```
<br>
```{r, echo=FALSE, eval=TRUE}
knitr::asis_output(names(sort(table(cln(x=Aeg$nomen, level=2, na.rm=TRUE)), decreasing=TRUE))[4])
```
```{r, echo=FALSE, eval=TRUE}
as.vector(sort(table(cln(x=Aeg$nomen, level=2, na.rm=TRUE)), decreasing=TRUE))[4]
```
<br>
*etc.* 
```
...
```



<div style="margin-bottom:60px;"> </div>


### Caveats

See `Warnings` section in manual.



<div style="margin-bottom:60px;"> </div>





<div style="margin-bottom:60px;"> </div>

### See also

#### Vignettes

* [Datasets in `"sdam"` package](../doc/Intro.html)

* [Dates and missing dating data](../doc/Dates.html)

* [Cartographical maps and networks](../doc/Maps.html)

<div style="margin-bottom:30px;"> </div>

#### Manuals

* [sdam: Digital Tools for the SDAM Project at Aarhus University](../html/sdam-package.html)
* [`"sdam"` manual](https://github.com/mplex/cedhar/blob/master/typesetting/reports/sdam.pdf)

<div style="margin-bottom:30px;"> </div>

#### Project

* [Release candidate version](https://github.com/sdam-au/sdam)
* [Code snippets using `"sdam"`](https://github.com/sdam-au/R_code)
* [Social Dynamics and complexity in the Ancient Mediterranean project](https://sdam-au.github.io/sdam-au/)

<div style="margin-bottom:60px;"> </div>

&nbsp;
