---
title: "Cluster Analysis Tutorial"
output: 
  rmdformats::robobook:
  html_document: default
  word_document: default
editor_options:
  chunk_output_type: console
---

# Overview 
This tutorial provides R code on conducting cluster analysis. Cluster analysis is a data-driven analytic technique that groups together units whose contents are similar to each other. 

In this example, we will identify turn types in supportive conversations by grouping together turns that are composed of similar utterances types. The utterances are drawn from a subset of conversations between strangers in which one dyad member disclosed about a current problem. Each utterance in these conversations was coded using Stiles' (1992) verbal response mode categories (see Bodie et al., 2021 in the *Journal of Language and Social Psychology* for more details). We are interested in identifying a typology of turns that comprise supportive conversations. 

In addition, the accompanying "ClusterAnalysis_Tutorial_2022August20.rmd" file contains all of the code presented in this tutorial and can be opened in RStudio (a somewhat more friendly user interface to R). 

Finally, note that much of this tutorial will consist of data management. So, please be patient as we work our way to running the cluster analysis.

# Outline
In this tutorial, we'll cover...  

* Reading in the data and loading needed packages.  
* Data descriptives.  
* Data management.  
* Conducting the cluster analysis.   
* Plotting clusters.  

# Read in the data and load needed packages.

**Let's read the data into R.**

We are working with a data set that contains repeated measures ("StrangerConversationUtterances_N59"), specifically coded conversation utterances for each dyad. 

The data set is stored as .csv file (comma-separated values file, which can be created by saving an Excel file as a csv document) on my computer's desktop.
```{r}
# Set working directory (i.e., where your data file is stored)
# This can be done by going to the top bar of RStudio and selecting 
# "Session" --> "Set Working Directory" --> "Choose Directory" --> 
# finding the location of your file
setwd("~/Desktop") # Note: You can skip this line if you have 
#the data file and this .rmd file stored in the same directory

# Read in the repeated measures data
data <- read.csv(file = "StrangerConversationUtterances_N59.csv", head = TRUE, sep = ",")

# View the first 10 rows of the repeated measures data
head(data, 10)
```

In the repeated measures data ("data"), we can see each row contains information for one utterance and there are multiple rows (i.e., multiple utterances) for each dyad. In this data set, there are columns for:  

* Dyad ID (`id`)  
* Time variable - in this case, utterance/segment in the conversation (`seg`)  
* Dyad member ID - in this case, role in the conversation (`role`; discloser = 1, listener = 2)  
* Utterance form - in this case, based upon Stiles' (1992) verbal response mode category coding scheme (`form`; 1 = disclosure, 2 = edification, 3 = advisement, 4 = confirmation, 5 = question, 6 = acknowledgement, 7 = interpretation, 8 = reflection, 9 = uncodable) 
* Utterance intent - in this case, based upon Stiles' (1992) verbal response mode category coding scheme (`intent`; 1 = disclosure, 2 = edification, 3 = advisement, 4 = confirmation, 5 = question, 6 = acknowledgement, 7 = interpretation, 8 = reflection, 9 = uncodable)   

**Load the R packages we need.  **

Packages in R are a collection of functions (and their documentation/explanations) that enable us to conduct particular tasks, such as plotting or fitting a statistical model.
```{r, warning= FALSE, message= FALSE}
# install.packages("cluster") # Install package if you have never used it before
library(cluster) # For hierarchical cluster analysis

# install.packages("devtools") # Install package if you have never used it before
require(devtools) # For version control

# install.packages("dplyr") # Install package if you have never used it before
library(dplyr) # For data management

# install.packages("ggplot2") # Install package if you have never used it before
library(ggplot2) # For plotting

# install.packages("psych") # Install package if you have never used it before
library(psych) # For descriptive statistics

# install.packages("reshape") # Install package if you have never used it before
library(reshape) # For data management
```

# Data Descriptives.

Let's begin by getting a feel for our data. Specifically, let's examine:    

(1) how many dyads we have in the data set,  
(2) how many utterances there are for each dyad, and  
(3) the frequency of each utterance type across all dyads.

1. Number of dyads.
```{r, warnings = FALSE}
# Length (i.e., number) of unique ID values
length(unique(data$id))
```
There are 59 dyads in the data set.

2. Number of utterances for each dyad.
```{r, message = FALSE}
num_utt <- # Select data
           data %>%
           # Select grouping variable, in this case, dyad ID (id)
           group_by(id) %>%
           # Count the number of turns in each conversation
           summarise(count = n()) %>%
           # Save the data as a data.frame
           as.data.frame()

# Calculate descriptives on the number of utterances per conversation
describe(num_utt$count)
```

The average dyad in this subset of the data had approximately 171 utterances in their conversation (*M* = 170.80, *SD* = 27.06), with conversations ranging from 125 to 265 utterances.

Plot the distribution of the number of utterances per conversation.
```{r}
# Select data (num_utt) and 
# value on the x-axis (number of utterances per conversation: "count")
ggplot(data = num_utt, aes(x = count)) +
       #Create a histogram with binwidth = 5 and white bars outlined in black
       geom_histogram(binwidth = 5, fill = "white", color="black") + 
       # Label x-axis
       labs(x = "Number of Utterances per Conversation") +
       # Change background aesthetics of plot
       theme_classic()
```

3. The number of total utterances and proportion for each type (for both form and intent).
```{r}
# Create table that calculates the number of utterances for each form type
uttform_table <- table(data$form)

# Display the table and proportions
uttform_table
round(prop.table(table(data$form)), 3)

# Create table that calculates the number of utterances for each intent type
uttintent_table <- table(data$intent)

# Display the table and proportions
uttintent_table
round(prop.table(table(data$intent)), 3)
```
For form, we can see that participants overall used disclosure (1) utterances the most and confirmation (4) utterances the least. 

For intent, we can see that participants overall used edification (2) utterances the most and advisement (3) utterances the least. 

# Data Management.

In this section, we will create our input variables for the cluster analysis. Specifically, we will calculate the proportion of each utterance type for each speaking turn across all of the conversations. The process will include several steps, including (1) labeling speaking turns in the data set - i.e., all consecutive utterances from one member of the dyad, (2) calculating the proportion of each utterance type for each speaking turn, and (3) reformatting the data so that each speaking turn is its own row and the columns represent the proportion of each utterance type.

Before labeling the speaking turns in the data set, let's make sure our data are in the format we will need.

Check and change the structure of the data set. We need to make the "id", "role", "form", and "intent" variables into factor variables, which makes sure R interprets the variables as categories instead of integers.
```{r}
# Examine structure
str(data) 

# Need to change "id", "role", "form", and "intent" to factor variables
data$id <- as.factor(data$id)
data$role <- as.factor(data$role)
data$form <- as.factor(data$form)
data$intent <- as.factor(data$intent)
```

Now that our data are in the correct format, we will label the speaking turns. Consecutive utterances from one member of the dyad are considered to be part of the same speaking turn. In the code below, we create a loop that goes through the data row-by-row and labels each row with a turn number based upon whether the dyad ID is the same as the prior row (if not, start the count over at 1) and if the dyad member (i.e., role) is the same as the prior row (if it is, then use the same turn label; if not, add one to the turn label). Additional explanations about the loop are provided below.
```{r}
# Create new data set that orders the rows by dyad ID and utterance number
newdata <- data[order(data$id, data$seg),] 

# Create new variable turn, currently with missing values
newdata$turn <- NA

# Create a lastid variable that is not one of the dyad IDs 
# (this helps start the counting for the first run through the loop)
lastid <- -1

# Create a lastrole variable that is 
# not one of the dyad member role labels 
# (this helps start the counting for the first run through the loop)
lastrole <- -1

# Set the value for lastturn at 1, 
# which is the value where we want our speaking turn count to start
lastturn <- 1

# For each row 1 through N of newdata
for (i in 1:nrow(newdata))
  
{ 
  # If the dyad ID of the row is not equal to the value of lastid 
  # (i.e., if we are trying to label a new conversation), then...
  if (newdata$id[i] != lastid) 
  {
    # Label the turn 1
    newdata$turn[i] <- 1
    
    # Update the value lastrole with the role value of the current row
    lastrole <- newdata$role[i]
    
    # Update the value of lastturn to 1
    lastturn <- 1
    
    # Update the value of lastid with the dyad ID of the current row
    lastid <- newdata$id[i]
  }
  
  # If the role of the row is equal to the value of lastrole 
  # (i.e., if the same dyad member is speaking), then...
  else if (newdata$role[i] == lastrole)
  {
    # Label the value of turn with the value of lastturn
    newdata$turn[i] <- lastturn
  }
  
  # If the conversation is the same, 
  # but the dyad member speaking changes, then...
  else
  {
    # Label the turn as the lastturn value plus 1
    newdata$turn[i] <- lastturn + 1
    
    # Update the last turn value with the turn value just created above
    lastturn <- newdata$turn[i]
    
    # Update the role value with the role of the current row
    lastrole <- newdata$role[i]
  }
}

# View the first 10 rows of the repeated measures data with turns
head(newdata, 10)
```
Looking at the first 10 rows of the data, we can see that the first two utterances (i.e., rows) are part of the discloser's speaking turn, the third utterance (row) is part of the listener's speaking turn, and so on.

Next, we calculate the contents of each speaking turn, specifically, the proportion of each utterance type. We calculate the proportion scores separately for listeners and disclosers since we will run our cluster analysis on listener and discloser turns separately. 

First, we calculate the turn proportion scores for the listeners. Note, that we calculate the proportion scores for form and intent separately and then bring these proportion scores together to create a final data set with 18 proportion scores (9 for form and 9 for intent).

Create new data set that only contains listener turns (role = 2).
```{r}
newdata_listener <- newdata[which(newdata$role == 2), ]
```

Create a separate data set to calculate proportions for listener form.
```{r}
newdata_listener_form <- newdata_listener[ ,c("id", "turn", "seg", "form")]
```

Calculate the proportion of each utterance type for each turn, then merge into one data set.
```{r}
# Count (i.e., sum) each utterance form type for each dyad ID and turn 
form_categories1 <- aggregate(form == 1 ~ id + turn, sum, data = newdata_listener_form)
form_categories2 <- aggregate(form == 2 ~ id + turn, sum, data = newdata_listener_form)
form_categories3 <- aggregate(form == 3 ~ id + turn, sum, data = newdata_listener_form)
form_categories4 <- aggregate(form == 4 ~ id + turn, sum, data = newdata_listener_form)
form_categories5 <- aggregate(form == 5 ~ id + turn, sum, data = newdata_listener_form)
form_categories6 <- aggregate(form == 6 ~ id + turn, sum, data = newdata_listener_form)
form_categories7 <- aggregate(form == 7 ~ id + turn, sum, data = newdata_listener_form)
form_categories8 <- aggregate(form == 8 ~ id + turn, sum, data = newdata_listener_form)
form_categories9 <- aggregate(form == 9 ~ id + turn, sum, data = newdata_listener_form)

# Merge the counts together by dyad ID and turn number
merged_listener_form <- Reduce(function(x,y) merge(x,y, by = c("id", "turn"), all=TRUE),
                               list(form_categories1, form_categories2, form_categories3, 
                                    form_categories4, form_categories5, form_categories6, 
                                    form_categories7, form_categories8, form_categories9))

# Calculate the total count by adding the counts of each type
merged_listener_form$totalform <- merged_listener_form$`form == 1` +
                                  merged_listener_form$`form == 2` + 
                                  merged_listener_form$`form == 3` + 
                                  merged_listener_form$`form == 4` + 
                                  merged_listener_form$`form == 5` + 
                                  merged_listener_form$`form == 6` + 
                                  merged_listener_form$`form == 7` + 
                                  merged_listener_form$`form == 8` + 
                                  merged_listener_form$`form == 9`

# Calculate proportions by dividing the count of each type by the total count
merged_listener_form$propform1 <- merged_listener_form$`form == 1`/merged_listener_form$totalform
merged_listener_form$propform2 <- merged_listener_form$`form == 2`/merged_listener_form$totalform
merged_listener_form$propform3 <- merged_listener_form$`form == 3`/merged_listener_form$totalform
merged_listener_form$propform4 <- merged_listener_form$`form == 4`/merged_listener_form$totalform
merged_listener_form$propform5 <- merged_listener_form$`form == 5`/merged_listener_form$totalform
merged_listener_form$propform6 <- merged_listener_form$`form == 6`/merged_listener_form$totalform
merged_listener_form$propform7 <- merged_listener_form$`form == 7`/merged_listener_form$totalform
merged_listener_form$propform8 <- merged_listener_form$`form == 8`/merged_listener_form$totalform
merged_listener_form$propform9 <- merged_listener_form$`form == 9`/merged_listener_form$totalform
```

Create a separate data set to calculate proportions for listener intent.
```{r}
newdata_listener_intent <- newdata_listener[ ,c("id", "turn", "seg", "intent")]
```

Calculate the proportion of each utterance type for each turn, then merge into one data set.
```{r}
# Count (i.e., sum) each utterance form type for each dyad ID and turn 
intent_categories1 <- aggregate(intent == 1 ~ id + turn, sum, data = newdata_listener_intent)
intent_categories2 <- aggregate(intent == 2 ~ id + turn, sum, data = newdata_listener_intent)
intent_categories3 <- aggregate(intent == 3 ~ id + turn, sum, data = newdata_listener_intent)
intent_categories4 <- aggregate(intent == 4 ~ id + turn, sum, data = newdata_listener_intent)
intent_categories5 <- aggregate(intent == 5 ~ id + turn, sum, data = newdata_listener_intent)
intent_categories6 <- aggregate(intent == 6 ~ id + turn, sum, data = newdata_listener_intent)
intent_categories7 <- aggregate(intent == 7 ~ id + turn, sum, data = newdata_listener_intent)
intent_categories8 <- aggregate(intent == 8 ~ id + turn, sum, data = newdata_listener_intent)
intent_categories9 <- aggregate(intent == 9 ~ id + turn, sum, data = newdata_listener_intent)

# Merge the counts together by dyad ID and turn number
merged_listener_intent <- Reduce(function(x,y) merge(x,y, by = c("id", "turn"), all=TRUE),
                                 list(intent_categories1, intent_categories2, intent_categories3,
                                      intent_categories4, intent_categories5, intent_categories6,
                                      intent_categories7, intent_categories8, intent_categories9))

# Calculate the total count by adding the counts of each type
merged_listener_intent$totalintent <- merged_listener_intent$`intent == 1` + 
                                      merged_listener_intent$`intent == 2` + 
                                      merged_listener_intent$`intent == 3` + 
                                      merged_listener_intent$`intent == 4` + 
                                      merged_listener_intent$`intent == 5` + 
                                      merged_listener_intent$`intent == 6` + 
                                      merged_listener_intent$`intent == 7` + 
                                      merged_listener_intent$`intent == 8` + 
                                      merged_listener_intent$`intent == 9` 

# Calculate proportions by dividing the count of each type by the total count
merged_listener_intent$propintent1 <- merged_listener_intent$`intent == 1`/merged_listener_intent$totalintent
merged_listener_intent$propintent2 <- merged_listener_intent$`intent == 2`/merged_listener_intent$totalintent
merged_listener_intent$propintent3 <- merged_listener_intent$`intent == 3`/merged_listener_intent$totalintent
merged_listener_intent$propintent4 <- merged_listener_intent$`intent == 4`/merged_listener_intent$totalintent
merged_listener_intent$propintent5 <- merged_listener_intent$`intent == 5`/merged_listener_intent$totalintent
merged_listener_intent$propintent6 <- merged_listener_intent$`intent == 6`/merged_listener_intent$totalintent
merged_listener_intent$propintent7 <- merged_listener_intent$`intent == 7`/merged_listener_intent$totalintent
merged_listener_intent$propintent8 <- merged_listener_intent$`intent == 8`/merged_listener_intent$totalintent
merged_listener_intent$propintent9 <- merged_listener_intent$`intent == 9`/merged_listener_intent$totalintent
```

Merge listener form and intent data sets.
```{r}
# Merge data
data_listener <- merge(merged_listener_form, merged_listener_intent, by= c("id", "turn"))

# Examine column names
names(data_listener)

# Partition to variables for the cluster analysis
data_listener <- data_listener[ , c("id", "turn", "propform1", "propform2", 
                                    "propform3", "propform4", "propform5", 
                                    "propform6", "propform7",
                                    "propform8","propform9", 
                                    "propintent1", "propintent2", "propintent3", 
                                    "propintent4", "propintent5", "propintent6", 
                                    "propintent7", "propintent8", "propintent9")]

# Re-order rows by dyad ID and turn number
data_listener <- data_listener[order(data_listener$id, data_listener$turn),]

# View the first 10 rows of the Listener turn proportion data
head(data_listener, 10)
```

Now, we will go through the same process to calculate the proportion of each utterance type for discloser form and intent. 

We calculate the proportion scores for form and intent separately and then bring these proportion scores together to create a final data set with 18 proportion scores (9 for form and 9 for intent).

Create new data set that only contains discloser turns (role = 1).
```{r}
newdata_discloser <- newdata[which(newdata$role == 1), ]
```

Create a separate data set to calculate proportions for discloser form.
```{r}
newdata_discloser_form <- newdata_discloser[ ,c("id", "turn", "seg", "form")]
```

Calculate the proportion of each utterance type for each turn, then merge into one data set.
```{r}
# Count (i.e., sum) each utterance form type for each dyad ID and turn
form_categories1 <- aggregate(form == 1 ~ id + turn, sum, data = newdata_discloser_form)
form_categories2 <- aggregate(form == 2 ~ id + turn, sum, data = newdata_discloser_form)
form_categories3 <- aggregate(form == 3 ~ id + turn, sum, data = newdata_discloser_form)
form_categories4 <- aggregate(form == 4 ~ id + turn, sum, data = newdata_discloser_form)
form_categories5 <- aggregate(form == 5 ~ id + turn, sum, data = newdata_discloser_form)
form_categories6 <- aggregate(form == 6 ~ id + turn, sum, data = newdata_discloser_form)
form_categories7 <- aggregate(form == 7 ~ id + turn, sum, data = newdata_discloser_form)
form_categories8 <- aggregate(form == 8 ~ id + turn, sum, data = newdata_discloser_form)
form_categories9 <- aggregate(form == 9 ~ id + turn, sum, data = newdata_discloser_form)

# Merge the counts together by dyad ID and turn number
merged_discloser_form <- Reduce(function(x,y) merge(x,y, by = c("id", "turn"), all=TRUE),
                                list(form_categories1, form_categories2, form_categories3,
                                     form_categories4, form_categories5, form_categories6,
                                     form_categories7, form_categories8, form_categories9))

# Calculate the total count by adding the counts of each type
merged_discloser_form$totalform <- merged_discloser_form$`form == 1` +
                                   merged_discloser_form$`form == 2` + 
                                   merged_discloser_form$`form == 3` + 
                                   merged_discloser_form$`form == 4` + 
                                   merged_discloser_form$`form == 5` + 
                                   merged_discloser_form$`form == 6` + 
                                   merged_discloser_form$`form == 7` + 
                                   merged_discloser_form$`form == 8` + 
                                   merged_discloser_form$`form == 9`

# Calculate proportions by dividing the count of each type by the total count
merged_discloser_form$propform1 <- merged_discloser_form$`form == 1`/merged_discloser_form$totalform
merged_discloser_form$propform2 <- merged_discloser_form$`form == 2`/merged_discloser_form$totalform
merged_discloser_form$propform3 <- merged_discloser_form$`form == 3`/merged_discloser_form$totalform
merged_discloser_form$propform4 <- merged_discloser_form$`form == 4`/merged_discloser_form$totalform
merged_discloser_form$propform5 <- merged_discloser_form$`form == 5`/merged_discloser_form$totalform
merged_discloser_form$propform6 <- merged_discloser_form$`form == 6`/merged_discloser_form$totalform
merged_discloser_form$propform7 <- merged_discloser_form$`form == 7`/merged_discloser_form$totalform
merged_discloser_form$propform8 <- merged_discloser_form$`form == 8`/merged_discloser_form$totalform
merged_discloser_form$propform9 <- merged_discloser_form$`form == 9`/merged_discloser_form$totalform
```

Create a separate data set to calculate proportions for discloser intent.
```{r}
newdata_discloser_intent <- newdata_discloser[ ,c("id", "turn", "seg", "intent")]
```

Calculate the proportion of each utterance type for each turn, then merge into one data set.
```{r}
# Count (i.e., sum) each utterance form type for each dyad ID and turn
intent_categories1 <- aggregate(intent == 1 ~ id + turn, sum, data = newdata_discloser_intent)
intent_categories2 <- aggregate(intent == 2 ~ id + turn, sum, data = newdata_discloser_intent)
intent_categories3 <- aggregate(intent == 3 ~ id + turn, sum, data = newdata_discloser_intent)
intent_categories4 <- aggregate(intent == 4 ~ id + turn, sum, data = newdata_discloser_intent)
intent_categories5 <- aggregate(intent == 5 ~ id + turn, sum, data = newdata_discloser_intent)
intent_categories6 <- aggregate(intent == 6 ~ id + turn, sum, data = newdata_discloser_intent)
intent_categories7 <- aggregate(intent == 7 ~ id + turn, sum, data = newdata_discloser_intent)
intent_categories8 <- aggregate(intent == 8 ~ id + turn, sum, data = newdata_discloser_intent)
intent_categories9 <- aggregate(intent == 9 ~ id + turn, sum, data = newdata_discloser_intent)

# Merge the counts together by dyad ID and turn number
merged_discloser_intent <- Reduce(function(x,y) merge(x,y, by = c("id", "turn"), all=TRUE),
                                  list(intent_categories1, intent_categories2, intent_categories3,
                                       intent_categories4, intent_categories5, intent_categories6,
                                       intent_categories7, intent_categories8, intent_categories9))

# Calculate the total count by adding the counts of each type
merged_discloser_intent$totalintent <- merged_discloser_intent$`intent == 1` + 
                                       merged_discloser_intent$`intent == 2` + 
                                       merged_discloser_intent$`intent == 3` + 
                                       merged_discloser_intent$`intent == 4` + 
                                       merged_discloser_intent$`intent == 5` + 
                                       merged_discloser_intent$`intent == 6` + 
                                       merged_discloser_intent$`intent == 7` + 
                                       merged_discloser_intent$`intent == 8` + 
                                       merged_discloser_intent$`intent == 9` 

# Calculate proportions by dividing the count of each type by the total count
merged_discloser_intent$propintent1 <- merged_discloser_intent$`intent == 1`/merged_discloser_intent$totalintent
merged_discloser_intent$propintent2 <- merged_discloser_intent$`intent == 2`/merged_discloser_intent$totalintent
merged_discloser_intent$propintent3 <- merged_discloser_intent$`intent == 3`/merged_discloser_intent$totalintent
merged_discloser_intent$propintent4 <- merged_discloser_intent$`intent == 4`/merged_discloser_intent$totalintent
merged_discloser_intent$propintent5 <- merged_discloser_intent$`intent == 5`/merged_discloser_intent$totalintent
merged_discloser_intent$propintent6 <- merged_discloser_intent$`intent == 6`/merged_discloser_intent$totalintent
merged_discloser_intent$propintent7 <- merged_discloser_intent$`intent == 7`/merged_discloser_intent$totalintent
merged_discloser_intent$propintent8 <- merged_discloser_intent$`intent == 8`/merged_discloser_intent$totalintent
merged_discloser_intent$propintent9 <- merged_discloser_intent$`intent == 9`/merged_discloser_intent$totalintent
```

Merge discloser form and intent data sets.
```{r}
# Merge data
data_discloser <- merge(merged_discloser_form, merged_discloser_intent, by= c("id", "turn"))

# Examine column names
names(data_discloser)

# Partition to variables for the cluster analysis
data_discloser <- data_discloser[ ,c("id", "turn", "propform1", "propform2", 
                                     "propform3", "propform4", "propform5", 
                                     "propform6", "propform7",
                                     "propform8","propform9", 
                                     "propintent1", "propintent2", "propintent3", 
                                     "propintent4", "propintent5", "propintent6",
                                     "propintent7", "propintent8", "propintent9")]

# Re-order rows by dyad ID and turn number
data_discloser <- data_discloser[order(data_discloser$id, data_discloser$turn),]

# View the first 10 rows of the Discloser turn proportion data
head(data_discloser, 10)
```

Finally, we remove turns that are 100% filler (i.e., uncodable utterances) because they lack theoretical meaning. 

How may filler turns are there? That is, how many turns are comprised only of uncodable utterances (propform9 = 1 or propintent9 = 1).
```{r}
# Listener filler turns
length(which(data_listener$propform9 == 1 & data_listener$propintent9 == 1))
#34 filler turns for Listeners

# Listener proportion of filler turns
length(which(data_listener$propform9 == 1 & data_listener$propintent9 == 1))/nrow(data_listener) 
# 1.3% of Listener turns are filler turns

# Discloser filler turns
length(which(data_discloser$propform9 == 1 & data_discloser$propintent9 == 1))
# 39 filler turns for Disclosers

# Discloser proportion of filler turns
length(which(data_discloser$propform9 == 1 & data_discloser$propintent9 == 1))/nrow(data_discloser) 
# 1.5% of Listener turns are filler turns
```

Remove filler turns. If propform9 or propintent9 = 1, then remove from the listener and discloser data sets.
```{r}
# Listener data set
data_listener <- data_listener %>%
                 filter(!(data_listener$propform9 == "1" & 
                          data_listener$propintent9 == "1")) %>%
                 as.data.frame()

# Discloser data set
data_discloser <- data_discloser %>%
                  filter(!(data_discloser$propform9 == "1" & 
                           data_discloser$propintent9 == "1")) %>%
                  as.data.frame()
```

Now our data are ready for a cluster analysis!

# Cluster analysis.

We conduct separate cluster analyses for the listener and discloser turns to explore whether the different conversational roles influence the potential types of conversational acts. The cluster analysis involves several steps including (1) removing missing data, (2) scaling the data (i.e., rescaling all of the variables to have mean = 0 and standard deviation = 1), (3) running the cluster analysis, (4) examining the dendrogram to choose the appropriate number of clusters, and (5) saving the cluster assignments for each turn. We walk through these steps in more detail below.

We begin with the listener cluster analysis.

First, we remove rows with missing data from our data set since cluster analysis cannot handle missing data. If your data set has high levels of missingness, it might be worth considering imputation methods to handle the missing data.
```{r}
list_subset <- data_listener[complete.cases(data_listener), ] 
```

Second, we scale all of the variables (i.e., the proportion scores) so the mean of each variable is equal to 0 and the standard deviation of each variable is equal to 1. This is typically done when variables in the cluster analysis are on very different scales (e.g., age and income have huge differences in the range of values available to measure each variable). Although differences in scale are not an issue here, we follow this common practice.

Also, make sure that you only scale the variables that will be included in the cluster analysis. In this case, we don't want to scale the dyad ID or turn number (the first two columns in our data set).
```{r}
list_scale <- data.frame(scale(list_subset[3:20]))
```

Add the dyad ID and turn number variables back into the data set and rearrange so dyad ID and turn number are the first two columns of the data set.
```{r}
# Add dyad ID and turn number variables
list_scale$id <- list_subset$id
list_scale$turn <- list_subset$turn

# Reorder columns
list_scale <- list_scale[, c(19, 20, 1:18)]

# View the first 10 rows of the scaled Listener turn proportion data
head(list_scale, 10)
```

It's also good to double check that the variables were scaled properly. We do so by examining whether the mean of the scaled variables is 0 and the standard deviation of the scaled variables is 1.
```{r}
describe(list_scale)
```

Our data look ready for the cluster analysis!

Third, we conduct the cluster analysis.
```{r}
# There are random starts involved so we set a seed to make the analysis reproducible
set.seed(1234)

# Calculate the dissimilarity matrix 
# between all turns in the data set using Euclidian distance
# Make sure to only include variables of interest 
# (i.e., do not include dyad ID or the turn number variable)
dist_list <- daisy(list_scale[, 3:20], metric = "euclidean", stand = FALSE)

# Compute the agglomerative hierarchical cluster analysis 
# using Ward's single linkage method
clusterward_listener <- agnes(dist_list, diss = TRUE, method = "ward")
```

Fourth, we examine the resulting dendrogram to determine an appropriate number of clusters for the data at hand. We examine the length of the vertical lines (longer vertical lines indicate greater differences between groups) and the number of turns within each group (we don't want a group with too few turns). 
```{r}
plot(clusterward_listener, which.plot = 2, main = "Ward Clustering of the Listener Data")
```

Finally, based on the dendrogram (and examining the contents of several cluster solutions), we chose a 6-cluster solution. Using this solution, each turn is assigned to one of the six clusters. We also include code to examine statistics about the chosen cluster solution (e.g., within-cluster heterogeneity), but do not go through those results here.
```{r}
# Cut dendrogram (or tree) by the number of 
# determined groups (in this case, 6)
# Insert cluster analysis results object ("clusterward_listener") 
# and the number of cut points
wardcluster6_list <- cutree(clusterward_listener, k = 6)

# Cluster statistics
# cluster.stats(dist_list, clustering = wardcluster6_list,
#               silhouette = TRUE, sepindex = TRUE)

# Create cluster labels; in this case, we have six clusters and label them Type 1, ..., Type 6
cluster6_label_list <- factor(wardcluster6_list, labels = c("Type 1", "Type 2", "Type 3", 
                                                            "Type 4", "Type 5", "Type 6"))

# Add cluster categories with the scaled listener data (list_subset)
list_subset$wardcluster6 <- wardcluster6_list

# Change struture of cluster categories to a factor variable
list_subset$wardcluster6 <- as.factor(list_subset$wardcluster6)
```

Examine the number of each turn type. The interpretation of each type will be easier once we examine the contents of each turn type below.
```{r}
listener_freq <- table(list_subset$wardcluster6)
listener_freq

round(prop.table(listener_freq), 2)
```

Now, we will go through the same process for the discloser cluster analysis.

First, we remove rows with missing data from our data set since cluster analysis cannot handle missing data. If your data set has high levels of missingness, it might be worth considering imputation methods to handle the missing data.
```{r}
disc_subset <- data_discloser[complete.cases(data_discloser),1:20] 
```

Second, we scale all of the variables (i.e., the proportion scores) so the mean of each variable is equal to 0 and the standard deviation of each variable is equal to 1. This is typically done when variables in the cluster analysis are on very different scales (e.g., age and income have huge differences in the range of values available to measure each variable). Although differences in scale are not an issue here, we follow this common practice.

Also, make sure that you only scale the variables that will be included in the cluster analysis. In this case, we don't want to scale the dyad ID or turn number (the first two columns in our data set).
```{r}
disc_scale <- data.frame(scale(disc_subset[3:20]))
```

Add the dyad ID and turn number variables back into the data set and rearrange so dyad ID and turn number are the first two columns of the data set.
```{r}
# Add dyad ID and turn number variables
disc_scale$id <- disc_subset$id
disc_scale$turn <- disc_subset$turn

# Reorder columns
disc_scale <- disc_scale[, c(19, 20, 1:18)]

# View the first 10 rows of the scaled Discloser turn proportion data
head(disc_scale, 10)
```

It's also good to double check that the variables were scaled properly. We do so by examining whether the mean of the scaled variables is 0 and the standard deviation of the scaled variables is 1.
```{r}
describe(disc_scale)
```

Our data look ready for the cluster analysis!

Third, we conduct the cluster analysis.
```{r}
# There are random starts involved so we set a seed to make the analysis reproducible
set.seed(1234)

# Calculate the dissimilarity matrix 
# between all turns in the data set using Euclidian distance
# Make sure to only include variables of interest 
# (i.e., do not include dyad ID or the turn number variable)
dist_disc <- daisy(disc_scale[, 3:20], metric = "euclidean", stand = FALSE)

# Compute the agglomerative hierarchical cluster analysis using Ward's single linkage method
clusterward_discloser <- agnes(dist_disc, diss = TRUE, method = "ward")
```

Fourth, we examine the resulting dendrogram to determine an appropriate number of clusters for the data at hand. We examine the length of the vertical lines (longer vertical lines indicate greater differences between groups) and the number of turns within each group (we don't want a group with too few turns). 
```{r}
plot(clusterward_discloser, which.plot = 2, main = "Ward Clustering of the Discloser Data")
```

Finally, based on the dendrogram (and examining the contents of several cluster solutions), we chose a 6-cluster solution. Using this solution, each turn is assigned to one of the six clusters. We also include code to examine statistics about the chosen cluster solution (e.g., within-cluster heterogeneity), but do not go through those results here.
```{r}
# Cut dendrogram (or tree) by the number of 
# determined groups (in this case, 6)
# Insert cluster analysis results object ("clusterward_discloser") 
# and the number of cut points
wardcluster6_disc <- cutree(clusterward_discloser, k = 6)

# Cluster statistics
# cluster.stats(dist_disc, clustering = wardcluster6_disc,
#               silhouette = TRUE, sepindex = TRUE)

# Create cluster labels; in this case, we have six clusters and label them Type 1, ..., Type 6
cluster6_label_disc <- factor(wardcluster6_disc, labels = c("Type 1", "Type 2", "Type 3", 
                                                            "Type 4", "Type 5", "Type 6"))

# Add cluster categories with the scaled Listener data (list_subset)
disc_subset$wardcluster6 <- wardcluster6_disc

# Change struture of cluster categories to a factor variable
disc_subset$wardcluster6 <- as.factor(disc_subset$wardcluster6)
```

Examine the number of each turn type. The interpretation of each type will be easier once we examine the contents of each turn type below.
```{r}
disc_freq <- table(disc_subset$wardcluster6)
disc_freq

round(prop.table(disc_freq), 2)
```

# Plot Clusters.

Listener cluster results.

Merge cluster assignments data with original listener turn proportion data and partition the data to only include the proportion scores and the cluster assignments.
```{r}
# Merge data
list_plot <- merge(data_listener, list_subset[, c("id", "turn", "wardcluster6")])

# Partition only proportion scores and cluster assignments
list_plot <- list_plot[ ,c("wardcluster6", "propform1", "propintent1", 
                           "propform2", "propintent2",
                           "propform3", "propintent3", 
                           "propform4", "propintent4", 
                           "propform5", "propintent5", 
                           "propform6", "propintent6", 
                           "propform7", "propintent7", 
                           "propform8", "propintent8", 
                           "propform9", "propintent9")]

# View the first 10 rows of the Listener turn proportion data with the cluster assignment
head(list_plot, 10)
```

The data need to be "melted" for the purposes of plotting. Melting the data refers to reshaping the data into a long format, such that the value of the proportion scores are in one column and the utterance type and cluster assignment associated with that proportion are in two other columns.
```{r}
# Melt data
list_melt <- melt(list_plot, id = "wardcluster6")

# View the first 10 rows of the melted Listener data
head(list_melt, 10)
```

Plot the average proportion of utterance contents for the 6 listener clusters. The naming and ordering of the clusters can be rearranged depending on how you would like your clusters ordered. In this case, we ordered the clusters alphabetically based upon the conceptual labels we gave the turn types.
```{r}
# Change the structure of the cluster variable to 
# rearrange the order of the clusters in the plot
list_melt$wardcluster6_factor <- factor(list_melt$wardcluster6, 
                                        levels = c('1', '5', '3', '4', '2', '6'))

# Create labels for the clusters to be used in the plot
cluster_labels_list <- c('1' = "Acknowledge", '5' = "Advice", 
                         '3' = "Elaboration", '4' = "Hedged Disc", 
                         '2' = "Question", '6' = "Reflection")

# Plot the contents of each cluster with the list_melt data, 
# the utterance types ("variable") on the x-axis, 
# the proportion of each utterance type ("value") on the y-axis, 
# and the color of the bars differing based upon the utterance type ("variable")
ggplot(list_melt, aes(x = variable, y = value, fill = factor(variable))) + 
  
       # Calculate mean proportion for each utterance type and display as a bar chart
       stat_summary(fun = mean, geom = "bar", position = position_dodge(1)) + 
  
       # Create a different panel for each cluster and 
       # label each cluster with the labels we created above
        facet_grid(~wardcluster6_factor,
                   labeller = labeller(wardcluster6_factor = cluster_labels_list)) +
  
       # Do not include utterance type labels on the x-axis
       theme(axis.text.x=element_blank()) +
  
       # Create legend: name the contents of the legend ("Utterance")
       # Order the contents (breaks) and labels (labels) in the same order
       scale_fill_discrete(name = "Utterance Type",
                           breaks = c("propform1", "propintent1", "propform2", "propintent2", 
                                      "propform3", "propintent3", "propform4", "propintent4",
                                      "propform5", "propintent5", "propform6", "propintent6",
                                      "propform7", "propintent7", "propform8", "propintent8",
                                      "propform9", "propintent9"), 
                           labels = c("Disclosure Form", "Disclosure Intent", 
                                      "Edification Form", "Edification Intent", 
                                      "Advisement Form", "Advisement Intent", 
                                      "Confirmation Form", "Confirmation Intent", 
                                      "Question Form", "Question Intent", 
                                      "Acknowledgement Form", "Acknowledgement Intent", 
                                      "Interpretation Form", "Interpretation Intent", 
                                      "Reflection Form", "Reflection Intent", 
                                      "Uncodable Form", "Uncodable Intent")) +
  
       # X-axis label
       xlab("Cluster") + 
  
       # Y- axis label
       ylab("Proportion") +
  
       # Change background
       theme_classic()
```

Discloser cluster results.

Merge cluster assignments data with original discloser turn proportion data and partition the data to only include the proportion scores and the cluster assignments.
```{r}
# Merge data
disc_plot <- merge(data_discloser, disc_subset[, c("id", "turn", "wardcluster6")])

# Partition only proportion scores and cluster assignments
disc_plot <- disc_plot[ ,c("wardcluster6", "propform1", "propintent1", 
                           "propform2", "propintent2",
                           "propform3", "propintent3", 
                           "propform4", "propintent4", 
                           "propform5", "propintent5", 
                           "propform6", "propintent6", 
                           "propform7", "propintent7", 
                           "propform8", "propintent8", 
                           "propform9", "propintent9")]

# View the first 10 rows of the Discloser turn proportion data with the cluster assignment
head(disc_plot, 10)
```

The data need to be "melted" for the purposes of plotting. Melting the data refers to reshaping the data into a long format, such that the value of the proportion scores are in one column and the utterance type and cluster assignment associated with that proportion are in two other columns.
```{r}
# Melt data
disc_melt <- melt(disc_plot, id = "wardcluster6")

# View the first 10 rows of the melted Discloser data
head(disc_melt, 10)
```

Plot the average proportion of utterance contents for the 6 discloser clusters. The naming and ordering of the clusters can be rearranged depending on how you would like your clusters ordered. In this case, we ordered the clusters alphabetically based upon the conceptual labels we gave the turn types.
```{r}
# Change the structure of the cluster variable to 
# rearrange the order of the clusters in the plot
disc_melt$wardcluster6_factor <- factor(disc_melt$wardcluster6, 
                                        levels = c('3', '6', '1', '2', '4', '5'))

# Create labels for the clusters to be used in the plot
cluster_labels_disc <- c('3' = "Acknowledge", '6' = "Advice", 
                         '1' = "Elaboration", '2' = "Hedged Disc", 
                         '4' = "Question", '5' = "Reflection")

# Plot the contents of each cluster with the disc_melt data, 
# the utterance types ("variable") on the x-axis, 
# the proportion of each utterance type ("value") on the y-axis, 
# and the color of the bars differing based upon the utterance type ("variable")
ggplot(disc_melt, aes(x = variable, y = value, fill = factor(variable))) + 
  
       # Calculate mean proportion for each utterance type and display as a bar chart
       stat_summary(fun = mean, geom = "bar", position = position_dodge(1)) + 
  
       # Create a different panel for each cluster and 
       # label each cluster with the labels we created above
       facet_grid(~wardcluster6_factor,
                  labeller = labeller(wardcluster6_factor = cluster_labels_disc)) +
  
       # Do not include utterance type labels on the x-axis
       theme(axis.text.x=element_blank()) +
  
       # Create legend: name the contents of the legend ("Utterance")
       # Order the contents (breaks) and labels (labels) in the same order
       scale_fill_discrete(name = "Utterance Type",
                           breaks = c("propform1", "propintent1", "propform2", "propintent2", 
                                      "propform3", "propintent3", "propform4", "propintent4",
                                      "propform5", "propintent5", "propform6", "propintent6",
                                      "propform7", "propintent7", "propform8", "propintent8",
                                      "propform9", "propintent9"), 
                           labels = c("Disclosure Form", "Disclosure Intent", 
                                      "Edification Form", "Edification Intent", 
                                      "Advisement Form", "Advisement Intent", 
                                      "Confirmation Form", "Confirmation Intent", 
                                      "Question Form", "Question Intent", 
                                      "Acknowledgement Form", "Acknowledgement Intent", 
                                      "Interpretation Form", "Interpretation Intent", 
                                      "Reflection Form", "Reflection Intent", 
                                      "Uncodable Form", "Uncodable Intent")) +
  
       # X-axis label
       xlab("Cluster") + 
  
       # Y- axis label
       ylab("Proportion") +
  
       # Change background
       theme_classic()
```

Ta-da!

-----
### Additional Information

We created this tutorial with a system environment and versions of R and packages that might be different from yours. If R reports errors when you attempt to run this tutorial, running the code chunk below and comparing your output and the tutorial posted on the LHAMA website may be helpful. 
  
```{r}
session_info(pkgs = c("attached"))
```