--- title: "Cluster Analysis Tutorial" output: rmdformats::robobook: html_document: default word_document: default editor_options: chunk_output_type: console --- # Overview This tutorial provides R code on conducting cluster analysis. Cluster analysis is a data-driven analytic technique that groups together units whose contents are similar to each other. In this example, we will identify turn types in supportive conversations by grouping together turns that are composed of similar utterances types. The utterances are drawn from a subset of conversations between strangers in which one dyad member disclosed about a current problem. Each utterance in these conversations was coded using Stiles' (1992) verbal response mode categories (see Bodie et al., 2021 in the *Journal of Language and Social Psychology* for more details). We are interested in identifying a typology of turns that comprise supportive conversations. In addition, the accompanying "ClusterAnalysis_Tutorial_2022August20.rmd" file contains all of the code presented in this tutorial and can be opened in RStudio (a somewhat more friendly user interface to R). Finally, note that much of this tutorial will consist of data management. So, please be patient as we work our way to running the cluster analysis. # Outline In this tutorial, we'll cover... * Reading in the data and loading needed packages. * Data descriptives. * Data management. * Conducting the cluster analysis. * Plotting clusters. # Read in the data and load needed packages. **Let's read the data into R.** We are working with a data set that contains repeated measures ("StrangerConversationUtterances_N59"), specifically coded conversation utterances for each dyad. The data set is stored as .csv file (comma-separated values file, which can be created by saving an Excel file as a csv document) on my computer's desktop. ```{r} # Set working directory (i.e., where your data file is stored) # This can be done by going to the top bar of RStudio and selecting # "Session" --> "Set Working Directory" --> "Choose Directory" --> # finding the location of your file setwd("~/Desktop") # Note: You can skip this line if you have #the data file and this .rmd file stored in the same directory # Read in the repeated measures data data <- read.csv(file = "StrangerConversationUtterances_N59.csv", head = TRUE, sep = ",") # View the first 10 rows of the repeated measures data head(data, 10) ``` In the repeated measures data ("data"), we can see each row contains information for one utterance and there are multiple rows (i.e., multiple utterances) for each dyad. In this data set, there are columns for: * Dyad ID (`id`) * Time variable - in this case, utterance/segment in the conversation (`seg`) * Dyad member ID - in this case, role in the conversation (`role`; discloser = 1, listener = 2) * Utterance form - in this case, based upon Stiles' (1992) verbal response mode category coding scheme (`form`; 1 = disclosure, 2 = edification, 3 = advisement, 4 = confirmation, 5 = question, 6 = acknowledgement, 7 = interpretation, 8 = reflection, 9 = uncodable) * Utterance intent - in this case, based upon Stiles' (1992) verbal response mode category coding scheme (`intent`; 1 = disclosure, 2 = edification, 3 = advisement, 4 = confirmation, 5 = question, 6 = acknowledgement, 7 = interpretation, 8 = reflection, 9 = uncodable) **Load the R packages we need. ** Packages in R are a collection of functions (and their documentation/explanations) that enable us to conduct particular tasks, such as plotting or fitting a statistical model. ```{r, warning= FALSE, message= FALSE} # install.packages("cluster") # Install package if you have never used it before library(cluster) # For hierarchical cluster analysis # install.packages("devtools") # Install package if you have never used it before require(devtools) # For version control # install.packages("dplyr") # Install package if you have never used it before library(dplyr) # For data management # install.packages("ggplot2") # Install package if you have never used it before library(ggplot2) # For plotting # install.packages("psych") # Install package if you have never used it before library(psych) # For descriptive statistics # install.packages("reshape") # Install package if you have never used it before library(reshape) # For data management ``` # Data Descriptives. Let's begin by getting a feel for our data. Specifically, let's examine: (1) how many dyads we have in the data set, (2) how many utterances there are for each dyad, and (3) the frequency of each utterance type across all dyads. 1. Number of dyads. ```{r, warnings = FALSE} # Length (i.e., number) of unique ID values length(unique(data$id)) ``` There are 59 dyads in the data set. 2. Number of utterances for each dyad. ```{r, message = FALSE} num_utt <- # Select data data %>% # Select grouping variable, in this case, dyad ID (id) group_by(id) %>% # Count the number of turns in each conversation summarise(count = n()) %>% # Save the data as a data.frame as.data.frame() # Calculate descriptives on the number of utterances per conversation describe(num_utt$count) ``` The average dyad in this subset of the data had approximately 171 utterances in their conversation (*M* = 170.80, *SD* = 27.06), with conversations ranging from 125 to 265 utterances. Plot the distribution of the number of utterances per conversation. ```{r} # Select data (num_utt) and # value on the x-axis (number of utterances per conversation: "count") ggplot(data = num_utt, aes(x = count)) + #Create a histogram with binwidth = 5 and white bars outlined in black geom_histogram(binwidth = 5, fill = "white", color="black") + # Label x-axis labs(x = "Number of Utterances per Conversation") + # Change background aesthetics of plot theme_classic() ``` 3. The number of total utterances and proportion for each type (for both form and intent). ```{r} # Create table that calculates the number of utterances for each form type uttform_table <- table(data$form) # Display the table and proportions uttform_table round(prop.table(table(data$form)), 3) # Create table that calculates the number of utterances for each intent type uttintent_table <- table(data$intent) # Display the table and proportions uttintent_table round(prop.table(table(data$intent)), 3) ``` For form, we can see that participants overall used disclosure (1) utterances the most and confirmation (4) utterances the least. For intent, we can see that participants overall used edification (2) utterances the most and advisement (3) utterances the least. # Data Management. In this section, we will create our input variables for the cluster analysis. Specifically, we will calculate the proportion of each utterance type for each speaking turn across all of the conversations. The process will include several steps, including (1) labeling speaking turns in the data set - i.e., all consecutive utterances from one member of the dyad, (2) calculating the proportion of each utterance type for each speaking turn, and (3) reformatting the data so that each speaking turn is its own row and the columns represent the proportion of each utterance type. Before labeling the speaking turns in the data set, let's make sure our data are in the format we will need. Check and change the structure of the data set. We need to make the "id", "role", "form", and "intent" variables into factor variables, which makes sure R interprets the variables as categories instead of integers. ```{r} # Examine structure str(data) # Need to change "id", "role", "form", and "intent" to factor variables data$id <- as.factor(data$id) data$role <- as.factor(data$role) data$form <- as.factor(data$form) data$intent <- as.factor(data$intent) ``` Now that our data are in the correct format, we will label the speaking turns. Consecutive utterances from one member of the dyad are considered to be part of the same speaking turn. In the code below, we create a loop that goes through the data row-by-row and labels each row with a turn number based upon whether the dyad ID is the same as the prior row (if not, start the count over at 1) and if the dyad member (i.e., role) is the same as the prior row (if it is, then use the same turn label; if not, add one to the turn label). Additional explanations about the loop are provided below. ```{r} # Create new data set that orders the rows by dyad ID and utterance number newdata <- data[order(data$id, data$seg),] # Create new variable turn, currently with missing values newdata$turn <- NA # Create a lastid variable that is not one of the dyad IDs # (this helps start the counting for the first run through the loop) lastid <- -1 # Create a lastrole variable that is # not one of the dyad member role labels # (this helps start the counting for the first run through the loop) lastrole <- -1 # Set the value for lastturn at 1, # which is the value where we want our speaking turn count to start lastturn <- 1 # For each row 1 through N of newdata for (i in 1:nrow(newdata)) { # If the dyad ID of the row is not equal to the value of lastid # (i.e., if we are trying to label a new conversation), then... if (newdata$id[i] != lastid) { # Label the turn 1 newdata$turn[i] <- 1 # Update the value lastrole with the role value of the current row lastrole <- newdata$role[i] # Update the value of lastturn to 1 lastturn <- 1 # Update the value of lastid with the dyad ID of the current row lastid <- newdata$id[i] } # If the role of the row is equal to the value of lastrole # (i.e., if the same dyad member is speaking), then... else if (newdata$role[i] == lastrole) { # Label the value of turn with the value of lastturn newdata$turn[i] <- lastturn } # If the conversation is the same, # but the dyad member speaking changes, then... else { # Label the turn as the lastturn value plus 1 newdata$turn[i] <- lastturn + 1 # Update the last turn value with the turn value just created above lastturn <- newdata$turn[i] # Update the role value with the role of the current row lastrole <- newdata$role[i] } } # View the first 10 rows of the repeated measures data with turns head(newdata, 10) ``` Looking at the first 10 rows of the data, we can see that the first two utterances (i.e., rows) are part of the discloser's speaking turn, the third utterance (row) is part of the listener's speaking turn, and so on. Next, we calculate the contents of each speaking turn, specifically, the proportion of each utterance type. We calculate the proportion scores separately for listeners and disclosers since we will run our cluster analysis on listener and discloser turns separately. First, we calculate the turn proportion scores for the listeners. Note, that we calculate the proportion scores for form and intent separately and then bring these proportion scores together to create a final data set with 18 proportion scores (9 for form and 9 for intent). Create new data set that only contains listener turns (role = 2). ```{r} newdata_listener <- newdata[which(newdata$role == 2), ] ``` Create a separate data set to calculate proportions for listener form. ```{r} newdata_listener_form <- newdata_listener[ ,c("id", "turn", "seg", "form")] ``` Calculate the proportion of each utterance type for each turn, then merge into one data set. ```{r} # Count (i.e., sum) each utterance form type for each dyad ID and turn form_categories1 <- aggregate(form == 1 ~ id + turn, sum, data = newdata_listener_form) form_categories2 <- aggregate(form == 2 ~ id + turn, sum, data = newdata_listener_form) form_categories3 <- aggregate(form == 3 ~ id + turn, sum, data = newdata_listener_form) form_categories4 <- aggregate(form == 4 ~ id + turn, sum, data = newdata_listener_form) form_categories5 <- aggregate(form == 5 ~ id + turn, sum, data = newdata_listener_form) form_categories6 <- aggregate(form == 6 ~ id + turn, sum, data = newdata_listener_form) form_categories7 <- aggregate(form == 7 ~ id + turn, sum, data = newdata_listener_form) form_categories8 <- aggregate(form == 8 ~ id + turn, sum, data = newdata_listener_form) form_categories9 <- aggregate(form == 9 ~ id + turn, sum, data = newdata_listener_form) # Merge the counts together by dyad ID and turn number merged_listener_form <- Reduce(function(x,y) merge(x,y, by = c("id", "turn"), all=TRUE), list(form_categories1, form_categories2, form_categories3, form_categories4, form_categories5, form_categories6, form_categories7, form_categories8, form_categories9)) # Calculate the total count by adding the counts of each type merged_listener_form$totalform <- merged_listener_form$`form == 1` + merged_listener_form$`form == 2` + merged_listener_form$`form == 3` + merged_listener_form$`form == 4` + merged_listener_form$`form == 5` + merged_listener_form$`form == 6` + merged_listener_form$`form == 7` + merged_listener_form$`form == 8` + merged_listener_form$`form == 9` # Calculate proportions by dividing the count of each type by the total count merged_listener_form$propform1 <- merged_listener_form$`form == 1`/merged_listener_form$totalform merged_listener_form$propform2 <- merged_listener_form$`form == 2`/merged_listener_form$totalform merged_listener_form$propform3 <- merged_listener_form$`form == 3`/merged_listener_form$totalform merged_listener_form$propform4 <- merged_listener_form$`form == 4`/merged_listener_form$totalform merged_listener_form$propform5 <- merged_listener_form$`form == 5`/merged_listener_form$totalform merged_listener_form$propform6 <- merged_listener_form$`form == 6`/merged_listener_form$totalform merged_listener_form$propform7 <- merged_listener_form$`form == 7`/merged_listener_form$totalform merged_listener_form$propform8 <- merged_listener_form$`form == 8`/merged_listener_form$totalform merged_listener_form$propform9 <- merged_listener_form$`form == 9`/merged_listener_form$totalform ``` Create a separate data set to calculate proportions for listener intent. ```{r} newdata_listener_intent <- newdata_listener[ ,c("id", "turn", "seg", "intent")] ``` Calculate the proportion of each utterance type for each turn, then merge into one data set. ```{r} # Count (i.e., sum) each utterance form type for each dyad ID and turn intent_categories1 <- aggregate(intent == 1 ~ id + turn, sum, data = newdata_listener_intent) intent_categories2 <- aggregate(intent == 2 ~ id + turn, sum, data = newdata_listener_intent) intent_categories3 <- aggregate(intent == 3 ~ id + turn, sum, data = newdata_listener_intent) intent_categories4 <- aggregate(intent == 4 ~ id + turn, sum, data = newdata_listener_intent) intent_categories5 <- aggregate(intent == 5 ~ id + turn, sum, data = newdata_listener_intent) intent_categories6 <- aggregate(intent == 6 ~ id + turn, sum, data = newdata_listener_intent) intent_categories7 <- aggregate(intent == 7 ~ id + turn, sum, data = newdata_listener_intent) intent_categories8 <- aggregate(intent == 8 ~ id + turn, sum, data = newdata_listener_intent) intent_categories9 <- aggregate(intent == 9 ~ id + turn, sum, data = newdata_listener_intent) # Merge the counts together by dyad ID and turn number merged_listener_intent <- Reduce(function(x,y) merge(x,y, by = c("id", "turn"), all=TRUE), list(intent_categories1, intent_categories2, intent_categories3, intent_categories4, intent_categories5, intent_categories6, intent_categories7, intent_categories8, intent_categories9)) # Calculate the total count by adding the counts of each type merged_listener_intent$totalintent <- merged_listener_intent$`intent == 1` + merged_listener_intent$`intent == 2` + merged_listener_intent$`intent == 3` + merged_listener_intent$`intent == 4` + merged_listener_intent$`intent == 5` + merged_listener_intent$`intent == 6` + merged_listener_intent$`intent == 7` + merged_listener_intent$`intent == 8` + merged_listener_intent$`intent == 9` # Calculate proportions by dividing the count of each type by the total count merged_listener_intent$propintent1 <- merged_listener_intent$`intent == 1`/merged_listener_intent$totalintent merged_listener_intent$propintent2 <- merged_listener_intent$`intent == 2`/merged_listener_intent$totalintent merged_listener_intent$propintent3 <- merged_listener_intent$`intent == 3`/merged_listener_intent$totalintent merged_listener_intent$propintent4 <- merged_listener_intent$`intent == 4`/merged_listener_intent$totalintent merged_listener_intent$propintent5 <- merged_listener_intent$`intent == 5`/merged_listener_intent$totalintent merged_listener_intent$propintent6 <- merged_listener_intent$`intent == 6`/merged_listener_intent$totalintent merged_listener_intent$propintent7 <- merged_listener_intent$`intent == 7`/merged_listener_intent$totalintent merged_listener_intent$propintent8 <- merged_listener_intent$`intent == 8`/merged_listener_intent$totalintent merged_listener_intent$propintent9 <- merged_listener_intent$`intent == 9`/merged_listener_intent$totalintent ``` Merge listener form and intent data sets. ```{r} # Merge data data_listener <- merge(merged_listener_form, merged_listener_intent, by= c("id", "turn")) # Examine column names names(data_listener) # Partition to variables for the cluster analysis data_listener <- data_listener[ , c("id", "turn", "propform1", "propform2", "propform3", "propform4", "propform5", "propform6", "propform7", "propform8","propform9", "propintent1", "propintent2", "propintent3", "propintent4", "propintent5", "propintent6", "propintent7", "propintent8", "propintent9")] # Re-order rows by dyad ID and turn number data_listener <- data_listener[order(data_listener$id, data_listener$turn),] # View the first 10 rows of the Listener turn proportion data head(data_listener, 10) ``` Now, we will go through the same process to calculate the proportion of each utterance type for discloser form and intent. We calculate the proportion scores for form and intent separately and then bring these proportion scores together to create a final data set with 18 proportion scores (9 for form and 9 for intent). Create new data set that only contains discloser turns (role = 1). ```{r} newdata_discloser <- newdata[which(newdata$role == 1), ] ``` Create a separate data set to calculate proportions for discloser form. ```{r} newdata_discloser_form <- newdata_discloser[ ,c("id", "turn", "seg", "form")] ``` Calculate the proportion of each utterance type for each turn, then merge into one data set. ```{r} # Count (i.e., sum) each utterance form type for each dyad ID and turn form_categories1 <- aggregate(form == 1 ~ id + turn, sum, data = newdata_discloser_form) form_categories2 <- aggregate(form == 2 ~ id + turn, sum, data = newdata_discloser_form) form_categories3 <- aggregate(form == 3 ~ id + turn, sum, data = newdata_discloser_form) form_categories4 <- aggregate(form == 4 ~ id + turn, sum, data = newdata_discloser_form) form_categories5 <- aggregate(form == 5 ~ id + turn, sum, data = newdata_discloser_form) form_categories6 <- aggregate(form == 6 ~ id + turn, sum, data = newdata_discloser_form) form_categories7 <- aggregate(form == 7 ~ id + turn, sum, data = newdata_discloser_form) form_categories8 <- aggregate(form == 8 ~ id + turn, sum, data = newdata_discloser_form) form_categories9 <- aggregate(form == 9 ~ id + turn, sum, data = newdata_discloser_form) # Merge the counts together by dyad ID and turn number merged_discloser_form <- Reduce(function(x,y) merge(x,y, by = c("id", "turn"), all=TRUE), list(form_categories1, form_categories2, form_categories3, form_categories4, form_categories5, form_categories6, form_categories7, form_categories8, form_categories9)) # Calculate the total count by adding the counts of each type merged_discloser_form$totalform <- merged_discloser_form$`form == 1` + merged_discloser_form$`form == 2` + merged_discloser_form$`form == 3` + merged_discloser_form$`form == 4` + merged_discloser_form$`form == 5` + merged_discloser_form$`form == 6` + merged_discloser_form$`form == 7` + merged_discloser_form$`form == 8` + merged_discloser_form$`form == 9` # Calculate proportions by dividing the count of each type by the total count merged_discloser_form$propform1 <- merged_discloser_form$`form == 1`/merged_discloser_form$totalform merged_discloser_form$propform2 <- merged_discloser_form$`form == 2`/merged_discloser_form$totalform merged_discloser_form$propform3 <- merged_discloser_form$`form == 3`/merged_discloser_form$totalform merged_discloser_form$propform4 <- merged_discloser_form$`form == 4`/merged_discloser_form$totalform merged_discloser_form$propform5 <- merged_discloser_form$`form == 5`/merged_discloser_form$totalform merged_discloser_form$propform6 <- merged_discloser_form$`form == 6`/merged_discloser_form$totalform merged_discloser_form$propform7 <- merged_discloser_form$`form == 7`/merged_discloser_form$totalform merged_discloser_form$propform8 <- merged_discloser_form$`form == 8`/merged_discloser_form$totalform merged_discloser_form$propform9 <- merged_discloser_form$`form == 9`/merged_discloser_form$totalform ``` Create a separate data set to calculate proportions for discloser intent. ```{r} newdata_discloser_intent <- newdata_discloser[ ,c("id", "turn", "seg", "intent")] ``` Calculate the proportion of each utterance type for each turn, then merge into one data set. ```{r} # Count (i.e., sum) each utterance form type for each dyad ID and turn intent_categories1 <- aggregate(intent == 1 ~ id + turn, sum, data = newdata_discloser_intent) intent_categories2 <- aggregate(intent == 2 ~ id + turn, sum, data = newdata_discloser_intent) intent_categories3 <- aggregate(intent == 3 ~ id + turn, sum, data = newdata_discloser_intent) intent_categories4 <- aggregate(intent == 4 ~ id + turn, sum, data = newdata_discloser_intent) intent_categories5 <- aggregate(intent == 5 ~ id + turn, sum, data = newdata_discloser_intent) intent_categories6 <- aggregate(intent == 6 ~ id + turn, sum, data = newdata_discloser_intent) intent_categories7 <- aggregate(intent == 7 ~ id + turn, sum, data = newdata_discloser_intent) intent_categories8 <- aggregate(intent == 8 ~ id + turn, sum, data = newdata_discloser_intent) intent_categories9 <- aggregate(intent == 9 ~ id + turn, sum, data = newdata_discloser_intent) # Merge the counts together by dyad ID and turn number merged_discloser_intent <- Reduce(function(x,y) merge(x,y, by = c("id", "turn"), all=TRUE), list(intent_categories1, intent_categories2, intent_categories3, intent_categories4, intent_categories5, intent_categories6, intent_categories7, intent_categories8, intent_categories9)) # Calculate the total count by adding the counts of each type merged_discloser_intent$totalintent <- merged_discloser_intent$`intent == 1` + merged_discloser_intent$`intent == 2` + merged_discloser_intent$`intent == 3` + merged_discloser_intent$`intent == 4` + merged_discloser_intent$`intent == 5` + merged_discloser_intent$`intent == 6` + merged_discloser_intent$`intent == 7` + merged_discloser_intent$`intent == 8` + merged_discloser_intent$`intent == 9` # Calculate proportions by dividing the count of each type by the total count merged_discloser_intent$propintent1 <- merged_discloser_intent$`intent == 1`/merged_discloser_intent$totalintent merged_discloser_intent$propintent2 <- merged_discloser_intent$`intent == 2`/merged_discloser_intent$totalintent merged_discloser_intent$propintent3 <- merged_discloser_intent$`intent == 3`/merged_discloser_intent$totalintent merged_discloser_intent$propintent4 <- merged_discloser_intent$`intent == 4`/merged_discloser_intent$totalintent merged_discloser_intent$propintent5 <- merged_discloser_intent$`intent == 5`/merged_discloser_intent$totalintent merged_discloser_intent$propintent6 <- merged_discloser_intent$`intent == 6`/merged_discloser_intent$totalintent merged_discloser_intent$propintent7 <- merged_discloser_intent$`intent == 7`/merged_discloser_intent$totalintent merged_discloser_intent$propintent8 <- merged_discloser_intent$`intent == 8`/merged_discloser_intent$totalintent merged_discloser_intent$propintent9 <- merged_discloser_intent$`intent == 9`/merged_discloser_intent$totalintent ``` Merge discloser form and intent data sets. ```{r} # Merge data data_discloser <- merge(merged_discloser_form, merged_discloser_intent, by= c("id", "turn")) # Examine column names names(data_discloser) # Partition to variables for the cluster analysis data_discloser <- data_discloser[ ,c("id", "turn", "propform1", "propform2", "propform3", "propform4", "propform5", "propform6", "propform7", "propform8","propform9", "propintent1", "propintent2", "propintent3", "propintent4", "propintent5", "propintent6", "propintent7", "propintent8", "propintent9")] # Re-order rows by dyad ID and turn number data_discloser <- data_discloser[order(data_discloser$id, data_discloser$turn),] # View the first 10 rows of the Discloser turn proportion data head(data_discloser, 10) ``` Finally, we remove turns that are 100% filler (i.e., uncodable utterances) because they lack theoretical meaning. How may filler turns are there? That is, how many turns are comprised only of uncodable utterances (propform9 = 1 or propintent9 = 1). ```{r} # Listener filler turns length(which(data_listener$propform9 == 1 & data_listener$propintent9 == 1)) #34 filler turns for Listeners # Listener proportion of filler turns length(which(data_listener$propform9 == 1 & data_listener$propintent9 == 1))/nrow(data_listener) # 1.3% of Listener turns are filler turns # Discloser filler turns length(which(data_discloser$propform9 == 1 & data_discloser$propintent9 == 1)) # 39 filler turns for Disclosers # Discloser proportion of filler turns length(which(data_discloser$propform9 == 1 & data_discloser$propintent9 == 1))/nrow(data_discloser) # 1.5% of Listener turns are filler turns ``` Remove filler turns. If propform9 or propintent9 = 1, then remove from the listener and discloser data sets. ```{r} # Listener data set data_listener <- data_listener %>% filter(!(data_listener$propform9 == "1" & data_listener$propintent9 == "1")) %>% as.data.frame() # Discloser data set data_discloser <- data_discloser %>% filter(!(data_discloser$propform9 == "1" & data_discloser$propintent9 == "1")) %>% as.data.frame() ``` Now our data are ready for a cluster analysis! # Cluster analysis. We conduct separate cluster analyses for the listener and discloser turns to explore whether the different conversational roles influence the potential types of conversational acts. The cluster analysis involves several steps including (1) removing missing data, (2) scaling the data (i.e., rescaling all of the variables to have mean = 0 and standard deviation = 1), (3) running the cluster analysis, (4) examining the dendrogram to choose the appropriate number of clusters, and (5) saving the cluster assignments for each turn. We walk through these steps in more detail below. We begin with the listener cluster analysis. First, we remove rows with missing data from our data set since cluster analysis cannot handle missing data. If your data set has high levels of missingness, it might be worth considering imputation methods to handle the missing data. ```{r} list_subset <- data_listener[complete.cases(data_listener), ] ``` Second, we scale all of the variables (i.e., the proportion scores) so the mean of each variable is equal to 0 and the standard deviation of each variable is equal to 1. This is typically done when variables in the cluster analysis are on very different scales (e.g., age and income have huge differences in the range of values available to measure each variable). Although differences in scale are not an issue here, we follow this common practice. Also, make sure that you only scale the variables that will be included in the cluster analysis. In this case, we don't want to scale the dyad ID or turn number (the first two columns in our data set). ```{r} list_scale <- data.frame(scale(list_subset[3:20])) ``` Add the dyad ID and turn number variables back into the data set and rearrange so dyad ID and turn number are the first two columns of the data set. ```{r} # Add dyad ID and turn number variables list_scale$id <- list_subset$id list_scale$turn <- list_subset$turn # Reorder columns list_scale <- list_scale[, c(19, 20, 1:18)] # View the first 10 rows of the scaled Listener turn proportion data head(list_scale, 10) ``` It's also good to double check that the variables were scaled properly. We do so by examining whether the mean of the scaled variables is 0 and the standard deviation of the scaled variables is 1. ```{r} describe(list_scale) ``` Our data look ready for the cluster analysis! Third, we conduct the cluster analysis. ```{r} # There are random starts involved so we set a seed to make the analysis reproducible set.seed(1234) # Calculate the dissimilarity matrix # between all turns in the data set using Euclidian distance # Make sure to only include variables of interest # (i.e., do not include dyad ID or the turn number variable) dist_list <- daisy(list_scale[, 3:20], metric = "euclidean", stand = FALSE) # Compute the agglomerative hierarchical cluster analysis # using Ward's single linkage method clusterward_listener <- agnes(dist_list, diss = TRUE, method = "ward") ``` Fourth, we examine the resulting dendrogram to determine an appropriate number of clusters for the data at hand. We examine the length of the vertical lines (longer vertical lines indicate greater differences between groups) and the number of turns within each group (we don't want a group with too few turns). ```{r} plot(clusterward_listener, which.plot = 2, main = "Ward Clustering of the Listener Data") ``` Finally, based on the dendrogram (and examining the contents of several cluster solutions), we chose a 6-cluster solution. Using this solution, each turn is assigned to one of the six clusters. We also include code to examine statistics about the chosen cluster solution (e.g., within-cluster heterogeneity), but do not go through those results here. ```{r} # Cut dendrogram (or tree) by the number of # determined groups (in this case, 6) # Insert cluster analysis results object ("clusterward_listener") # and the number of cut points wardcluster6_list <- cutree(clusterward_listener, k = 6) # Cluster statistics # cluster.stats(dist_list, clustering = wardcluster6_list, # silhouette = TRUE, sepindex = TRUE) # Create cluster labels; in this case, we have six clusters and label them Type 1, ..., Type 6 cluster6_label_list <- factor(wardcluster6_list, labels = c("Type 1", "Type 2", "Type 3", "Type 4", "Type 5", "Type 6")) # Add cluster categories with the scaled listener data (list_subset) list_subset$wardcluster6 <- wardcluster6_list # Change struture of cluster categories to a factor variable list_subset$wardcluster6 <- as.factor(list_subset$wardcluster6) ``` Examine the number of each turn type. The interpretation of each type will be easier once we examine the contents of each turn type below. ```{r} listener_freq <- table(list_subset$wardcluster6) listener_freq round(prop.table(listener_freq), 2) ``` Now, we will go through the same process for the discloser cluster analysis. First, we remove rows with missing data from our data set since cluster analysis cannot handle missing data. If your data set has high levels of missingness, it might be worth considering imputation methods to handle the missing data. ```{r} disc_subset <- data_discloser[complete.cases(data_discloser),1:20] ``` Second, we scale all of the variables (i.e., the proportion scores) so the mean of each variable is equal to 0 and the standard deviation of each variable is equal to 1. This is typically done when variables in the cluster analysis are on very different scales (e.g., age and income have huge differences in the range of values available to measure each variable). Although differences in scale are not an issue here, we follow this common practice. Also, make sure that you only scale the variables that will be included in the cluster analysis. In this case, we don't want to scale the dyad ID or turn number (the first two columns in our data set). ```{r} disc_scale <- data.frame(scale(disc_subset[3:20])) ``` Add the dyad ID and turn number variables back into the data set and rearrange so dyad ID and turn number are the first two columns of the data set. ```{r} # Add dyad ID and turn number variables disc_scale$id <- disc_subset$id disc_scale$turn <- disc_subset$turn # Reorder columns disc_scale <- disc_scale[, c(19, 20, 1:18)] # View the first 10 rows of the scaled Discloser turn proportion data head(disc_scale, 10) ``` It's also good to double check that the variables were scaled properly. We do so by examining whether the mean of the scaled variables is 0 and the standard deviation of the scaled variables is 1. ```{r} describe(disc_scale) ``` Our data look ready for the cluster analysis! Third, we conduct the cluster analysis. ```{r} # There are random starts involved so we set a seed to make the analysis reproducible set.seed(1234) # Calculate the dissimilarity matrix # between all turns in the data set using Euclidian distance # Make sure to only include variables of interest # (i.e., do not include dyad ID or the turn number variable) dist_disc <- daisy(disc_scale[, 3:20], metric = "euclidean", stand = FALSE) # Compute the agglomerative hierarchical cluster analysis using Ward's single linkage method clusterward_discloser <- agnes(dist_disc, diss = TRUE, method = "ward") ``` Fourth, we examine the resulting dendrogram to determine an appropriate number of clusters for the data at hand. We examine the length of the vertical lines (longer vertical lines indicate greater differences between groups) and the number of turns within each group (we don't want a group with too few turns). ```{r} plot(clusterward_discloser, which.plot = 2, main = "Ward Clustering of the Discloser Data") ``` Finally, based on the dendrogram (and examining the contents of several cluster solutions), we chose a 6-cluster solution. Using this solution, each turn is assigned to one of the six clusters. We also include code to examine statistics about the chosen cluster solution (e.g., within-cluster heterogeneity), but do not go through those results here. ```{r} # Cut dendrogram (or tree) by the number of # determined groups (in this case, 6) # Insert cluster analysis results object ("clusterward_discloser") # and the number of cut points wardcluster6_disc <- cutree(clusterward_discloser, k = 6) # Cluster statistics # cluster.stats(dist_disc, clustering = wardcluster6_disc, # silhouette = TRUE, sepindex = TRUE) # Create cluster labels; in this case, we have six clusters and label them Type 1, ..., Type 6 cluster6_label_disc <- factor(wardcluster6_disc, labels = c("Type 1", "Type 2", "Type 3", "Type 4", "Type 5", "Type 6")) # Add cluster categories with the scaled Listener data (list_subset) disc_subset$wardcluster6 <- wardcluster6_disc # Change struture of cluster categories to a factor variable disc_subset$wardcluster6 <- as.factor(disc_subset$wardcluster6) ``` Examine the number of each turn type. The interpretation of each type will be easier once we examine the contents of each turn type below. ```{r} disc_freq <- table(disc_subset$wardcluster6) disc_freq round(prop.table(disc_freq), 2) ``` # Plot Clusters. Listener cluster results. Merge cluster assignments data with original listener turn proportion data and partition the data to only include the proportion scores and the cluster assignments. ```{r} # Merge data list_plot <- merge(data_listener, list_subset[, c("id", "turn", "wardcluster6")]) # Partition only proportion scores and cluster assignments list_plot <- list_plot[ ,c("wardcluster6", "propform1", "propintent1", "propform2", "propintent2", "propform3", "propintent3", "propform4", "propintent4", "propform5", "propintent5", "propform6", "propintent6", "propform7", "propintent7", "propform8", "propintent8", "propform9", "propintent9")] # View the first 10 rows of the Listener turn proportion data with the cluster assignment head(list_plot, 10) ``` The data need to be "melted" for the purposes of plotting. Melting the data refers to reshaping the data into a long format, such that the value of the proportion scores are in one column and the utterance type and cluster assignment associated with that proportion are in two other columns. ```{r} # Melt data list_melt <- melt(list_plot, id = "wardcluster6") # View the first 10 rows of the melted Listener data head(list_melt, 10) ``` Plot the average proportion of utterance contents for the 6 listener clusters. The naming and ordering of the clusters can be rearranged depending on how you would like your clusters ordered. In this case, we ordered the clusters alphabetically based upon the conceptual labels we gave the turn types. ```{r} # Change the structure of the cluster variable to # rearrange the order of the clusters in the plot list_melt$wardcluster6_factor <- factor(list_melt$wardcluster6, levels = c('1', '5', '3', '4', '2', '6')) # Create labels for the clusters to be used in the plot cluster_labels_list <- c('1' = "Acknowledge", '5' = "Advice", '3' = "Elaboration", '4' = "Hedged Disc", '2' = "Question", '6' = "Reflection") # Plot the contents of each cluster with the list_melt data, # the utterance types ("variable") on the x-axis, # the proportion of each utterance type ("value") on the y-axis, # and the color of the bars differing based upon the utterance type ("variable") ggplot(list_melt, aes(x = variable, y = value, fill = factor(variable))) + # Calculate mean proportion for each utterance type and display as a bar chart stat_summary(fun = mean, geom = "bar", position = position_dodge(1)) + # Create a different panel for each cluster and # label each cluster with the labels we created above facet_grid(~wardcluster6_factor, labeller = labeller(wardcluster6_factor = cluster_labels_list)) + # Do not include utterance type labels on the x-axis theme(axis.text.x=element_blank()) + # Create legend: name the contents of the legend ("Utterance") # Order the contents (breaks) and labels (labels) in the same order scale_fill_discrete(name = "Utterance Type", breaks = c("propform1", "propintent1", "propform2", "propintent2", "propform3", "propintent3", "propform4", "propintent4", "propform5", "propintent5", "propform6", "propintent6", "propform7", "propintent7", "propform8", "propintent8", "propform9", "propintent9"), labels = c("Disclosure Form", "Disclosure Intent", "Edification Form", "Edification Intent", "Advisement Form", "Advisement Intent", "Confirmation Form", "Confirmation Intent", "Question Form", "Question Intent", "Acknowledgement Form", "Acknowledgement Intent", "Interpretation Form", "Interpretation Intent", "Reflection Form", "Reflection Intent", "Uncodable Form", "Uncodable Intent")) + # X-axis label xlab("Cluster") + # Y- axis label ylab("Proportion") + # Change background theme_classic() ``` Discloser cluster results. Merge cluster assignments data with original discloser turn proportion data and partition the data to only include the proportion scores and the cluster assignments. ```{r} # Merge data disc_plot <- merge(data_discloser, disc_subset[, c("id", "turn", "wardcluster6")]) # Partition only proportion scores and cluster assignments disc_plot <- disc_plot[ ,c("wardcluster6", "propform1", "propintent1", "propform2", "propintent2", "propform3", "propintent3", "propform4", "propintent4", "propform5", "propintent5", "propform6", "propintent6", "propform7", "propintent7", "propform8", "propintent8", "propform9", "propintent9")] # View the first 10 rows of the Discloser turn proportion data with the cluster assignment head(disc_plot, 10) ``` The data need to be "melted" for the purposes of plotting. Melting the data refers to reshaping the data into a long format, such that the value of the proportion scores are in one column and the utterance type and cluster assignment associated with that proportion are in two other columns. ```{r} # Melt data disc_melt <- melt(disc_plot, id = "wardcluster6") # View the first 10 rows of the melted Discloser data head(disc_melt, 10) ``` Plot the average proportion of utterance contents for the 6 discloser clusters. The naming and ordering of the clusters can be rearranged depending on how you would like your clusters ordered. In this case, we ordered the clusters alphabetically based upon the conceptual labels we gave the turn types. ```{r} # Change the structure of the cluster variable to # rearrange the order of the clusters in the plot disc_melt$wardcluster6_factor <- factor(disc_melt$wardcluster6, levels = c('3', '6', '1', '2', '4', '5')) # Create labels for the clusters to be used in the plot cluster_labels_disc <- c('3' = "Acknowledge", '6' = "Advice", '1' = "Elaboration", '2' = "Hedged Disc", '4' = "Question", '5' = "Reflection") # Plot the contents of each cluster with the disc_melt data, # the utterance types ("variable") on the x-axis, # the proportion of each utterance type ("value") on the y-axis, # and the color of the bars differing based upon the utterance type ("variable") ggplot(disc_melt, aes(x = variable, y = value, fill = factor(variable))) + # Calculate mean proportion for each utterance type and display as a bar chart stat_summary(fun = mean, geom = "bar", position = position_dodge(1)) + # Create a different panel for each cluster and # label each cluster with the labels we created above facet_grid(~wardcluster6_factor, labeller = labeller(wardcluster6_factor = cluster_labels_disc)) + # Do not include utterance type labels on the x-axis theme(axis.text.x=element_blank()) + # Create legend: name the contents of the legend ("Utterance") # Order the contents (breaks) and labels (labels) in the same order scale_fill_discrete(name = "Utterance Type", breaks = c("propform1", "propintent1", "propform2", "propintent2", "propform3", "propintent3", "propform4", "propintent4", "propform5", "propintent5", "propform6", "propintent6", "propform7", "propintent7", "propform8", "propintent8", "propform9", "propintent9"), labels = c("Disclosure Form", "Disclosure Intent", "Edification Form", "Edification Intent", "Advisement Form", "Advisement Intent", "Confirmation Form", "Confirmation Intent", "Question Form", "Question Intent", "Acknowledgement Form", "Acknowledgement Intent", "Interpretation Form", "Interpretation Intent", "Reflection Form", "Reflection Intent", "Uncodable Form", "Uncodable Intent")) + # X-axis label xlab("Cluster") + # Y- axis label ylab("Proportion") + # Change background theme_classic() ``` Ta-da! ----- ### Additional Information We created this tutorial with a system environment and versions of R and packages that might be different from yours. If R reports errors when you attempt to run this tutorial, running the code chunk below and comparing your output and the tutorial posted on the LHAMA website may be helpful. ```{r} session_info(pkgs = c("attached")) ```