Cluster Analysis Tutorial
Overview
This tutorial provides R code on conducting cluster analysis. Cluster analysis is a data-driven analytic technique that groups together units whose contents are similar to each other.
In this example, we will identify turn types in supportive conversations by grouping together turns that are composed of similar utterances types. The utterances are drawn from a subset of conversations between strangers in which one dyad member disclosed about a current problem. Each utterance in these conversations was coded using Stiles’ (1992) verbal response mode categories (see Bodie et al., 2021 in the Journal of Language and Social Psychology for more details). We are interested in identifying a typology of turns that comprise supportive conversations.
In addition, the accompanying “ClusterAnalysis_Tutorial_2022August20.rmd” file contains all of the code presented in this tutorial and can be opened in RStudio (a somewhat more friendly user interface to R).
Finally, note that much of this tutorial will consist of data management. So, please be patient as we work our way to running the cluster analysis.
Outline
In this tutorial, we’ll cover…
- Reading in the data and loading needed packages.
- Data descriptives.
- Data management.
- Conducting the cluster analysis.
- Plotting clusters.
Read in the data and load needed packages.
Let’s read the data into R.
We are working with a data set that contains repeated measures (“StrangerConversationUtterances_N59”), specifically coded conversation utterances for each dyad.
The data set is stored as .csv file (comma-separated values file, which can be created by saving an Excel file as a csv document) on my computer’s desktop.
# Set working directory (i.e., where your data file is stored)
# This can be done by going to the top bar of RStudio and selecting
# "Session" --> "Set Working Directory" --> "Choose Directory" -->
# finding the location of your file
setwd("~/Desktop") # Note: You can skip this line if you have
#the data file and this .rmd file stored in the same directory
# Read in the repeated measures data
<- read.csv(file = "StrangerConversationUtterances_N59.csv", head = TRUE, sep = ",")
data
# View the first 10 rows of the repeated measures data
head(data, 10)
## id seg role form intent
## 1 105 10 1 1 1
## 2 105 36 1 1 1
## 3 105 45 1 1 1
## 4 105 46 1 1 1
## 5 105 79 1 1 1
## 6 105 80 1 1 1
## 7 105 81 1 1 1
## 8 105 92 1 1 1
## 9 105 104 1 1 1
## 10 105 105 1 1 1
In the repeated measures data (“data”), we can see each row contains information for one utterance and there are multiple rows (i.e., multiple utterances) for each dyad. In this data set, there are columns for:
- Dyad ID (
id
)
- Time variable - in this case, utterance/segment in the conversation
(
seg
)
- Dyad member ID - in this case, role in the conversation
(
role
; discloser = 1, listener = 2)
- Utterance form - in this case, based upon Stiles’ (1992) verbal
response mode category coding scheme (
form
; 1 = disclosure, 2 = edification, 3 = advisement, 4 = confirmation, 5 = question, 6 = acknowledgement, 7 = interpretation, 8 = reflection, 9 = uncodable) - Utterance intent - in this case, based upon Stiles’ (1992) verbal
response mode category coding scheme (
intent
; 1 = disclosure, 2 = edification, 3 = advisement, 4 = confirmation, 5 = question, 6 = acknowledgement, 7 = interpretation, 8 = reflection, 9 = uncodable)
Load the R packages we need.
Packages in R are a collection of functions (and their documentation/explanations) that enable us to conduct particular tasks, such as plotting or fitting a statistical model.
# install.packages("cluster") # Install package if you have never used it before
library(cluster) # For hierarchical cluster analysis
# install.packages("devtools") # Install package if you have never used it before
require(devtools) # For version control
# install.packages("dplyr") # Install package if you have never used it before
library(dplyr) # For data management
# install.packages("ggplot2") # Install package if you have never used it before
library(ggplot2) # For plotting
# install.packages("psych") # Install package if you have never used it before
library(psych) # For descriptive statistics
# install.packages("reshape") # Install package if you have never used it before
library(reshape) # For data management
Data Descriptives.
Let’s begin by getting a feel for our data. Specifically, let’s examine:
- how many dyads we have in the data set,
- how many utterances there are for each dyad, and
- the frequency of each utterance type across all dyads.
- Number of dyads.
# Length (i.e., number) of unique ID values
length(unique(data$id))
## [1] 59
There are 59 dyads in the data set.
- Number of utterances for each dyad.
<- # Select data
num_utt %>%
data # Select grouping variable, in this case, dyad ID (id)
group_by(id) %>%
# Count the number of turns in each conversation
summarise(count = n()) %>%
# Save the data as a data.frame
as.data.frame()
# Calculate descriptives on the number of utterances per conversation
describe(num_utt$count)
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 59 170.8 27.06 169 169.27 22.24 125 265 140 0.76 1 3.52
The average dyad in this subset of the data had approximately 171 utterances in their conversation (M = 170.80, SD = 27.06), with conversations ranging from 125 to 265 utterances.
Plot the distribution of the number of utterances per conversation.
# Select data (num_utt) and
# value on the x-axis (number of utterances per conversation: "count")
ggplot(data = num_utt, aes(x = count)) +
#Create a histogram with binwidth = 5 and white bars outlined in black
geom_histogram(binwidth = 5, fill = "white", color="black") +
# Label x-axis
labs(x = "Number of Utterances per Conversation") +
# Change background aesthetics of plot
theme_classic()
- The number of total utterances and proportion for each type (for both form and intent).
# Create table that calculates the number of utterances for each form type
<- table(data$form)
uttform_table
# Display the table and proportions
uttform_table
##
## 1 2 3 4 5 6 7 8 9
## 3014 2675 69 31 823 2599 78 163 625
round(prop.table(table(data$form)), 3)
##
## 1 2 3 4 5 6 7 8 9
## 0.299 0.265 0.007 0.003 0.082 0.258 0.008 0.016 0.062
# Create table that calculates the number of utterances for each intent type
<- table(data$intent)
uttintent_table
# Display the table and proportions
uttintent_table
##
## 1 2 3 4 5 6 7 8 9
## 1557 3869 96 473 732 1899 514 288 649
round(prop.table(table(data$intent)), 3)
##
## 1 2 3 4 5 6 7 8 9
## 0.155 0.384 0.010 0.047 0.073 0.188 0.051 0.029 0.064
For form, we can see that participants overall used disclosure (1) utterances the most and confirmation (4) utterances the least.
For intent, we can see that participants overall used edification (2) utterances the most and advisement (3) utterances the least.
Data Management.
In this section, we will create our input variables for the cluster analysis. Specifically, we will calculate the proportion of each utterance type for each speaking turn across all of the conversations. The process will include several steps, including (1) labeling speaking turns in the data set - i.e., all consecutive utterances from one member of the dyad, (2) calculating the proportion of each utterance type for each speaking turn, and (3) reformatting the data so that each speaking turn is its own row and the columns represent the proportion of each utterance type.
Before labeling the speaking turns in the data set, let’s make sure our data are in the format we will need.
Check and change the structure of the data set. We need to make the “id”, “role”, “form”, and “intent” variables into factor variables, which makes sure R interprets the variables as categories instead of integers.
# Examine structure
str(data)
## 'data.frame': 10077 obs. of 5 variables:
## $ id : int 105 105 105 105 105 105 105 105 105 105 ...
## $ seg : int 10 36 45 46 79 80 81 92 104 105 ...
## $ role : int 1 1 1 1 1 1 1 1 1 1 ...
## $ form : int 1 1 1 1 1 1 1 1 1 1 ...
## $ intent: int 1 1 1 1 1 1 1 1 1 1 ...
# Need to change "id", "role", "form", and "intent" to factor variables
$id <- as.factor(data$id)
data$role <- as.factor(data$role)
data$form <- as.factor(data$form)
data$intent <- as.factor(data$intent) data
Now that our data are in the correct format, we will label the speaking turns. Consecutive utterances from one member of the dyad are considered to be part of the same speaking turn. In the code below, we create a loop that goes through the data row-by-row and labels each row with a turn number based upon whether the dyad ID is the same as the prior row (if not, start the count over at 1) and if the dyad member (i.e., role) is the same as the prior row (if it is, then use the same turn label; if not, add one to the turn label). Additional explanations about the loop are provided below.
# Create new data set that orders the rows by dyad ID and utterance number
<- data[order(data$id, data$seg),]
newdata
# Create new variable turn, currently with missing values
$turn <- NA
newdata
# Create a lastid variable that is not one of the dyad IDs
# (this helps start the counting for the first run through the loop)
<- -1
lastid
# Create a lastrole variable that is
# not one of the dyad member role labels
# (this helps start the counting for the first run through the loop)
<- -1
lastrole
# Set the value for lastturn at 1,
# which is the value where we want our speaking turn count to start
<- 1
lastturn
# For each row 1 through N of newdata
for (i in 1:nrow(newdata))
{ # If the dyad ID of the row is not equal to the value of lastid
# (i.e., if we are trying to label a new conversation), then...
if (newdata$id[i] != lastid)
{# Label the turn 1
$turn[i] <- 1
newdata
# Update the value lastrole with the role value of the current row
<- newdata$role[i]
lastrole
# Update the value of lastturn to 1
<- 1
lastturn
# Update the value of lastid with the dyad ID of the current row
<- newdata$id[i]
lastid
}
# If the role of the row is equal to the value of lastrole
# (i.e., if the same dyad member is speaking), then...
else if (newdata$role[i] == lastrole)
{# Label the value of turn with the value of lastturn
$turn[i] <- lastturn
newdata
}
# If the conversation is the same,
# but the dyad member speaking changes, then...
else
{# Label the turn as the lastturn value plus 1
$turn[i] <- lastturn + 1
newdata
# Update the last turn value with the turn value just created above
<- newdata$turn[i]
lastturn
# Update the role value with the role of the current row
<- newdata$role[i]
lastrole
}
}
# View the first 10 rows of the repeated measures data with turns
head(newdata, 10)
## id seg role form intent turn
## 5114 3 1 1 6 6 1
## 3497 3 2 1 2 2 1
## 9230 3 3 2 6 6 2
## 1645 3 4 1 1 2 3
## 5684 3 5 1 9 9 3
## 1646 3 6 1 1 2 3
## 1647 3 7 1 1 2 3
## 1648 3 8 1 1 2 3
## 1649 3 9 1 1 2 3
## 1650 3 10 1 1 2 3
Looking at the first 10 rows of the data, we can see that the first two utterances (i.e., rows) are part of the discloser’s speaking turn, the third utterance (row) is part of the listener’s speaking turn, and so on.
Next, we calculate the contents of each speaking turn, specifically, the proportion of each utterance type. We calculate the proportion scores separately for listeners and disclosers since we will run our cluster analysis on listener and discloser turns separately.
First, we calculate the turn proportion scores for the listeners. Note, that we calculate the proportion scores for form and intent separately and then bring these proportion scores together to create a final data set with 18 proportion scores (9 for form and 9 for intent).
Create new data set that only contains listener turns (role = 2).
<- newdata[which(newdata$role == 2), ] newdata_listener
Create a separate data set to calculate proportions for listener form.
<- newdata_listener[ ,c("id", "turn", "seg", "form")] newdata_listener_form
Calculate the proportion of each utterance type for each turn, then merge into one data set.
# Count (i.e., sum) each utterance form type for each dyad ID and turn
<- aggregate(form == 1 ~ id + turn, sum, data = newdata_listener_form)
form_categories1 <- aggregate(form == 2 ~ id + turn, sum, data = newdata_listener_form)
form_categories2 <- aggregate(form == 3 ~ id + turn, sum, data = newdata_listener_form)
form_categories3 <- aggregate(form == 4 ~ id + turn, sum, data = newdata_listener_form)
form_categories4 <- aggregate(form == 5 ~ id + turn, sum, data = newdata_listener_form)
form_categories5 <- aggregate(form == 6 ~ id + turn, sum, data = newdata_listener_form)
form_categories6 <- aggregate(form == 7 ~ id + turn, sum, data = newdata_listener_form)
form_categories7 <- aggregate(form == 8 ~ id + turn, sum, data = newdata_listener_form)
form_categories8 <- aggregate(form == 9 ~ id + turn, sum, data = newdata_listener_form)
form_categories9
# Merge the counts together by dyad ID and turn number
<- Reduce(function(x,y) merge(x,y, by = c("id", "turn"), all=TRUE),
merged_listener_form list(form_categories1, form_categories2, form_categories3,
form_categories4, form_categories5, form_categories6,
form_categories7, form_categories8, form_categories9))
# Calculate the total count by adding the counts of each type
$totalform <- merged_listener_form$`form == 1` +
merged_listener_form$`form == 2` +
merged_listener_form$`form == 3` +
merged_listener_form$`form == 4` +
merged_listener_form$`form == 5` +
merged_listener_form$`form == 6` +
merged_listener_form$`form == 7` +
merged_listener_form$`form == 8` +
merged_listener_form$`form == 9`
merged_listener_form
# Calculate proportions by dividing the count of each type by the total count
$propform1 <- merged_listener_form$`form == 1`/merged_listener_form$totalform
merged_listener_form$propform2 <- merged_listener_form$`form == 2`/merged_listener_form$totalform
merged_listener_form$propform3 <- merged_listener_form$`form == 3`/merged_listener_form$totalform
merged_listener_form$propform4 <- merged_listener_form$`form == 4`/merged_listener_form$totalform
merged_listener_form$propform5 <- merged_listener_form$`form == 5`/merged_listener_form$totalform
merged_listener_form$propform6 <- merged_listener_form$`form == 6`/merged_listener_form$totalform
merged_listener_form$propform7 <- merged_listener_form$`form == 7`/merged_listener_form$totalform
merged_listener_form$propform8 <- merged_listener_form$`form == 8`/merged_listener_form$totalform
merged_listener_form$propform9 <- merged_listener_form$`form == 9`/merged_listener_form$totalform merged_listener_form
Create a separate data set to calculate proportions for listener intent.
<- newdata_listener[ ,c("id", "turn", "seg", "intent")] newdata_listener_intent
Calculate the proportion of each utterance type for each turn, then merge into one data set.
# Count (i.e., sum) each utterance form type for each dyad ID and turn
<- aggregate(intent == 1 ~ id + turn, sum, data = newdata_listener_intent)
intent_categories1 <- aggregate(intent == 2 ~ id + turn, sum, data = newdata_listener_intent)
intent_categories2 <- aggregate(intent == 3 ~ id + turn, sum, data = newdata_listener_intent)
intent_categories3 <- aggregate(intent == 4 ~ id + turn, sum, data = newdata_listener_intent)
intent_categories4 <- aggregate(intent == 5 ~ id + turn, sum, data = newdata_listener_intent)
intent_categories5 <- aggregate(intent == 6 ~ id + turn, sum, data = newdata_listener_intent)
intent_categories6 <- aggregate(intent == 7 ~ id + turn, sum, data = newdata_listener_intent)
intent_categories7 <- aggregate(intent == 8 ~ id + turn, sum, data = newdata_listener_intent)
intent_categories8 <- aggregate(intent == 9 ~ id + turn, sum, data = newdata_listener_intent)
intent_categories9
# Merge the counts together by dyad ID and turn number
<- Reduce(function(x,y) merge(x,y, by = c("id", "turn"), all=TRUE),
merged_listener_intent list(intent_categories1, intent_categories2, intent_categories3,
intent_categories4, intent_categories5, intent_categories6,
intent_categories7, intent_categories8, intent_categories9))
# Calculate the total count by adding the counts of each type
$totalintent <- merged_listener_intent$`intent == 1` +
merged_listener_intent$`intent == 2` +
merged_listener_intent$`intent == 3` +
merged_listener_intent$`intent == 4` +
merged_listener_intent$`intent == 5` +
merged_listener_intent$`intent == 6` +
merged_listener_intent$`intent == 7` +
merged_listener_intent$`intent == 8` +
merged_listener_intent$`intent == 9`
merged_listener_intent
# Calculate proportions by dividing the count of each type by the total count
$propintent1 <- merged_listener_intent$`intent == 1`/merged_listener_intent$totalintent
merged_listener_intent$propintent2 <- merged_listener_intent$`intent == 2`/merged_listener_intent$totalintent
merged_listener_intent$propintent3 <- merged_listener_intent$`intent == 3`/merged_listener_intent$totalintent
merged_listener_intent$propintent4 <- merged_listener_intent$`intent == 4`/merged_listener_intent$totalintent
merged_listener_intent$propintent5 <- merged_listener_intent$`intent == 5`/merged_listener_intent$totalintent
merged_listener_intent$propintent6 <- merged_listener_intent$`intent == 6`/merged_listener_intent$totalintent
merged_listener_intent$propintent7 <- merged_listener_intent$`intent == 7`/merged_listener_intent$totalintent
merged_listener_intent$propintent8 <- merged_listener_intent$`intent == 8`/merged_listener_intent$totalintent
merged_listener_intent$propintent9 <- merged_listener_intent$`intent == 9`/merged_listener_intent$totalintent merged_listener_intent
Merge listener form and intent data sets.
# Merge data
<- merge(merged_listener_form, merged_listener_intent, by= c("id", "turn"))
data_listener
# Examine column names
names(data_listener)
## [1] "id" "turn" "form == 1" "form == 2" "form == 3"
## [6] "form == 4" "form == 5" "form == 6" "form == 7" "form == 8"
## [11] "form == 9" "totalform" "propform1" "propform2" "propform3"
## [16] "propform4" "propform5" "propform6" "propform7" "propform8"
## [21] "propform9" "intent == 1" "intent == 2" "intent == 3" "intent == 4"
## [26] "intent == 5" "intent == 6" "intent == 7" "intent == 8" "intent == 9"
## [31] "totalintent" "propintent1" "propintent2" "propintent3" "propintent4"
## [36] "propintent5" "propintent6" "propintent7" "propintent8" "propintent9"
# Partition to variables for the cluster analysis
<- data_listener[ , c("id", "turn", "propform1", "propform2",
data_listener "propform3", "propform4", "propform5",
"propform6", "propform7",
"propform8","propform9",
"propintent1", "propintent2", "propintent3",
"propintent4", "propintent5", "propintent6",
"propintent7", "propintent8", "propintent9")]
# Re-order rows by dyad ID and turn number
<- data_listener[order(data_listener$id, data_listener$turn),]
data_listener
# View the first 10 rows of the Listener turn proportion data
head(data_listener, 10)
## id turn propform1 propform2 propform3 propform4 propform5 propform6
## 1903 3 2 0.0 0.0 0 0 0.0 1.0
## 1914 3 4 0.0 0.0 0 0 0.0 1.0
## 1925 3 6 0.0 0.0 0 0 0.0 1.0
## 1929 3 8 0.0 0.0 0 0 1.0 0.0
## 1898 3 10 0.0 0.0 0 0 0.0 1.0
## 1899 3 12 0.0 0.0 0 0 0.5 0.5
## 1900 3 14 0.0 0.5 0 0 0.0 0.5
## 1901 3 16 0.0 0.0 0 0 0.0 1.0
## 1902 3 18 0.5 0.5 0 0 0.0 0.0
## 1904 3 20 0.0 1.0 0 0 0.0 0.0
## propform7 propform8 propform9 propintent1 propintent2 propintent3
## 1903 0 0 0 0.0 0.0 0
## 1914 0 0 0 0.0 0.0 0
## 1925 0 0 0 0.0 0.0 0
## 1929 0 0 0 0.0 0.0 0
## 1898 0 0 0 0.0 0.0 0
## 1899 0 0 0 0.0 0.0 0
## 1900 0 0 0 0.0 0.0 0
## 1901 0 0 0 0.0 0.0 0
## 1902 0 0 0 0.5 0.5 0
## 1904 0 0 0 1.0 0.0 0
## propintent4 propintent5 propintent6 propintent7 propintent8 propintent9
## 1903 0 0.0 1 0.0 0 0
## 1914 0 0.0 1 0.0 0 0
## 1925 0 0.0 1 0.0 0 0
## 1929 0 1.0 0 0.0 0 0
## 1898 0 0.0 1 0.0 0 0
## 1899 0 0.5 0 0.5 0 0
## 1900 0 0.0 0 1.0 0 0
## 1901 0 0.0 1 0.0 0 0
## 1902 0 0.0 0 0.0 0 0
## 1904 0 0.0 0 0.0 0 0
Now, we will go through the same process to calculate the proportion of each utterance type for discloser form and intent.
We calculate the proportion scores for form and intent separately and then bring these proportion scores together to create a final data set with 18 proportion scores (9 for form and 9 for intent).
Create new data set that only contains discloser turns (role = 1).
<- newdata[which(newdata$role == 1), ] newdata_discloser
Create a separate data set to calculate proportions for discloser form.
<- newdata_discloser[ ,c("id", "turn", "seg", "form")] newdata_discloser_form
Calculate the proportion of each utterance type for each turn, then merge into one data set.
# Count (i.e., sum) each utterance form type for each dyad ID and turn
<- aggregate(form == 1 ~ id + turn, sum, data = newdata_discloser_form)
form_categories1 <- aggregate(form == 2 ~ id + turn, sum, data = newdata_discloser_form)
form_categories2 <- aggregate(form == 3 ~ id + turn, sum, data = newdata_discloser_form)
form_categories3 <- aggregate(form == 4 ~ id + turn, sum, data = newdata_discloser_form)
form_categories4 <- aggregate(form == 5 ~ id + turn, sum, data = newdata_discloser_form)
form_categories5 <- aggregate(form == 6 ~ id + turn, sum, data = newdata_discloser_form)
form_categories6 <- aggregate(form == 7 ~ id + turn, sum, data = newdata_discloser_form)
form_categories7 <- aggregate(form == 8 ~ id + turn, sum, data = newdata_discloser_form)
form_categories8 <- aggregate(form == 9 ~ id + turn, sum, data = newdata_discloser_form)
form_categories9
# Merge the counts together by dyad ID and turn number
<- Reduce(function(x,y) merge(x,y, by = c("id", "turn"), all=TRUE),
merged_discloser_form list(form_categories1, form_categories2, form_categories3,
form_categories4, form_categories5, form_categories6,
form_categories7, form_categories8, form_categories9))
# Calculate the total count by adding the counts of each type
$totalform <- merged_discloser_form$`form == 1` +
merged_discloser_form$`form == 2` +
merged_discloser_form$`form == 3` +
merged_discloser_form$`form == 4` +
merged_discloser_form$`form == 5` +
merged_discloser_form$`form == 6` +
merged_discloser_form$`form == 7` +
merged_discloser_form$`form == 8` +
merged_discloser_form$`form == 9`
merged_discloser_form
# Calculate proportions by dividing the count of each type by the total count
$propform1 <- merged_discloser_form$`form == 1`/merged_discloser_form$totalform
merged_discloser_form$propform2 <- merged_discloser_form$`form == 2`/merged_discloser_form$totalform
merged_discloser_form$propform3 <- merged_discloser_form$`form == 3`/merged_discloser_form$totalform
merged_discloser_form$propform4 <- merged_discloser_form$`form == 4`/merged_discloser_form$totalform
merged_discloser_form$propform5 <- merged_discloser_form$`form == 5`/merged_discloser_form$totalform
merged_discloser_form$propform6 <- merged_discloser_form$`form == 6`/merged_discloser_form$totalform
merged_discloser_form$propform7 <- merged_discloser_form$`form == 7`/merged_discloser_form$totalform
merged_discloser_form$propform8 <- merged_discloser_form$`form == 8`/merged_discloser_form$totalform
merged_discloser_form$propform9 <- merged_discloser_form$`form == 9`/merged_discloser_form$totalform merged_discloser_form
Create a separate data set to calculate proportions for discloser intent.
<- newdata_discloser[ ,c("id", "turn", "seg", "intent")] newdata_discloser_intent
Calculate the proportion of each utterance type for each turn, then merge into one data set.
# Count (i.e., sum) each utterance form type for each dyad ID and turn
<- aggregate(intent == 1 ~ id + turn, sum, data = newdata_discloser_intent)
intent_categories1 <- aggregate(intent == 2 ~ id + turn, sum, data = newdata_discloser_intent)
intent_categories2 <- aggregate(intent == 3 ~ id + turn, sum, data = newdata_discloser_intent)
intent_categories3 <- aggregate(intent == 4 ~ id + turn, sum, data = newdata_discloser_intent)
intent_categories4 <- aggregate(intent == 5 ~ id + turn, sum, data = newdata_discloser_intent)
intent_categories5 <- aggregate(intent == 6 ~ id + turn, sum, data = newdata_discloser_intent)
intent_categories6 <- aggregate(intent == 7 ~ id + turn, sum, data = newdata_discloser_intent)
intent_categories7 <- aggregate(intent == 8 ~ id + turn, sum, data = newdata_discloser_intent)
intent_categories8 <- aggregate(intent == 9 ~ id + turn, sum, data = newdata_discloser_intent)
intent_categories9
# Merge the counts together by dyad ID and turn number
<- Reduce(function(x,y) merge(x,y, by = c("id", "turn"), all=TRUE),
merged_discloser_intent list(intent_categories1, intent_categories2, intent_categories3,
intent_categories4, intent_categories5, intent_categories6,
intent_categories7, intent_categories8, intent_categories9))
# Calculate the total count by adding the counts of each type
$totalintent <- merged_discloser_intent$`intent == 1` +
merged_discloser_intent$`intent == 2` +
merged_discloser_intent$`intent == 3` +
merged_discloser_intent$`intent == 4` +
merged_discloser_intent$`intent == 5` +
merged_discloser_intent$`intent == 6` +
merged_discloser_intent$`intent == 7` +
merged_discloser_intent$`intent == 8` +
merged_discloser_intent$`intent == 9`
merged_discloser_intent
# Calculate proportions by dividing the count of each type by the total count
$propintent1 <- merged_discloser_intent$`intent == 1`/merged_discloser_intent$totalintent
merged_discloser_intent$propintent2 <- merged_discloser_intent$`intent == 2`/merged_discloser_intent$totalintent
merged_discloser_intent$propintent3 <- merged_discloser_intent$`intent == 3`/merged_discloser_intent$totalintent
merged_discloser_intent$propintent4 <- merged_discloser_intent$`intent == 4`/merged_discloser_intent$totalintent
merged_discloser_intent$propintent5 <- merged_discloser_intent$`intent == 5`/merged_discloser_intent$totalintent
merged_discloser_intent$propintent6 <- merged_discloser_intent$`intent == 6`/merged_discloser_intent$totalintent
merged_discloser_intent$propintent7 <- merged_discloser_intent$`intent == 7`/merged_discloser_intent$totalintent
merged_discloser_intent$propintent8 <- merged_discloser_intent$`intent == 8`/merged_discloser_intent$totalintent
merged_discloser_intent$propintent9 <- merged_discloser_intent$`intent == 9`/merged_discloser_intent$totalintent merged_discloser_intent
Merge discloser form and intent data sets.
# Merge data
<- merge(merged_discloser_form, merged_discloser_intent, by= c("id", "turn"))
data_discloser
# Examine column names
names(data_discloser)
## [1] "id" "turn" "form == 1" "form == 2" "form == 3"
## [6] "form == 4" "form == 5" "form == 6" "form == 7" "form == 8"
## [11] "form == 9" "totalform" "propform1" "propform2" "propform3"
## [16] "propform4" "propform5" "propform6" "propform7" "propform8"
## [21] "propform9" "intent == 1" "intent == 2" "intent == 3" "intent == 4"
## [26] "intent == 5" "intent == 6" "intent == 7" "intent == 8" "intent == 9"
## [31] "totalintent" "propintent1" "propintent2" "propintent3" "propintent4"
## [36] "propintent5" "propintent6" "propintent7" "propintent8" "propintent9"
# Partition to variables for the cluster analysis
<- data_discloser[ ,c("id", "turn", "propform1", "propform2",
data_discloser "propform3", "propform4", "propform5",
"propform6", "propform7",
"propform8","propform9",
"propintent1", "propintent2", "propintent3",
"propintent4", "propintent5", "propintent6",
"propintent7", "propintent8", "propintent9")]
# Re-order rows by dyad ID and turn number
<- data_discloser[order(data_discloser$id, data_discloser$turn),]
data_discloser
# View the first 10 rows of the Discloser turn proportion data
head(data_discloser, 10)
## id turn propform1 propform2 propform3 propform4 propform5 propform6
## 1908 3 1 0.0000000 0.500 0 0 0.000 0.5
## 1919 3 3 0.8571429 0.000 0 0 0.000 0.0
## 1930 3 5 0.7000000 0.200 0 0 0.000 0.0
## 1939 3 7 1.0000000 0.000 0 0 0.000 0.0
## 1940 3 9 0.0000000 1.000 0 0 0.000 0.0
## 1909 3 11 0.2500000 0.750 0 0 0.000 0.0
## 1910 3 13 1.0000000 0.000 0 0 0.000 0.0
## 1911 3 15 0.2500000 0.125 0 0 0.125 0.5
## 1912 3 17 1.0000000 0.000 0 0 0.000 0.0
## 1913 3 19 0.0000000 0.000 0 0 0.000 1.0
## propform7 propform8 propform9 propintent1 propintent2 propintent3
## 1908 0 0 0.0000000 0.00 0.5000000 0
## 1919 0 0 0.1428571 0.00 0.8571429 0
## 1930 0 0 0.1000000 0.40 0.5000000 0
## 1939 0 0 0.0000000 0.00 1.0000000 0
## 1940 0 0 0.0000000 0.00 1.0000000 0
## 1909 0 0 0.0000000 0.25 0.7500000 0
## 1910 0 0 0.0000000 0.00 1.0000000 0
## 1911 0 0 0.0000000 0.25 0.2500000 0
## 1912 0 0 0.0000000 1.00 0.0000000 0
## 1913 0 0 0.0000000 0.00 0.0000000 0
## propintent4 propintent5 propintent6 propintent7 propintent8 propintent9
## 1908 0 0.000 0.500 0 0 0.0000000
## 1919 0 0.000 0.000 0 0 0.1428571
## 1930 0 0.000 0.000 0 0 0.1000000
## 1939 0 0.000 0.000 0 0 0.0000000
## 1940 0 0.000 0.000 0 0 0.0000000
## 1909 0 0.000 0.000 0 0 0.0000000
## 1910 0 0.000 0.000 0 0 0.0000000
## 1911 0 0.125 0.375 0 0 0.0000000
## 1912 0 0.000 0.000 0 0 0.0000000
## 1913 0 0.000 1.000 0 0 0.0000000
Finally, we remove turns that are 100% filler (i.e., uncodable utterances) because they lack theoretical meaning.
How may filler turns are there? That is, how many turns are comprised only of uncodable utterances (propform9 = 1 or propintent9 = 1).
# Listener filler turns
length(which(data_listener$propform9 == 1 & data_listener$propintent9 == 1))
## [1] 34
#34 filler turns for Listeners
# Listener proportion of filler turns
length(which(data_listener$propform9 == 1 & data_listener$propintent9 == 1))/nrow(data_listener)
## [1] 0.01317829
# 1.3% of Listener turns are filler turns
# Discloser filler turns
length(which(data_discloser$propform9 == 1 & data_discloser$propintent9 == 1))
## [1] 39
# 39 filler turns for Disclosers
# Discloser proportion of filler turns
length(which(data_discloser$propform9 == 1 & data_discloser$propintent9 == 1))/nrow(data_discloser)
## [1] 0.0150289
# 1.5% of Listener turns are filler turns
Remove filler turns. If propform9 or propintent9 = 1, then remove from the listener and discloser data sets.
# Listener data set
<- data_listener %>%
data_listener filter(!(data_listener$propform9 == "1" &
$propintent9 == "1")) %>%
data_listeneras.data.frame()
# Discloser data set
<- data_discloser %>%
data_discloser filter(!(data_discloser$propform9 == "1" &
$propintent9 == "1")) %>%
data_discloseras.data.frame()
Now our data are ready for a cluster analysis!
Cluster analysis.
We conduct separate cluster analyses for the listener and discloser turns to explore whether the different conversational roles influence the potential types of conversational acts. The cluster analysis involves several steps including (1) removing missing data, (2) scaling the data (i.e., rescaling all of the variables to have mean = 0 and standard deviation = 1), (3) running the cluster analysis, (4) examining the dendrogram to choose the appropriate number of clusters, and (5) saving the cluster assignments for each turn. We walk through these steps in more detail below.
We begin with the listener cluster analysis.
First, we remove rows with missing data from our data set since cluster analysis cannot handle missing data. If your data set has high levels of missingness, it might be worth considering imputation methods to handle the missing data.
<- data_listener[complete.cases(data_listener), ] list_subset
Second, we scale all of the variables (i.e., the proportion scores) so the mean of each variable is equal to 0 and the standard deviation of each variable is equal to 1. This is typically done when variables in the cluster analysis are on very different scales (e.g., age and income have huge differences in the range of values available to measure each variable). Although differences in scale are not an issue here, we follow this common practice.
Also, make sure that you only scale the variables that will be included in the cluster analysis. In this case, we don’t want to scale the dyad ID or turn number (the first two columns in our data set).
<- data.frame(scale(list_subset[3:20])) list_scale
Add the dyad ID and turn number variables back into the data set and rearrange so dyad ID and turn number are the first two columns of the data set.
# Add dyad ID and turn number variables
$id <- list_subset$id
list_scale$turn <- list_subset$turn
list_scale
# Reorder columns
<- list_scale[, c(19, 20, 1:18)]
list_scale
# View the first 10 rows of the scaled Listener turn proportion data
head(list_scale, 10)
## id turn propform1 propform2 propform3 propform4 propform5 propform6
## 1 3 2 -0.4516489 -0.5099395 -0.1222075 -0.06964092 -0.4694416 1.18776821
## 2 3 4 -0.4516489 -0.5099395 -0.1222075 -0.06964092 -0.4694416 1.18776821
## 3 3 6 -0.4516489 -0.5099395 -0.1222075 -0.06964092 -0.4694416 1.18776821
## 4 3 8 -0.4516489 -0.5099395 -0.1222075 -0.06964092 2.4229863 -1.00136226
## 5 3 10 -0.4516489 -0.5099395 -0.1222075 -0.06964092 -0.4694416 1.18776821
## 6 3 12 -0.4516489 -0.5099395 -0.1222075 -0.06964092 0.9767724 0.09320297
## 7 3 14 -0.4516489 1.0120514 -0.1222075 -0.06964092 -0.4694416 0.09320297
## 8 3 16 -0.4516489 -0.5099395 -0.1222075 -0.06964092 -0.4694416 1.18776821
## 9 3 18 1.2488045 1.0120514 -0.1222075 -0.06964092 -0.4694416 -1.00136226
## 10 3 20 -0.4516489 2.5340423 -0.1222075 -0.06964092 -0.4694416 -1.00136226
## propform7 propform8 propform9 propintent1 propintent2 propintent3
## 1 -0.1252172 -0.1940528 -0.22738 -0.3308599 -0.4763249 -0.1490803
## 2 -0.1252172 -0.1940528 -0.22738 -0.3308599 -0.4763249 -0.1490803
## 3 -0.1252172 -0.1940528 -0.22738 -0.3308599 -0.4763249 -0.1490803
## 4 -0.1252172 -0.1940528 -0.22738 -0.3308599 -0.4763249 -0.1490803
## 5 -0.1252172 -0.1940528 -0.22738 -0.3308599 -0.4763249 -0.1490803
## 6 -0.1252172 -0.1940528 -0.22738 -0.3308599 -0.4763249 -0.1490803
## 7 -0.1252172 -0.1940528 -0.22738 -0.3308599 -0.4763249 -0.1490803
## 8 -0.1252172 -0.1940528 -0.22738 -0.3308599 -0.4763249 -0.1490803
## 9 -0.1252172 -0.1940528 -0.22738 1.8783940 1.0657089 -0.1490803
## 10 -0.1252172 -0.1940528 -0.22738 4.0876480 -0.4763249 -0.1490803
## propintent4 propintent5 propintent6 propintent7 propintent8 propintent9
## 1 -0.2259633 -0.4335979 1.3652738 -0.3740743 -0.2815835 -0.2306533
## 2 -0.2259633 -0.4335979 1.3652738 -0.3740743 -0.2815835 -0.2306533
## 3 -0.2259633 -0.4335979 1.3652738 -0.3740743 -0.2815835 -0.2306533
## 4 -0.2259633 2.6225772 -0.8479337 -0.3740743 -0.2815835 -0.2306533
## 5 -0.2259633 -0.4335979 1.3652738 -0.3740743 -0.2815835 -0.2306533
## 6 -0.2259633 1.0944897 -0.8479337 1.4352054 -0.2815835 -0.2306533
## 7 -0.2259633 -0.4335979 -0.8479337 3.2444852 -0.2815835 -0.2306533
## 8 -0.2259633 -0.4335979 1.3652738 -0.3740743 -0.2815835 -0.2306533
## 9 -0.2259633 -0.4335979 -0.8479337 -0.3740743 -0.2815835 -0.2306533
## 10 -0.2259633 -0.4335979 -0.8479337 -0.3740743 -0.2815835 -0.2306533
It’s also good to double check that the variables were scaled properly. We do so by examining whether the mean of the scaled variables is 0 and the standard deviation of the scaled variables is 1.
describe(list_scale)
## vars n mean sd median trimmed mad min max range
## id* 1 2546 31.12 16.94 31.00 31.27 20.76 1.00 59.00 58.00
## turn 2 2546 46.67 29.21 44.00 44.96 32.62 1.00 147.00 146.00
## propform1 3 2546 0.00 1.00 -0.45 -0.28 0.00 -0.45 2.95 3.40
## propform2 4 2546 0.00 1.00 -0.51 -0.25 0.00 -0.51 2.53 3.04
## propform3 5 2546 0.00 1.00 -0.12 -0.12 0.00 -0.12 10.66 10.78
## propform4 6 2546 0.00 1.00 -0.07 -0.07 0.00 -0.07 19.32 19.39
## propform5 7 2546 0.00 1.00 -0.47 -0.24 0.00 -0.47 2.42 2.89
## propform6 8 2546 0.00 1.00 -0.27 -0.02 1.08 -1.00 1.19 2.19
## propform7 9 2546 0.00 1.00 -0.13 -0.13 0.00 -0.13 10.16 10.28
## propform8 10 2546 0.00 1.00 -0.19 -0.19 0.00 -0.19 5.90 6.10
## propform9 11 2546 0.00 1.00 -0.23 -0.23 0.00 -0.23 10.59 10.82
## propintent1 12 2546 0.00 1.00 -0.33 -0.30 0.00 -0.33 4.09 4.42
## propintent2 13 2546 0.00 1.00 -0.48 -0.26 0.00 -0.48 2.61 3.08
## propintent3 14 2546 0.00 1.00 -0.15 -0.15 0.00 -0.15 8.29 8.44
## propintent4 15 2546 0.00 1.00 -0.23 -0.23 0.00 -0.23 5.64 5.87
## propintent5 16 2546 0.00 1.00 -0.43 -0.27 0.00 -0.43 2.62 3.06
## propintent6 17 2546 0.00 1.00 -0.85 -0.06 0.00 -0.85 1.37 2.21
## propintent7 18 2546 0.00 1.00 -0.37 -0.30 0.00 -0.37 3.24 3.62
## propintent8 19 2546 0.00 1.00 -0.28 -0.28 0.00 -0.28 4.13 4.41
## propintent9 20 2546 0.00 1.00 -0.23 -0.23 0.00 -0.23 10.09 10.32
## skew kurtosis se
## id* -0.05 -1.17 0.34
## turn 0.48 -0.33 0.58
## propform1 2.12 3.11 0.02
## propform2 1.77 1.60 0.02
## propform3 9.21 89.04 0.02
## propform4 16.51 292.27 0.02
## propform5 1.83 1.58 0.02
## propform6 0.19 -1.79 0.02
## propform7 8.87 81.88 0.02
## propform8 5.34 27.61 0.02
## propform9 4.91 26.62 0.02
## propintent1 3.21 9.45 0.02
## propintent2 1.88 1.93 0.02
## propintent3 7.35 55.19 0.02
## propintent4 4.80 22.71 0.02
## propintent5 2.05 2.45 0.02
## propintent6 0.50 -1.60 0.02
## propintent7 2.60 5.29 0.02
## propintent8 3.57 11.40 0.02
## propintent9 5.04 29.41 0.02
Our data look ready for the cluster analysis!
Third, we conduct the cluster analysis.
# There are random starts involved so we set a seed to make the analysis reproducible
set.seed(1234)
# Calculate the dissimilarity matrix
# between all turns in the data set using Euclidian distance
# Make sure to only include variables of interest
# (i.e., do not include dyad ID or the turn number variable)
<- daisy(list_scale[, 3:20], metric = "euclidean", stand = FALSE)
dist_list
# Compute the agglomerative hierarchical cluster analysis
# using Ward's single linkage method
<- agnes(dist_list, diss = TRUE, method = "ward") clusterward_listener
Fourth, we examine the resulting dendrogram to determine an appropriate number of clusters for the data at hand. We examine the length of the vertical lines (longer vertical lines indicate greater differences between groups) and the number of turns within each group (we don’t want a group with too few turns).
plot(clusterward_listener, which.plot = 2, main = "Ward Clustering of the Listener Data")
Finally, based on the dendrogram (and examining the contents of several cluster solutions), we chose a 6-cluster solution. Using this solution, each turn is assigned to one of the six clusters. We also include code to examine statistics about the chosen cluster solution (e.g., within-cluster heterogeneity), but do not go through those results here.
# Cut dendrogram (or tree) by the number of
# determined groups (in this case, 6)
# Insert cluster analysis results object ("clusterward_listener")
# and the number of cut points
<- cutree(clusterward_listener, k = 6)
wardcluster6_list
# Cluster statistics
# cluster.stats(dist_list, clustering = wardcluster6_list,
# silhouette = TRUE, sepindex = TRUE)
# Create cluster labels; in this case, we have six clusters and label them Type 1, ..., Type 6
<- factor(wardcluster6_list, labels = c("Type 1", "Type 2", "Type 3",
cluster6_label_list "Type 4", "Type 5", "Type 6"))
# Add cluster categories with the scaled listener data (list_subset)
$wardcluster6 <- wardcluster6_list
list_subset
# Change struture of cluster categories to a factor variable
$wardcluster6 <- as.factor(list_subset$wardcluster6) list_subset
Examine the number of each turn type. The interpretation of each type will be easier once we examine the contents of each turn type below.
<- table(list_subset$wardcluster6)
listener_freq listener_freq
##
## 1 2 3 4 5 6
## 826 368 997 125 49 181
round(prop.table(listener_freq), 2)
##
## 1 2 3 4 5 6
## 0.32 0.14 0.39 0.05 0.02 0.07
Now, we will go through the same process for the discloser cluster analysis.
First, we remove rows with missing data from our data set since cluster analysis cannot handle missing data. If your data set has high levels of missingness, it might be worth considering imputation methods to handle the missing data.
<- data_discloser[complete.cases(data_discloser),1:20] disc_subset
Second, we scale all of the variables (i.e., the proportion scores) so the mean of each variable is equal to 0 and the standard deviation of each variable is equal to 1. This is typically done when variables in the cluster analysis are on very different scales (e.g., age and income have huge differences in the range of values available to measure each variable). Although differences in scale are not an issue here, we follow this common practice.
Also, make sure that you only scale the variables that will be included in the cluster analysis. In this case, we don’t want to scale the dyad ID or turn number (the first two columns in our data set).
<- data.frame(scale(disc_subset[3:20])) disc_scale
Add the dyad ID and turn number variables back into the data set and rearrange so dyad ID and turn number are the first two columns of the data set.
# Add dyad ID and turn number variables
$id <- disc_subset$id
disc_scale$turn <- disc_subset$turn
disc_scale
# Reorder columns
<- disc_scale[, c(19, 20, 1:18)]
disc_scale
# View the first 10 rows of the scaled Discloser turn proportion data
head(disc_scale, 10)
## id turn propform1 propform2 propform3 propform4 propform5 propform6
## 1 3 1 -0.8563934 0.5256717 -0.06486956 -0.06280547 -0.2842116 0.6968319
## 2 3 3 1.3274107 -0.7896930 -0.06486956 -0.06280547 -0.2842116 -0.6455197
## 3 3 5 0.9270466 -0.2635472 -0.06486956 -0.06280547 -0.2842116 -0.6455197
## 4 3 7 1.6913780 -0.7896930 -0.06486956 -0.06280547 -0.2842116 -0.6455197
## 5 3 9 -0.8563934 1.8410364 -0.06486956 -0.06280547 -0.2842116 -0.6455197
## 6 3 11 -0.2194505 1.1833540 -0.06486956 -0.06280547 -0.2842116 -0.6455197
## 7 3 13 1.6913780 -0.7896930 -0.06486956 -0.06280547 -0.2842116 -0.6455197
## 8 3 15 -0.2194505 -0.4608519 -0.06486956 -0.06280547 0.2596740 0.6968319
## 9 3 17 1.6913780 -0.7896930 -0.06486956 -0.06280547 -0.2842116 -0.6455197
## 10 3 19 -0.8563934 -0.7896930 -0.06486956 -0.06280547 -0.2842116 2.0391834
## propform7 propform8 propform9 propintent1 propintent2 propintent3
## 1 -0.06854934 -0.1187548 -0.3231609 -0.5672444 0.07739987 -0.07313752
## 2 -0.06854934 -0.1187548 0.8890512 -0.5672444 0.89952221 -0.07313752
## 3 -0.06854934 -0.1187548 0.5253876 0.6751414 0.07739987 -0.07313752
## 4 -0.06854934 -0.1187548 -0.3231609 -0.5672444 1.22837114 -0.07313752
## 5 -0.06854934 -0.1187548 -0.3231609 -0.5672444 1.22837114 -0.07313752
## 6 -0.06854934 -0.1187548 -0.3231609 0.2092467 0.65288551 -0.07313752
## 7 -0.06854934 -0.1187548 -0.3231609 -0.5672444 1.22837114 -0.07313752
## 8 -0.06854934 -0.1187548 -0.3231609 0.2092467 -0.49808577 -0.07313752
## 9 -0.06854934 -0.1187548 -0.3231609 2.5387202 -1.07357141 -0.07313752
## 10 -0.06854934 -0.1187548 -0.3231609 -0.5672444 -1.07357141 -0.07313752
## propintent4 propintent5 propintent6 propintent7 propintent8 propintent9
## 1 -0.2917414 -0.2711234 1.1117651 -0.1690681 -0.1341955 -0.3267685
## 2 -0.2917414 -0.2711234 -0.4716268 -0.1690681 -0.1341955 0.8181791
## 3 -0.2917414 -0.2711234 -0.4716268 -0.1690681 -0.1341955 0.4746948
## 4 -0.2917414 -0.2711234 -0.4716268 -0.1690681 -0.1341955 -0.3267685
## 5 -0.2917414 -0.2711234 -0.4716268 -0.1690681 -0.1341955 -0.3267685
## 6 -0.2917414 -0.2711234 -0.4716268 -0.1690681 -0.1341955 -0.3267685
## 7 -0.2917414 -0.2711234 -0.4716268 -0.1690681 -0.1341955 -0.3267685
## 8 -0.2917414 0.3002788 0.7159172 -0.1690681 -0.1341955 -0.3267685
## 9 -0.2917414 -0.2711234 -0.4716268 -0.1690681 -0.1341955 -0.3267685
## 10 -0.2917414 -0.2711234 2.6951570 -0.1690681 -0.1341955 -0.3267685
It’s also good to double check that the variables were scaled properly. We do so by examining whether the mean of the scaled variables is 0 and the standard deviation of the scaled variables is 1.
describe(disc_scale)
## vars n mean sd median trimmed mad min max range
## id* 1 2556 31.14 16.90 31.00 31.31 20.76 1.00 59.00 58.00
## turn 2 2556 46.28 29.26 44.00 44.59 32.62 1.00 146.00 145.00
## propform1 3 2556 0.00 1.00 -0.86 -0.10 0.00 -0.86 1.69 2.55
## propform2 4 2556 0.00 1.00 -0.79 -0.13 0.00 -0.79 1.84 2.63
## propform3 5 2556 0.00 1.00 -0.06 -0.06 0.00 -0.06 20.34 20.40
## propform4 6 2556 0.00 1.00 -0.06 -0.06 0.00 -0.06 26.69 26.76
## propform5 7 2556 0.00 1.00 -0.28 -0.28 0.00 -0.28 4.07 4.35
## propform6 8 2556 0.00 1.00 -0.65 -0.17 0.00 -0.65 2.04 2.68
## propform7 9 2556 0.00 1.00 -0.07 -0.07 0.00 -0.07 21.65 21.72
## propform8 10 2556 0.00 1.00 -0.12 -0.12 0.00 -0.12 10.53 10.65
## propform9 11 2556 0.00 1.00 -0.32 -0.30 0.00 -0.32 8.16 8.49
## propintent1 12 2556 0.00 1.00 -0.57 -0.24 0.00 -0.57 2.54 3.11
## propintent2 13 2556 0.00 1.00 0.08 -0.02 1.71 -1.07 1.23 2.30
## propintent3 14 2556 0.00 1.00 -0.07 -0.07 0.00 -0.07 21.81 21.88
## propintent4 15 2556 0.00 1.00 -0.29 -0.29 0.00 -0.29 4.54 4.83
## propintent5 16 2556 0.00 1.00 -0.27 -0.27 0.00 -0.27 4.30 4.57
## propintent6 17 2556 0.00 1.00 -0.47 -0.28 0.00 -0.47 2.70 3.17
## propintent7 18 2556 0.00 1.00 -0.17 -0.17 0.00 -0.17 7.11 7.28
## propintent8 19 2556 0.00 1.00 -0.13 -0.13 0.00 -0.13 8.78 8.92
## propintent9 20 2556 0.00 1.00 -0.33 -0.30 0.00 -0.33 7.69 8.01
## skew kurtosis se
## id* -0.06 -1.16 0.33
## turn 0.47 -0.33 0.58
## propform1 0.66 -1.13 0.02
## propform2 0.87 -0.77 0.02
## propform3 18.14 349.41 0.02
## propform4 19.84 453.62 0.02
## propform5 3.54 11.15 0.02
## propform6 1.25 -0.05 0.02
## propform7 18.11 361.59 0.02
## propform8 9.31 89.69 0.02
## propform9 3.44 12.80 0.02
## propintent1 1.63 1.31 0.02
## propintent2 0.11 -1.71 0.02
## propintent3 16.67 312.42 0.02
## propintent4 3.70 13.00 0.02
## propintent5 3.76 12.81 0.02
## propintent6 2.01 2.52 0.02
## propintent7 6.37 40.70 0.02
## propintent8 8.01 64.94 0.02
## propintent9 3.52 14.02 0.02
Our data look ready for the cluster analysis!
Third, we conduct the cluster analysis.
# There are random starts involved so we set a seed to make the analysis reproducible
set.seed(1234)
# Calculate the dissimilarity matrix
# between all turns in the data set using Euclidian distance
# Make sure to only include variables of interest
# (i.e., do not include dyad ID or the turn number variable)
<- daisy(disc_scale[, 3:20], metric = "euclidean", stand = FALSE)
dist_disc
# Compute the agglomerative hierarchical cluster analysis using Ward's single linkage method
<- agnes(dist_disc, diss = TRUE, method = "ward") clusterward_discloser
Fourth, we examine the resulting dendrogram to determine an appropriate number of clusters for the data at hand. We examine the length of the vertical lines (longer vertical lines indicate greater differences between groups) and the number of turns within each group (we don’t want a group with too few turns).
plot(clusterward_discloser, which.plot = 2, main = "Ward Clustering of the Discloser Data")
Finally, based on the dendrogram (and examining the contents of several cluster solutions), we chose a 6-cluster solution. Using this solution, each turn is assigned to one of the six clusters. We also include code to examine statistics about the chosen cluster solution (e.g., within-cluster heterogeneity), but do not go through those results here.
# Cut dendrogram (or tree) by the number of
# determined groups (in this case, 6)
# Insert cluster analysis results object ("clusterward_discloser")
# and the number of cut points
<- cutree(clusterward_discloser, k = 6)
wardcluster6_disc
# Cluster statistics
# cluster.stats(dist_disc, clustering = wardcluster6_disc,
# silhouette = TRUE, sepindex = TRUE)
# Create cluster labels; in this case, we have six clusters and label them Type 1, ..., Type 6
<- factor(wardcluster6_disc, labels = c("Type 1", "Type 2", "Type 3",
cluster6_label_disc "Type 4", "Type 5", "Type 6"))
# Add cluster categories with the scaled Listener data (list_subset)
$wardcluster6 <- wardcluster6_disc
disc_subset
# Change struture of cluster categories to a factor variable
$wardcluster6 <- as.factor(disc_subset$wardcluster6) disc_subset
Examine the number of each turn type. The interpretation of each type will be easier once we examine the contents of each turn type below.
<- table(disc_subset$wardcluster6)
disc_freq disc_freq
##
## 1 2 3 4 5 6
## 1676 308 394 114 50 14
round(prop.table(disc_freq), 2)
##
## 1 2 3 4 5 6
## 0.66 0.12 0.15 0.04 0.02 0.01
Plot Clusters.
Listener cluster results.
Merge cluster assignments data with original listener turn proportion data and partition the data to only include the proportion scores and the cluster assignments.
# Merge data
<- merge(data_listener, list_subset[, c("id", "turn", "wardcluster6")])
list_plot
# Partition only proportion scores and cluster assignments
<- list_plot[ ,c("wardcluster6", "propform1", "propintent1",
list_plot "propform2", "propintent2",
"propform3", "propintent3",
"propform4", "propintent4",
"propform5", "propintent5",
"propform6", "propintent6",
"propform7", "propintent7",
"propform8", "propintent8",
"propform9", "propintent9")]
# View the first 10 rows of the Listener turn proportion data with the cluster assignment
head(list_plot, 10)
## wardcluster6 propform1 propintent1 propform2 propintent2 propform3
## 1 3 0 0 0.0 0.0 0
## 2 3 0 0 1.0 1.0 0
## 3 4 0 0 0.5 0.5 0
## 4 3 1 1 0.0 0.0 0
## 5 1 0 0 0.0 0.0 0
## 6 1 0 0 0.0 0.0 0
## 7 3 0 0 0.0 0.0 0
## 8 1 0 0 0.0 0.0 0
## 9 3 0 0 0.0 0.0 0
## 10 3 0 0 1.0 0.0 0
## propintent3 propform4 propintent4 propform5 propintent5 propform6
## 1 0 0 0 0 0 1
## 2 0 0 0 0 0 0
## 3 0 0 0 0 0 0
## 4 0 0 0 0 0 0
## 5 0 0 0 0 0 1
## 6 0 0 0 0 0 1
## 7 0 0 0 0 0 1
## 8 0 0 0 0 0 1
## 9 0 0 1 0 0 1
## 10 0 0 0 0 0 0
## propintent6 propform7 propintent7 propform8 propintent8 propform9
## 1 0 0 1 0 0 0.0
## 2 0 0 0 0 0 0.0
## 3 0 0 0 0 0 0.5
## 4 0 0 0 0 0 0.0
## 5 1 0 0 0 0 0.0
## 6 1 0 0 0 0 0.0
## 7 0 0 1 0 0 0.0
## 8 1 0 0 0 0 0.0
## 9 0 0 0 0 0 0.0
## 10 0 0 1 0 0 0.0
## propintent9
## 1 0.0
## 2 0.0
## 3 0.5
## 4 0.0
## 5 0.0
## 6 0.0
## 7 0.0
## 8 0.0
## 9 0.0
## 10 0.0
The data need to be “melted” for the purposes of plotting. Melting the data refers to reshaping the data into a long format, such that the value of the proportion scores are in one column and the utterance type and cluster assignment associated with that proportion are in two other columns.
# Melt data
<- melt(list_plot, id = "wardcluster6")
list_melt
# View the first 10 rows of the melted Listener data
head(list_melt, 10)
## wardcluster6 variable value
## 1 3 propform1 0
## 2 3 propform1 0
## 3 4 propform1 0
## 4 3 propform1 1
## 5 1 propform1 0
## 6 1 propform1 0
## 7 3 propform1 0
## 8 1 propform1 0
## 9 3 propform1 0
## 10 3 propform1 0
Plot the average proportion of utterance contents for the 6 listener clusters. The naming and ordering of the clusters can be rearranged depending on how you would like your clusters ordered. In this case, we ordered the clusters alphabetically based upon the conceptual labels we gave the turn types.
# Change the structure of the cluster variable to
# rearrange the order of the clusters in the plot
$wardcluster6_factor <- factor(list_melt$wardcluster6,
list_meltlevels = c('1', '5', '3', '4', '2', '6'))
# Create labels for the clusters to be used in the plot
<- c('1' = "Acknowledge", '5' = "Advice",
cluster_labels_list '3' = "Elaboration", '4' = "Hedged Disc",
'2' = "Question", '6' = "Reflection")
# Plot the contents of each cluster with the list_melt data,
# the utterance types ("variable") on the x-axis,
# the proportion of each utterance type ("value") on the y-axis,
# and the color of the bars differing based upon the utterance type ("variable")
ggplot(list_melt, aes(x = variable, y = value, fill = factor(variable))) +
# Calculate mean proportion for each utterance type and display as a bar chart
stat_summary(fun = mean, geom = "bar", position = position_dodge(1)) +
# Create a different panel for each cluster and
# label each cluster with the labels we created above
facet_grid(~wardcluster6_factor,
labeller = labeller(wardcluster6_factor = cluster_labels_list)) +
# Do not include utterance type labels on the x-axis
theme(axis.text.x=element_blank()) +
# Create legend: name the contents of the legend ("Utterance")
# Order the contents (breaks) and labels (labels) in the same order
scale_fill_discrete(name = "Utterance Type",
breaks = c("propform1", "propintent1", "propform2", "propintent2",
"propform3", "propintent3", "propform4", "propintent4",
"propform5", "propintent5", "propform6", "propintent6",
"propform7", "propintent7", "propform8", "propintent8",
"propform9", "propintent9"),
labels = c("Disclosure Form", "Disclosure Intent",
"Edification Form", "Edification Intent",
"Advisement Form", "Advisement Intent",
"Confirmation Form", "Confirmation Intent",
"Question Form", "Question Intent",
"Acknowledgement Form", "Acknowledgement Intent",
"Interpretation Form", "Interpretation Intent",
"Reflection Form", "Reflection Intent",
"Uncodable Form", "Uncodable Intent")) +
# X-axis label
xlab("Cluster") +
# Y- axis label
ylab("Proportion") +
# Change background
theme_classic()
Discloser cluster results.
Merge cluster assignments data with original discloser turn proportion data and partition the data to only include the proportion scores and the cluster assignments.
# Merge data
<- merge(data_discloser, disc_subset[, c("id", "turn", "wardcluster6")])
disc_plot
# Partition only proportion scores and cluster assignments
<- disc_plot[ ,c("wardcluster6", "propform1", "propintent1",
disc_plot "propform2", "propintent2",
"propform3", "propintent3",
"propform4", "propintent4",
"propform5", "propintent5",
"propform6", "propintent6",
"propform7", "propintent7",
"propform8", "propintent8",
"propform9", "propintent9")]
# View the first 10 rows of the Discloser turn proportion data with the cluster assignment
head(disc_plot, 10)
## wardcluster6 propform1 propintent1 propform2 propintent2 propform3
## 1 4 0.0000000 0.0000000 0.0000000 0.0000000 0
## 2 4 0.0000000 0.0000000 0.0000000 0.0000000 0
## 3 1 0.0000000 0.0000000 0.5000000 0.5000000 0
## 4 2 0.0000000 0.0000000 0.5000000 0.5000000 0
## 5 1 0.5000000 0.0000000 0.5000000 1.0000000 0
## 6 1 0.7142857 0.1428571 0.2857143 0.8571429 0
## 7 1 0.4444444 0.3333333 0.5555556 0.6666667 0
## 8 2 0.0000000 0.0000000 0.6666667 0.6666667 0
## 9 1 0.0000000 0.0000000 1.0000000 1.0000000 0
## 10 1 0.0000000 0.0000000 1.0000000 1.0000000 0
## propintent3 propform4 propintent4 propform5 propintent5 propform6
## 1 0 0 0 1 1 0.0
## 2 0 0 0 1 1 0.0
## 3 0 0 0 0 0 0.5
## 4 0 0 0 0 0 0.0
## 5 0 0 0 0 0 0.0
## 6 0 0 0 0 0 0.0
## 7 0 0 0 0 0 0.0
## 8 0 0 0 0 0 0.0
## 9 0 0 0 0 0 0.0
## 10 0 0 0 0 0 0.0
## propintent6 propform7 propintent7 propform8 propintent8 propform9
## 1 0.0 0 0 0 0 0.0000000
## 2 0.0 0 0 0 0 0.0000000
## 3 0.5 0 0 0 0 0.0000000
## 4 0.0 0 0 0 0 0.5000000
## 5 0.0 0 0 0 0 0.0000000
## 6 0.0 0 0 0 0 0.0000000
## 7 0.0 0 0 0 0 0.0000000
## 8 0.0 0 0 0 0 0.3333333
## 9 0.0 0 0 0 0 0.0000000
## 10 0.0 0 0 0 0 0.0000000
## propintent9
## 1 0.0000000
## 2 0.0000000
## 3 0.0000000
## 4 0.5000000
## 5 0.0000000
## 6 0.0000000
## 7 0.0000000
## 8 0.3333333
## 9 0.0000000
## 10 0.0000000
The data need to be “melted” for the purposes of plotting. Melting the data refers to reshaping the data into a long format, such that the value of the proportion scores are in one column and the utterance type and cluster assignment associated with that proportion are in two other columns.
# Melt data
<- melt(disc_plot, id = "wardcluster6")
disc_melt
# View the first 10 rows of the melted Discloser data
head(disc_melt, 10)
## wardcluster6 variable value
## 1 4 propform1 0.0000000
## 2 4 propform1 0.0000000
## 3 1 propform1 0.0000000
## 4 2 propform1 0.0000000
## 5 1 propform1 0.5000000
## 6 1 propform1 0.7142857
## 7 1 propform1 0.4444444
## 8 2 propform1 0.0000000
## 9 1 propform1 0.0000000
## 10 1 propform1 0.0000000
Plot the average proportion of utterance contents for the 6 discloser clusters. The naming and ordering of the clusters can be rearranged depending on how you would like your clusters ordered. In this case, we ordered the clusters alphabetically based upon the conceptual labels we gave the turn types.
# Change the structure of the cluster variable to
# rearrange the order of the clusters in the plot
$wardcluster6_factor <- factor(disc_melt$wardcluster6,
disc_meltlevels = c('3', '6', '1', '2', '4', '5'))
# Create labels for the clusters to be used in the plot
<- c('3' = "Acknowledge", '6' = "Advice",
cluster_labels_disc '1' = "Elaboration", '2' = "Hedged Disc",
'4' = "Question", '5' = "Reflection")
# Plot the contents of each cluster with the disc_melt data,
# the utterance types ("variable") on the x-axis,
# the proportion of each utterance type ("value") on the y-axis,
# and the color of the bars differing based upon the utterance type ("variable")
ggplot(disc_melt, aes(x = variable, y = value, fill = factor(variable))) +
# Calculate mean proportion for each utterance type and display as a bar chart
stat_summary(fun = mean, geom = "bar", position = position_dodge(1)) +
# Create a different panel for each cluster and
# label each cluster with the labels we created above
facet_grid(~wardcluster6_factor,
labeller = labeller(wardcluster6_factor = cluster_labels_disc)) +
# Do not include utterance type labels on the x-axis
theme(axis.text.x=element_blank()) +
# Create legend: name the contents of the legend ("Utterance")
# Order the contents (breaks) and labels (labels) in the same order
scale_fill_discrete(name = "Utterance Type",
breaks = c("propform1", "propintent1", "propform2", "propintent2",
"propform3", "propintent3", "propform4", "propintent4",
"propform5", "propintent5", "propform6", "propintent6",
"propform7", "propintent7", "propform8", "propintent8",
"propform9", "propintent9"),
labels = c("Disclosure Form", "Disclosure Intent",
"Edification Form", "Edification Intent",
"Advisement Form", "Advisement Intent",
"Confirmation Form", "Confirmation Intent",
"Question Form", "Question Intent",
"Acknowledgement Form", "Acknowledgement Intent",
"Interpretation Form", "Interpretation Intent",
"Reflection Form", "Reflection Intent",
"Uncodable Form", "Uncodable Intent")) +
# X-axis label
xlab("Cluster") +
# Y- axis label
ylab("Proportion") +
# Change background
theme_classic()
Ta-da!
Additional Information
We created this tutorial with a system environment and versions of R and packages that might be different from yours. If R reports errors when you attempt to run this tutorial, running the code chunk below and comparing your output and the tutorial posted on the LHAMA website may be helpful.
session_info(pkgs = c("attached"))
## ─ Session info ───────────────────────────────────────────────────────────────
## setting value
## version R version 4.2.0 (2022-04-22)
## os macOS Big Sur/Monterey 10.16
## system x86_64, darwin17.0
## ui X11
## language (EN)
## collate en_US.UTF-8
## ctype en_US.UTF-8
## tz America/New_York
## date 2022-08-20
## pandoc 2.18 @ /Applications/RStudio.app/Contents/MacOS/quarto/bin/tools/ (via rmarkdown)
##
## ─ Packages ───────────────────────────────────────────────────────────────────
## package * version date (UTC) lib source
## cluster * 2.1.3 2022-03-28 [1] CRAN (R 4.2.0)
## devtools * 2.4.3 2021-11-30 [1] CRAN (R 4.2.0)
## dplyr * 1.0.9 2022-04-28 [1] CRAN (R 4.2.0)
## ggplot2 * 3.3.6 2022-05-03 [1] CRAN (R 4.2.0)
## psych * 2.2.5 2022-05-10 [1] CRAN (R 4.2.0)
## reshape * 0.8.9 2022-04-12 [1] CRAN (R 4.2.0)
## usethis * 2.1.6 2022-05-25 [1] CRAN (R 4.2.0)
##
## [1] /Library/Frameworks/R.framework/Versions/4.2/Resources/library
##
## ──────────────────────────────────────────────────────────────────────────────