---
title: "State Space Grid Metric Tutorial"
author: Miriam Brinberg
output: 
  rmdformats::robobook:
  html_document: default
  word_document: default
editor_options:
  chunk_output_type: console
---

# Overview
This tutorial will provide R code on establishing a state space grid and deriving several metrics.

State space grids are typically used to depict and quantify longitudinal dyadic (or bivariate) data in a 2-dimensional space (Hollenstein, 2013). In this example, we plot the speaking turn behaviors enacted by each dyad member during a support conversation in which one dyad member shared a current problem they were facing. Each turn in the conversations was coded into one of six turn types e.g., as an acknowledgement, question, advice, etc.). 

Here, we demonstrate how to calculate three metrics illustrated in the accompanying paper ("Using State Space Grids to Quantify and Examine Dynamics of Dyadic Conversation" by Brinberg, Solomon, Bodie, Jones, & Ram in *Communication Methods and Measures*):      

+ *Entropy* as a measure of behavioral flexibility      
+ *Time using problem description behaviors* as a measure of an attractor      
+ *Time of exit from the problem description attractor* as a measure of a phase shift      
   
We also demonstrate how to examine the association between the state space grid-derived metrics and a conversational outcome (in this case, distress) using multiple regression. 

Note that the accompanying "SSGMetric_Tutorial.rmd" file contains all of the code presented in this tutorial and can be opened in RStudio (a somewhat more friendly user interface to R). This file can be used so you don't have to copy and paste code or so you can easily rename the variables in the current code.

Thank you to Jon Helm for writing the original state space grid plotting and entropy code. 

# Outline
In this tutorial, we'll cover...  

* Reading in the data and loading needed packages.  
* General data preparation.  
* Entropy.   
* Problem description attractor.   
* Attractor exit time.   
* SSG metrics and between-dyad differences.  

# Read in the data and load needed libraries.

Let's read the data into R.  

The exemplar data set we are working with is called "StrangerConversations_N59" and is stored as a .csv file (comma-separated values file, which can be created by saving an Excel file as a csv document) on my computer's desktop.
```{r}
# Set working directory (i.e., where your data file is stored)
# This can be done by going to the top bar of RStudio and selecting "Session" --> "Set Working Directory" --> "Choose Directory" --> finding the location of your file
setwd("~/Desktop")

# Read in the repeated measures data
data <- read.csv(file = "StrangerConversations_N59.csv", head = TRUE, sep = ",")

# View the first 10 rows of the repeated measures data
head(data, 10)

# Read in the outcomes data
outcomes <- read.csv(file = "StrangerConversations_N59_Outcomes.csv", head = TRUE, sep = ",")

# View the first 10 rows of the outcomes data
head(outcomes, 10)
```

In the data, we can see each row contains information for one utterance and there are multiple rows (i.e., multiple utterances) for each dyad. Specifically, there is a column for:  

* Dyad ID (`id`)  
* Time variable - in this case, turn in the conversation (`turn`)  
* Dyad member ID - in this case, role in the conversation (`role`; discloser = 1, listener = 2)  
* Turn type - in this case, a typology of six different speaking turn behaviors: acknowledgement, advice, elaboration, hedged disclosure, question, and reflection (`turn_type`)

In the outcome data ("outcomes"), we can see there is one row for each dyad and there are columns for:  

* Dyad ID (`id`)  
* Outcome variable - in this case, post-conversation report of distress by the support receiver (`distress`)  

Load the R packages we need.  

Packages in R are a collection of functions (and their documentation/explanations) that enable us to conduct particular tasks, such as plotting or fitting a statistical model.
```{r, message = FALSE, warning = FALSE}
# install.packages("devtools") # Install package if you have never used it before
library(devtools) # For version control

# install.packages("dplyr") # Install package if you have never used it before
library(dplyr) # For data management

# install.packages("ggplot2") # Install package if you have never used it before
library(ggplot2) # For plotting

# install.packages("psych") # Install package if you have never used it before
library(psych) # For descriptive statistics

# install.packages("tidyr") # Install package if you have never used it before
library(tidyr) # For data management

# install.package("vctrs") # Install package if you have never used it before
library(vctrs) # For data management
```

# General data preparation. 

Before calculating our state space grid metrics, we need to create a new variable that will assign each speaking turn transition pair to a cell (or state) in the state space grid.

To do so, we first create two new variables that represent the listeners' turn types and disclosers' turn types separately.
```{r}
# Add "Discloser" to turn_type variable, then set all Listener (role = 2) turns to NA
data$discloser_turntype <- paste("Discloser", data$turn_type, sep = " ")
data$discloser_turntype[data$role == 2] <- NA

# Add "Listener" to turn_type variable, then set all Discloser (role = 1) turns to NA
data$listener_turntype <- paste("Listener", data$turn_type, sep = " ")
data$listener_turntype[data$role == 1] <- NA

# Reset missing values
data$listener_turntype[data$listener_turntype == "Listener NA"] <- NA
data$discloser_turntype[data$discloser_turntype == "Discloser NA"] <- NA

# View the first 10 rows of the data
head(data, 10)
```

We then carry forward each listener and discloser turn type. Specifically, we assign the following row (i.e., the following turn) that is missing to the turn type prior for both the listener turn type column and the discloser turn type column.
```{r}
data <- # Select data
        data %>%
        # Select grouping variable, in this case, dyad ID (id)
        dplyr::group_by(id) %>% 
        # Fill in the turn type for "listener_turntype" and "discloser_turntype"
        # such that the turn type is carried forward one row
        # in order to capture turn transitions
        dplyr::mutate(listener_turntype = vec_fill_missing(listener_turntype, max_fill = 1),
                      discloser_turntype = vec_fill_missing(discloser_turntype, max_fill = 1)) %>%
        # Save the data as a data.frame
        as.data.frame()

# View the first 10 rows of the data
head(data, 10)
```

Next, we create a new variable ("ssg_state") in which we label each combination of the disclosers' and listeners' turns to a cell in the state space grid. Since we have six turn types for both disclosers and listeners, we will have 36 states in which the conversation can exist. 
```{r}
data$ssg_cell <- NA


data$ssg_cell[data$listener_turntype == "Listener Acknowledgement" &
              data$discloser_turntype == "Discloser Elaboration"] <- "LAck_DElab"

data$ssg_cell[data$listener_turntype == "Listener Question" &
              data$discloser_turntype == "Discloser Elaboration"] <- "LQues_DElab"

data$ssg_cell[data$listener_turntype == "Listener Reflection" &
              data$discloser_turntype == "Discloser Elaboration"] <- "LRefl_DElab"

data$ssg_cell[data$listener_turntype == "Listener Acknowledgement" &
              data$discloser_turntype == "Discloser HedgedDisclosure"] <- "LAck_DHedg"

data$ssg_cell[data$listener_turntype == "Listener Question" &
              data$discloser_turntype == "Discloser HedgedDisclosure"] <- "LQues_DHedg"

data$ssg_cell[data$listener_turntype == "Listener Reflection" &
              data$discloser_turntype == "Discloser HedgedDisclosure"] <- "LRefl_DHedg"

data$ssg_cell[data$listener_turntype == "Listener Elaboration" &
              data$discloser_turntype == "Discloser Elaboration"] <- "LElab_DElab"

data$ssg_cell[data$listener_turntype == "Listener Elaboration" &
              data$discloser_turntype == "Discloser HedgedDisclosure"] <- "LElab_DHedg"

data$ssg_cell[data$listener_turntype == "Listener Elaboration" &
              data$discloser_turntype == "Discloser Reflection"] <- "LElab_DRefl"

data$ssg_cell[data$listener_turntype == "Listener Elaboration" &
              data$discloser_turntype == "Discloser Question"] <- "LElab_DQues"

data$ssg_cell[data$listener_turntype == "Listener Elaboration" &
              data$discloser_turntype == "Discloser Acknowledgement"] <- "LElab_DAck"

data$ssg_cell[data$listener_turntype == "Listener HedgedDisclosure" &
              data$discloser_turntype == "Discloser Elaboration"] <- "LHedg_DElab"

data$ssg_cell[data$listener_turntype == "Listener HedgedDisclosure" &
              data$discloser_turntype == "Discloser HedgedDisclosure"] <- "LHedg_DHedg"

data$ssg_cell[data$listener_turntype == "Listener HedgedDisclosure" &
              data$discloser_turntype == "Discloser Reflection"] <- "LHedg_DRefl"

data$ssg_cell[data$listener_turntype == "Listener HedgedDisclosure" &
              data$discloser_turntype == "Discloser Question"] <- "LHedg_DQues"

data$ssg_cell[data$listener_turntype == "Listener HedgedDisclosure" &
              data$discloser_turntype == "Discloser Acknowledgement"] <- "LHedg_DAck"

data$ssg_cell[data$listener_turntype == "Listener Acknowledgement" &
              data$discloser_turntype == "Discloser Reflection"] <- "LAck_DRefl"

data$ssg_cell[data$listener_turntype == "Listener Acknowledgement" &
              data$discloser_turntype == "Discloser Question"] <- "LAck_DQues"

data$ssg_cell[data$listener_turntype == "Listener Acknowledgement" &
              data$discloser_turntype == "Discloser Acknowledgement"] <- "LAck_DAck"

data$ssg_cell[data$listener_turntype == "Listener Question" &
              data$discloser_turntype == "Discloser Reflection"] <- "LQues_DRefl"

data$ssg_cell[data$listener_turntype == "Listener Question" &
              data$discloser_turntype == "Discloser Question"] <- "LQues_DQues"

data$ssg_cell[data$listener_turntype == "Listener Question" &
              data$discloser_turntype == "Discloser Acknowledgement"] <- "LQues_DAck"

data$ssg_cell[data$listener_turntype == "Listener Reflection" &
              data$discloser_turntype == "Discloser Reflection"] <- "LRefl_DRefl"

data$ssg_cell[data$listener_turntype == "Listener Reflection" &
              data$discloser_turntype == "Discloser Question"] <- "LRefl_DQues"

data$ssg_cell[data$listener_turntype == "Listener Reflection" &
              data$discloser_turntype == "Discloser Acknowledgement"] <- "LRefl_DAck"

data$ssg_cell[data$listener_turntype == "Listener Advice" &
              data$discloser_turntype == "Discloser Elaboration"] <- "LAdv_DElab"

data$ssg_cell[data$listener_turntype == "Listener Advice" &
              data$discloser_turntype == "Discloser HedgedDisclosure"] <- "LAdv_DHedg"

data$ssg_cell[data$listener_turntype == "Listener Advice" &
              data$discloser_turntype == "Discloser Reflection"] <- "LAdv_DRefl"

data$ssg_cell[data$listener_turntype == "Listener Advice" &
              data$discloser_turntype == "Discloser Question"] <- "LAdv_DQues"

data$ssg_cell[data$listener_turntype == "Listener Advice" &
              data$discloser_turntype == "Discloser Acknowledgement"] <- "LAdv_DAck"

data$ssg_cell[data$listener_turntype == "Listener Advice" &
              data$discloser_turntype == "Discloser Advice"] <- "LAdv_DAdv"

data$ssg_cell[data$listener_turntype == "Listener Acknowledgement" &
              data$discloser_turntype == "Discloser Advice"] <- "LAck_DAdv"

data$ssg_cell[data$listener_turntype == "Listener Question" &
              data$discloser_turntype == "Discloser Advice"] <- "LQues_DAdv"

data$ssg_cell[data$listener_turntype == "Listener Reflection" &
              data$discloser_turntype == "Discloser Advice"] <- "LRefl_DAdv"

data$ssg_cell[data$listener_turntype == "Listener HedgedDisclosure" &
              data$discloser_turntype == "Discloser Advice"] <- "LHedg_DAdv"

data$ssg_cell[data$listener_turntype == "Listener Elaboration" &
              data$discloser_turntype == "Discloser Advice"] <- "LElab_DAdv"

# View the first 10 rows of the data
head(data, 10)
```

Finally, we count the number of turns per conversation, which will serve as a control variable in a later analysis.
```{r}
#create variable with total number of turns per dyad
total_turns <- # Select data
               data %>%
               # Select grouping variable, in this case, dyad ID (id)
               dplyr::group_by(id) %>%
               # Create a new variable ("total_turns") that represents the turn number of turns
               # within the conversation - i.e., the maximum turn number
               dplyr::summarise(total_turns = max(turn)) %>%
               # Save the data as a data.frame
               as.data.frame()

# View the first 10 rows of the total_turns data
head(total_turns, 10)
```

Merge "total_turns" into "outcomes."
```{r}
outcomes <- merge(outcomes, total_turns, by = "id")

# View the first 10 rows of the outcomes data
head(outcomes, 10)
```

# Entropy.   

In this section, we calculate entropy for each dyad.

Create function to calculate entropy. Thank you to [blinded for review] for creating this function!
```{r}
# Create the function called iEntropy which calculates entropy for each conversation

# This function takes in two vectors. x is the observed vector on which means are calculated, id is a vector identification values.
Entropy = function(x = x, base = exp(1)){
  
# First, create a vector that counts all of the responses of x
      count_x = table(x)
      
# Divide that table by its sum to convert the counts of the responses to probabilities of the responses
      prob_x  = count_x / sum(count_x)

# Take the logarithm of all of the probabilities with the user-specified base    
      log_x = log(prob_x, base)
      
# If any of the values within log_x equal -Inf (i.e. negative infinity) then replace them with 0      
      if(any(log_x == -Inf)){
          log_x_corrected =  log_x
          log_x_corrected[which(log_x == -Inf)] = 0
      }
      

# If all of the values do not equal -Inf, then do not replace any of them      
      if(all(log_x != -Inf)){
          log_x_corrected =  log_x
      }     

# Multiply the probablilities by the corresponding logs of the probabilities            
      prod_x = prob_x * log_x_corrected
      
# sum the products of the probabilities and log probabilities, and multiply by negative one      
      entropy = -1*sum(prod_x)

# return entropy            
      return(entropy)
}


# This second function takes the originally defined Entropy function and applies it to each individual
# I entropy takes in a vector of obesrvations, x, and id vector, and a user-specified base of the logarithm
iEntropy = function(x = x, id = id, base = exp(1)){
  
# The aggregate function splits x by id, and then applies the entropy function to each of the subsets of x  
      out = aggregate(x, by = list(id), FUN = Entropy, base = base)
      
# Rename the 'out' to 'id' and 'iEntropy'
      names(out) = c('id', 'iEntropy')
      
# Return the calculated iEntropy values
      return(out)
}
```

Calculate entropy.
```{r}
# Remove missing data for the sake of calculating entropy
data_noNA <- data[!is.na(data$ssg_cell), ]

# Change "ssg_cell" to factor so the function recognizes the variable as categorical (rather than as a character)
data_noNA$ssg_cell <- as.factor(data_noNA$ssg_cell)

# Calculate entropy and save to new data frame ("entropy_data")
entropy_data <-           # Select variable
                aggregate(data_noNA$ssg_cell, 
                          # Apply variable to each group, in this case, dyad id ("id")
                          by = list(data_noNA$id), 
                          # Apply the Entropy function we created
                          FUN = Entropy, base = exp(1))

# Rename columns in new data set
names(entropy_data) = c('id', 'iEntropy')

# View the first 10 rows of the entropy data
head(entropy_data, 10)
```

Entropy descriptives.
```{r}
# Examine descriptive statistics of iEntropy
describe(entropy_data$iEntropy)

# Plot iEntropy distribution
      # Select data and variable of interest (iEntropy)
ggplot(entropy_data, aes(x = iEntropy)) + 
      # Create histogram and set width of bars in histogram
      geom_histogram(binwidth = 0.10) + 
      # Label x-axis
      xlab("Entropy - Conversation Behavioral Flexibility") +
      # Label y-axis
      ylab("Frequency") +
      # Update aesthetics
      theme_classic()
```

Merge "entropy_data" into "outcomes."
```{r}
outcomes <- merge(outcomes, entropy_data, by = "id")

# View the first 10 rows of the outcomes data
head(outcomes, 10)
```

# Problem description attractor.  

In this next section, we calculate the relative proportion of time spent in the problem description attractor.

To begin, we first have to define the attractor. Specifically, we define the problem description attractor as the cells that were created from the intersection of discloser elaboration and hedged disclosure and listener acknowledgement, question, and reflection.

We create a new variable that labels each turn as either IN the attractor or OUT of the attractor.
```{r}
# Create a variable that labels all turns as "OUT" of the attractor
data$attractor <- "OUT"

# Update the "attractor" variable so that turns that occur in one of the six problem description attractor cells is now labeled "IN"
data$attractor[data$ssg_cell == "LAck_DElab"] <- "IN"
data$attractor[data$ssg_cell == "LAck_DHedg"] <- "IN"
data$attractor[data$ssg_cell == "LQues_DElab"] <- "IN"
data$attractor[data$ssg_cell == "LQues_DHedg"] <- "IN"
data$attractor[data$ssg_cell == "LRefl_DElab"] <- "IN"
data$attractor[data$ssg_cell == "LRefl_DElab"] <- "IN"

# View the first 10 rows of the data
head(data, 10)
```

Calculate the proportion of time in the attractor for each dyad.
```{r}
attractor_time <- # Select data
                  data %>%
                  # Select grouping variable, in this case, dyad ID (id)
                  dplyr::group_by(id) %>%
                  # Count the occurrence of each category ("IN", "OUT") in the "attractor" variable
                  dplyr::count(attractor) %>%
                  # Remove any missing data
                  tidyr::drop_na() %>%
                  # Calculate the proportion of time in each category
                  dplyr::mutate(attractor_prop = prop.table(n)) %>%
                  # Save the data as a data.frame
                  as.data.frame()

# View the first 10 rows of the attractor_time data
head(attractor_time, 10)
```

Proportion of time in attractor descriptives.
```{r}
# Examine descriptive statistics of attractor_prop
describe(attractor_time$attractor_prop)

# Plot attractor_prop distribution
      # Select data and variable of interest (attractor_prop)
ggplot(attractor_time, aes(x = attractor_prop)) + 
      # Create histogram and set width of bars in histogram
      geom_histogram(binwidth = 0.10) + 
      # Label x-axis
      xlab("Proportion of Turns in the Problem Description Attractor for Each Dyad") +
      # Label y-axis
      ylab("Frequency") +
      # Update aesthetics
      theme_classic()
```

Merge proportion of time in the attractor (attractor_prop == "IN") into "outcomes."
```{r}
outcomes <- merge(outcomes,
                  attractor_time[attractor_time$attractor == "IN", c("id", "attractor_prop")], 
                  by = "id", all.x = TRUE)

# View the first 10 rows of the outcomes data
head(outcomes, 10)
```

# Attractor exit time.   

Finally, we will calculate the time (i.e., turn number) when the dyad first leaves the problem description attractor.

Find first row in which zone is not equal to 1.
```{r}
attractor_exit <- # Select data
                  data %>%
                  # Select grouping variable, in this case, dyad ID (id)
                  dplyr::group_by(id) %>%
                  # Remove the first row of each group since the first turn will
                  # never be in a cell (since we need two data points to locate a turn
                  # transition in a cell)
                  dplyr::slice(2:n()) %>%
                  # Keep rows in which the dyad is out of the attractor 
                  # (and is not due to missing data)
                  dplyr::filter(attractor == "OUT" & !is.na(ssg_cell)) %>%
                  # Keep the first row in which the dyad is out of the attractor
                  dplyr::filter(row_number()==1) %>%
                  # Save the data as a data.frame
                  as.data.frame()

# Rename turn variable to "attractor_exit"
colnames(attractor_exit)[2] <- "attractor_exit"
```

Time of attractor exit descriptives.
```{r}
# Examine descriptive statistics of attractor_exit
describe(attractor_exit$attractor_exit)

# Plot attractor_exit distribution
      # Select data and variable of interest (attractor_exit)
ggplot(attractor_exit, aes(x = attractor_exit)) + 
      # Create histogram and set width of bars in histogram
      geom_histogram(binwidth = 5) + 
      # Label x-axis
      xlab("Time (in turns) of Problem Description Attractor Exit") +
      # Label y-axis
      ylab("Frequency") +
      # Update aesthetics
      theme_classic()
```

Merge "attractor_exit" variable into "outcomes."
```{r}
outcomes <- merge(outcomes,
                  attractor_exit[, c("id", "attractor_exit")], 
                  by = "id")

# View the first 10 rows of the outcomes data
head(outcomes, 10)
```

# SSG metrics and between-dyad differences. 

In this final step, we will examine how our state space grid metrics are associated with a between-dyad difference (distress). We will examine these associations using multiple regression, and we will control for the total number of turns in the conversation.

Entropy.
```{r}
entropy_regression <- lm(# Post-conversation distress is predicted by entropy and total turns
                         distress ~ iEntropy + total_turns, 
                         data = outcomes)
summary(entropy_regression)

# Plot result
      # Select data and set x-axis (iEntropy - i.e., the predictor)
      # and y-axis (distress - i.e., the outcome)
ggplot(outcomes, aes(x = iEntropy, y = distress)) + 
      # Plot each dyad's iEntropy and distress
      geom_point() +
      # Plot smoothed regression line
      stat_smooth(method='lm', formula = y ~ x, color = "blue", size = 1) +
      # Label x-axis
      xlab("Conversation Behavior Flexibility") + 
      # Label y-axis
      ylab("Disclosers' Post-conversation Distress") +
      # Update aesthetics
      theme_classic()
```
Entropy of conversation behaviors during a support conversation is not associated with discloser's post-conversation distress, while controlling for the total number of turns within the conversation.

Problem description attractor.
```{r}
attractor_regression <- lm(# Post-conversation distress is predicted by time in 
                           # attractor and total turns
                           distress ~ attractor_prop + total_turns, 
                           data = outcomes)
summary(attractor_regression)

# Plot result
      # Select data and set x-axis (attractor_prop - i.e., the predictor)
      # and y-axis (distress - i.e., the outcome)
ggplot(outcomes, aes(x = attractor_prop, y = distress)) + 
      # Plot each dyad's attractor_prop and distress
      geom_point() +
      # Plot smoothed regression line
      stat_smooth(method='lm', formula = y ~ x, color = "blue", size = 1) +
      # Label x-axis
      xlab("Proportion of Turns in the Problem Description Attractor") + 
      # Label y-axis
      ylab("Disclosers' Post-conversation Distress") +
      # Update aesthetics
      theme_classic()
```
The proportion of time spent in the problem description attractor during a support conversation is not associated with discloser's post-conversation distress, while controlling for the total number of turns within the conversation.

Attractor exit time.
```{r}
exit_regression <- lm(# Post-conversation distress is predicted by time of first
                      # attractor exit and total turns
                      distress ~ attractor_exit + total_turns, 
                      data = outcomes)
summary(exit_regression)

# Plot result
      # Select data and set x-axis (attractor_exit - i.e., the predictor)
      # and y-axis (distress - i.e., the outcome)
ggplot(outcomes, aes(x = attractor_exit, y = distress)) + 
      # Plot each dyad's attractor_exit and distress
      geom_jitter() +
      # Plot smoothed regression line
      stat_smooth(method='lm', formula = y ~ x, color = "blue", size = 1) +
      # Label x-axis
      xlab("Time (in turns) of Problem Description Attractor Exit") + 
      # Label y-axis
      ylab("Disclosers' Post-conversation Distress") +
      # Update aesthetics
      theme_classic()
```
The timing of first exit from the problem description attractor during a support conversation is not associated with discloser's post-conversation distress, while controlling for the total number of turns within the conversation.

## Additional Information

We created this tutorial with a system environment and versions of R and packages that might be different from yours. If R reports errors when you attempt to run this tutorial, running the code chunk below and comparing your output may be helpful.

```{r}
session_info(pkgs = c("attached"))
```