---
title: "Sequence Analysis Tutorial"
output: 
  rmdformats::robobook:
  html_document: default
  word_document: default
editor_options:
  chunk_output_type: console
---

# Overview 
This tutorial provides R code on conducting sequence analysis (MacIndoe & Abbott, 2004). Sequence analysis is a descriptive analytic technique to capture within-sequence patterns and allow for between-sequence comparisons. This analytic technique has previously been used in biology to identify and group DNA sequences (i.e., categorical sequences depicting the order of the four nucleotides - A, C, T, and G) that are similar, and in sociology to examine occupational trajectories (e.g., Halpin & Cban, 1998), dance rituals (MacIndoe & Abbott, 2004), and residential mobility (Stovel & Bolan, 2004). In sum, sequence analysis is suitable to identify potential patterns in series of categorical data, to group participants based upon the similarity of their patterns, and to examine differences across pattern groups.

In this example, we will examine the use of turn types during a subset of conversations between strangers in which one dyad member disclosed about a current problem. Each turn in these conversations was coded (e.g., as an acknowledgement, question, hedged disclosure, etc; see Bodie et al., 2021 in the *Journal of Language and Social Psychology* for more details about the creation of the turn typology). We are interested in whether there are specific patterns of turn type use during supportive conversations and whether the general ordering of turn types during a conversation is associated with post-conversation reports of distress by the support receiver. 

Note: the code provided here has been modified from another tutorial we created on [sequence analysis](https://quantdev.ssri.psu.edu/tutorials/sequence-analysis-clustering-edit-distance). In contrast to that tutorial which used data from an ecological momentary assessment study, this tutorial uses data from a laboratory based study.  

In addition, the accompanying "SequenceAnalysis_Tutorial_2022August14.rmd" file contains all of the code presented in this tutorial and can be opened in RStudio (a somewhat more friendly user interface to R). 

# Outline
In this tutorial, we'll cover...  

* Reading in the data and loading needed packages.  
* Data descriptives.  
* Creating sequences.  
* Establishing a cost matrix and conducting sequence analysis.  
* Determining the number of clusters.   
* Examining group differences among clusters.  
* Cautions.

# Read in the data and load needed packages.

**Let's read the data into R.**

We are working with two data sets in this tutorial. One data set contains repeated measures ("StrangerConversations_N59"), specifically the conversation turn codes for each dyad. The other data set contains time-invariant outcome data ("StrangerConversations_N59_Outcomes"), specifically the reports of change in discloser distress following the conversation.

Both data sets were stored as .csv files (comma-separated values file, which can be created by saving an Excel file as a csv document) on my computer's desktop.
```{r}
# Set working directory (i.e., where your data file is stored)
# This can be done by going to the top bar of RStudio and selecting 
# "Session" --> "Set Working Directory" --> "Choose Directory" --> 
# finding the location of your file
setwd("~/Desktop") # Note: You can skip this line if you have 
#the data files and this .rmd file stored in the same directory

# Read in the repeated measures data
data <- read.csv(file = "StrangerConversations_N59.csv", head = TRUE, sep = ",")

# View the first 10 rows of the repeated measures data
head(data, 10)

# Read in the outcomes data
outcomes <- read.csv(file = "StrangerConversations_N59_Outcomes.csv", head = TRUE, sep = ",")

# View the first 10 rows of the outcomes data
head(outcomes, 10)
```

In the repeated measures data ("data"), we can see each row contains information for one turn and there are multiple rows (i.e., multiple turns) for each dyad. In this data set, there are columns for:  

* Dyad ID (`id`)  
* Time variable - in this case, turn in the conversation (`turn`)  
* Dyad member ID - in this case, role in the conversation (`role`; discloser = 1, listener = 2)  
* Turn type - in this case, based upon a typology derived in Bodie et al. (2021; `turn_type`)

In the outcome data ("outcomes"), we can see there is one row for each dyad and there are columns for:  

* Dyad ID (`id`)  
* Outcome variable - in this case, post-conversation report of distress by the support receiver (`distress`)  

**Load the R packages we need.**  

Packages in R are a collection of functions (and their documentation/explanations) that enable us to conduct particular tasks, such as plotting or fitting a statistical model.
```{r, warning= FALSE, message= FALSE}
# install.packages("cluster") # Install package if you have never used it before
library(cluster) # For hierarchical cluster analysis

# install.packages("devtools") # Install package if you have never used it before
require(devtools) # For version control

# install.packages("dplyr") # Install package if you have never used it before
library(dplyr) # For data management

# install.packages("ggplot2") # Install package if you have never used it before
library(ggplot2) # For plotting

# install.packages("psych") # Install package if you have never used it before
library(psych) # For descriptive statistics

# install.packages("reshape") # Install package if you have never used it before
library(reshape) # For data management

# install.packages("TraMineR") # Install package if you have never used it before
library(TraMineR) # For sequence analysis

# install.packages("TraMineRextras") # Install package if you have never used it before
library(TraMineRextras) # For sequence analysis
```

# Data Descriptives.

Let's begin by getting a feel for our data. Specifically, let's examine:    

(1) how many dyads we have in the data set,  
(2) how many conversation turns there are for each dyad, and  
(3) the frequency of each turn type across all dyads.

1. Number of dyads.
```{r, warnings = FALSE}
# Number of dyads in the repeated measures data
# Length (i.e., number) of unique ID values
length(unique(data$id))

# Number of dyads in the outcome data
# Length (i.e., number) of unique ID values
length(unique(outcomes$id))
```
There are 59 dyads in both data sets.

2. Number of conversation turns for each dyad.
```{r, message = FALSE}
num_occ <- # Select data
           data %>%
           # Select grouping variable, in this case, dyad ID (id)
           group_by(id) %>%
           # Count the number of turns in each conversation
           summarise(count = n()) %>%
           # Save the data as a data.frame
           as.data.frame()

# Calculate descriptives on the number of turns per conversation
describe(num_occ$count)
```

The average dyad in this subset of the data had approximately 88 turns in their conversation (*M* = 87.71, *SD* = 19.69), with participants ranging from 60 to 147 turns over the course of a conversation.

Plot the distribution of the number of turns per conversation.
```{r}
# Select data (num_occ) and value on the x-axis (number of turns per conversation: "count")
ggplot(data = num_occ, aes(x = count)) +
  #Create a histogram with binwidth = 5 and white bars outlined in black
  geom_histogram(binwidth = 5, fill = "white", color = "black") + 
  # Label x-axis
  labs(x = "Number of Turns per Conversation") +
  # Change background aesthetics of plot
  theme_classic()
```

3. The number of total turns for each turn type.
```{r}
# Create table that calculates the number of turns for each turn type
turntype_table <- table(data$turn_type)

# Display the table
turntype_table
```
We can see that participants overall used *Elaboration* turns the most and *Advice* turns the least. 

# Create Sequences.

Now that we know a little bit more about our repeated measures data, we can move on to the data preparation required for sequence analysis.

In this step, we:  

(1) reformat the repeated measures data ("data") from long to wide,  
(2) create an "alphabet" that represents each of our categories, and  
(3) create and plot the categorical sequence data. 

1. Reformat the data from long to wide.  

Our repeated measures data ("data") are currently in a "long" format, which means that each dyad has multiple rows with each turn of the conversation having its own row. For sequence analysis, we need our repeated measures data to be in a "wide" format, which means that each dyad should only have one row and each turn of the conversation should have its own column.
```{r, message= FALSE}
# Reformat repeated measures data ("data") from long to wide
data_wide <- reshape(
                     # Select data and specific columns that should be kept 
                     # (dyad ID: id, time variable: turn, 
                     # categorical variable: turn_type)
                     data = data[, c("id", "turn", "turn_type")],
                     # Identify time variable ("turn")
                     timevar = "turn",           
                     # Identify dyad ID variable ("id")
                     idvar = "id",             
                     # Identify categorical variable ("turn_type")
                     v.names = "turn_type",             
                     # Note direction of data reformat
                     direction = "wide",      
                     # No spaces in new column names
                     sep = "")                    

# View the first 10 rows and first 10 columns of the wide data
head(data_wide[1:10, 1:10])
```

2. Create alphabet.  

We create an alphabet that represents each possible category within the categorical variable of interest (in this case, "turn_type"). The actual naming of these values is not important, but we are going to name them in a way that facilitates interpretation.
```{r}
# Create a vector that contains the categories that appear in the wide data set
turntype_alphabet <- c("Acknowledgement", "Advice", "Elaboration", 
                       "HedgedDisclosure", "Question", "Reflection")

# Create a vector that allows for more helpful labels if applicable 
turntype_labels <- c("Acknowledgement", "Advice", "Elaboration", 
                      "HedgedDisclosure", "Question", "Reflection")
```

3. Create sequences.  

Now that the data are in the correct (i.e., wide) format and we have created an alphabet for our sequence analysis, we next need to create a sequence object that can be understood by the R package for sequence analysis (`TraMineR`).

Before creating the sequences, we first we assign colors to each of the categories, which will help us when viewing plots of the sequences. This step is not required since there is a default color palette, but this gives us control over what color is assigned to which category. We do this by assigning each category (i.e., turn type) a hex code (https://www.color-hex.com/). The categories should be written as they appear in the alphabet created above.

A note on accessibility: To make your plots accessible, you may consider adopting a colorblind-friendly palette. David Nichols’ website (https://davidmathlogic.com/colorblind/) provides a great explainer on this issue, as well as a color picking tool.
```{r}
Acknowledgement <- "#F8766D"       # Red
Advice <- "#93AA00"                # Olive
Elaboration <- "#00BA38"           # Green
HedgedDisclosure <- "#00C19F"      # Teal      
Question <- "#619CFF"              # Blue
Reflection <- "#DB72FB"            # Purple
```

Next, we create an object ("turn_seq") that contains all of the sequences in the format needed for the sequence analysis package.
```{r}
turn_seq <- seqdef(data_wide,                      # Select data   
                   var = 2:148,                    # Columns containing repeated measures data
                   alphabet = turntype_alphabet,   # Alphabet  
                   labels = turntype_labels,       # Labels
                   xtstep = 25,                    # Steps between tick marks
                   cpal=c(Acknowledgement, Advice, 
                          Elaboration, 
                          HedgedDisclosure, 
                          Question, Reflection))   # Color palette
```

A lot of text will appear after the sequence object is created. This text tells you about the number of sequences (which should be equal to the number of dyads in the sample), the states (i.e., categories) that appear in the sequence object, and the alphabet and labels of the categories.

Finally, we can plot the sequences.
```{r}
seqIplot(turn_seq,                                             # Sequence object
         with.legend = "right",                                # Display legend on right side of plot
         cex.legend = 0.6,                                     # Change size of legend
         main = "Turn Type Use during a Support Conversation") # Plot title
```

To read this plot, each row represents one dyad's conversation and each color represents a different turn type. We can see that the conversations varied in length and varied in content. Although, elaboration turns (green turns) seem to appear frequently across all conversations. 

# Establish a Cost Matrix and Conduct Sequence Analysis.

There is one last step before conducting sequence analysis: establishing a cost matrix.

Sequence analysis aims to minimize the "cost" of transforming one sequence into another and relies on an optimal matching algorithm to do this transformation. There are costs for inserting, deleting, and substituting categories, as well as costs for replacing missing values. Insertion/deletion costs are typically set to 1.0 and missingness costs are typically set to half the highest cost within the matrix. We set these costs when conducting the sequence analysis.

There are a number of ways to set substitution costs. Typically, substitution costs are established as the distance between cells. However, we do not have an ordinal scale for the categories (i.e., there is no logical order or distance between our turn types, e.g., what turn type is "closest" to acknowledgement?). In this case, we use a constant cost matrix (i.e., the distance between any turn type is the same). If we were to use a theoretical rationale to sort turn types that were more or less similar to each other, we could use Manhattan (city-block) distance or Euclidian distance. 

We need to create a substitution cost matrix before conducting the sequence analysis. The substitution cost matrix will be a (k+1) by (k+1) matrix with k = number of categories and an additional right-most column and bottom row to represent missingness costs (half of the highest cost, which in this case is half of 2). In our case, the substitution cost matrix will be a 7 x 7 matrix since we have 6 turn type categories + 1 row and column for the cost of missing values. 

Here, we establish our substitution cost matrix.
```{r}
# Create substitution cost matrix and save to the object "costmatrix"
costmatrix <- seqsubm(turn_seq,             # Sequence object
                      method = "CONSTANT",  # Method to determine costs
                      cval = 2,             # Substitution cost
                      with.missing = TRUE,  # Allows for missingness state
                      miss.cost = 1,        # Cost for substituting a missing state
                      time.varying = FALSE, # Does not allow the cost to vary over time
                      weighted = TRUE)      # Allows weights to be used when applicable

# Examine substitution cost matrix
costmatrix
```

Now that we have created the substitution cost matrix, we can conduct our sequence analysis.

We use an optimal matching algorithm for sequence analysis. The output of the sequence analysis is a _n_ x _n_ (_n_ = number of dyads) dissimilarity matrix with the cost of transforming one sequence into every other sequence in the corresponding cell of the matrix.

Note: Other algorithms are available, and they can be specified in the method = "" argument below. To see other algorithms available in the TraMineR package, type ?seqdist in the console or type seqdist in the search bar at the top of the Help tab on the right. 
```{r}
# Conduct sequence analysis 
dist_om <- seqdist(turn_seq,            # Sequence object
                   method = "OM",       # Optimal matching algorithm
                   indel = 1.0,         # Insert/deletion costs set to 1
                   sm = costmatrix,     # Substitution cost matrix
                   with.missing = TRUE)

# Examine the top left corner of the dissimilarity matrix
dist_om[1:10, 1:10]
```

# Determine Number of Clusters.

We next take the distance matrix obtained in the prior step to determine an appropriate number of clusters that represent the different turn use patterns within our supportive conversations. Several clustering techniques are available, but we use hierarchical cluster analysis using Ward's single linkage method. Other possible methods include k-medoids clustering or latent mixture models. After determining the number of clusters that work for the data, we create an object that contains cluster membership for each dyad (which will be used in the final step) and plot the clusters.

Conduct hierarchical cluster analysis and save cluster analysis results to the object "clusterward".
```{r}
# Insert dissimilarity matrix ("dist_om"), 
# indicate that we are using a dissimilarity matrix, and
# indicate that we want to use Ward's single linkage clustering method
clusterward <- agnes(dist_om, diss = TRUE, method = "ward")

# Plot the results of the cluster analysis using a dendrogram
# Insert cluster analysis results object ("clusterward")
plot(clusterward, which.plot = 2)
```

In this example, the resulting dendrogram indicates two clusters. We reached this conclusion by examining the length of the vertical lines (longer vertical lines indicate greater differences between groups) and the number of dyads within each group (we didn't want a group with too few dyads). After selecting a two cluster solution, we plotted the sequences of the two clusters for visual comparison.
```{r, message= FALSE}
# Cut dendrogram (or tree) by the number of 
# determined groups (in this case, 2)
# Insert cluster analysis results object ("clusterward") 
# and the number of cut points
cl2 <- cutree(clusterward, k = 2) 

# Turn cut points into a factor variable and label them
# Insert cut point object ("cl2") and 
# create labels by combining the text "Type" with either 1 or 2
cl2fac <- factor(cl2, labels = paste("Type", 1:2)) 

# Plot the sequences for each cluster
seqplot(turn_seq,              # Sequence object
        group = cl2fac,        # Grouping factor level variable
        type = "I",            # Create whole sequence plot
        sortv = "from.start",  # Sort sequences based upon the category in which they begin
        with.legend = "right", # Display legend on right side of plot
        cex.legend = 0.8,      # Change size of legend
        border = NA)           # No plot border
```

It appears that "Type 1" dyads use more elaboration turns compared to "Type 2" dyads. These plots can sometimes be difficult to distinguish, so further descriptives (e.g., percent of each turn type that comprise each cluster) can be helpful. 

In the next (and final) step, we will formally test whether the dyads within these clusters differ on support receivers' post-conversation report of distress.

# Examine Group Differences among Clusters.

The final step of of our analysis is to examine group differences among the clusters. One can use a variety of methods to examine group differences, and the choice of method will depend on the number of clusters chosen and the research question. For example, if more than two clusters are chosen and one wants to examine differences on an outcome variable between the clusters, then one would use an analysis of variance (ANOVA). In this case, we use a t-test since we are examining the difference between two groups.

We examined whether support receivers' post-conversation report of distress differed by cluster membership, which represented turn type usage patterns. As you can see below, there was no significant difference between clusters in levels of support receivers' post-conversation report of distress. 
```{r, eval = FALSE}
# Add grouping variable to outcomes data set
outcomes$cl2 <- cl2

# Run t-test
t.test(outcomes$distress ~ outcomes$cl2, data = outcomes)
```

# Cautions.

Although there are several distinct advantages of sequence analysis, there are several limitations and considerations to the process, which include:

1. The length of time series needed (dependent on the process under examination, but could be lengthy).
2. The need for an ordinal or categorical variable. 
3. The determination of the cost matrices (which in turn affects the prioritization of left/right shifts vs. substitution of letters in the sequence).
4. The extent of missingness.

To the best of our knowledge, there are no "rules of thumb" for the number of time points needed, the acceptable amount of variability in time series length, the choice of cost matrices, and the acceptable amount of missingness. We recommend that researchers explore their data (e.g., by trying different cost matrices or inserting missingness) to better understand how these factors impact their results, and be transparent in reporting the steps and specifications required to obtain those results.


-----
### Additional Information

We created this tutorial with a system environment and versions of R and packages that might be different from yours. If R reports errors when you attempt to run this tutorial, running the code chunk below and comparing your output and the tutorial posted on the LHAMA website may be helpful.  
  
```{r}
session_info(pkgs = c("attached"))
```