Sequence Analysis Tutorial
Overview
This tutorial provides R code on conducting sequence analysis (MacIndoe & Abbott, 2004). Sequence analysis is a descriptive analytic technique to capture within-sequence patterns and allow for between-sequence comparisons. This analytic technique has previously been used in biology to identify and group DNA sequences (i.e., categorical sequences depicting the order of the four nucleotides - A, C, T, and G) that are similar, and in sociology to examine occupational trajectories (e.g., Halpin & Cban, 1998), dance rituals (MacIndoe & Abbott, 2004), and residential mobility (Stovel & Bolan, 2004). In sum, sequence analysis is suitable to identify potential patterns in series of categorical data, to group participants based upon the similarity of their patterns, and to examine differences across pattern groups.
In this example, we will examine the use of turn types during a subset of conversations between strangers in which one dyad member disclosed about a current problem. Each turn in these conversations was coded (e.g., as an acknowledgement, question, hedged disclosure, etc; see Bodie et al., 2021 in the Journal of Language and Social Psychology for more details about the creation of the turn typology). We are interested in whether there are specific patterns of turn type use during supportive conversations and whether the general ordering of turn types during a conversation is associated with post-conversation reports of distress by the support receiver.
Note: the code provided here has been modified from another tutorial we created on sequence analysis. In contrast to that tutorial which used data from an ecological momentary assessment study, this tutorial uses data from a laboratory based study.
In addition, the accompanying “SequenceAnalysis_Tutorial_2022August14.rmd” file contains all of the code presented in this tutorial and can be opened in RStudio (a somewhat more friendly user interface to R).
Outline
In this tutorial, we’ll cover…
- Reading in the data and loading needed packages.
- Data descriptives.
- Creating sequences.
- Establishing a cost matrix and conducting sequence analysis.
- Determining the number of clusters.
- Examining group differences among clusters.
- Cautions.
Read in the data and load needed packages.
Let’s read the data into R.
We are working with two data sets in this tutorial. One data set contains repeated measures (“StrangerConversations_N59”), specifically the conversation turn codes for each dyad. The other data set contains time-invariant outcome data (“StrangerConversations_N59_Outcomes”), specifically the reports of change in discloser distress following the conversation.
Both data sets were stored as .csv files (comma-separated values file, which can be created by saving an Excel file as a csv document) on my computer’s desktop.
# Set working directory (i.e., where your data file is stored)
# This can be done by going to the top bar of RStudio and selecting
# "Session" --> "Set Working Directory" --> "Choose Directory" -->
# finding the location of your file
setwd("~/Desktop") # Note: You can skip this line if you have
#the data files and this .rmd file stored in the same directory
# Read in the repeated measures data
<- read.csv(file = "StrangerConversations_N59.csv", head = TRUE, sep = ",")
data
# View the first 10 rows of the repeated measures data
head(data, 10)
## id turn role turn_type
## 1 105 1 1 Question
## 2 105 2 2 Acknowledgement
## 3 105 3 1 Elaboration
## 4 105 4 2 Acknowledgement
## 5 105 5 1 Elaboration
## 6 105 6 2 Acknowledgement
## 7 105 7 1 Elaboration
## 8 105 8 2 Elaboration
## 9 105 9 1 Elaboration
## 10 105 10 2 Reflection
# Read in the outcomes data
<- read.csv(file = "StrangerConversations_N59_Outcomes.csv", head = TRUE, sep = ",")
outcomes
# View the first 10 rows of the outcomes data
head(outcomes, 10)
## id distress
## 1 3 3
## 2 11 1
## 3 12 2
## 4 14 2
## 5 31 1
## 6 38 2
## 7 45 1
## 8 54 3
## 9 55 2
## 10 58 3
In the repeated measures data (“data”), we can see each row contains information for one turn and there are multiple rows (i.e., multiple turns) for each dyad. In this data set, there are columns for:
- Dyad ID (
id
)
- Time variable - in this case, turn in the conversation
(
turn
)
- Dyad member ID - in this case, role in the conversation
(
role
; discloser = 1, listener = 2)
- Turn type - in this case, based upon a typology derived in Bodie et
al. (2021;
turn_type
)
In the outcome data (“outcomes”), we can see there is one row for each dyad and there are columns for:
- Dyad ID (
id
)
- Outcome variable - in this case, post-conversation report of
distress by the support receiver (
distress
)
Load the R packages we need.
Packages in R are a collection of functions (and their documentation/explanations) that enable us to conduct particular tasks, such as plotting or fitting a statistical model.
# install.packages("cluster") # Install package if you have never used it before
library(cluster) # For hierarchical cluster analysis
# install.packages("devtools") # Install package if you have never used it before
require(devtools) # For version control
# install.packages("dplyr") # Install package if you have never used it before
library(dplyr) # For data management
# install.packages("ggplot2") # Install package if you have never used it before
library(ggplot2) # For plotting
# install.packages("psych") # Install package if you have never used it before
library(psych) # For descriptive statistics
# install.packages("reshape") # Install package if you have never used it before
library(reshape) # For data management
# install.packages("TraMineR") # Install package if you have never used it before
library(TraMineR) # For sequence analysis
# install.packages("TraMineRextras") # Install package if you have never used it before
library(TraMineRextras) # For sequence analysis
Data Descriptives.
Let’s begin by getting a feel for our data. Specifically, let’s examine:
- how many dyads we have in the data set,
- how many conversation turns there are for each dyad, and
- the frequency of each turn type across all dyads.
- Number of dyads.
# Number of dyads in the repeated measures data
# Length (i.e., number) of unique ID values
length(unique(data$id))
## [1] 59
# Number of dyads in the outcome data
# Length (i.e., number) of unique ID values
length(unique(outcomes$id))
## [1] 59
There are 59 dyads in both data sets.
- Number of conversation turns for each dyad.
<- # Select data
num_occ %>%
data # Select grouping variable, in this case, dyad ID (id)
group_by(id) %>%
# Count the number of turns in each conversation
summarise(count = n()) %>%
# Save the data as a data.frame
as.data.frame()
# Calculate descriptives on the number of turns per conversation
describe(num_occ$count)
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 59 87.71 19.69 85 85.98 19.27 60 147 87 0.85 0.42 2.56
The average dyad in this subset of the data had approximately 88 turns in their conversation (M = 87.71, SD = 19.69), with participants ranging from 60 to 147 turns over the course of a conversation.
Plot the distribution of the number of turns per conversation.
# Select data (num_occ) and value on the x-axis (number of turns per conversation: "count")
ggplot(data = num_occ, aes(x = count)) +
#Create a histogram with binwidth = 5 and white bars outlined in black
geom_histogram(binwidth = 5, fill = "white", color = "black") +
# Label x-axis
labs(x = "Number of Turns per Conversation") +
# Change background aesthetics of plot
theme_classic()
- The number of total turns for each turn type.
# Create table that calculates the number of turns for each turn type
<- table(data$turn_type)
turntype_table
# Display the table
turntype_table
##
## Acknowledgement Advice Elaboration HedgedDisclosure
## 1265 72 2422 455
## Question Reflection
## 392 496
We can see that participants overall used Elaboration turns the most and Advice turns the least.
Create Sequences.
Now that we know a little bit more about our repeated measures data, we can move on to the data preparation required for sequence analysis.
In this step, we:
- reformat the repeated measures data (“data”) from long to
wide,
- create an “alphabet” that represents each of our categories,
and
- create and plot the categorical sequence data.
- Reformat the data from long to wide.
Our repeated measures data (“data”) are currently in a “long” format, which means that each dyad has multiple rows with each turn of the conversation having its own row. For sequence analysis, we need our repeated measures data to be in a “wide” format, which means that each dyad should only have one row and each turn of the conversation should have its own column.
# Reformat repeated measures data ("data") from long to wide
<- reshape(
data_wide # Select data and specific columns that should be kept
# (dyad ID: id, time variable: turn,
# categorical variable: turn_type)
data = data[, c("id", "turn", "turn_type")],
# Identify time variable ("turn")
timevar = "turn",
# Identify dyad ID variable ("id")
idvar = "id",
# Identify categorical variable ("turn_type")
v.names = "turn_type",
# Note direction of data reformat
direction = "wide",
# No spaces in new column names
sep = "")
# View the first 10 rows and first 10 columns of the wide data
head(data_wide[1:10, 1:10])
## id turn_type1 turn_type2 turn_type3 turn_type4
## 1 105 Question Acknowledgement Elaboration Acknowledgement
## 105 107 Elaboration Acknowledgement Elaboration Acknowledgement
## 179 11 Elaboration Acknowledgement Elaboration Question
## 239 119 HedgedDisclosure <NA> Advice <NA>
## 329 12 Elaboration Question Elaboration Acknowledgement
## 411 123 Elaboration Acknowledgement Elaboration Acknowledgement
## turn_type5 turn_type6 turn_type7 turn_type8
## 1 Elaboration Acknowledgement Elaboration Elaboration
## 105 Elaboration HedgedDisclosure Elaboration Acknowledgement
## 179 Elaboration Question Elaboration Question
## 239 <NA> Elaboration Advice HedgedDisclosure
## 329 Elaboration Reflection Acknowledgement HedgedDisclosure
## 411 Elaboration Acknowledgement Elaboration Acknowledgement
## turn_type9
## 1 Elaboration
## 105 Elaboration
## 179 Elaboration
## 239 Acknowledgement
## 329 Acknowledgement
## 411 Elaboration
- Create alphabet.
We create an alphabet that represents each possible category within the categorical variable of interest (in this case, “turn_type”). The actual naming of these values is not important, but we are going to name them in a way that facilitates interpretation.
# Create a vector that contains the categories that appear in the wide data set
<- c("Acknowledgement", "Advice", "Elaboration",
turntype_alphabet "HedgedDisclosure", "Question", "Reflection")
# Create a vector that allows for more helpful labels if applicable
<- c("Acknowledgement", "Advice", "Elaboration",
turntype_labels "HedgedDisclosure", "Question", "Reflection")
- Create sequences.
Now that the data are in the correct (i.e., wide) format and we have
created an alphabet for our sequence analysis, we next need to create a
sequence object that can be understood by the R package for sequence
analysis (TraMineR
).
Before creating the sequences, we first we assign colors to each of the categories, which will help us when viewing plots of the sequences. This step is not required since there is a default color palette, but this gives us control over what color is assigned to which category. We do this by assigning each category (i.e., turn type) a hex code (https://www.color-hex.com/). The categories should be written as they appear in the alphabet created above.
A note on accessibility: To make your plots accessible, you may consider adopting a colorblind-friendly palette. David Nichols’ website (https://davidmathlogic.com/colorblind/) provides a great explainer on this issue, as well as a color picking tool.
<- "#F8766D" # Red
Acknowledgement <- "#93AA00" # Olive
Advice <- "#00BA38" # Green
Elaboration <- "#00C19F" # Teal
HedgedDisclosure <- "#619CFF" # Blue
Question <- "#DB72FB" # Purple Reflection
Next, we create an object (“turn_seq”) that contains all of the sequences in the format needed for the sequence analysis package.
<- seqdef(data_wide, # Select data
turn_seq var = 2:148, # Columns containing repeated measures data
alphabet = turntype_alphabet, # Alphabet
labels = turntype_labels, # Labels
xtstep = 25, # Steps between tick marks
cpal=c(Acknowledgement, Advice,
Elaboration,
HedgedDisclosure, # Color palette Question, Reflection))
## [>] found missing values ('NA') in sequence data
## [>] preparing 59 sequences
## [>] coding void elements with '%' and missing values with '*'
## [>] 6 distinct states appear in the data:
## 1 = Acknowledgement
## 2 = Advice
## 3 = Elaboration
## 4 = HedgedDisclosure
## 5 = Question
## 6 = Reflection
## [>] state coding:
## [alphabet] [label] [long label]
## 1 Acknowledgement Acknowledgement Acknowledgement
## 2 Advice Advice Advice
## 3 Elaboration Elaboration Elaboration
## 4 HedgedDisclosure HedgedDisclosure HedgedDisclosure
## 5 Question Question Question
## 6 Reflection Reflection Reflection
## [>] 59 sequences in the data set
## [>] min/max sequence length: 59/147
A lot of text will appear after the sequence object is created. This text tells you about the number of sequences (which should be equal to the number of dyads in the sample), the states (i.e., categories) that appear in the sequence object, and the alphabet and labels of the categories.
Finally, we can plot the sequences.
seqIplot(turn_seq, # Sequence object
with.legend = "right", # Display legend on right side of plot
cex.legend = 0.6, # Change size of legend
main = "Turn Type Use during a Support Conversation") # Plot title
To read this plot, each row represents one dyad’s conversation and each color represents a different turn type. We can see that the conversations varied in length and varied in content. Although, elaboration turns (green turns) seem to appear frequently across all conversations.
Establish a Cost Matrix and Conduct Sequence Analysis.
There is one last step before conducting sequence analysis: establishing a cost matrix.
Sequence analysis aims to minimize the “cost” of transforming one sequence into another and relies on an optimal matching algorithm to do this transformation. There are costs for inserting, deleting, and substituting categories, as well as costs for replacing missing values. Insertion/deletion costs are typically set to 1.0 and missingness costs are typically set to half the highest cost within the matrix. We set these costs when conducting the sequence analysis.
There are a number of ways to set substitution costs. Typically, substitution costs are established as the distance between cells. However, we do not have an ordinal scale for the categories (i.e., there is no logical order or distance between our turn types, e.g., what turn type is “closest” to acknowledgement?). In this case, we use a constant cost matrix (i.e., the distance between any turn type is the same). If we were to use a theoretical rationale to sort turn types that were more or less similar to each other, we could use Manhattan (city-block) distance or Euclidian distance.
We need to create a substitution cost matrix before conducting the sequence analysis. The substitution cost matrix will be a (k+1) by (k+1) matrix with k = number of categories and an additional right-most column and bottom row to represent missingness costs (half of the highest cost, which in this case is half of 2). In our case, the substitution cost matrix will be a 7 x 7 matrix since we have 6 turn type categories + 1 row and column for the cost of missing values.
Here, we establish our substitution cost matrix.
# Create substitution cost matrix and save to the object "costmatrix"
<- seqsubm(turn_seq, # Sequence object
costmatrix method = "CONSTANT", # Method to determine costs
cval = 2, # Substitution cost
with.missing = TRUE, # Allows for missingness state
miss.cost = 1, # Cost for substituting a missing state
time.varying = FALSE, # Does not allow the cost to vary over time
weighted = TRUE) # Allows weights to be used when applicable
## [>] creating 7x7 substitution-cost matrix using 2 as constant value
# Examine substitution cost matrix
costmatrix
## Acknowledgement Advice Elaboration HedgedDisclosure Question
## Acknowledgement 0 2 2 2 2
## Advice 2 0 2 2 2
## Elaboration 2 2 0 2 2
## HedgedDisclosure 2 2 2 0 2
## Question 2 2 2 2 0
## Reflection 2 2 2 2 2
## * 1 1 1 1 1
## Reflection *
## Acknowledgement 2 1
## Advice 2 1
## Elaboration 2 1
## HedgedDisclosure 2 1
## Question 2 1
## Reflection 0 1
## * 1 0
Now that we have created the substitution cost matrix, we can conduct our sequence analysis.
We use an optimal matching algorithm for sequence analysis. The output of the sequence analysis is a n x n (n = number of dyads) dissimilarity matrix with the cost of transforming one sequence into every other sequence in the corresponding cell of the matrix.
Note: Other algorithms are available, and they can be specified in the method = “” argument below. To see other algorithms available in the TraMineR package, type ?seqdist in the console or type seqdist in the search bar at the top of the Help tab on the right.
# Conduct sequence analysis
<- seqdist(turn_seq, # Sequence object
dist_om method = "OM", # Optimal matching algorithm
indel = 1.0, # Insert/deletion costs set to 1
sm = costmatrix, # Substitution cost matrix
with.missing = TRUE)
## [>] including missing values as an additional state
## [>] 59 sequences with 7 distinct states
## [>] checking 'sm' (size and triangle inequality)
## [>] 59 distinct sequences
## [>] min/max sequence lengths: 59/147
## [>] computing distances using the OM metric
## [>] elapsed time: 0.124 secs
# Examine the top left corner of the dissimilarity matrix
1:10, 1:10] dist_om[
## 1 105 179 239 329 411 502 601 703 811
## 1 0 69 74 91 64 57 79 58 64 63
## 105 69 0 54 70 50 56 71 72 78 55
## 179 74 54 0 52 58 61 62 76 72 43
## 239 91 70 52 0 76 84 81 95 91 67
## 329 64 50 58 76 0 57 60 60 72 54
## 411 57 56 61 84 57 0 69 53 63 53
## 502 79 71 62 81 60 69 0 68 72 65
## 601 58 72 76 95 60 53 68 0 54 64
## 703 64 78 72 91 72 63 72 54 0 74
## 811 63 55 43 67 54 53 65 64 74 0
Determine Number of Clusters.
We next take the distance matrix obtained in the prior step to determine an appropriate number of clusters that represent the different turn use patterns within our supportive conversations. Several clustering techniques are available, but we use hierarchical cluster analysis using Ward’s single linkage method. Other possible methods include k-medoids clustering or latent mixture models. After determining the number of clusters that work for the data, we create an object that contains cluster membership for each dyad (which will be used in the final step) and plot the clusters.
Conduct hierarchical cluster analysis and save cluster analysis results to the object “clusterward”.
# Insert dissimilarity matrix ("dist_om"),
# indicate that we are using a dissimilarity matrix, and
# indicate that we want to use Ward's single linkage clustering method
<- agnes(dist_om, diss = TRUE, method = "ward")
clusterward
# Plot the results of the cluster analysis using a dendrogram
# Insert cluster analysis results object ("clusterward")
plot(clusterward, which.plot = 2)
In this example, the resulting dendrogram indicates two clusters. We reached this conclusion by examining the length of the vertical lines (longer vertical lines indicate greater differences between groups) and the number of dyads within each group (we didn’t want a group with too few dyads). After selecting a two cluster solution, we plotted the sequences of the two clusters for visual comparison.
# Cut dendrogram (or tree) by the number of
# determined groups (in this case, 2)
# Insert cluster analysis results object ("clusterward")
# and the number of cut points
<- cutree(clusterward, k = 2)
cl2
# Turn cut points into a factor variable and label them
# Insert cut point object ("cl2") and
# create labels by combining the text "Type" with either 1 or 2
<- factor(cl2, labels = paste("Type", 1:2))
cl2fac
# Plot the sequences for each cluster
seqplot(turn_seq, # Sequence object
group = cl2fac, # Grouping factor level variable
type = "I", # Create whole sequence plot
sortv = "from.start", # Sort sequences based upon the category in which they begin
with.legend = "right", # Display legend on right side of plot
cex.legend = 0.8, # Change size of legend
border = NA) # No plot border
It appears that “Type 1” dyads use more elaboration turns compared to “Type 2” dyads. These plots can sometimes be difficult to distinguish, so further descriptives (e.g., percent of each turn type that comprise each cluster) can be helpful.
In the next (and final) step, we will formally test whether the dyads within these clusters differ on support receivers’ post-conversation report of distress.
Examine Group Differences among Clusters.
The final step of of our analysis is to examine group differences among the clusters. One can use a variety of methods to examine group differences, and the choice of method will depend on the number of clusters chosen and the research question. For example, if more than two clusters are chosen and one wants to examine differences on an outcome variable between the clusters, then one would use an analysis of variance (ANOVA). In this case, we use a t-test since we are examining the difference between two groups.
We examined whether support receivers’ post-conversation report of distress differed by cluster membership, which represented turn type usage patterns. As you can see below, there was no significant difference between clusters in levels of support receivers’ post-conversation report of distress.
# Add grouping variable to outcomes data set
$cl2 <- cl2
outcomes
# Run t-test
t.test(outcomes$distress ~ outcomes$cl2, data = outcomes)
Cautions.
Although there are several distinct advantages of sequence analysis, there are several limitations and considerations to the process, which include:
- The length of time series needed (dependent on the process under examination, but could be lengthy).
- The need for an ordinal or categorical variable.
- The determination of the cost matrices (which in turn affects the prioritization of left/right shifts vs. substitution of letters in the sequence).
- The extent of missingness.
To the best of our knowledge, there are no “rules of thumb” for the number of time points needed, the acceptable amount of variability in time series length, the choice of cost matrices, and the acceptable amount of missingness. We recommend that researchers explore their data (e.g., by trying different cost matrices or inserting missingness) to better understand how these factors impact their results, and be transparent in reporting the steps and specifications required to obtain those results.
Additional Information
We created this tutorial with a system environment and versions of R and packages that might be different from yours. If R reports errors when you attempt to run this tutorial, running the code chunk below and comparing your output and the tutorial posted on the LHAMA website may be helpful.
session_info(pkgs = c("attached"))
## ─ Session info ───────────────────────────────────────────────────────────────
## setting value
## version R version 4.2.0 (2022-04-22)
## os macOS Big Sur/Monterey 10.16
## system x86_64, darwin17.0
## ui X11
## language (EN)
## collate en_US.UTF-8
## ctype en_US.UTF-8
## tz America/New_York
## date 2022-08-14
## pandoc 2.18 @ /Applications/RStudio.app/Contents/MacOS/quarto/bin/tools/ (via rmarkdown)
##
## ─ Packages ───────────────────────────────────────────────────────────────────
## package * version date (UTC) lib source
## cluster * 2.1.3 2022-03-28 [1] CRAN (R 4.2.0)
## devtools * 2.4.3 2021-11-30 [1] CRAN (R 4.2.0)
## dplyr * 1.0.9 2022-04-28 [1] CRAN (R 4.2.0)
## ggplot2 * 3.3.6 2022-05-03 [1] CRAN (R 4.2.0)
## psych * 2.2.5 2022-05-10 [1] CRAN (R 4.2.0)
## reshape * 0.8.9 2022-04-12 [1] CRAN (R 4.2.0)
## TraMineR * 2.2-4 2022-06-09 [1] CRAN (R 4.2.0)
## TraMineRextras * 0.6.4 2022-06-13 [1] CRAN (R 4.2.0)
## usethis * 2.1.6 2022-05-25 [1] CRAN (R 4.2.0)
##
## [1] /Library/Frameworks/R.framework/Versions/4.2/Resources/library
##
## ──────────────────────────────────────────────────────────────────────────────