These slides and notes accompany a paper presented at ICMI 2012 on October 23, 2012. The notes were used to prepare the presentation and should generally be very similar to what was said, but may not match the exact wording. The published paper should be considered the authoritative source wherever there are disagreements or omissions.
The slideshow should work with modern browsers; it's currently been tested with Firefox 16 and Chrome 22. There are likely issues with Internet Explorer.
Good morning, and thank you for your time and attention. Today I'll be presenting "Changes in Verbal and Nonverbal Conversational Behavior in Long-Term Interaction".
I'll just start off with the broad-picture overview of what this paper is about. We're interested in modeling verbal and nonverbal behavior in long-term multi-conversation dyadic interaction, which we take to mean multiple conversations, spanning some significant period of time, between the same dyad and addressing the addressing one overall conversational task. Our goal is to produce conversational agents which have behaviors that is engaging and realistic over this kind over long-term interaction with users, and I'll approach things mainly from that angle, but I hope some of this work will also be relevant to those in the audience interested in areas like behavior recognition and interpretation.
This paper is primarily an empirical and observational study. We collected a corpus of this kind of multi-conversation interaction, and have done some analysis of verbal and nonverbal behavior in that corpus. For this study, we focus specifically on conversation openings: the beginning of each conversation. I'll justify that approach in a little while.
Conversational agents — and I'll focus on embodied conversational agents, which attempt to simulate face-to-face conversation — have become a widely-applied technology (at least in research projects) for a variety of applications; there are a few examples listed here. One reason for that is that we hope that the affordances of face-to-face conversation can be leveraged by an agent to promote engagement and rapport. To do that, we believe, or at least assume, that we need to produce sufficiently realistic behavior, that matches how a human analogue of the agent would promote engagement and rapport in conversation.
Many of these applications put the agents in roles where multiple interactions between the user and agent are required. It's common for a counseling intervention, conducted by a human counselor, to last months, years, or longer — indefinitely. Given this, we are motivated to produce conversational agents that have realistic behavior in the context of long-term interaction.
Put simply, an agent's behavior in the tenth conversation with a user should be appropriate for a tenth conversation, in the context of all the previous interactions. If a tenth conversation looks different from a first conversation, an agent's behavior should reflect that, or risk looking increasingly unrealistic and loosing engagement over time.
We also argue that behavior in all tenth conversations may not look the same: there are a lot of different aspects of long-term relationship that may be changing over time that could be associated with changes on behavior. I'll focusing on two: Interaction history includes the number of previous conversations. Interpersonal relationship includes constructs such as trust and intimacy. These two are related but separate: we may see different behavior in a dyad with a long history and a close relationship compared to a dyad with a similarly long history and a distant or weak relationship.
As I mentioned before, we also look specifically at the beginnings of conversation. Some of our previous work in this area hints that where there are differences between conversations, those differences are particularly pronounced at the beginning of conversations. To the extent that differences in behavior are associated with participants' beliefs — or mutual beliefs — about their interpersonal relationship, the beginning of a conversation is where we expect those beliefs to be negotiated or communicated. So if you're looking for differences between conversations, and are going to pick a particular part of those conversations to examine, both of those suggest that openings are a good choice.
There is some previous work that indicates differences in behavior predicted by prior history and by relationship. Much of the prior work is cross-sectional, comparing pairs of friends to pairs of strangers, for example. Cross-sectional studies are valuable, but have some limits in this area. Often, you can show that some change occurs, but it is more difficult to show a pattern of change over time. A trickier issue is separating change over time from differences between dyads: if, for example, we see a difference between friends and strangers, we could explain this as a change that occurs over time, but also as a difference that predicts whether people are more likely to become friends.
In the two items cited here, we see some differences between friends and strangers, both look cross-sectionally. Cassell et al. found a decrease or minimization in acknowledgment behavior in friends compared to strangers.
Finally, I note that we've done some previous work showing that fairly small changes in behavior — things like adding or removing some lexical variability, or switching between first-person or third-person language — can have a significant effect on user engagement when someone is interacting with a conversational agent for months or years.
Tickle-Degnen and Rosenthal give a model of rapport and nonverbal behavior that I want to pull out as background work, because it specifically addresses the idea of change over time. They suggest that which nonverbal behaviors indicate strong rapport vary over time, with those indicating positivity being more important in early conversations, while coordination is more important later on. Mutual attention is always associated with strong rapport.
We also have published two previous studies looking at changes in verbal and nonverbal behavior over time, both actually using the same corpus as in this study. We found that posture shifts occur more frequently at the beginning of conversations than the end, and that this decrease in the rate of posture shifts is significantly greater in later conversations.
We also looked at changes in articulation rate, defined here in terms of a normalized duration of a word, not counting any pauses or silence between words. We found that the duration of some words decreased — words that appeared as a pause group by themselves, with silence before and after — and that these words were mostly acknowledgments and discourse markers, so broadly speaking the words that mainly had a conversation management role got minimized over time.
We collected a corpus we call the Exercise Counseling Corpus: a longitudinal videotaped corpus of conversations. These are videotapes of weekly conversations between an exercise trainer and clients, with the trainer acting as a counselor to try and change the clients' attitudes about physical activity. These are in a laboratory setting, with participants recruited specifically for a study, but besides that we tried to make it fairly naturalistic. There was a real, not role-played intervention occurring, and a meaningful task.
We recruited 6 different clients, and a single counselor conducted all sessions. All clients, when recruited, were asked to come in once a week for six weeks, and all but one did come in every week. We have a total of 32 conversations, and a fairly large amount of video.
Since we're interested partly in looking at behavior relative to the perceived strength of interpersonal relationship, we used a self-report measure of therapeutic alliance, which is a conceptualization of interpersonal relationship developed specifically in the context of counseling interactions. Advantages of this construct are that it has good, validated measures, and that we know it's meaningful: a strong alliance has been shown to significantly predict positive outcomes in counseling.
Both the counselor and the client completed a therapeutic alliance survey after every conversation. We see a fairly strong alliance, as illustrated in this plot — this is on a 1 to 5 scale — and we see a clear pattern of increasing alliance over time. Generally, this corpus contains examples of successful counseling and fairly good development of interpersonal relationship. It's weaker on bad examples.
All coding for nonverbal behavior was done in ANVIL, and looked only at the first minute of each conversation. The paper has more detail about the coding and related matters like inter-rater reliability. I'll skip those here for time.
We looked at a fairly large set of different behaviors. Based on preliminary analysis, and on our previous studies, we saw no evidence of major differences that were observable within a single minute of conversation (although a larger corpus may certainly have turned up some), so all of these variables are aggregates over a video clip.
Our choice of outcome variables was based on a survey of the prior literature, trying to pick out those that might change across multiple conversations. For more detailed references, see the paper, but briefly: The proportion of time speaking is reported as a difference between friends and strangers, as is the use of head nods, specifically for acknowledgment. Smiling, the frequency and expressivity of gestures, and the use of eyebrows are all cited in the literature on immediacy. Gaze-aways are associated with immediacy as well, and at least one study reported an association between gaze-away and topic intimacy. Self-adaptors are associated with anxiety in conversation, so we might expect more in early conversations.
And to look at how those behaviors change over time, we considered a number of possible predictors. First, looking at interaction history, we have the number of previous sessions. But we added a second predictor, which is whether this was the last session. This one didn't initially have strong justification from previous work or theory: primarily, we noticed while doing coding that the final sessions looked qualitatively different from earlier ones, and were easy to tell apart.
To look at interpersonal relationship, we have a participant's self-reported therapeutic alliance as a predictor of their behavior. Since we assessed this at the end of a conversation, this is lagged: it's the alliance reported in the previous conversation.
Finally, since the counselor and client have very different roles in this conversation, we looked at that as a predictor.
Since we didn't want to start out assuming that any particular behavior would be associated with interaction history, or with the quality of the interpersonal relationship, we looked at four different models for each behavior, each picking a different subset of those predictors.
The first (showing A) just includes predictors related to interaction history: the number of previous sessions, and whether it's the last session. The next (showing B) adds the quality of interpersonal relationship: self-reported therapeutic alliance, and an interaction effect with the number of sessions. Finally, we look at variants (showing C and D) which add the participants' role in the interaction, allowing effects to vary separately for the counselor and clients.
This is all the gory details of the statistics, or at least some of the details: again in the interests of time I won't say too much about this, and refer to details in the paper, but please feel free to ask afterwards if there's any questions you have.
For mouth positions, our best-fitting model looked only at interaction history, not therapeutic alliance. We see the same trend for both the counselor and clients. Participants spent significantly more time smiling and frowning in the early sessions and this decreased over time. However, the final sessions look different. We see significantly more smiling and frowning in final sessions relative to the trend: it looks a lot more like an initial session.
For gaze-away, we came up with a similar model. Participants gazed away from their conversation partner while speaking significantly more in later conversations. But again, the final sessions between each dyad look different: we see significantly fewer gaze-aways in those sessions, and again it looks a lot more like an initial session.
However, looking at nodding when not speaking — which is primarily nodding while the conversation partner is speaking — we have a more complex model that includes both interaction history and therapeutic alliance: participants nodded more in sessions where they had previously reported low perceived alliance, but only in early sessions. The difference attenuates over time. There's a trend toward less nodding in the last session, but it's not significant.
The previous slides covered three of the seven behavioral variables we looked at, and those were the cleaner of the results. I'll briefly cover here the remaining ones, which are a bit messier.
For amount of time speaking, we see the counselor and client having trends in opposite directions. I should mention they're not becoming more even: the counselor starts off talking less than the clients, and the difference gets larger, not smaller. For adaptors and other hand gestures, we see significant changes only for the counselor. For eyebrow movements, we don't see much of anything significant at all.
To summarize the main results, we see systematic changes across sessions in three behaviors for both the counselor and client. In all of these cases, the trends reverse in the last sessions, which look much more like an initial interaction.
I want to note that our results are only partially in agreement with prior work. Tickle-Degnen and Rosenthal give a model of nonverbal behaviors and rapport which predicts that high rapport is associated with the communication of "positivity" in early conversations, but not in late ones. We see something similar, where both smiles and frowns are more common early on. We did a quick qualitative look at our video, and it was apparent that most of the smiles in our corpus are not Duchenne smiles, which are generally thought to represent felt emotion: so we conjecture that much of that is changes in how much people intentionally communicate positive and negative affect. This is only a slight modification, I think, from the notion of "positivity" to being very explicit about communicating appropriate affective responses in early conversations.
Where there's closer to a conflict with previous work is the results here on headnodding. It's been reported that friends tend to use less nodding than strangers, for acknowledgment. We see something different here: strong alliance dyads have less nodding than weak alliance dyads, but only in early sessions. One possible conjecture is that we're picking up early differences that are predictive of whether people become friends, rather than differences that would appear over time: the earlier work is cross-sectional, and can't make that distinction.
The last topic I'll discuss is implementing these findings in a nonverbal behavior generation system for a conversational agent. Our basic approach is to implement this kind of long-term changes as adjustments to the probability of generating some nonverbal behavior event: a smile or frown, or a gaze-away, for example.
We've implemented this on top of a rule-based nonverbal behavior generation system, but it should be workable on top of something that is based more directly on a machine-learning model, as long as it can output generation probabilities for behaviors. The disadvantage is that we have to make some untested assumptions: the main one is that our results apply additively with other predictors of behaviors.
Here are the adjustments that we make for the probability of generating a smile or frown, a gaze-away, and a headnod, based on the number of previous sessions, whether its the final session, and an estimate of the strength of therapeutic alliance. These come directly from the results of the observational study. There's also some adjustments, not shown on this slide, based on our earlier results on posture shift and on articulation rates.
This is showing an example of different behavior generated for the same utterance, both at a couple minutes into a conversation, but otherwise in very different contexts. You can see that in an early conversation, with low therapeutic alliance, we've generated a nod and some smiling. In a later conversation, these drop out, but we see an added gaze-away during the utterance.
That wraps up the work reported in this paper. I'll just briefly give some conclusions, and mention our current work and some possible future work. Currently, we're running a longitudinal web-based evaluation study to test the effect of implementing this model in a conversational agent. Participants interact with the agent once a week for six weeks, seeing either the behavior model I discussed here, a model where nothing changes across sessions, or an exaggerated model which basically just multiplies all parameters in the behavior generation slide by 3.
Our main outcomes are self-reported measures of engagement and perceived behavioral realism. I can't yet give you results of that, since it's still ongoing, but it'll be finishing up shortly.
In conclusion, we find that there are indeed systematic changes in verbal and nonverbal behavior that occur from the first conversation to later conversations, and that these changes are a complex product of multiple aspects of interpersonal relationship, including the interaction history and the strength of the relationship. Given these changes, applications that deal with verbal and nonverbal behavior in long-term interaction — whether dealing with behavior generation or interpretation — should consider carefully the context of the behavior, and avoid assumptions that behavior is unchanging across multiple sessions.
I'll wrap up by giving some future work, and future research questions raised here. First, most obviously, we're interested in validating these results in a larger population, and in generalizing them beyond the specific scenario — the specific conversational task, kind of interpersonal relationship, and fixed schedule of interaction — present in the Exercise Counseling Corpus.
Finally, we're interested in following up on the finding that final sessions do not follow the trend of the previous interactions, and tend to look a lot more like a first session. This wasn't something we were looking for — maybe we should've been — and we conjecture that it happened specifically because all participants knew that it was the last session. Like the first session, this represents a change in their interpersonal relationship, in their roles toward each other. As the use of conversational agents moves more toward long-term interaction, it's likely that such changes will become more common, so we're interested in whether there are characteristic patterns of conversational behavior, and whether that is what we are seeing here.
Thank you for your time, and I'll take questions now.