Files
honors-thesis/thesis/chapters/08_discussion.tex
T
soconnor e1af7b1f8f
Build Proposal and Thesis / build-github (push) Has been skipped
Build Proposal and Thesis / build-gitea (push) Failing after 32s
final submissions update part 1
2026-04-12 17:07:09 -04:00

59 lines
16 KiB
TeX

\chapter{Discussion}
\label{ch:discussion}
This chapter interprets the results presented in Chapter~\ref{ch:results} against the two research questions established in Chapter~\ref{ch:evaluation}, situates the findings within the broader literature on WoZ methodology, and identifies the limitations of this study.
\section{Interpretation of Findings}
\subsection{Research Question 1: Accessibility}
The first research question asked whether HRIStudio enables domain experts without prior robotics experience to successfully implement a robot interaction from a written specification. The Choregraphe condition provides the baseline against which this question is evaluated.
The six completed sessions provide directional evidence on the accessibility question. Across the three Choregraphe wizards, design fidelity scores were 42.5, 65, and 62.5, yielding a condition mean of 56.7. Across the three HRIStudio sessions, all three wizards achieved a DFS of 100. No HRIStudio wizard required a T-type intervention that reflected an inability to operate the platform; the T-type marks logged for W-05 concerned interface orientation, and those logged for W-06 concerned gesture execution details (parallel execution and posture-reset blocks), neither of which constituted fundamental operational barriers. By contrast, Choregraphe produced design difficulties across all three sessions. W-01 required T-type assistance for connection routing and branch wiring. W-03 required no T-type interventions but over-engineered the design, adding concurrent execution nodes and attempting onboard speech-recognition logic that falls outside the WoZ paradigm. W-04 required T-type assistance for speech content punctuation and a failed choice block attempt.
The SUS scores reinforce this pattern. Choregraphe SUS scores were 60, 75, and 42.5 (mean 59.2), all at or below the average usability benchmark of 68~\cite{Brooke1996}. HRIStudio SUS scores were 90, 70, and 70 (mean 76.7), all above the benchmark. The Choregraphe condition produced the lowest single SUS score in the study (42.5, W-04), a wizard who described the platform as getting in the way of their attempt. The HRIStudio condition produced the highest (90, W-02). With programming backgrounds now balanced across conditions---each condition contains one wizard with \emph{None} programming experience, one with \emph{Moderate} experience, and one with \emph{Extensive} experience---a cross-background comparison is possible: W-01 (\emph{None}, Choregraphe, SUS 60) versus W-05 (\emph{None}, HRIStudio, SUS 70); W-04 (\emph{Moderate}, Choregraphe, SUS 42.5) versus W-02 (\emph{Moderate}, HRIStudio, SUS 90); W-03 (\emph{Extensive}, Choregraphe, SUS 75) versus W-06 (\emph{Extensive}, HRIStudio, SUS 70). HRIStudio scores exceed Choregraphe scores at the \emph{None} and \emph{Moderate} levels; at the \emph{Extensive} level the scores reverse by five points, suggesting that extensive programming experience largely attenuates the tool-level usability difference.
The most striking accessibility finding comes from W-05: a Chemical Engineering faculty member with no programming experience trained in 6 minutes, completed a perfect design in 18 minutes with no operational confusion, and ran the trial to conclusion. This outcome directly addresses the accessibility claim. HRIStudio's timeline-based model and guided workflow allowed a domain novice to implement the written specification correctly on their first attempt, without the interface friction that blocked or slowed all three Choregraphe wizards. Session timing data underscores the difference: Choregraphe design phases averaged 35.7 minutes (two overruns, one incomplete), while HRIStudio design phases averaged 21 minutes (all three within the allocation). Underlying this difference is a structural property of the two tools: HRIStudio's model is domain-specific to Wizard-of-Oz execution, so wizard effort is channeled toward implementing the specification more completely rather than elaborating the tool's architecture. Choregraphe's general-purpose programming model makes the opposite available, and both W-03 and W-04 took it, spending time on concurrent execution structures and a speech-recognition-driven choice block that the WoZ context does not support. No HRIStudio wizard had that option, and all three scored 100 on the DFS.
\subsection{Research Question 2: Reproducibility}
The second research question asked whether HRIStudio produces more reliable execution of a designed interaction compared to Choregraphe. The most instructive finding from W-01's session is not a score but an incident: without any technical failure, the wizard substituted a different rock color in the robot's speech and comprehension question, departing from the ``red'' specified in the written protocol. This deviation was not caught during the design phase, was not flagged by the tool, and was only discovered during the live trial.
This is precisely the failure mode the reproducibility problem predicts. Riek's~\cite{Riek2012} review found that fewer than 4\% of published WoZ studies reported any measure of wizard error, meaning most studies have no mechanism to detect whether execution matched design intent. W-01's session demonstrates that such deviations occur even in controlled conditions with a single, simple specification and an engaged wizard. The deviation was not negligence; it was creative drift made possible by a tool that places no structural constraint on what the wizard types into a speech action.
HRIStudio's protocol enforcement model is designed to prevent this class of deviation by locking speech content at design time. The available data supports this design intent. No speech content deviations occurred in any of the three HRIStudio sessions. W-05 added an action beyond the specification (a crouch gesture), but this was an elaboration of the gesture category rather than a substitution of specified content, and it was scored within the rubric. The Choregraphe condition produced the only speech substitution in the dataset (W-01) and two sessions in which branching was absent from the design entirely (W-03, W-04).
ERS scores reflect the downstream effect of these design differences. Choregraphe ERS scores were 65, 60, and 75 (mean 66.7). HRIStudio ERS scores were 95, 95, and 100 (mean 96.7). The branching item is particularly instructive: in the Choregraphe condition, branch execution was either absent from the design entirely (W-03) or present but not implemented as conditional logic (W-01, W-04). W-01 resolved the branch by manually re-routing connections during the trial; W-04 required a T-type trial intervention to be reminded how to trigger the branch step. In all three HRIStudio sessions, the conditional branch was present in the design and executed during the trial. W-05's branch fired cleanly via programmed conditional logic; W-02's session saw a brief platform-side step misfire immediately corrected by manual step selection, logged as an H-type (platform behavior) intervention rather than a wizard error; W-06's branch fired cleanly with no intervention of any kind. In no HRIStudio session did branch execution depend on tool-operation guidance from the researcher.
\subsection{Session Timing and Downstream Effects}
W-01's design phase extended to 35 minutes, overrunning the 30-minute allocation by five minutes and leaving approximately five minutes for the trial phase. It is worth distinguishing between the two factors at play here: the overrun reflected both the tool's demands on the wizard and a procedural decision not to interrupt W-01 at the 30-minute mark. Subsequent sessions enforced the transition to the trial phase at 30 minutes regardless of design completion status, consistent with the observer protocol. That said, if a tool's demands make design completion within the allocation genuinely difficult, the risk of an overrun is real regardless of enforcement: a wizard who has not finished at 30 minutes faces a reduced trial window no matter when the cutoff is applied.
Across all six sessions, design phase overruns are concentrated in the Choregraphe condition. W-01 and W-03 each exceeded the 30-minute design target but completed their designs before the session time limit; W-04 was the only wizard cut off by the limit without finishing. No HRIStudio wizard exceeded the target. This pattern holds across programming backgrounds: W-01 (non-programmer) and W-03 (extensive programmer) both overran in the Choregraphe condition, while W-05 (non-programmer, HRIStudio) completed in 18 minutes and W-06 (extensive programmer, HRIStudio) completed in 21 minutes. The timing data thus corroborates the DFS and SUS findings as a supplementary accessibility indicator, and supports the conclusion that the overrun pattern is attributable to tool condition rather than wizard background alone. With programming backgrounds balanced across conditions, the design-phase timing difference cannot be attributed to prior programming experience.
\section{Comparison to Prior Work}
The accessibility findings are consistent with prior characterizations of both tools. Pot et al.~\cite{Pot2009} introduced Choregraphe as a tool for enabling non-programmers to create NAO behaviors, but subsequent HRI research has treated it primarily as a programmer's tool in practice. This study confirms that characterization: W-01 (no programming experience) and W-04 (moderate experience) both required substantial T-type assistance and produced incomplete or deviation-prone designs, while W-03 (extensive experience) navigated the interface without T-type support yet still over-engineered the design and scored below every HRIStudio participant on both DFS and ERS. Riek's~\cite{Riek2012} observation that WoZ tools tend to require substantial technical investment even when the underlying experiment is conceptually simple holds across all three Choregraphe sessions regardless of background. In contrast, the HRIStudio results support the claim advanced in prior work~\cite{OConnor2024, OConnor2025} that a domain-specific, web-based platform can decouple task complexity from interface complexity: all three HRIStudio wizards---spanning no, moderate, and extensive programming experience---achieved a perfect DFS, and none encountered a fundamental barrier to operating the platform.
The specification deviation in W-01's session connects directly to Porfirio et al.'s~\cite{Porfirio2023} argument that formal, verifiable behavior specifications are a prerequisite for reproducible HRI. Porfirio et al. propose specification languages as the solution; HRIStudio takes a complementary approach by embedding the specification into the execution environment, making deviation structurally harder rather than formally detectable after the fact. The ERS data confirms this design intent: no speech content deviations occurred across all three HRIStudio sessions, and the condition ERS mean of 96.7 versus 66.7 for Choregraphe supports the conclusion that structural enforcement produces more reliable execution in practice. Riek's~\cite{Riek2012} finding that only 3.7\% of published WoZ studies reported any measure of wizard error makes this comparison particularly significant: the ERS operationalizes exactly the kind of execution measurement the literature has consistently omitted, and the difference it surfaces here is substantial.
The SUS scores are consistent with prior tool evaluations in HCI. The Choregraphe mean of 59.2 falls below the average benchmark of 68~\cite{Brooke1996} and below scores reported for general-purpose visual programming environments in comparable studies, consistent with Bartneck et al.'s~\cite{Bartneck2024} finding that domain-specific design is necessary to make tools genuinely accessible to non-programmers. The HRIStudio mean of 76.7 places the platform above the benchmark across all three sessions. With programming backgrounds balanced across conditions, the overall 17.5-point gap in condition means reflects a genuine tool-level effect rather than a sampling artifact. The gap is largest at the Moderate experience level (W-02 HRIStudio 90 vs.\ W-04 Choregraphe 42.5) and smallest at the Extensive level, where the scores reverse by five points (W-03 Choregraphe 75 vs.\ W-06 HRIStudio 70), suggesting that extensive programming experience largely attenuates the tool-level usability difference while the accessibility advantage remains pronounced for non-programmers and moderate programmers.
\section{Limitations}
This study has several limitations that must be considered when interpreting the findings.
\textbf{Sample size.} With six wizard participants ($N = 6$), the study is too small for inferential statistics. The reported scores are descriptive. Patterns in the data can suggest directions for future work but cannot establish causal claims about the effect of the tool on design fidelity or execution reliability.
\textbf{Trial execution without a separate test subject.} Following scheduling difficulties, the study protocol was adjusted so that the wizard executes the designed interaction directly rather than running it for a separate test subject. Because the DFS and ERS are scored against the exported project file and live observation rather than a subject's behavioral responses, this change does not affect the primary quantitative measures. The trial phase evaluates whether the wizard's design executes as specified; the presence or absence of a separate subject does not alter that criterion.
\textbf{Single task.} Both conditions used the same Interactive Storyteller specification. While this controls for task difficulty, it limits generalizability. The task is simple relative to real HRI experiments; the gap between conditions may be larger or smaller with a more complex protocol involving multiple branches or longer interaction sequences.
\textbf{Condition imbalance.} Random assignment produced a programming-background distribution that happens to be balanced: each condition contains one wizard with no programming experience, one with moderate experience, and one with extensive experience. While this balance is favorable for interpretation, it was not guaranteed by design. The small $N$ means that balance on other potentially relevant dimensions (disciplinary background, prior experience with visual programming tools, or familiarity with robots more broadly) was not assessed or controlled.
\textbf{Platform version.} HRIStudio is under active development. The version used in this study represents the system at a specific point in time. Future iterations may change how the wizard interface presents protocol steps, how branch conditions are constructed during the design phase, or how protocol enforcement is applied during execution. Any of these changes could affect how easily a non-programmer completes the design challenge or how reliably the tool enforces the specification during the trial, potentially altering the DFS and ERS scores observed under otherwise identical conditions. Results from this study therefore describe the system as it existed at the time of data collection and may not generalize to later releases.
\section{Chapter Summary}
This chapter interpreted the results of all six completed pilot sessions against the two research questions and connected the findings to prior work. Across all primary measures, the directional evidence favors HRIStudio. The Choregraphe condition produced a mean DFS of 56.7, mean ERS of 66.7, and mean SUS of 59.2, with design phase overruns in all three sessions and branching failures or absences in two. The three HRIStudio sessions produced mean DFS 100, mean ERS 96.7, and mean SUS 76.7, all three design phases within the allocation, and no speech content deviations. W-06 produced the only perfect ERS in the dataset. The specification deviation observed in W-01 illustrates the reproducibility problem concretely; its absence across all three HRIStudio sessions is consistent with the enforcement model's design intent. Programming backgrounds are balanced across conditions, strengthening the cross-background comparisons. The limitations of this pilot study, including sample size, task simplicity, and the single-session design, are acknowledged and inform the future directions described in Chapter~\ref{ch:conclusion}.