honors-thesis/thesis/chapters/08_discussion.tex

\chapter{Discussion}
\label{ch:discussion}

This chapter interprets the results presented in Chapter~\ref{ch:results} against the two research questions established in Chapter~\ref{ch:evaluation}, situates the findings within the broader literature on WoZ methodology, and identifies the limitations of this study.

\section{Interpretation of Findings}

\subsection{Research Question 1: Accessibility}

The first research question asked whether HRIStudio enables domain experts without prior robotics experience to successfully implement a robot interaction from a written specification. The Choregraphe study condition provides the baseline against which this question is evaluated.

The six completed sessions provide directional evidence on the accessibility question. Across the three Choregraphe wizards, design fidelity scores were 42.5, 65, and 62.5, yielding a Choregraphe mean of 56.7. Across the three HRIStudio sessions, all three wizards achieved a DFS of 100. No HRIStudio wizard required a T-type intervention that reflected an inability to operate the platform; the T-type marks logged for W-05 concerned interface orientation, and those logged for W-06 concerned gesture execution details (parallel execution and posture-reset blocks), neither of which constituted fundamental operational barriers. By contrast, Choregraphe produced design difficulties across all three sessions. W-01 required T-type assistance for connection routing and branch wiring. W-03 required no T-type interventions but over-engineered the design, adding concurrent execution nodes and attempting onboard speech-recognition logic that falls outside the WoZ paradigm. W-04 required T-type assistance for speech content punctuation and a failed choice block attempt.

The SUS scores reinforce this pattern. Choregraphe SUS scores were 60, 75, and 42.5 (mean 59.2), all at or below the average usability benchmark of 68~\cite{Brooke1996}. HRIStudio SUS scores were 90, 70, and 70 (mean 76.7), all above the benchmark. The Choregraphe study condition produced the lowest single SUS score in the study (42.5, W-04), a wizard who described the platform as getting in the way of their attempt. The HRIStudio study condition produced the highest (90, W-02). Because assignment was stratified by programming background, each condition contains exactly one wizard with \emph{None} experience, one with \emph{Moderate} experience, and one with \emph{Extensive} experience, enabling a direct cross-background comparison: W-01 (\emph{None}, Choregraphe, SUS 60) versus W-05 (\emph{None}, HRIStudio, SUS 70); W-04 (\emph{Moderate}, Choregraphe, SUS 42.5) versus W-02 (\emph{Moderate}, HRIStudio, SUS 90); W-03 (\emph{Extensive}, Choregraphe, SUS 75) versus W-06 (\emph{Extensive}, HRIStudio, SUS 70). HRIStudio scores exceed Choregraphe scores at the \emph{None} and \emph{Moderate} levels; at the \emph{Extensive} level the scores reverse by five points, suggesting that extensive programming experience largely attenuates the tool-level usability difference. It is worth noting that only one participant (W-01, Digital Humanities) came from a non-STEM discipline; the remaining five wizards held backgrounds in Computer Science, Chemical Engineering, or Logic and Philosophy of Science, a composition that limits claims about accessibility for humanities-domain researchers.

The most striking accessibility finding comes from W-05: a Chemical Engineering faculty member with no programming experience trained in 6 minutes, completed a perfect design in 18 minutes with no operational confusion, and ran the trial to conclusion. This outcome directly addresses the accessibility research question. HRIStudio's timeline-based model and guided workflow allowed a domain novice to implement the written specification correctly on their first attempt, without the interface friction that blocked or slowed all three Choregraphe wizards. Session timing data underscores the difference: Choregraphe design phases averaged 35.7 minutes (two overruns, one incomplete), while HRIStudio design phases averaged 21 minutes (all three within the allocation). Underlying this difference is a structural property of the two tools: HRIStudio's model is domain-specific to Wizard-of-Oz execution, so wizard effort is channeled toward implementing the specification more completely rather than elaborating the tool's architecture. Choregraphe's general-purpose programming model makes the opposite available, and both W-03 and W-04 took it, spending time on concurrent execution structures and a speech-recognition-driven choice block that the WoZ context does not support. No HRIStudio wizard had that option, and all three scored 100 on the DFS.

\subsection{Research Question 2: Reproducibility}

The second research question asked whether HRIStudio produces more reliable execution of a designed interaction compared to Choregraphe. The most instructive finding from W-01's session is not a score but an incident: the wizard deviated from the written specification by substituting different speech content, and this was not flagged or caught until the live trial (see Section~\ref{sec:results-qualitative} for the full account).

This is precisely the failure mode the reproducibility problem predicts. Riek's~\cite{Riek2012} review found that fewer than 4\% of published WoZ studies reported any measure of wizard error, meaning most studies have no mechanism to detect whether execution matched design intent. W-01's session demonstrates that such deviations occur even in controlled conditions with a single, simple specification and an engaged wizard. The deviation was not negligence; it was creative drift made possible by a tool that places no structural constraint on what the wizard types into a speech action.

HRIStudio's protocol enforcement model is designed to prevent this class of deviation by locking speech content at design time. The available data supports this design intent. No speech content deviations occurred in any of the three HRIStudio sessions. W-05 added an action beyond the specification (a crouch gesture), but this was an elaboration of the gesture category rather than a substitution of specified content, and it was scored within the rubric. The Choregraphe study condition produced the only speech substitution in the dataset (W-01) and two sessions in which branching was absent from the design entirely (W-03, W-04).

ERS scores reflect the downstream effect of these design differences. Choregraphe ERS scores were 65, 60, and 75 (mean 66.7). HRIStudio ERS scores were 95, 95, and 100 (mean 96.7). The branching item is particularly instructive: in the Choregraphe study condition, branch execution was either absent from the design entirely (W-03) or present but not implemented as conditional logic (W-01, W-04). W-01 resolved the branch by manually re-routing connections during the trial; W-04 required a T-type trial intervention to be reminded how to trigger the branch step. In all three HRIStudio sessions, the conditional branch was present in the design and executed during the trial. W-05's branch fired cleanly via programmed conditional logic; W-02's session saw a brief platform-side step misfire immediately corrected by manual step selection, logged as an H-type (platform behavior) intervention rather than a wizard error; W-06's branch fired cleanly with no intervention of any kind. In no HRIStudio session did branch execution depend on tool-operation guidance from the researcher.

\subsection{Session Timing and Downstream Effects}

W-01's design phase extended to 35 minutes, overrunning the 30-minute allocation by five minutes and leaving approximately five minutes for the trial phase. It is worth distinguishing between the two factors at play here: the overrun reflected both the tool's demands on the wizard and a procedural decision not to interrupt W-01 at the 30-minute mark. Subsequent sessions enforced the transition to the trial phase at 30 minutes regardless of design completion status, consistent with the observer protocol. That said, if a tool's demands make design completion within the allocation genuinely difficult, the risk of an overrun is real regardless of enforcement: a wizard who has not finished at 30 minutes faces a reduced trial window no matter when the cutoff is applied.

Across all six sessions, design phase overruns are concentrated in the Choregraphe study condition. W-01 and W-03 each exceeded the 30-minute design target but completed their designs before the session time limit; W-04 was the only wizard cut off by the limit without finishing. No HRIStudio wizard exceeded the target. This pattern holds across programming backgrounds: W-01 (non-programmer) and W-03 (extensive programmer) both overran in the Choregraphe study condition, while W-05 (non-programmer, HRIStudio) completed in 18 minutes and W-06 (extensive programmer, HRIStudio) completed in 21 minutes. The timing data thus corroborates the DFS and SUS findings as a supplementary accessibility indicator, and supports the conclusion that the overrun pattern is attributable to assigned tool rather than wizard background alone. Because programming experience was balanced across conditions by stratified assignment, the design-phase timing difference cannot be attributed to prior programming experience.

\section{Comparison to Prior Work}

Because assignment was stratified by programming experience, the overall 17.5-point gap in both means reflects a genuine tool-level effect rather than a sampling artifact. Pot et al.~\cite{Pot2009} introduced Choregraphe as a tool for enabling non-programmers to create NAO behaviors, but subsequent HRI research has treated it primarily as a programmer's tool in practice. This study confirms that characterization: W-01 (no programming experience) and W-04 (moderate experience) both required substantial T-type assistance and produced incomplete or deviation-prone designs, while W-03 (extensive experience) navigated the interface without T-type support yet still over-engineered the design and scored below every HRIStudio participant on both DFS and ERS. Riek's~\cite{Riek2012} observation that WoZ tools tend to require substantial technical investment even when the underlying experiment is conceptually simple holds across all three Choregraphe sessions regardless of background. In contrast, the HRIStudio results support the claim advanced in prior work~\cite{OConnor2024, OConnor2025} that a domain-specific, web-based platform can decouple task complexity from interface complexity: all three HRIStudio wizards---spanning no, moderate, and extensive programming experience---achieved a perfect DFS, and none encountered a fundamental barrier to operating the platform.

The specification deviation in W-01's session connects directly to Porfirio et al.'s~\cite{Porfirio2023} argument that formal, verifiable behavior specifications are a prerequisite for reproducible HRI. Porfirio et al. propose specification languages as the solution; HRIStudio takes a complementary approach by embedding the specification into the execution environment, making deviation structurally harder rather than formally detectable after the fact. The ERS data confirms this design intent: no speech content deviations occurred across all three HRIStudio sessions, and the HRIStudio ERS mean of 96.7 versus 66.7 for Choregraphe supports the conclusion that structural enforcement produces more reliable execution in practice. Riek's~\cite{Riek2012} finding that only 3.7\% of published WoZ studies reported any measure of wizard error makes this comparison particularly significant: the ERS operationalizes exactly the kind of execution measurement the literature has consistently omitted, and the difference it surfaces here is substantial.

The SUS scores are consistent with prior tool evaluations in HCI. The Choregraphe mean of 59.2 falls below the average benchmark of 68~\cite{Brooke1996} and below scores reported for general-purpose visual programming environments in comparable studies, consistent with Bartneck et al.'s~\cite{Bartneck2024} finding that domain-specific design is necessary to make tools genuinely accessible to non-programmers. The HRIStudio mean of 76.7 places the platform above the benchmark across all three sessions. Because programming experience is balanced across conditions by design, the overall 17.5-point gap in the two conditions' means reflects a genuine tool-level effect rather than an artifact of the sample's background composition. The gap is largest at the Moderate experience level (W-02 HRIStudio 90 vs.\ W-04 Choregraphe 42.5) and smallest at the Extensive level, where the scores reverse by five points (W-03 Choregraphe 75 vs.\ W-06 HRIStudio 70), suggesting that extensive programming experience largely attenuates the tool-level usability difference while the accessibility advantage remains pronounced for non-programmers and moderate programmers.

\section{Limitations}

This study has several limitations that must be considered when interpreting the findings.

\textbf{Sample size.} With six wizard participants ($N = 6$), the study is too small for inferential statistics. The reported scores are descriptive. Patterns in the data can suggest directions for future work but cannot establish causal claims about the effect of the tool on design fidelity or execution reliability.

\textbf{Trial execution without a separate test subject.} Following scheduling difficulties, the study protocol was adjusted so that the wizard executes the designed interaction directly rather than running it for a separate test subject. Because the DFS and ERS are scored against the exported project file and live observation rather than a subject's behavioral responses, this change does not affect the primary quantitative measures. The trial phase evaluates whether the wizard's design executes as specified; the presence or absence of a separate subject does not alter that criterion.

\textbf{Single task.} Both study conditions used the same Interactive Storyteller specification. While this controls for task difficulty, it limits generalizability. The task is simple relative to real HRI experiments; the gap between conditions may be larger or smaller with a more complex protocol involving multiple branches or longer interaction sequences.

\textbf{Uncontrolled dimensions.} Programming experience was balanced across conditions by stratified assignment (see Section~\ref{sec:measures} and Chapter~\ref{ch:evaluation}): each condition contains one wizard at each of the three experience levels (\emph{None}, \emph{Moderate}, \emph{Extensive}). This controls for programming background as a potential confounder but does not extend to other dimensions. The small $N$ means that balance on other potentially relevant dimensions (disciplinary background, prior experience with visual programming tools, or familiarity with robots more broadly) was not assessed or controlled and remains a source of variability not addressed in this pilot.

\textbf{Platform version.} HRIStudio is continuously evolving. The version used in this study represents the system at a specific point in time. Future iterations may change how the wizard interface presents protocol steps, how branch conditions are constructed during the design phase, or how protocol enforcement is applied during execution. Any of these changes could affect how easily a non-programmer completes the design challenge or how reliably the tool enforces the specification during the trial, potentially altering the DFS and ERS scores observed under otherwise identical conditions. Results from this study therefore describe the system as it existed at the time of data collection and may not generalize to later releases.

\section{Chapter Summary}

This chapter interpreted the results of all six completed pilot sessions against the two research questions and connected the findings to prior work. Across all primary measures, the directional evidence favors HRIStudio. HRIStudio wizards uniformly achieved perfect design fidelity (DFS 100) and near-perfect execution reliability (mean ERS 96.7), while Choregraphe wizards averaged DFS 56.7 and ERS 66.7, with design overruns in all three sessions and no session completing without researcher guidance. The W-01 content deviation (see Section~\ref{sec:results-qualitative}) illustrates the reproducibility problem concretely; its absence in all three HRIStudio sessions is consistent with the enforcement model's design intent. Programming backgrounds are balanced across study conditions by stratified assignment, strengthening the cross-background comparisons. The limitations of this pilot study, including sample size, task simplicity, and the single-session design, are acknowledged and inform the future directions described in Chapter~\ref{ch:conclusion}.