honors-thesis/thesis/chapters/09_conclusion.tex

\chapter{Conclusion and Future Work}
\label{ch:conclusion}

This thesis set out to address two persistent problems in Wizard-of-Oz-based Human-Robot Interaction research. The first is the Accessibility Problem: a high technical barrier prevents domain experts who are not programmers from conducting HRI studies independently. The second is the Reproducibility Problem: the fragmented landscape of custom tools makes it difficult to verify or replicate experimental results across studies and labs. This chapter summarizes the contributions of the work, reflects on what the pilot study results suggest, and identifies directions for future investigation.

\section{Contributions}

This thesis makes three contributions to the field of HRI research infrastructure.

\textbf{A principled architecture for WoZ platforms.} The primary contribution is a set of design principles for Wizard-of-Oz infrastructure: a hierarchical specification model (Study $\to$ Experiment $\to$ Step $\to$ Action), an event-driven execution model that separates protocol design from live trial control, and a plugin architecture that decouples experiment logic from robot-specific implementations. These principles are not specific to any one robot or institution; they describe a general approach to building WoZ tools that are simultaneously accessible to non-programmers and reproducible across executions. The principles were derived from a systematic analysis of reproducibility failures in published WoZ literature, grounded in the prior work of Riek~\cite{Riek2012} and Porfirio et al.~\cite{Porfirio2023}, and refined through the design and implementation process described in Chapters~\ref{ch:design} and~\ref{ch:implementation}.

\textbf{HRIStudio: a complete, operational platform.} The second contribution is HRIStudio, an open-source, web-based platform that fully realizes the design principles described above. HRIStudio provides a visual experiment designer, a consolidated wizard execution interface, role-based access control for research teams, and a repository-based plugin system for integrating robot platforms including the NAO6 used in this study. HRIStudio demonstrates that the design principles are not only technically feasible but can be delivered as a complete system that real researchers use without programming expertise, making it both an artifact and an instrument of validation. The platform's architecture is documented in detail in Chapter~\ref{ch:implementation} and the accompanying technical appendix.

\textbf{Pilot empirical evidence.} The third contribution is a pilot between-subjects study comparing HRIStudio against Choregraphe as a representative baseline tool. While the pilot scale precludes inferential claims, the study provides directional evidence on both research questions and produces a concrete demonstration of the reproducibility problem in a controlled setting: a wizard using Choregraphe deviated from the written specification in a way that was undetected until the live trial. This incident motivates the enforcement model at the core of HRIStudio's design and illustrates why the reproducibility problem is difficult to solve through training or norms alone.

\section{Reflection on Research Questions}

The central question this thesis addressed was: \emph{can the right software architecture make Wizard-of-Oz experiments more accessible to non-programmers and more reproducible across participants?} The evidence from the pilot study suggests the answer is yes, with the qualifications appropriate to a small $N$ directional study.

On accessibility, the evidence from all six sessions is consistent and directional. The Choregraphe condition produced a mean DFS of 56.7 across three wizards, with design phases averaging 35.7 minutes; W-01 and W-03 exceeded the 30-minute target but completed their designs, while W-04 was the only wizard cut off by the session time limit without finishing. All three HRIStudio sessions produced a DFS of 100, with design phases averaging 21 minutes, all within the allocation. The most direct demonstration comes from W-05: a Chemical Engineering faculty member with no programming background trained in 6 minutes, completed a perfect design in 18 minutes, and ran the trial to completion without tool-operation difficulty. Choregraphe's finite state machine model, with boxes connected by signals, imposed cognitive overhead that domain knowledge of the task alone could not resolve; HRIStudio's timeline-based model did not produce this friction for any wizard regardless of background. SUS scores reflect the same pattern: Choregraphe mean 59.2 (below average), HRIStudio mean 76.7 (above average).

On reproducibility, the specification deviation observed in W-01's Choregraphe session, a substituted rock color in the robot's speech that was undetected until execution, illustrates the failure mode the reproducibility problem predicts. No equivalent speech content deviation occurred in any of the three HRIStudio sessions. Branching, the other primary reliability measure, was present in the design and executed in all three HRIStudio sessions. W-05's branch fired cleanly via programmed conditional logic; W-02's session experienced a brief platform-side misfire corrected immediately by manual step selection, logged as an H-type (platform behavior) rather than a wizard error; W-06's branch fired cleanly with no intervention of any kind. In no HRIStudio session was branching absent from the design or dependent on tool-operation guidance from the researcher. By contrast, branching was absent from two Choregraphe designs entirely (W-03, W-04) and resolved by manual re-routing in a third (W-01). ERS condition means reflect the outcome: 66.7 for Choregraphe, 96.7 for HRIStudio. W-06 produced the only perfect ERS in the dataset (100), with a three-minute trial run entirely without researcher intervention. The enforcement model's design intent, locking speech at design time and presenting it during execution rather than requiring re-entry, appears to produce the reliability difference the architecture was designed to achieve.

\section{Future Directions}

The work described in this thesis suggests several directions for future investigation.

\textbf{Larger validation study.} The most immediate next step is a full-scale study with sufficient participants to support inferential analysis. A sample of 20 or more wizard participants, balanced across programming backgrounds and conditions, would allow the DFS and ERS comparisons to be evaluated for statistical significance. A larger study would also enable subgroup analysis, for example whether the accessibility benefit of HRIStudio is concentrated among non-programmers or extends equally to programmers.

\textbf{Multi-task evaluation.} The Interactive Storyteller is a simple single-interaction task with one conditional branch. Real HRI experiments are more complex: they involve multiple conditions, longer interactions, and more elaborate branching logic. Evaluating HRIStudio on richer specifications would test whether the accessibility and reproducibility benefits scale with task complexity, and whether any new limitations emerge at that scale.

\textbf{Longitudinal use.} This study evaluated first-session performance, which captures the initial learning curve but not longer-term practice. A longitudinal study tracking wizard performance across multiple sessions would reveal whether HRIStudio's benefits persist or diminish as wizards become proficient, and whether the tool's structured approach continues to enforce reproducibility over time.

\textbf{Observer and researcher roles.} HRIStudio's role-based architecture includes Observer and Researcher roles that were not formally evaluated in this study. Future work should investigate how these roles support team coordination in multi-experimenter studies, and whether the annotation and logging capabilities they enable produce analysis workflows that are meaningfully more efficient than manual video coding.

\textbf{Platform expansion.} The NAO integration used in this study is one instance of HRIStudio's plugin architecture. Extending the plugin ecosystem to include mobile robots, socially assistive robots, and non-humanoid platforms would broaden the system's applicability and test whether the plugin abstraction is sufficiently general to accommodate the range of robot capabilities used in published HRI research.

\textbf{Community adoption.} The reproducibility problem in WoZ research is ultimately a community problem, not a tool problem. Future work should investigate what it would take for HRIStudio to be adopted as shared infrastructure across multiple labs, including documentation standards, experiment sharing mechanisms, and incentive structures that make reproducibility a norm rather than an exception.

\section{Closing Remarks}

The Wizard-of-Oz technique is one of the most powerful tools available to HRI researchers: it allows the study of interaction designs that do not yet exist as autonomous systems, accelerating the feedback loop between design intuition and empirical evidence. But the technique has been practiced for decades without the infrastructure needed to make it rigorous. Studies are conducted with custom tools that are never shared, by wizards whose behavior is never verified against a protocol, producing results that cannot be replicated because the conditions that produced them were never precisely recorded.

HRIStudio is an attempt to build that infrastructure. It will not solve the reproducibility problem by itself; that requires community norms, institutional incentives, and continued investment in open, shared tooling. But it demonstrates that the technical barriers are not insurmountable: a web-based platform can make WoZ research accessible to domain experts who are not engineers, and execution enforcement can prevent the kinds of specification drift that silently degrade research quality. That is, at minimum, where the work begins.