honors-thesis/thesis/chapters/09_conclusion.tex

\chapter{Conclusion and Future Work}
\label{ch:conclusion}

This thesis set out to address two persistent problems in Wizard-of-Oz-based Human-Robot Interaction research. The first is the Accessibility Problem: a high technical barrier prevents domain experts who are not programmers from conducting HRI studies independently. The second is the Reproducibility Problem: the fragmented landscape of custom tools makes it difficult to verify or replicate experimental results across studies and labs. This chapter summarizes the contributions of the work, reflects on what the pilot study results suggest, and identifies directions for future investigation.

\section{Contributions}

This thesis makes three contributions to the field of Human-Robot Interaction research infrastructure.

\textbf{A principled architecture for WoZ platforms.} The primary contribution is a set of design principles for Wizard-of-Oz infrastructure: a hierarchical specification model (Study $\to$ Experiment $\to$ Step $\to$ Action), an event-driven execution model that separates protocol design from live trial control, and a plugin architecture that decouples experiment logic from robot-specific implementations. These principles are not specific to any one robot or institution; they describe a general approach to building WoZ tools that are simultaneously accessible to non-programmers and reproducible across executions. The principles were derived from a systematic analysis of reproducibility failures in published WoZ literature, grounded in the prior work of Riek~\cite{Riek2012} and Porfirio et al.~\cite{Porfirio2023}, and refined through the design and implementation process described in Chapters~\ref{ch:design} and~\ref{ch:implementation}.

\textbf{HRIStudio: a reference implementation.} The second contribution is HRIStudio, an open-source, web-based platform that realizes the design principles described above. HRIStudio provides a visual experiment designer, a consolidated wizard execution interface, role-based access control for research teams, and a repository-based plugin system for integrating robot platforms including the NAO V6 used in this study. As a reference implementation, HRIStudio demonstrates that the design principles are technically feasible and can be delivered in a form that real researchers can use without programming expertise. The platform's architecture is documented in detail in Chapter~\ref{ch:implementation} and the accompanying technical appendix.

\textbf{Pilot empirical evidence.} The third contribution is a pilot between-subjects study comparing HRIStudio against Choregraphe as a representative baseline tool. While the pilot scale precludes inferential claims, the study provides directional evidence on both research questions and produces a concrete demonstration of the reproducibility problem in a controlled setting: a wizard using Choregraphe deviated from the written specification in a way that was undetected until the live trial. This incident motivates the enforcement model at the core of HRIStudio's design and illustrates why the reproducibility problem is difficult to solve through training or norms alone.

\section{Reflection on Research Questions}

The central question this thesis addressed was: \emph{can the right software architecture make Wizard-of-Oz experiments more accessible to non-programmers and more reproducible across participants?} The evidence from the pilot study suggests the answer is yes, with the qualifications appropriate to a small-N directional study.

On accessibility, the Choregraphe condition demonstrates that even a tool described as suitable for non-programmers creates significant interface friction in practice. A wizard with programming experience required more time than allocated, generated a high volume of tool-level help requests, and rated the tool below the average SUS benchmark. The finite state machine model --- boxes connected by signals --- imposed cognitive overhead that domain knowledge of the task alone could not resolve. If HRIStudio's timeline-based model and guided workflow reduce that overhead, the difference should appear as higher DFS scores, fewer tool-operation interventions, and higher SUS ratings across the full sample.

On reproducibility, the specification deviation observed in the Choregraphe session illustrates why enforcement matters. A tool that allows wizards to freely edit speech content at any point in the design process creates opportunities for drift that are invisible until they surface during execution. HRIStudio's protocol enforcement forecloses this class of deviation by construction --- speech is locked at design time and surfaced during execution rather than re-entered. Whether this architectural choice translates into measurably higher execution reliability scores, and whether the proportion of tool-assisted branching resolution differs between conditions, are the questions the full dataset answers.

% TODO: Once all sessions are complete, rewrite the Reflection section with actual condition means for DFS, ERS, and SUS.
% TODO: Replace the forward-looking framing in both RQ paragraphs with concrete comparative analysis.
% TODO: Update the chapter intro sentence ("The evidence suggests yes...") to reflect the actual direction of the findings.

\section{Future Directions}

The work described in this thesis suggests several directions for future investigation.

\textbf{Larger validation study.} The most immediate next step is a full-scale study with sufficient participants to support inferential analysis. A sample of 20 or more wizard participants, balanced across programming backgrounds and conditions, would allow the DFS and ERS comparisons to be evaluated for statistical significance. A larger study would also enable subgroup analysis --- for example, whether the accessibility benefit of HRIStudio is concentrated among non-programmers or extends equally to programmers.

\textbf{Multi-task evaluation.} The Interactive Storyteller is a simple single-interaction task with one conditional branch. Real HRI experiments are more complex: they involve multiple conditions, longer interactions, and more elaborate branching logic. Evaluating HRIStudio on richer specifications would test whether the accessibility and reproducibility benefits scale with task complexity, and whether any new limitations emerge at that scale.

\textbf{Longitudinal use.} This study evaluated first-session performance, which captures the initial learning curve but not longer-term practice. A longitudinal study tracking wizard performance across multiple sessions would reveal whether HRIStudio's benefits persist or diminish as wizards become proficient, and whether the tool's structured approach continues to enforce reproducibility over time.

\textbf{Observer and researcher roles.} HRIStudio's role-based architecture includes Observer and Researcher roles that were not formally evaluated in this study. Future work should investigate how these roles support team coordination in multi-experimenter studies, and whether the annotation and logging capabilities they enable produce analysis workflows that are meaningfully more efficient than manual video coding.

\textbf{Platform expansion.} The NAO integration used in this study is one instance of HRIStudio's plugin architecture. Extending the plugin ecosystem to include mobile robots, socially assistive robots, and non-humanoid platforms would broaden the system's applicability and test whether the plugin abstraction is sufficiently general to accommodate the range of robot capabilities used in published HRI research.

\textbf{Community adoption.} The reproducibility problem in WoZ research is ultimately a community problem, not a tool problem. Future work should investigate what it would take for HRIStudio to be adopted as shared infrastructure across multiple labs --- including documentation standards, experiment sharing mechanisms, and incentive structures that make reproducibility a norm rather than an exception.

\section{Closing Remarks}

The Wizard-of-Oz technique is one of the most powerful tools available to HRI researchers: it allows the study of interaction designs that do not yet exist as autonomous systems, accelerating the feedback loop between design intuition and empirical evidence. But the technique has been practiced for decades without the infrastructure needed to make it rigorous. Studies are conducted with custom tools that are never shared, by wizards whose behavior is never verified against a protocol, producing results that cannot be replicated because the conditions that produced them were never precisely recorded.

HRIStudio is an attempt to build that infrastructure. It will not solve the reproducibility problem by itself; that requires community norms, institutional incentives, and continued investment in open, shared tooling. But it demonstrates that the technical barriers are not insurmountable --- that a web-based platform can make WoZ research accessible to domain experts who are not engineers, and that execution enforcement can prevent the kinds of specification drift that silently degrade research quality. That is, at minimum, where the work begins.