\chapter{Reproducibility Challenges in WoZ-based HRI Research}
\label{ch:reproducibility}

Having established the landscape of existing WoZ platforms and their limitations, I now turn to a more fundamental question: what makes WoZ experiments difficult to reproduce, and how can software infrastructure help address these challenges? This chapter analyzes the sources of variability in WoZ studies, examines how current practices in infrastructure and reporting contribute to reproducibility problems, and derives specific platform requirements that can mitigate these issues. Understanding these challenges is essential for designing a system that not only makes experiments easier to conduct but also more scientifically rigorous.

\section{Sources of Variability}

Reproducibility in experimental research requires that independent investigators can obtain consistent results when following the same procedures. In WoZ-based HRI studies, however, multiple sources of variability can compromise this goal. The wizard is simultaneously the strength and weakness of the WoZ paradigm. While human control enables sophisticated, adaptive interactions, it also introduces inconsistency. Consider a wizard conducting multiple trials of the same experiment with different participants. Even with a detailed script, the wizard may vary in timing, with delays between a participant's action and the robot's response fluctuating based on the wizard's attention, fatigue, or interpretation of when to act. When a script allows for choices, different wizards may make different selections, or the same wizard may choose differently across trials. Furthermore, a wizard may accidentally skip steps, trigger actions in the wrong order, or misinterpret experimental protocols.

Riek's systematic review \cite{Riek2012} found that very few published studies reported measuring wizard error rates or providing standardized wizard training. Without such measures, it becomes impossible to determine whether experimental results reflect the intended interaction design or inadvertent variations in wizard behavior.

Beyond wizard behavior, the ``one-off'' nature of many WoZ control systems introduces technical variability. When each research group builds custom software for each study, several problems arise. Custom interfaces may have undocumented capabilities, hidden features, default behaviors, or timing characteristics that are never formally described. Software tightly coupled to specific robot models or operating system versions may become unusable when hardware is upgraded or replaced. Each system logs data differently, with different file formats, different levels of granularity, and different choices about what to record. This fragmentation means that replicating a study often requires not just following an experimental protocol but also reverse-engineering or rebuilding the original software infrastructure.

Even when researchers intend for their work to be reproducible, practical constraints on publication length lead to incomplete documentation. Exact timing parameters are often omitted. Decision rules for wizard actions remain unspecified. Details of the wizard interface go unreported. Specifications of data collection, including which sensor streams were recorded and at what sampling rate, are frequently missing. Without this information, other researchers cannot faithfully recreate the experimental conditions, limiting both direct replication and conceptual extensions of prior work.

\section{Infrastructure Requirements for Enhanced Reproducibility}

Based on this analysis, I identify specific ways that software infrastructure can mitigate reproducibility challenges. Rather than merely providing tools for wizard control, an ideal WoZ platform should actively guide wizards through scripted procedures. This means presenting actions in a prescribed sequence to prevent out-of-order execution, highlighting the current step in the protocol, recording any deviations from the script as explicit events in the data log, and supporting repeatable decision logic through clearly defined conditional branches. By constraining wizard behavior within the bounds of the experimental design, the system reduces unintended variability across trials and participants.

Manual data collection is error-prone and often incomplete. The platform should automatically record every action triggered by the wizard with precise timestamps, all robot sensor data and state changes, timing information indicating when actions were requested, when they began executing, and when they completed, as well as the full experimental protocol embedded in the log file so that the script used for any session can be recovered later. This ``data by default'' approach ensures that critical information is never accidentally omitted.

The experimental design itself should serve as documentation. When interaction protocols are defined using structured formats such as visual flowcharts or declarative scripts rather than imperative code, they become simultaneously executable and human-readable. Researchers can then share complete, unambiguous descriptions of their experimental procedures alongside their results.

To maximize the lifespan and transferability of experimental designs, the platform must separate the high-level logic of an interaction from the low-level details of how specific robots execute those behaviors. This abstraction allows experiments designed for one robot to be adapted to another, extending the reproducibility of interaction designs even when the original hardware becomes obsolete.

\section{Gap Between Current Practice and Requirements}

Existing platforms address some but not all of these requirements. OpenWoZ provides comprehensive logging and supports runtime adaptability, but does not enforce scripted protocols. Choregraphe offers visual programming that is somewhat self-documenting, but tightly couples designs to specific hardware. WoZ4U is accessible and intuitive, but provides limited logging capabilities and no platform independence.

The persistent gap is the absence of a platform that holistically addresses reproducibility by combining enforced experimental protocols, automatic comprehensive logging, self-documenting design interfaces, and platform-agnostic architecture. Closing this gap requires a fundamental rethinking of how WoZ infrastructure is designed, prioritizing methodological rigor as a first-class design goal rather than an afterthought.

\section{Chapter Summary}

This chapter has analyzed the reproducibility challenges inherent in WoZ-based HRI research, identifying three primary sources of variability: inconsistent wizard behavior, fragmented technical infrastructure, and incomplete documentation. I derived four specific infrastructure requirements that can mitigate these challenges: enforced experimental protocols, comprehensive automatic logging, self-documenting experiment designs, and platform-independent abstractions. Current platforms address these requirements only partially, revealing a clear opportunity for a new approach that prioritizes reproducibility from its inception. The following chapters describe the design and implementation of such a system.