post-m04-ch03 edits

This commit is contained in:
Sean O'Connor
2026-02-19 23:50:30 -05:00
parent 5288007c8b
commit 02c40dde96

View File

@@ -1,31 +1,37 @@
\chapter{Reproducibility Challenges in WoZ-based HRI Research}
\label{ch:reproducibility}
Having established the landscape of existing WoZ platforms and their limitations, I now examine the factors that make WoZ experiments difficult to reproduce and how software infrastructure can address them. This chapter analyzes the sources of variability in WoZ studies, examines how current practices in infrastructure and reporting contribute to reproducibility problems, and derives specific platform requirements that can mitigate these issues. Understanding these challenges is essential for designing a system that supports experimentation at scale while remaining scientifically rigorous.
Having established the landscape of existing WoZ platforms and their limitations, I now examine the factors that make WoZ experiments difficult to reproduce and how software infrastructure can address them. This chapter analyzes the sources of variability in WoZ studies and examines how current practices in infrastructure and reporting contribute to reproducibility problems. Understanding these challenges is essential for designing a system that supports experimentation at scale while remaining scientifically rigorous.
\section{Sources of Variability}
Reproducibility in experimental research requires that independent investigators can obtain consistent results when following the same procedures. In WoZ-based HRI studies, however, multiple sources of variability can compromise this goal. The wizard is simultaneously the strength and weakness of the WoZ paradigm. While human control enables sophisticated, adaptive interactions, it also introduces inconsistency. Consider a wizard conducting multiple trials of the same experiment with different participants. Even with a detailed script, the wizard may vary in timing, with delays between a participant's action and the robot's response fluctuating based on the wizard's attention, fatigue, or interpretation of when to act. When a script allows for choices, different wizards may make different selections, or the same wizard may choose differently across trials. Furthermore, a wizard may accidentally skip steps, trigger actions in the wrong order, or misinterpret experimental protocols.
Reproducibility in experimental research requires that independent investigators can obtain consistent results when following the same procedures. In WoZ-based HRI studies, however, multiple sources of variability can compromise this goal. The wizard is simultaneously the strength and weakness of the WoZ paradigm. While human control enables sophisticated, adaptive interactions, it also introduces inconsistency. Consider a wizard conducting multiple trials of the same experiment with different participants. Even with a detailed script, the wizard may vary in timing, with delays between a participant's action and the robot's response fluctuating based on the wizard's attention, fatigue, or interpretation of when to act. When a script allows for choices, different wizards may make different selections, or the same wizard may act differently across trials. Furthermore, a wizard may accidentally skip steps, trigger actions in the wrong order, or misinterpret experimental protocols.
Riek's systematic review \cite{Riek2012} found that very few published studies reported measuring wizard error rates or providing standardized wizard training. Without such measures, it becomes impossible to determine whether experimental results reflect the intended interaction design or inadvertent variations in wizard behavior.
Beyond wizard behavior, the ``one-off'' nature of many WoZ control systems introduces technical variability. When each research group builds custom software for each study, several problems arise. Custom interfaces may have undocumented capabilities, hidden features, default behaviors, or timing characteristics that are never formally described. Software tightly coupled to specific robot models or operating system versions may become unusable when hardware is upgraded or replaced. Each system logs data differently, with different file formats, different levels of granularity, and different choices about what to record. This fragmentation means that replicating a study often requires not just following an experimental protocol but also reverse-engineering or rebuilding the original software infrastructure.
Beyond wizard behavior, the custom nature of many WoZ control systems introduces technical variability. When each research group builds custom software for each study, several problems arise. Custom interfaces may have undocumented capabilities, hidden features, default behaviors, or timing characteristics researchers never formally describe. Software tightly coupled to specific robot models or operating system versions may become unusable when hardware or software is upgraded or replaced. Each system logs data differently, with different file formats, different levels of granularity, and different choices about what to record. This fragmentation means that replicating a study often requires not just following an experimental protocol but also reverse-engineering or rebuilding the original software and hardware infrastructure.
Even when researchers intend for their work to be reproducible, practical constraints on publication length lead to incomplete documentation. Exact timing parameters are often omitted. Decision rules for wizard actions remain unspecified. Details of the wizard interface go unreported. Specifications of data collection, including which sensor streams were recorded and at what sampling rate, are frequently missing. Without this information, other researchers cannot faithfully recreate the experimental conditions, limiting both direct replication and conceptual extensions of prior work.
Even when researchers intend for their work to be reproducible, practical constraints on publication length lead to incomplete documentation. Papers often omit exact timing parameters. Authors leave decision rules for wizard actions unspecified and fail to report details of the wizard interface. Specifications of data collection, including which sensor streams were recorded and at what sampling rate, frequently go missing. Without this information, other researchers cannot faithfully recreate the experimental conditions, limiting both direct replication and conceptual extensions of prior work.
\section{Infrastructure Requirements for Enhanced Reproducibility}
Based on this analysis, I identify specific ways that software infrastructure can mitigate reproducibility challenges. Rather than merely providing tools for wizard control, an ideal WoZ platform should actively guide wizards through scripted procedures. This means presenting actions in a prescribed sequence to prevent out-of-order execution, highlighting the current step in the protocol, recording any deviations from the script as explicit events in the data log, and supporting repeatable decision logic through clearly defined conditional branches. By constraining wizard behavior within the bounds of the experimental design, the system reduces unintended variability across trials and participants.
Based on this analysis, I identify specific ways that software infrastructure can mitigate reproducibility challenges:
Manual data collection is error-prone and often incomplete. The platform should automatically record every action triggered by the wizard with precise timestamps, all robot sensor data and state changes, timing information indicating when actions were requested, when they began executing, and when they completed, as well as the full experimental protocol embedded in the log file so that the script used for any session can be recovered later. This approach of recording data by default ensures that critical information is never accidentally omitted.
\begin{enumerate}
\item \textbf{Guided wizard execution.} Rather than merely providing tools for wizard control, an ideal WoZ platform should actively guide wizards through scripted procedures. This means presenting actions in a prescribed sequence to prevent out-of-order execution, highlighting the current step in the protocol, recording any deviations from the script as explicit events in the data log, and supporting repeatable decision logic through clearly defined conditional branches. By constraining wizard behavior within the bounds of the experimental design, the system reduces unintended variability across trials and participants.
\item \textbf{Comprehensive automatic logging.} Manual data collection is error-prone and often incomplete. The platform should automatically record every action triggered by the wizard with precise timestamps, all robot sensor data and state changes, and timing information indicating when actions were requested, when they began executing, and when they completed. The full experimental protocol should be embedded in each log file so that researchers can recover the exact script used for any session. Note that recording precise timestamps does not imply that trials must have identical timing--human-robot interactions naturally vary in duration--but rather that the system captures what actually occurred for later analysis.
\item \textbf{Self-documenting protocol specifications.} The protocol specification itself should serve as documentation. When interaction protocols are defined using structured formats such as visual flowcharts or declarative scripts rather than imperative code, they become simultaneously executable and human-readable. Researchers can then share complete, unambiguous descriptions of their experimental procedures alongside their results.
\item \textbf{Platform-independent abstractions.} To maximize the lifespan and transferability of experimental designs, the platform must separate the high-level control logic--the sequence of wizard and robot actions--from the low-level details of how specific robots execute those behaviors. This abstraction allows experiments designed for one robot to be more easily adapted to another, extending the reproducibility of interaction designs even when the original hardware becomes obsolete.
\end{enumerate}
The experimental design itself should serve as documentation. When interaction protocols are defined using structured formats such as visual flowcharts or declarative scripts rather than imperative code, they become simultaneously executable and human-readable. Researchers can then share complete, unambiguous descriptions of their experimental procedures alongside their results.
To maximize the lifespan and transferability of experimental designs, the platform must separate the high-level logic of an interaction from the low-level details of how specific robots execute those behaviors. This abstraction allows experiments designed for one robot to be adapted to another, extending the reproducibility of interaction designs even when the original hardware becomes obsolete.
\section{Connecting Reproducibility Challenges to Infrastructure Requirements}
The reproducibility challenges identified above directly motivate the infrastructure requirements established in Chapter~\ref{ch:background}. Inconsistent wizard behavior violates the requirement for enforced experimental protocols and comprehensive automatic logging. The absence of standardized logging formats and sensor specifications violates both the automated logging and self-documenting design requirements. Technical fragmentation violates the platform-agnostic requirement, as bespoke systems become obsolete when hardware evolves. Incomplete documentation reflects a failure to treat experiment design as executable, self-documenting specifications. No existing platform simultaneously satisfies all six requirements: most critically, the trade-off between accessibility and flexibility remains unresolved, and few tools embed methodological best practices directly into their design. As Chapter~\ref{ch:background} demonstrated, this gap persists across a decade of platform development. Addressing it requires a fundamental rethinking of how WoZ infrastructure is designed, prioritizing reproducibility and methodological rigor as first-class design goals rather than afterthoughts.
The reproducibility challenges identified above directly motivate the infrastructure requirements established in Chapter~\ref{ch:background}. Inconsistent wizard behavior creates the need for enforced experimental protocols (R1, R2) that guide wizards systematically. The lack of comprehensive data undermines analysis, motivating automatic logging requirements (R4). Technical fragmentation--where each lab builds custom software tied to specific hardware--violates platform agnosticism (R5), as these custom systems become obsolete when hardware evolves. Incomplete documentation reflects the need for self-documenting protocol specifications (R1, R2) that are simultaneously executable and shareable. As Chapter~\ref{ch:background} demonstrated, no existing platform simultaneously satisfies all six requirements. Addressing this gap requires rethinking how WoZ infrastructure is designed, prioritizing reproducibility and methodological rigor as first-class design goals rather than afterthoughts.
\section{Chapter Summary}