feat: draft discussion chapter and update thesis structure with preliminary results and placeholder sections.

Enhance architectural design, implementation, and evaluation chapters with detailed specifications and pilot validation study
Refactor implementation and evaluation chapters for clarity and detail
2026-05-08 15:18:54 -04:00 · 2026-04-01 17:22:53 -04:00 · 2026-03-26 13:50:07 -04:00 · 2026-03-05 23:28:59 -05:00 · 2026-03-05 14:09:57 -05:00 · 2026-03-04 13:24:36 -05:00
24 changed files with 1141 additions and 151 deletions
@@ -22,3 +22,6 @@ context

 # OS files
 .DS_Store
+
+# IDE files
+.vscode/
@@ -10,20 +10,30 @@
 \setboolean{@twoadv}{false}
 \DeclareOption{numbib}{\PassOptionsToPackage{numbers}{natbib}}
 \DeclareOption{twoadv}{\setboolean{@twoadv}{true}}
+
+\renewcommand\frontmatter{%
+    \if@openright\cleardoublepage\else\clearpage\fi
+    \@mainmatterfalse
+    \pagenumbering{roman}
+    \setcounter{page}{4}
+}
 \ProcessOptions\relax
 %
 \RequirePackage{bm}
 \RequirePackage[square]{natbib}
 \RequirePackage[bf,hang,small]{caption}
-\pagestyle{headings}
+\RequirePackage{geometry}
+\geometry{
+    left=1.5in,
+    right=1.0in,
+    top=1.0in,
+    bottom=1.5in
+}
+\RequirePackage{setspace}
+\doublespacing
+\pagestyle{myheadings}
 %\markright{}
 \setlength{\parskip}{0.2in}
-\setlength{\topmargin}{0.0in}
-\setlength{\oddsidemargin}{0.5in}
-\setlength{\evensidemargin}{0.5in}
-\setlength{\textwidth}{6.0in}
-\addtolength{\textheight}{0.4in}
-%\setlength{\footskip}{1.0in}
 \newcommand{\advisor}[1]{\newcommand{\advisorname}{#1}}
 \newcommand{\advisorb}[1]{\newcommand{\advisornameb}{#1}}
 \newcommand{\chair}[1]{\newcommand{\chairname}{#1}}
@@ -31,7 +41,6 @@
 \newcommand{\butitle}[1]{\newcommand{\titletext}{#1}}
 \newcommand{\degree}[1]{\newcommand{\degreename}{#1}}
 \newcommand{\acknowledgments}[1]{\thispagestyle{myheadings}\markright{}
-                                 \setcounter{page}{2}
                                 \mbox{}
                                 \vspace{1.5in}

@@ -69,15 +78,18 @@
                    \thispagestyle{myheadings}\markright{}
                    #1
                    \pagestyle{headings}}
-\renewcommand{\maketitle}{\begin{titlepage}
+\renewcommand{\maketitle}{
+\thispagestyle{empty}
+\mbox{}
+\newpage
+\begin{titlepage}
+\setcounter{page}{3}
+\begin{singlespace}
 \mbox{}
-\addtolength{\textheight}{1.0in}
 \begin{center}
-\renewcommand{\baselinestretch}{1.2}
 \large
 {\bf \MakeUppercase{\titletext}}

-\renewcommand{\baselinestretch}{1.}
 \normalsize
 \vspace{0.1in}

@@ -129,6 +141,7 @@ Approved: \hspace{0.2in}\underline{\hspace{2.5in}}\\
 \mbox{\hspace{1.0in}}\underline{\hspace{2.5in}}\\
 \mbox{\hspace{1.3in}}\chairname\\
 \mbox{\hspace{1.3in}}Chair of the Department of \departmentname}
+\end{singlespace}
 \vfill
 \end{titlepage}}
 \endinput
@@ -1,24 +1,32 @@
 \chapter{Introduction}
 \label{ch:intro}

+Human-Robot Interaction (HRI) is an essential field of study for understanding how robots should communicate, collaborate, and coexist with people. As researchers work to develop social robots capable of natural interaction, they face a fundamental challenge: how to prototype and evaluate interaction designs before the underlying autonomous systems are fully developed. This chapter introduces the technical and methodological barriers that currently limit HRI research, describes a generalized approach to address these challenges, and establishes the research objectives and thesis statement for this work.
+
 \section{Motivation}

-To build the social robots of tomorrow, researchers must find ways to convincingly simulate them today. The process of designing and optimizing interactions between human and robot is essential to the field of Human-Robot Interaction (HRI), a discipline dedicated to ensuring these technologies are safe, effective, and accepted by the public. However, current practices for prototyping these interactions are often hindered by complex technical requirements and inconsistent methodologies.
+To build the social robots of tomorrow, researchers must study how people respond to robot behavior today. That requires interactions that feel real even when autonomy is incomplete. The process of designing and optimizing interactions between human and robot is essential to HRI, a discipline dedicated to ensuring these technologies are safe, effective, and accepted by the public \cite{Bartneck2024}. However, current practices for prototyping these interactions are often hindered by complex technical requirements and inconsistent methodologies.

-In a typical social robotics interaction, a robot operates autonomously based on pre-programmed behaviors. Because human interaction is inherently unpredictable, pre-programmed autonomy often fails to respond appropriately to subtle social cues, causing the interaction to degrade. To overcome this, researchers utilize the Wizard-of-Oz (WoZ) technique, where a human operator--the ``wizard''--controls the robot's actions in real-time, creating the illusion of autonomy. This allows for rapid prototyping and testing of interaction designs before the underlying artificial intelligence is fully matured.
+Social robotics focuses on robots designed for social interaction with humans, and it poses unique challenges for autonomy. In a typical social robotics interaction, a robot operates autonomously based on pre-programmed behaviors. Because human interaction is inherently unpredictable, pre-programmed autonomy often fails to respond appropriately to subtle social cues, causing the interaction to degrade.

-Despite its versaility, WoZ research faces two critical challenges. First, a high technical barrier prevents many non-programmers, such as experts in psychology or sociology, from conducting their own studies without engineering support. Second, the hardware landscape is highly fragmented. Researchers frequently build bespoke, ``one-off'' control interfaces for specific robots and specific experiments. These ad-hoc tools are rarely shared, making it difficult for the scientific community to replicate studies or verify findings. This has led to a replication crisis in HRI, where a lack of standardized tooling undermines the reliability of the field's body of knowledge.
+To overcome this limitation, researchers use the Wizard-of-Oz (WoZ) technique. The name references L. Frank Baum's story \cite{Baum1900}, in which the "great and powerful" Oz is revealed to be an ordinary person operating machinery behind a curtain, creating an illusion of magic. In HRI, the wizard similarly creates an illusion of robot intelligence from behind the scenes. Consider a scenario where a researcher wants to test whether a robot tutor can effectively encourage student subjects during a learning task. Rather than building a complete autonomous system with speech recognition, natural language understanding, and emotion detection, the researcher uses a WoZ setup: a human operator (the ``wizard'') sits in a separate room, observing the interaction through cameras and microphones. When the subject appears frustrated, the wizard makes the robot say an encouraging phrase and perform a supportive gesture. To the subject, the robot appears to be acting autonomously, responding naturally to the subject's emotional state. This methodology allows researchers to rapidly prototype and test interaction designs, gathering valuable data about human responses before investing in the development of complex autonomous capabilities.

-\section{HRIStudio Overview}
+Despite its versatility, WoZ research faces two critical challenges. The first is \emph{The Accessibility Problem}: a high technical barrier prevents many non-programmers, such as experts in psychology or sociology, from conducting their own studies without engineering support. The second is \emph{The Reproducibility Problem}: the hardware landscape is highly fragmented, and researchers frequently build custom control interfaces for specific robots and experiments. These tools are rarely shared, making it difficult for the scientific community to replicate results or compare findings across labs.

-To address these challenges, this thesis presents HRIStudio, a web-based platform designed to manage the entire lifecycle of a WoZ experiment: from interaction design, through live execution, to final analysis.
+\section{Proposed Approach}

-HRIStudio is built on three core design principles: disciplinary accessibility, scientific reproducibility, and platform sustainability. To achieve accessibility, the platform replaces complex code with a visual, drag-and-drop interface, allowing domain experts to design interaction flows much like creating a storyboard. To ensure reproducibility, HRIStudio enforces a structured experimental workflow that acts as a ``smart co-pilot'' for the wizard. It guides them through a standardized script to minimize human error while automatically logging synchronized data streams for analysis. Finally, unlike tools tightly coupled to specific hardware, HRIStudio utilizes a robot-agnostic architecture to ensure sustainability. This design ensures that the platform remains a viable tool for the community even as individual robot platforms become obsolete.
+To address the accessibility and reproducibility problems in WoZ-based HRI research, I propose a web-based software framework that integrates three key capabilities. First, the framework must provide an intuitive interface for experiment design that does not require programming expertise, enabling domain experts from psychology, sociology, or other fields to create interaction protocols independently. Second, it must enforce methodological rigor during experiment execution by guiding the wizard through standardized procedures and preventing deviations from the experimental script that could compromise validity. Third, it must be platform-agnostic, meaning the same experiment design can be reused across different robot hardware as technology evolves.
+
+This approach represents a shift from the current paradigm of custom, robot-specific tools toward a unified platform that can serve as shared infrastructure for the HRI research community. By treating experiment design, execution, and analysis as distinct but integrated phases of a study, such a framework can systematically address both technical barriers and sources of variability that currently limit research quality and reproducibility.
+
+The contributions of this thesis are the design principles of this approach, namely: a hierarchical specification model, an event-driven execution model, and a protocol/trial separation with explicit deviation logging. Together they form a coherent architecture for WoZ infrastructure that any implementation could adopt. The platform I developed, HRIStudio, is one implementation of this architecture: an open-source reference system that realizes those principles and serves as the instrument for empirical validation.

 \section{Research Objectives}

-The primary objective of this work is to demonstrate that a unified, web-based software framework can significantly improve both the accessibility and reproducibility of HRI research. Specifically, this thesis aims to develop a production-ready platform, validate its accessibility for non-programmers, and assess its impact on experimental rigor.
+This thesis builds upon foundational work presented in two prior peer-reviewed publications. Prof. Perrone and I first introduced the conceptual framework for HRIStudio at the 2024 IEEE International Conference on Robot and Human Interactive Communication (RO-MAN) \cite{OConnor2024}, establishing the vision for a collaborative, web-based platform. Subsequently, we published the detailed system architecture and a first prototype at RO-MAN 2025 \cite{OConnor2025}, validating the technical feasibility of web-based robot control. Those publications established the vision and the prototype. This thesis formalizes the contribution: a set of design principles for WoZ infrastructure that simultaneously address the \textit{Accessibility} and \textit{Reproducibility} Problems, a reference implementation of those principles, and pilot empirical evidence that they produce measurably different outcomes in practice.

-First, this work translates the foundational architecture proposed in prior publications into a stable, full-featured software platform capable of supporting real-world experiments. Second, through a formal user study, we evaluate whether HRIStudio allows participants with no robotics experience to successfully design and execute a robot interaction, comparing their performance against industry-standard software. Finally, we quantify the impact of the platform's guided execution features on the consistency of wizard behavior and the accuracy of data collection.
+The central question this thesis addresses is: \emph{can the right software architecture make Wizard-of-Oz experiments more accessible to non-programmers and more reproducible across participants?} To answer it, I propose a hierarchical, event-driven specification model that separates protocol design from trial execution, enforces action sequences, and logs deviations automatically; implement it as HRIStudio; and evaluate it in a pilot study comparing design fidelity and execution reliability against a representative baseline tool. The goal is not to prove a statistical effect at scale, but to establish directional evidence that the architecture changes what researchers can do and how consistently they can do it.

-This work builds upon preliminary concepts reported in two peer-reviewed publications \cite{OConnor2024, OConnor2025}. It extends that research by delivering the complete implementation of the system and a comprehensive empirical evaluation of its efficacy.
+\section{Chapter Summary}
+
+This chapter has established the context and objectives for this thesis. I identified two critical challenges facing WoZ-based HRI research. The first is the \emph{Accessibility Problem}: high technical barriers limit participation by non-programmers. The second is the \emph{Reproducibility Problem}: fragmented tooling makes results difficult to replicate across labs. I proposed a web-based framework approach that addresses these challenges through intuitive design interfaces, enforced experimental protocols, and platform-agnostic architecture. Finally, I posed the central research question (can a hierarchical, event-driven specification model with explicit deviation logging lower the technical barrier and improve reproducibility of WoZ experiments?) and described how this thesis addresses it through formal design, a reference implementation, and a pilot validation study. The next chapters establish the technical and methodological foundations.
@@ -1,20 +1,47 @@
-\chapter{Background and Context}
+\chapter{Background and Related Work}
 \label{ch:background}

-\section{Human-Robot Interaction and Wizard-of-Oz}
+This chapter provides the necessary context for understanding the challenges addressed by this thesis. I survey the landscape of existing WoZ platforms, analyze their capabilities and limitations, and establish requirements that a modern infrastructure should satisfy. Finally, I position this thesis relative to prior work on this topic.

-HRI is a multidisciplinary field dedicated to understanding, designing, and evaluating robotic systems for use by or with humans. Unlike industrial robotics, where safety often means physical separation, social robotics envisions a future where robots operate in shared spaces, collaborating with people in roles ranging from healthcare assistants and educational tutors to customer service agents.
+As established in Chapter~\ref{ch:intro}, the WoZ technique enables researchers to prototype and test robot interaction designs before autonomous capabilities are developed. To understand how the proposed framework advances this research paradigm, I review the existing landscape of WoZ platforms, identify their limitations relative to disciplinary needs, and establish requirements for a more comprehensive approach. HRI is fundamentally a multidisciplinary field which brings together engineers, psychologists, designers, and domain experts from various application areas \cite{Bartneck2024}. Yet two challenges have historically limited participation from non-technical researchers. First, each research group builds custom software for specific robots, creating tool fragmentation across the field. Second, high technical barriers prevent many domain experts from conducting independent studies.

-For these interactions to be effective, robots must exhibit social intelligence. They must recognize and respond to human social cues--such as speech, gaze, and gesture--in a manner that is natural and intuitive. However, developing the artificial intelligence required for fully autonomous social interaction is an immense technical challenge. Perception systems often struggle in noisy environments, and natural language understanding remains an area of active research.
+\section{Existing WoZ Platforms and Tools}

-To bridge the gap between current technical limitations and desired interaction capabilities, researchers employ the WoZ technique. In a WoZ experiment, a human operator (the ``wizard'') remotely controls the robot's behaviors, unaware to the study participant. To the participant, the robot appears to be acting autonomously. This methodology allows researchers to test hypotheses about human responses to robot behaviors without needing to solve the underlying engineering challenges first.
+Over the last two decades, multiple frameworks to support and automate the WoZ paradigm have been reported in the literature. These frameworks can be broadly categorized based on their primary design emphases, generality, and the methodological practices they encourage. Foundational work by Steinfeld et al. \cite{Steinfeld2009} articulated the methodological importance of WoZ simulation, distinguishing between the human simulating the robot (Wizard of Oz) and the robot simulating the human. In the latter case (Oz of Wizard), the robot acts as if controlled by a person when it is actually autonomous. This distinction has influenced how subsequent tools approach the design and execution of WoZ experiments.

-\section{Prior Work}
+Early platform-agnostic tools focused on providing robust, flexible interfaces for technically sophisticated users. These systems were designed to work with multiple robot types rather than a single hardware platform. Polonius \cite{Lu2011}, built on the Robot Operating System (ROS) \cite{Quigley2009}, exemplifies this generation. It provides a graphical interface for defining finite state machine scripts that control robot behaviors, with integrated logging capabilities to streamline post-experiment analysis. The system was explicitly designed to enable robotics engineers to create experiments that their non-technical collaborators could then execute. However, the initial setup and configuration still required substantial programming expertise. Similarly, OpenWoZ \cite{Hoffman2016} introduced a cloud-based, runtime-configurable architecture using web protocols. Its design allows multiple operators or observers to connect simultaneously, and its plugin system enables researchers to extend functionality such as adding new robot behaviors or sensor integrations. Most importantly, OpenWoZ allows runtime modification of robot behaviors, enabling wizards to deviate from scripts when unexpected situations arise. While architecturally sophisticated and highly flexible, OpenWoZ requires programming knowledge to create custom behaviors and configure experiments, creating the \emph{Accessibility Problem} for non-technical researchers.

-This thesis represents the culmination of a multi-year research effort to address critical infrastructure gaps in the HRI community. The ideas presented here build upon a foundational trajectory established through two peer-reviewed publications.
+A second wave of tools shifted focus toward usability, often achieving accessibility by coupling tightly with specific hardware platforms. WoZ4U \cite{Rietz2021} was explicitly designed as an ``easy-to-use'' tool for conducting experiments with Aldebaran's Pepper robot. It provides an intuitive graphical interface that allows non-programmers to design interaction flows, and it successfully lowers the technical barrier. However, this usability comes at the cost of generalizability. WoZ4U is unusable with other robot platforms, and manufacturer-provided software follows a similar pattern.

-We first introduced the concept for HRIStudio as a Late-Breaking Report at the 2024 IEEE International Conference on Robot and Human Interactive Communication (RO-MAN) \cite{OConnor2024}. In that work, we identified the lack of accessible tooling as a primary barrier to entry in HRI and proposed the high-level vision of a web-based, collaborative platform. We established the core requirements for the system: disciplinary accessibility, robot agnosticism, and reproducibility.
+Choregraphe \cite{Pot2009}, developed by Aldebaran Robotics for the NAO and Pepper robots, offers a visual programming environment based on connected behavior boxes. Researchers can create complex interaction flows using drag-and-drop blocks without writing code in traditional programming languages. However, when new robot platforms emerge or when hardware becomes obsolete, tools like Choregraphe and WoZ4U lose their utility. Pettersson and Wik, in their review of WoZ tools \cite{Pettersson2015}, note that platform-specific systems often fall out of use as technology evolves, forcing researchers to constantly rebuild their experimental infrastructure.

-Following the initial proposal, we published the detailed system architecture and preliminary prototype as a full paper at RO-MAN 2025 \cite{OConnor2025}. That publication validated the technical feasibility of our web-based approach, detailing the communication protocols and data models necessary to support real-time robot control using standard web technologies.
+Recent years have seen renewed interest in comprehensive WoZ frameworks. Gibert et al. \cite{Gibert2013} developed the Super Wizard of Oz (SWoOZ) platform. This system integrates facial tracking, gesture recognition, and real-time control capabilities to enable naturalistic human-robot interaction studies. Virtual and augmented reality have also emerged as complementary approaches to WoZ. Helgert et al. \cite{Helgert2024} demonstrated how VR-based WoZ environments can simplify experimental setup while providing researchers with precise control over environmental conditions and high fidelity data collection.

-While those prior publications established the ``what'' and the ``how'' of HRIStudio, this thesis focuses on the realization and validation of the platform. We extend our previous research in two key ways. First, we move beyond prototypes to deliver a complete, production-ready software platform (v1.0), resolving complex engineering challenges related to stability, latency, and deployment. Second, and crucially, we provide the first rigorous user study of the platform. By comparing HRIStudio against industry-standard tools, this work provides empirical evidence to support our claims of improved accessibility and experimental consistency.
+This expanding landscape reveals a persistent fundamental gap in the design space of WoZ tools. Flexible, general-purpose platforms like Polonius and OpenWoZ offer powerful capabilities but present high technical barriers. Accessible, user-friendly tools like WoZ4U and Choregraphe lower those barriers but sacrifice cross-platform compatibility and longevity. Newer approaches such as VR-based frameworks attempt to bridge this gap, yet no existing tool successfully combines accessibility, flexibility, deployment portability, and built-in methodological rigor. By methodological rigor, I refer to systematic features that guide experimenters toward best practices like standardized protocols, comprehensive logging, and reproducible experimental designs.
+
+Moreover, few platforms directly address the methodological concerns raised by systematic reviews of WoZ research. Riek's influential analysis \cite{Riek2012} of 54 HRI studies uncovered widespread inconsistencies in how wizard behaviors were controlled and reported. Very few studies documented standardized wizard training procedures or measured wizard error rates, raising questions about internal validity. The tools themselves often exacerbate this problem: poorly designed interfaces increase cognitive load on wizards, leading to timing errors and behavioral inconsistencies that can confound experimental results. Recent work by Strazdas et al. \cite{Strazdas2020} further demonstrates the importance of careful interface design in WoZ systems, showing that intuitive wizard interfaces directly improve both the quality of robot behavior and the reliability of collected data.
+
+\section{Requirements for Modern WoZ Infrastructure}
+
+This thesis is the latest step in a multi-year effort to build infrastructure that addresses the challenges identified in the WoZ platform landscape. Based on the analysis of existing platforms and identified methodological gaps, I derived requirements for a modern WoZ research infrastructure. Through our preliminary work \cite{OConnor2024}, we identified six critical capabilities that a comprehensive platform should provide:
+
+\begin{description}
+\item[R1: Integrated workflow.] All phases of the experimental workflow (design, execution, and analysis) should be integrated within a single unified environment to minimize context switching and tool fragmentation.
+\item[R2: Low technical barrier.] Creating interaction protocols should require minimal to no programming expertise, enabling domain experts from psychology, education, or other fields to work independently \cite{Bartneck2024}.
+\item[R3: Real-time control.] The system must support fine-grained, responsive real-time control during live experiment sessions across a variety of robotic platforms.
+\item[R4: Automated logging.] All actions, timings, and sensor data should be automatically logged with synchronized timestamps to facilitate analysis.
+\item[R5: Platform agnosticism.] The architecture should decouple experimental logic from robot-specific implementations. This allows experiments designed for one robot type to be adapted to others, ensuring the platform remains viable as hardware evolves.
+\item[R6: Collaborative support.] Multiple team members should be able to contribute to experiment design and review execution data, supporting truly interdisciplinary research.
+\end{description}
+
+To the best of my knowledge, no existing platform satisfies all six requirements. Most critically, the trade-off between accessibility and flexibility remains unresolved. Few tools embed methodological best practices directly into their design to guide experimenters toward sound methodology by default.
+
+This work builds on two prior peer-reviewed publications. We first introduced the concept for HRIStudio as a Late-Breaking Report at the 2024 IEEE International Conference on Robot and Human Interactive Communication (RO-MAN) \cite{OConnor2024}. In that position paper, we identified the lack of accessible tooling as a primary barrier to entry in HRI and proposed the high-level vision of a web-based, collaborative platform. We established the core requirements listed above and argued for a web-based approach to achieve them.
+
+Following the initial proposal, we published the detailed system architecture and preliminary prototype as a full paper at RO-MAN 2025 \cite{OConnor2025}. That publication validated the technical feasibility of our approach, detailing the communication protocols, data models, and plugin architecture necessary to support real-time robot control using standard web technologies while maintaining platform independence.
+
+While those prior publications established the conceptual framework and technical architecture, this thesis formalizes those design principles, realizes them in a complete implementation, and tests whether they produce measurably different outcomes in a pilot validation study. The pilot study compares design fidelity and execution reliability between HRIStudio and a representative baseline tool, showing whether these principles translate into better outcomes for real researchers.
+
+\section{Chapter Summary}
+
+This chapter has established the technical and methodological context for this thesis. Existing WoZ platforms fall into two categories: general-purpose tools like Polonius and OpenWoZ that offer flexibility but high technical barriers, and platform-specific systems like WoZ4U and Choregraphe that prioritize usability at the cost of cross-platform generality. Recent approaches such as VR-based frameworks attempt to bridge this gap, yet to the best of my knowledge, no existing tool successfully combines accessibility, flexibility, and embedded methodological rigor. Based on this landscape analysis, I identified six critical requirements for modern WoZ infrastructure (R1--R6): integrated workflows, low technical barriers, real-time control across platforms, automated logging, platform-agnostic design, and collaborative support. These requirements are the standard against which the proposed design is evaluated in Chapter~\ref{ch:evaluation}. The next chapter examines the broader reproducibility challenges that justify why these requirements are essential.
@@ -1,18 +0,0 @@
-\chapter{Related Work and State of the Art}
-\label{ch:related_work}
-
-\section{Existing Frameworks}
-
-The HRI community has a long history of developing custom tools to support WoZ studies. Early efforts focused on providing robust interfaces for technical users. For example, Polonius \cite{Lu2011} was designed to give robotics engineers a flexible way to create experiments for their collaborators, emphasizing integrated logging to streamline analysis. Similarly, OpenWoZ \cite{Hoffman2016} introduced a cloud-based, runtime-configurable architecture that allowed researchers to modify robot behaviors on the fly. These tools represented significant advancements in experimental infrastructure, moving the field away from purely hard-coded scripts. However, they largely targeted users with significant technical expertise, requiring knowledge of specific programming languages or network protocols to configure and extend.
-
-\section{General vs. Domain-Specific Tools}
-
-A recurring tension in the design of HRI tools is the trade-off between specialization and generalizability. Some tools prioritize usability by coupling tightly with specific hardware. WoZ4U \cite{Rietz2021}, for instance, provides an intuitive graphical interface specifically for the Pepper robot, making it accessible to non-technical researchers but unusable for other platforms. Manufacturer-provided software like Choregraphe \cite{Pot2009} for the NAO robot follows a similar pattern: it offers a powerful visual programming environment but locks the user into a single vendor's ecosystem. Conversely, generic tools like Ozlab seek to support a wide range of devices but often struggle to maintain relevance as hardware evolves \cite{Pettersson2015}. This fragmentation forces labs to constantly switch tools or reinvent infrastructure, hindering the accumulation of shared methodological knowledge.
-
-\section{Methodological Critiques}
-
-Beyond software architecture, the methodological rigor of WoZ studies has been a subject of critical review. In a seminal systematic review, Riek \cite{Riek2012} analyzed 54 HRI studies and uncovered a widespread lack of consistency in how wizard behaviors were controlled and reported. The review noted that very few researchers reported standardized wizard training or measured wizard error rates, raising concerns about the internal validity of many experiments. This lack of rigor is often exacerbated by the tools themselves; when interfaces are ad-hoc or poorly designed, they increase the cognitive load on the wizard, leading to inconsistent timing and behavior that can confound study results.
-
-\section{Research Gaps}
-
-Despite the rich landscape of existing tools, a critical gap remains for a platform that is simultaneously accessible, reproducible, and sustainable. Existing accessible tools are often too platform-specific to be widely adopted, while flexible, general-purpose frameworks often present a prohibitively high technical barrier. Furthermore, few tools directly address the methodological crisis identified by Riek by enforcing standardized protocols or actively guiding the wizard during execution. HRIStudio aims to fill this void by providing a web-based, robot-agnostic platform that not only lowers the barrier to entry for interdisciplinary researchers but also embeds methodological best practices directly into the experimental workflow.
@@ -0,0 +1,38 @@
+\chapter{Reproducibility Challenges}
+\label{ch:reproducibility}
+
+Having established the landscape of existing WoZ platforms and their limitations, I now examine the factors that make WoZ experiments difficult to reproduce and how software infrastructure can address them. This chapter analyzes the sources of variability in WoZ studies and examines how current practices in infrastructure and reporting contribute to reproducibility problems. Understanding these challenges is essential for designing a system that supports reproducible, rigorous experimentation.
+
+\section{Sources of Variability}
+
+Reproducibility in experimental research requires that independent investigators can obtain consistent results when following the same procedures. In WoZ-based HRI studies, however, multiple sources of variability can compromise this goal. The wizard is simultaneously the strength and weakness of the WoZ paradigm. While human control enables sophisticated, adaptive interactions, it also introduces inconsistency. Consider a wizard conducting multiple trials of the same experiment with different participants. Even with a detailed script, the wizard may vary in timing, with delays between a participant's action and the robot's response fluctuating based on the wizard's attention, fatigue, or interpretation of when to act. When a script allows for choices, different wizards may make different selections, or the same wizard may act differently across trials. Furthermore, a wizard may accidentally skip steps, trigger actions in the wrong order, or misinterpret experimental protocols.
+
+Riek's systematic review \cite{Riek2012} found that very few published studies reported measuring wizard error rates or providing standardized wizard training. Without such measures, it becomes impossible to determine whether experimental results reflect the intended interaction design or inadvertent variations in wizard behavior.
+
+Beyond wizard behavior, the custom nature of many WoZ control systems introduces technical variability. When each research group builds custom software for each study, several problems arise. Custom interfaces may have undocumented capabilities, hidden features, default behaviors, or timing characteristics researchers never formally describe. Software tightly coupled to specific robot models or operating system versions may become unusable when hardware or software is upgraded or replaced. Each system logs data differently, with different file formats, different levels of granularity, and different choices about what to record. This fragmentation means that replicating a study often requires not just following an experimental protocol but also reverse-engineering or rebuilding the original software and hardware infrastructure.
+
+Even when researchers intend for their work to be reproducible, practical constraints on publication length lead to incomplete documentation. Papers often omit exact timing parameters. Authors leave decision rules for wizard actions unspecified and fail to report details of the wizard interface. Specifications of data collection, including which sensor streams were recorded and at what sampling rate, frequently go missing. Without this information, other researchers cannot faithfully recreate the experimental conditions, limiting both direct replication and conceptual extensions of prior work.
+
+\section{Infrastructure Requirements for Enhanced Reproducibility}
+
+Based on this analysis, I identify specific ways that software infrastructure can mitigate reproducibility challenges:
+
+\begin{enumerate}
+\item \textbf{Guided wizard execution.} Rather than merely providing tools for wizard control, an ideal WoZ platform should actively guide wizards through scripted procedures. This means presenting actions in a prescribed sequence to prevent out-of-order execution, highlighting the current step in the protocol, recording any deviations from the script as explicit events in the data log, and supporting repeatable decision logic through clearly defined conditional branches. By constraining wizard behavior within the bounds of the experimental design, the system reduces unintended variability across trials and participants.
+
+\item \textbf{Comprehensive automatic logging.} Manual data collection is error-prone and often incomplete. The platform should automatically record every action triggered by the wizard with precise timestamps, all robot sensor data and state changes, and timing information indicating when actions were requested, when they began executing, and when they completed. The full experimental protocol should be embedded in each log file so that researchers can recover the exact script used for any session. Note that recording precise timestamps does not imply that trials must have identical timing, since human-robot interactions naturally vary in duration; rather, the system captures what actually occurred for later analysis.
+
+\item \textbf{Self-documenting protocol specifications.} The protocol specification itself should serve as documentation. When interaction protocols are defined using structured formats such as visual flowcharts or declarative scripts rather than imperative code, they become simultaneously executable and human-readable. Researchers can then share complete, unambiguous descriptions of their experimental procedures alongside their results.
+
+\item \textbf{Platform-independent abstractions.} To maximize the lifespan and transferability of experimental designs, the platform must separate the high-level control logic, the sequence of wizard and robot actions, from the low-level details of how specific robots execute those behaviors. This abstraction allows experiments designed for one robot to be more easily adapted to another, extending the reproducibility of interaction designs even when the original hardware becomes obsolete.
+\end{enumerate}
+
+
+
+\section{Connecting Reproducibility Challenges to Infrastructure Requirements}
+
+The reproducibility challenges identified above directly motivate the infrastructure requirements (R1--R6) established in Chapter~\ref{ch:background}. Inconsistent wizard behavior creates the need for enforced execution protocols (R1) that guide wizards step by step, and for automatic logging (R4) that captures any deviations that occur. Timing errors specifically motivate responsive, fine-grained real-time control (R3): a wizard working with a sluggish interface introduces latency that disrupts the interaction and confounds timing analysis. Technical fragmentation forces each lab to rebuild infrastructure as hardware changes, violating platform agnosticism (R5). Incomplete documentation reflects the need for self-documenting, code-free protocol specifications (R1, R2) that are simultaneously executable and shareable. Finally, the isolation of individual research groups motivates collaborative support (R6): allowing multiple team members to observe and review trials enables the shared scrutiny that reproducibility requires. As Chapter~\ref{ch:background} demonstrated, no existing platform simultaneously satisfies all six requirements. Addressing this gap requires rethinking how WoZ infrastructure is designed, prioritizing reproducibility and methodological rigor as first-class design goals rather than afterthoughts.
+
+\section{Chapter Summary}
+
+This chapter has analyzed the reproducibility challenges inherent in WoZ-based HRI research, identifying three primary sources of variability: inconsistent wizard behavior, fragmented technical infrastructure, and incomplete documentation. Rather than treating these challenges as inherent to the WoZ paradigm, I showed how each stems from gaps in current infrastructure. Software design can systematically mitigate these challenges through enforced experimental protocols, comprehensive automatic logging, self-documenting experiment designs, and platform-independent abstractions. These design goals directly address the six infrastructure requirements identified in Chapter~\ref{ch:background}. The following chapters describe the design, implementation, and pilot validation of a system that prioritizes reproducibility as a foundational design principle from inception.
@@ -1,11 +0,0 @@
-\chapter{Reproducibility Challenges in WoZ-based HRI Research}
-\label{ch:reproducibility}
-
-\section{Sources of Variability}
-% TODO
-
-\section{Infrastructure and Reporting}
-% TODO
-
-\section{Platform Requirements}
-% TODO
@@ -0,0 +1,335 @@
+\chapter{Architectural Design}
+\label{ch:design}
+
+Chapter~\ref{ch:background} established six requirements for modern WoZ infrastructure, labeled R1 through R6, and Chapter~\ref{ch:reproducibility} showed the reproducibility problems that motivate them. This chapter presents the architectural contribution of this thesis: a hierarchical specification model, an event-driven execution model, a modular interface architecture, and an integrated data flow that together address all six requirements. These are design principles, not implementation details; they apply to any system built with the same goals.
+
+\section{Hierarchical Organization of Experiments}
+
+WoZ studies involve multiple reusable conditions, shared protocol phases, and platform-specific behaviors that span the full research lifecycle. To organize these components without requiring researchers to write code, the system structures every study as a four-level hierarchy: \emph{study} $\rightarrow$ \emph{experiment} $\rightarrow$ \emph{step} $\rightarrow$ \emph{action}. This structure separates high-level protocol design from low-level execution behavior, keeping the authoring process code-free while integrating design, execution, and analysis into a single unified workflow.
+
+The terms in this hierarchy are used in a strict way. A \emph{study} is the top-level research container that groups related protocol conditions. An \emph{experiment} is one reusable condition within that study (for example, a control versus experimental condition). A \emph{step} is one phase of the protocol timeline (for example, an introduction, telling a story, or testing recall). An \emph{action} is the smallest executable unit inside a step (for example, trigger a gesture, play audio, or speak a prompt).
+
+Figure~\ref{fig:experiment-hierarchy} shows a representation of this hierarchical structure for social robotics studies. Reading top-down, one study contains one or more experiments, each experiment contains one or more steps, and each step contains one or more actions. Figure~\ref{fig:trial-instantiation} shows the protocol-versus-instance separation in isolation. The left column holds the protocol designed once before the study begins; the right column shows the separate trial records produced each time a participant runs it. A dashed line marks the protocol/trial boundary: everything to its left was authored by the researcher before any participant arrived; everything to its right was generated during a live session. The \textit{instantiates} arrows from the experiment node fan out to each trial record, making the relationship explicit. This separation is central to reproducibility: the same experiment specification generates a distinct, timestamped record per participant, so researchers can compare across participants without conflating what was designed with what was executed.
+
+To illustrate how the schema can be used with a concrete example, consider an interactive storytelling study with the research question: \emph{Does robot interaction modality influence participant recall performance?} The two conditions differ in how the robot looks and behaves: NAO6 has a human-like form and uses expressive gestures, while TurtleBot is visibly machine-like with no social movement cues. This keeps the narrative task the same across both conditions while changing only how the robot delivers it.
+
+Figure~\ref{fig:example-hierarchy} maps that study onto the same hierarchy. The study branches into two experiments (TurtleBot with only voice, NAO6 with added gestures), each experiment uses the same ordered steps (Intro, Story Telling, Recall Test), and each step contains actions. The figure expands only the Story Telling step to keep the diagram readable, but Intro and Recall Test follow the same structure. Figures~\ref{fig:experiment-hierarchy}, \ref{fig:trial-instantiation}, and~\ref{fig:example-hierarchy} together progress from abstract schema, to protocol-versus-instance separation, to a concrete instantiation.
+
+\begin{figure}[htbp]
+\centering
+\begin{tikzpicture}[
+	nodebox/.style={rectangle, draw=black, thick, fill=gray!15, align=center,
+		text width=3.2cm, minimum height=1.0cm, font=\small, inner sep=4pt},
+	nodeboxdark/.style={rectangle, draw=black, thick, fill=gray!35, align=center,
+		text width=3.2cm, minimum height=1.0cm, font=\small, inner sep=4pt},
+	arrow/.style={->, thick},
+	label/.style={font=\small\itshape, fill=white, inner sep=2pt}]
+
+	\node[nodebox]     (study)  at (0,  6.0) {Study};
+	\node[nodebox]     (exp)    at (0,  4.0) {Experiment};
+	\node[nodebox]     (step)   at (0,  2.0) {Step};
+	\node[nodeboxdark] (action) at (0,  0.0) {Action};
+
+	\draw[arrow] (study.south)  -- node[label, right=6pt] {has one or more} (exp.north);
+	\draw[arrow] (exp.south)    -- node[label, right=6pt] {has one or more} (step.north);
+	\draw[arrow] (step.south)   -- node[label, right=6pt] {has one or more} (action.north);
+
+\end{tikzpicture}
+\caption{The four-level experiment specification hierarchy.}
+\label{fig:experiment-hierarchy}
+\end{figure}
+
+\begin{figure}[htbp]
+\centering
+\begin{tikzpicture}[
+	spec/.style={rectangle, draw=black, thick, fill=gray!15, align=center,
+		text width=3.2cm, minimum height=1.0cm, font=\small, inner sep=4pt},
+	trial/.style={rectangle, draw=black, thick, dashed, fill=gray!5, align=center,
+		text width=3.2cm, minimum height=1.0cm, font=\small, inner sep=4pt},
+	arrow/.style={->, thick},
+	darrow/.style={->, thick, dashed}]
+
+	%% ---- Column headers ----
+	\node[font=\small\bfseries] at (1.9,  7.0) {Protocol (designed once)};
+	\node[font=\small\bfseries] at (7.9,  7.0) {Trials (run per participant)};
+
+	%% ---- Protocol column ----
+	\node[spec] (study) at (1.9, 5.8) {Study};
+	\node[spec] (exp)   at (1.9, 4.2) {Experiment};
+	\node[spec] (step)  at (1.9, 2.6) {Step};
+
+	\draw[arrow] (study.south) -- (exp.north);
+	\draw[arrow] (exp.south)   -- (step.north);
+
+	%% ---- Trial column ----
+	\node[trial] (t1) at (7.9, 5.5) {Trial --- P01\\{\footnotesize timestamped log}};
+	\node[trial] (t2) at (7.9, 4.2) {Trial --- P02\\{\footnotesize timestamped log}};
+	\node[trial] (t3) at (7.9, 2.9) {Trial --- P03\\{\footnotesize timestamped log}};
+
+	%% ---- Separator ----
+	\draw[gray!60, thick, dashed] (4.85, 1.8) -- (4.85, 6.6);
+	\node[font=\footnotesize\itshape, gray!80] at (4.85, 1.4) {protocol\,/\,trial boundary};
+
+	%% ---- Instantiation arrows + label ----
+	\node[font=\small\itshape] at (6.35, 6.3) {instantiates};
+	\draw[darrow] (exp.east) -- (t1.west);
+	\draw[darrow] (exp.east) -- (t2.west);
+	\draw[darrow] (exp.east) -- (t3.west);
+
+\end{tikzpicture}
+\caption{One experiment protocol instantiated as a separate trial record per participant.}
+\label{fig:trial-instantiation}
+\end{figure}
+
+\begin{figure}[htbp]
+\centering
+\begin{tikzpicture}[
+	nodebox/.style={rectangle, draw=black, thick, fill=gray!15, align=center, text width=2.0cm, font=\small, minimum height=1.2cm, inner sep=2pt},
+	nodeboxdark/.style={rectangle, draw=black, thick, fill=gray!30, align=center, text width=1.6cm, font=\small, minimum height=1.2cm, inner sep=2pt},
+	arrow/.style={->, thick}]
+
+	% Study
+	\node[nodebox] (study) at (0, 7.0) {\textit{Study}\\Recall Study};
+
+	% Experiments
+	\node[nodebox] (nao_exp) at (-3.8, 5.0) {\textit{Experiment}\\NAO6 with Gestures};
+	\node[nodebox] (tb_exp) at (3.8, 5.0) {\textit{Experiment}\\TurtleBot with Voice};
+	\draw[arrow] (study.south) -- (nao_exp.north);
+	\draw[arrow] (study.south) -- (tb_exp.north);
+
+	% NAO steps (independent branch)
+	\node[nodebox] (nao_s1) at (-6.1, 3.0) {\textit{Step 1}\\Intro};
+	\node[nodebox] (nao_s2) at (-3.8, 3.0) {\textit{Step 2}\\Story Telling};
+	\node[nodebox] (nao_s3) at (-1.5, 3.0) {\textit{Step 3}\\Recall Test};
+	\draw[arrow] (nao_exp.south) -- (nao_s1.north);
+	\draw[arrow] (nao_exp.south) -- (nao_s2.north);
+	\draw[arrow] (nao_exp.south) -- (nao_s3.north);
+
+	% TurtleBot steps (independent branch)
+	\node[nodebox] (tb_s1) at (1.5, 3.0) {\textit{Step 1}\\Intro};
+	\node[nodebox] (tb_s2) at (3.8, 3.0) {\textit{Step 2}\\Story Telling};
+	\node[nodebox] (tb_s3) at (6.1, 3.0) {\textit{Step 3}\\Recall Test};
+	\draw[arrow] (tb_exp.south) -- (tb_s1.north);
+	\draw[arrow] (tb_exp.south) -- (tb_s2.north);
+	\draw[arrow] (tb_exp.south) -- (tb_s3.north);
+
+	% NAO: multiple real actions for Story Telling
+	\node[nodeboxdark] (nao_a1) at (-5.9, 1.0) {\textit{Action 1}\\Gesture Hand};
+	\node[nodeboxdark] (nao_a2) at (-3.8, 1.0) {\textit{Action 2}\\Gesture Head};
+	\node[nodeboxdark] (nao_a3) at (-1.7, 1.0) {\textit{Action 3}\\Speak};
+	\draw[arrow] (nao_s2.south) -- (nao_a1.north);
+	\draw[arrow] (nao_s2.south) -- (nao_a2.north);
+	\draw[arrow] (nao_s2.south) -- (nao_a3.north);
+
+	% TurtleBot: multiple real actions for Story Telling
+	\node[nodeboxdark] (tb_a1) at (1.7, 1.0) {\textit{Action 1}\\Play Audio};
+	\node[nodeboxdark] (tb_a2) at (3.8, 1.0) {\textit{Action 2}\\Beep};
+	\node[nodeboxdark] (tb_a3) at (5.9, 1.0) {\textit{Action 3}\\Speak};
+	\draw[arrow] (tb_s2.south) -- (tb_a1.north);
+	\draw[arrow] (tb_s2.south) -- (tb_a2.north);
+	\draw[arrow] (tb_s2.south) -- (tb_a3.north);
+
+\end{tikzpicture}
+\caption{A recall study with two conditions mapped onto the four-level hierarchy.}
+\label{fig:example-hierarchy}
+\end{figure}
+
+Together, these three figures motivate why the hierarchy is useful in practice. The layered structure lets researchers define protocols at any level of granularity without writing code, which keeps the tool accessible to non-programmers. The step and action levels also align naturally with trial flow, so the wizard stays guided by the protocol while retaining control over timing, which supports the real-time control requirement. Action-level execution provides a natural unit for timestamped logging and post-trial analysis, satisfying the automated logging requirement. Finally, keeping experiment definitions separate from trial instances means the same protocol can be reproduced across participants and conditions, supporting both the integrated workflow and collaborative support requirements.
+
+\section{Event-Driven Execution Model}
+
+To achieve real-time responsiveness while maintaining methodological rigor (R3, R5), the system uses an event-driven execution model rather than a time-driven one. In a time-driven approach, the system advances through actions on a fixed schedule regardless of what the participant is doing, so the robot might speak over a participant who is still talking, or move on before a response has been given. The event-driven model avoids this by letting the wizard trigger each action when the interaction is ready for it. Figure~\ref{fig:event-driven-timeline} contrasts the two approaches using the same four-action sequence: Greet (G), Begin Story (BS), Ask Question (AQ), and End (E). In the time-driven row, fixed intervals $t_0$ through $t_2$ define when each event fires, and dashed vertical lines show where those moments fall relative to the event-driven rows below. In both event-driven rows, the wizard fires the same four labeled events at different real-time positions --- T1 (a faster participant) finishes well before T2 (a slower one) --- while both preserve the same action order.
+
+\begin{figure}[htbp]
+\centering
+\begin{tikzpicture}[
+	dot/.style={circle, fill=black, minimum size=6pt, inner sep=0pt},
+	tline/.style={->, thick}]
+
+	% Row y positions
+	% 3.5 = Time-Driven, 2.0 = Event-Driven S1, 0.5 = Event-Driven S2
+
+	% Timelines
+	\draw[tline] (0, 3.5) -- (11.5, 3.5);
+	\draw[tline] (0, 2.0) -- (11.5, 2.0);
+	\draw[tline] (0, 0.5) -- (11.5, 0.5);
+
+	% Row labels
+	\node[font=\small, anchor=east] at (-0.15, 3.5) {Time-Driven};
+	\node[font=\small, anchor=east] at (-0.15, 2.0) {Event-Driven (T1)};
+	\node[font=\small, anchor=east] at (-0.15, 0.5) {Event-Driven (T2)};
+
+	% Time-driven events at fixed positions
+	\node[dot] at (1.0,  3.5) {};
+	\node[dot] at (3.5,  3.5) {};
+	\node[dot] at (7.0,  3.5) {};
+	\node[dot] at (10.5, 3.5) {};
+
+	% Action labels above time-driven row
+	\node[font=\scriptsize, above=3pt] at (1.0,  3.5) {Greet};
+	\node[font=\scriptsize, above=3pt] at (3.5,  3.5) {Begin Story};
+	\node[font=\scriptsize, above=3pt] at (7.0,  3.5) {Ask Question};
+	\node[font=\scriptsize, above=3pt] at (10.5, 3.5) {End};
+
+	%% ---- Time interval braces below time-driven row ----
+	\draw[decorate, decoration={brace, amplitude=4pt, mirror}]
+		(1.0, 3.2) -- (3.5, 3.2) node[midway, below=6pt, font=\scriptsize] {$t_0$};
+	\draw[decorate, decoration={brace, amplitude=4pt, mirror}]
+		(3.5, 3.2) -- (7.0, 3.2) node[midway, below=6pt, font=\scriptsize] {$t_1$};
+	\draw[decorate, decoration={brace, amplitude=4pt, mirror}]
+		(7.0, 3.2) -- (10.5, 3.2) node[midway, below=6pt, font=\scriptsize] {$t_2$};
+
+	% Dashed vertical alignment lines
+	\draw[dashed, gray!70] (1.0,  3.35) -- (1.0,  0.35);
+	\draw[dashed, gray!70] (3.5,  3.35) -- (3.5,  0.35);
+	\draw[dashed, gray!70] (7.0,  3.35) -- (7.0,  0.35);
+	\draw[dashed, gray!70] (10.5, 3.35) -- (10.5, 0.35);
+
+	% Event-driven S1 (fast participant)
+	\node[dot] at (1.0, 2.0) {};
+	\node[dot] at (2.5, 2.0) {};
+	\node[dot] at (5.5, 2.0) {};
+	\node[dot] at (7.8, 2.0) {};
+
+	% Event-driven S1 labels
+	\node[font=\scriptsize, below=3pt] at (1.0, 2.0) {G};
+	\node[font=\scriptsize, below=3pt] at (2.5, 2.0) {BS};
+	\node[font=\scriptsize, below=3pt] at (5.5, 2.0) {AQ};
+	\node[font=\scriptsize, below=3pt] at (7.8, 2.0) {E};
+
+	% Event-driven S2 (slower participant)
+	\node[dot] at (1.0,  0.5) {};
+	\node[dot] at (4.3,  0.5) {};
+	\node[dot] at (8.5,  0.5) {};
+	\node[dot] at (10.8, 0.5) {};
+
+	% Event-driven S2 labels
+	\node[font=\scriptsize, below=3pt] at (1.0,  0.5) {G};
+	\node[font=\scriptsize, below=3pt] at (4.3,  0.5) {BS};
+	\node[font=\scriptsize, below=3pt] at (8.5,  0.5) {AQ};
+	\node[font=\scriptsize, below=3pt] at (10.8, 0.5) {E};
+
+	% Time axis label
+	\node[font=\small\itshape] at (5.75, -0.25) {time};
+
+\end{tikzpicture}
+\caption{Time-driven (top) versus event-driven (bottom, two trials) execution of the same four-action protocol.}
+\label{fig:event-driven-timeline}
+\end{figure}
+
+This approach has several implications. First, not all trials of the same experiment will have identical timing or duration; the length of a learning task, for example, depends on the participant's progress. The system records the actual timing of actions, permitting researchers to capture these natural variations in their data. Second, the event-driven model enables the wizard to respond contextually without departing from the protocol; the wizard remains guided by the sequence of available actions while having control over when to advance based on participant cues.
+
+The system guides the wizard through the protocol step-by-step, ensuring the intended sequence is followed. Every action is logged with a timestamp whether it was scripted or not, and anything outside the protocol is flagged as a deviation. This means inconsistent wizard behavior shows up in the data rather than disappearing into it.
+
+\section{Modular Interface Architecture}
+
+Researchers interact with the system through three interfaces, each one encapsulating a specific phase of an experimental study: designing a protocol, running a trial, and reviewing the results.
+
+\subsection{Design Interface}
+
+The \emph{Design} interface gives researchers a drag-and-drop canvas for building experiment protocols, creating a visual programming environment. Researchers drag pre-built action components, including robot movements, speech, wizard instructions, and conditional logic, onto the canvas and drop them into sequence. Clicking a component opens a side panel where its parameters can be set, such as the text for a speech action or the gesture name for a movement.
+
+By treating experiment design as a visual specification task, the interface lowers technical barriers (R2). Researchers can assemble interaction logic by dragging components into sequence and setting parameters in plain language, without writing code. The resulting protocol specification is also human-readable and shareable alongside research results. The specification is stored in a structured format that can be displayed as a timeline for analysis and executed directly by the platform's runtime.
+
+\subsection{Execution Interface}
+
+During trials, the Execution interface shows the wizard exactly where they are in the protocol: the current step, the available actions, and the robot's current state, all updated in real time as the trial progresses.
+
+The Execution interface also exposes a set of manual controls for actions that fall outside the scripted protocol. Consider a participant who asks an unexpected question mid-trial: the wizard can trigger an unscripted speech response on the spot rather than leaving the interaction to stall. This keeps the interaction feeling natural for the participant. Critically, the system does not simply ignore these moments. Every unscripted action is timestamped and written to the trial log as an explicit deviation, giving researchers a complete picture of what actually happened versus what was planned. This makes unscripted actions a feature rather than a source of noise: the wizard retains real-time control over the interaction, and the logging infrastructure captures everything needed for post-trial analysis.
+
+Additional researchers can simultaneously access this same live view through the platform's Dashboard by selecting a trial to ``spectate.'' Multiple researchers observing the same trial view the identical synchronized display of the wizard's controls, participant interactions, and robot state, supporting real-time collaboration and interdisciplinary observation (R6). Observers can take notes and mark significant moments without interfering with the wizard's control or the participant's experience.
+
+\subsection{Analysis Interface}
+
+After a trial concludes, the \emph{Analysis} interface lets researchers review everything that was recorded: video of the interaction, audio, timestamped action logs, and robot sensor data, all scrubable from a single timeline. Researchers can annotate significant moments and export segments for further analysis. Because the same platform produced both the protocol and the recording, the interface can show exactly where the execution matched the design and where it deviated, without any manual cross-referencing.
+
+\section{Data Flow and Infrastructure Implementation}
+
+To ensure that data from every experimental phase remains traceable, the system organizes its internals into three architectural layers and defines a clear data pathway from protocol design through post-trial analysis, covering how experiment specifications, control commands, and recorded data move through the system.
+
+\subsection{Architectural Layers}
+
+The system is structured as a three-layer architecture, each with a specific responsibility:
+
+\begin{description}
+\item[User Interface layer.] Runs in researchers' web browsers and exposes the three interfaces (Design, Execution, Analysis), managing user interactions such as clicking buttons, dragging and dropping experiment components, and reviewing experimental results.
+\item[Application Logic layer.] Operates as a server process that manages experiment data, coordinates trial execution, authenticates users, and orchestrates communication between the interface and the robot.
+\item[Data and Robot Control layer.] Encompasses long-term storage of experiment protocols and trial data, as well as direct communication with robot hardware.
+\end{description}
+
+This separation of concerns provides two concrete benefits. First, each layer can evolve independently: improving the user interface requires no changes to robot control logic, and swapping in a different storage backend requires no changes to the execution engine. Second, the separation enforces clear responsibilities: the user interface never directly commands robot hardware; all robot actions flow through the application logic layer, which maintains consistent logging. Figure~\ref{fig:three-tier} shows that HRIStudio separates interface behavior, execution logic, and robot/data operations into distinct layers with explicit boundaries.
+
+\begin{figure}[htbp]
+\centering
+\begin{tikzpicture}[
+    layer/.style={rectangle, draw=black, thick, fill, minimum width=6.5cm, minimum height=1cm, align=center, text width=6.2cm},
+    arrow/.style={->, thick, line width=1.5pt}]
+    
+    % Layer 1: UI
+    \node[layer, fill=gray!15] (ui) at (0, 3.5) {
+        \textbf{User Interface}\\[0.1cm]
+        {\small Design, Execution, Analysis}
+    };
+    
+    % Layer 2: Logic
+    \node[layer, fill=gray!30] (logic) at (0, 1.8) {
+        \textbf{Application Logic}\\[0.1cm]
+        {\small Execution, Authentication, Logger}
+    };
+    
+    % Layer 3: Data
+    \node[layer, fill=gray!45] (data) at (0, 0.1) {
+        \textbf{Data \& Robot Control}\\[0.1cm]
+        {\small Database, File Storage, ROS}
+    };
+    
+    % Arrows
+    \draw[arrow] (ui.south) -- (logic.north);
+    \draw[arrow] (logic.south) -- (data.north);
+    
+\end{tikzpicture}
+\caption{Three-layer architecture separates user interface, application logic, and data/robot control.}
+\label{fig:three-tier}
+\end{figure}
+
+\subsection{Data Flow Through Experimental Phases}
+
+During the design phase, researchers create experiment specifications that are stored in the system database. During a trial, the system manages bidirectional communication between the wizard's interface and the robot control layer. All actions, sensor data, and events are streamed to a data logging service that stores complete records. After the trial, researchers can inspect these records through the Analysis interface.
+
+The flow of data during a trial proceeds through six distinct phases, as shown in Figure~\ref{fig:trial-dataflow}. First, a researcher creates an experiment protocol using the Design interface. Second, when a trial begins, the application server loads the protocol and begins stepping through it, sending commands to the robot and waiting for events such as wizard inputs, sensor readings, or timeouts. Third, every action, both planned protocol steps and unexpected events, is immediately written to the trial log with precise timing information. Fourth, the Execution interface continuously displays the current state, allowing the wizard and observers to monitor the progress of a trial in real-time. Fifth, when the trial concludes, all recorded media (video and audio) is transferred from the browser to the server and persisted in a database as part of the trial record. Sixth, the Analysis interface retrieves the stored trial data and reconstructs exactly what happened, synchronizing notable events with the video and audio recordings.
+
+This design ensures comprehensive documentation of every trial, supporting both fine-grained analysis and reproducibility. Researchers can review not just what they intended to happen, but what actually did happen, including timing variations and unexpected events.
+
+\begin{figure}[htbp]
+\centering
+\begin{tikzpicture}[
+    stage/.style={rectangle, draw, thick, rounded corners, minimum width=3.5cm, minimum height=1cm, align=center, font=\footnotesize},
+    arrow/.style={->, thick, line width=1.3pt}]
+    
+    % Six stages stacked vertically with descriptions inside
+    \node[stage, fill=gray!10] (s1) at (0, 7.5) {1. Design Protocol\\{\scriptsize Researcher creates workflow}};
+    \node[stage, fill=gray!15] (s2) at (0, 6) {2. Load \& Execute\\{\scriptsize System loads and runs trial}};
+    \node[stage, fill=gray!20] (s3) at (0, 4.5) {3. Log Events\\{\scriptsize Actions recorded with timestamps}};
+    \node[stage, fill=gray!25] (s4) at (0, 3) {4. Display Live State\\{\scriptsize Wizard sees real-time progress}};
+    \node[stage, fill=gray!30] (s5) at (0, 1.5) {5. Transfer Media\\{\scriptsize Video/audio saved to server}};
+    \node[stage, fill=gray!35] (s6) at (0, 0) {6. Analyze \& Playback\\{\scriptsize Review data with synchronized media}};
+    
+    % Downward arrows
+    \draw[arrow] (s1.south) -- (s2.north);
+    \draw[arrow] (s2.south) -- (s3.north);
+    \draw[arrow] (s3.south) -- (s4.north);
+    \draw[arrow] (s4.south) -- (s5.north);
+    \draw[arrow] (s5.south) -- (s6.north);
+    
+\end{tikzpicture}
+\caption{Trial data flow: from protocol design through execution and recording, to analysis and playback.}
+\label{fig:trial-dataflow}
+\end{figure}
+
+\subsection{Requirements Satisfaction}
+
+The design choices described in this chapter were made to meet the requirements from Chapter~\ref{ch:background}. Having the researcher work through a single platform from protocol creation to post-trial review satisfies R1 (integrated workflow: design, execution, and analysis in one environment) without extra tooling. The visual drag-and-drop Design interface removes the need for programming knowledge, satisfying R2 (low technical barriers) by keeping the system accessible to researchers without a software background. Event-driven execution satisfies R3 (real-time control) by giving the wizard control over pacing while keeping the trial on protocol. All actions are logged automatically at the system level, satisfying R4 (automated logging) without requiring researchers to add logging by hand. The three-layer architecture decouples action specifications from robot-specific commands, satisfying R5 (platform agnosticism) by letting the same protocol run on different hardware without modification. Finally, shared live views and multi-user access let interdisciplinary teams observe and annotate the same trial simultaneously, satisfying R6 (collaborative support).
+
+\section{Chapter Summary}
+
+This chapter described the architectural design with emphasis on how each design choice directly implements the infrastructure requirements identified in Chapter~\ref{ch:background}. The hierarchical organization of experiment specifications enables intuitive, executable design. The event-driven execution model balances protocol consistency with realistic interaction dynamics. The modular interface architecture separates concerns across design, execution, and analysis phases while maintaining data coherence. The integrated data flow ensures that reproducibility is supported by design rather than by afterthought. The following chapter presents HRIStudio as a reference implementation of these design principles, discussing specific technologies and architectural components.
@@ -0,0 +1,184 @@
+\chapter{Implementation}
+\label{ch:implementation}
+
+HRIStudio is a reference implementation of the design principles established in Chapter~\ref{ch:design}. The central contribution of this work is not the tool itself but the design principles that underpin it: the hierarchical specification model, the event-driven execution model, and the integrated data flow. Any system built on those principles would satisfy the same requirements. This chapter explains how HRIStudio realizes them, covering the architectural choices and mechanisms behind how the platform stores experiments, executes trials, integrates robot hardware, and controls access. The specific technologies used in this particular implementation are presented in Appendix~\ref{app:tech_docs}.
+
+\section{Platform Architecture}
+
+HRIStudio follows the model of a web application. Users access it through a standard browser without installing specialized software, and the entire study team, including researchers, wizards, and observers, connect to the same shared system. This eliminates the need for a local installation and ensures the platform works identically on any operating system, directly addressing the low-technical-barrier requirement (R2, from Chapter~\ref{ch:background}). It also enables easy collaboration (R6): multiple team members can access experiment data and observe trials simultaneously from different machines without any additional configuration.
+
+I organized the system into three layers: User Interface, Application Logic, and Data \& Robot Control. This layered structure is shown in Figure~\ref{fig:three-tier}. In the implementation of this architecture, it is essential that the application server and the robot control hardware run on the same local network. This keeps communication latency low during trials: a noticeable delay between the wizard's input and the robot's response would break the interaction.
+
+I implemented all three layers in the same language — TypeScript~\cite{TypeScript2014}, a statically-typed superset of JavaScript. The single-language decision keeps the type system consistent across the full stack. When the structure of experiment data changes, the type checker surfaces inconsistencies across the entire codebase at compile time rather than allowing them to appear as runtime failures during a trial.
+
+\section{Experiment Storage and Trial Logging}
+
+The system saves experiments to persistent storage when a researcher completes them in the Design interface. A saved experiment is a complete, reusable specification that a researcher can run across any number of trials without modification. In this chapter, a trial means one concrete run of an experiment protocol with one human subject; this is where spontaneous wizard deviations can occur.
+
+When a trial begins, the system creates a new trial record linked to that experiment. The system writes every action the wizard triggers to that record with a precise timestamp, whether scripted or not, including any unscripted actions triggered outside the protocol. The system flags those unscripted actions as deviations. The Execution interface records video, audio, and robot sensor data alongside the action log for the duration of the trial. The Analysis interface can directly compare what was planned against what was executed for any trial, without any manual work by the researcher, because the trial record and the experiment reference the same underlying specification. Figure~\ref{fig:trial-record} shows the structure of a completed trial record: action log entries, video, audio, and robot sensor data all share a common timestamp reference so the Analysis interface can align them without manual synchronization; dashed lines mark step boundaries; and the system flags any deviation from the experiment specification at the appropriate position in the timeline.
+
+\begin{figure}[htbp]
+\centering
+\begin{tikzpicture}[
+    dot/.style={circle, fill=black, minimum size=6pt, inner sep=0pt},
+    devdot/.style={rectangle, draw=black, thick, fill=gray!50, minimum size=7pt, inner sep=0pt, rotate=45},
+    stepbox/.style={rectangle, draw=black, thick, fill=gray!15, align=center, font=\scriptsize, inner sep=3pt, minimum height=0.55cm},
+    mediabar/.style={rectangle, draw=black, thick, fill=gray!30, minimum height=0.45cm},
+    track/.style={font=\small, anchor=east}]
+
+    % Time axis
+    \draw[->, thick] (0, -0.5) -- (11.5, -0.5) node[right, font=\small\itshape] {time};
+    \node[font=\small] at (0.1, -0.8) {$t_0$};
+    \node[font=\small] at (10.9, -0.8) {$t_n$};
+
+    % Track labels
+    \node[track] at (-0.2, 5.2) {Experiment};
+    \node[track] at (-0.2, 3.9) {Action Log};
+    \node[track] at (-0.2, 2.9) {Video};
+    \node[track] at (-0.2, 1.9) {Audio};
+    \node[track] at (-0.2, 0.9) {Sensor Data};
+
+    % Track dividers
+    \foreach \y in {4.5, 3.4, 2.4, 1.4, 0.4} {
+        \draw[gray!35, thin] (0, \y) -- (11.0, \y);
+    }
+
+    % Experiment step boxes
+    \node[stepbox, minimum width=2.5cm] at (1.5, 5.2) {Intro};
+    \node[stepbox, minimum width=4.0cm] at (5.2, 5.2) {Story Telling};
+    \node[stepbox, minimum width=2.5cm] at (9.5, 5.2) {Recall Test};
+
+    % Step boundary markers
+    \draw[dashed, gray!60] (3.0, 4.5) -- (3.0, 0.4);
+    \draw[dashed, gray!60] (7.5, 4.5) -- (7.5, 0.4);
+
+    % Scripted actions
+    \node[dot] at (0.5, 3.9) {};
+    \node[dot] at (1.4, 3.9) {};
+    \node[dot] at (2.3, 3.9) {};
+    \node[dot] at (3.8, 3.9) {};
+    \node[dot] at (5.0, 3.9) {};
+    \node[dot] at (6.1, 3.9) {};
+    \node[dot] at (7.2, 3.9) {};
+    \node[dot] at (9.0, 3.9) {};
+    \node[dot] at (10.5, 3.9) {};
+
+    % Deviation marker
+    \node[devdot] at (5.6, 3.9) {};
+    \node[font=\scriptsize, above=5pt] at (5.6, 3.9) {deviation};
+
+    % Video bar
+    \node[mediabar, minimum width=10.8cm] at (5.4, 2.9) {};
+
+    % Audio bar
+    \node[mediabar, minimum width=10.8cm, fill=gray!20] at (5.4, 1.9) {};
+
+    % Sensor data (continuous sampled line)
+    \draw[thick, gray!60] plot[smooth] coordinates {
+        (0.0, 0.90) (1.0, 0.97) (2.0, 0.84) (3.0, 1.01) (4.0, 0.87)
+        (5.0, 0.96) (6.0, 0.83) (7.0, 0.99) (8.0, 0.86) (9.0, 0.95)
+        (10.0, 0.88) (11.0, 0.93)
+    };
+
+\end{tikzpicture}
+\caption{Structure of a completed trial record, showing synchronized action log, media, and sensor tracks.}
+\label{fig:trial-record}
+\end{figure}
+
+Video and audio are recorded locally in the researcher's browser during the trial rather than streamed to the server in real time. This prevents network delays or server load from dropping frames or degrading audio quality during the interaction. When the trial concludes, the browser transfers the complete recordings to the server and associates them with the trial record. The Analysis interface can align video and audio with the logged actions without any manual synchronization, because the timestamp when recording starts is logged alongside the action log.
+
+The system stores structured and media data separately. Experiment specifications and trial records are stored in the same structured database, which makes it efficient to query across trials (for example, retrieving all trials for a specific participant or comparing action timing across conditions). Video and audio files are stored in a dedicated file store, since their size makes them unsuitable for a database and the system never queries their content directly.
+
+\section{The Execution Engine}
+
+The execution engine is the component that runs a trial: it loads the experiment, manages the wizard's connection, sends robot commands, and keeps all connected clients in sync.
+
+When a trial begins, the server loads the experiment and maintains a live connection to the wizard's browser and any observer connections. The execution engine does not advance through the actions of an experiment on a timer; instead, the wizard controls how time advances from action to action. This preserves the natural pacing of the interaction: the wizard advances only when the participant is ready, while the experiment structure ensures the protocol is followed. When the wizard triggers an action, the server sends the related command to the robot, writes the log entry, and pushes the updated experiment state to all connected clients in the same operation — keeping the wizard's view, the observer view, and the actual robot state synchronized in real time.
+
+No two human subjects respond identically to an experimental protocol. One subject gives a one-word answer; another offers a paragraph; a third asks the robot a question the script never anticipated. A fully programmed robot has no answer for that third subject: the interaction stalls, or immersion breaks. The wizard exists to fill that gap: where the program runs out of instructions, the wizard draws on their knowledge of human social interaction to keep the exchange coherent. Unscripted actions give the wizard the tools to exercise that judgment in the moment. The wizard triggers them via the manual controls in the Execution interface, the robot command runs, and the system logs the action with a deviation flag. This design preserves research value: the interaction gains the flexibility only a human can provide, and that flexibility appears explicitly in the record rather than disappearing into it.
+
+\section{Robot Integration}
+
+A configuration file describes each robot platform, listing the actions it supports and specifying how each one maps to a command the robot understands. The execution engine reads this file at startup and uses it whenever it needs to dispatch a command: it looks up the action type, assembles the appropriate message, and sends it to the robot over a bridge process running on the local network. The web server itself has no knowledge of any specific robot; all hardware-specific logic lives in the configuration file.
+
+The execution engine treats control flow elements such as branches and conditionals, which function as elements of a computer program, the same way as robot actions. These control-flow elements appear as action groups in the experiment and are evaluated during the trial, so researchers can freely mix logical decisions and physical robot behaviors when designing an experiment without any special handling.
+
+Figure~\ref{fig:plugin-architecture} illustrates this mapping using NAO6 and TurtleBot as an example. Actions a platform does not support (such as \texttt{raise\_arm} on TurtleBot) appear as explicitly unsupported in the configuration file rather than silently failing. Because all hardware-specific logic lives in the configuration file, the experiment itself does not change between platforms.
+
+\begin{figure}[htbp]
+\centering
+\begin{tikzpicture}[
+    expbox/.style={rectangle, draw=black, thick, fill=gray!10, align=left, font=\small, inner sep=10pt},
+    cfgbox/.style={rectangle, draw=black, thick, dashed, fill=white, align=center, font=\small\itshape, inner sep=6pt},
+    robotbox/.style={rectangle, draw=black, thick, fill=gray!25, align=left, font=\small, inner sep=10pt},
+    arrow/.style={->, thick}]
+
+    % Experiment box
+    \node[expbox] (exp) at (0, 0) {
+        \textbf{Experiment}\\[4pt]
+        \texttt{speak(text)}\\[2pt]
+        \texttt{raise\_arm()}\\[2pt]
+        \texttt{move\_forward()}
+    };
+
+    % Configuration file node (intermediate)
+    \node[cfgbox] (cfg) at (4.5, 0) {configuration\\file};
+
+    % NAO6 box
+    \node[robotbox] (nao) at (9.5, 1.6) {
+        \textbf{NAO6}\\[4pt]
+        \texttt{speak} $\to$ \texttt{/nao/tts}\\[2pt]
+        \texttt{raise\_arm} $\to$ \texttt{/nao/arm}\\[2pt]
+        \texttt{move} $\to$ \texttt{/nao/move}
+    };
+
+    % TurtleBot box
+    \node[robotbox] (tb) at (9.5, -1.6) {
+        \textbf{TurtleBot}\\[4pt]
+        \texttt{speak} $\to$ \texttt{/tts/say}\\[2pt]
+        \texttt{raise\_arm} $\to$ \textit{(not supported)}\\[2pt]
+        \texttt{move} $\to$ \texttt{/cmd\_vel}
+    };
+
+    % Arrows
+    \draw[arrow] (exp.east) -- (cfg.west);
+    \draw[arrow] (cfg.east) -- (nao.west);
+    \draw[arrow] (cfg.east) -- (tb.west);
+
+\end{tikzpicture}
+\caption{Abstract experiment actions translated to platform-specific robot commands through per-platform configuration files.}
+\label{fig:plugin-architecture}
+\end{figure}
+
+\section{Access Control}
+
+I implemented access control using a role-based access control (RBAC) model. Each study has a membership list, and each member is assigned one of four roles that define a clear separation of capabilities: those who own the study, those who design it, those who run it, and those who observe it. This enforces need-to-know access at the study level so that each team member sees or is able to modify only what their role requires.
+
+\begin{description}
+    \item[Owner.] Full control over the study: can invite or remove members, configure the study settings, and access all data.
+    \item[Researcher.] Can create and modify experiment designs and review all collected trial data, but cannot manage team membership.
+    \item[Wizard.] Can trigger actions during a trial and view the execution interface, but cannot modify the experiment design or access other wizards' sessions.
+    \item[Observer.] Read-only access: can watch a trial in real time and annotate significant moments, but cannot trigger actions or modify any data.
+\end{description}
+
+The role definitions above determine who can view and change data during normal study operation. The role system also supports what is known as a double-blind design~\cite{Bartneck2024}, where neither the wizard nor the researcher has access to condition assignments or results until the study concludes. For example, the Owner can restrict a Wizard's view of which condition a human subject has been assigned to, and can prevent Researchers from accessing result data until all trials are complete, without any changes to the underlying experiment.
+
+\section{Architectural Challenges}
+
+The following two problems required specific solutions during implementation.
+
+\begin{description}
+    \item[Execution latency.] During a trial, the execution engine must respond quickly to wizard input --- a noticeable delay between the button press and the robot's action can disrupt the interaction. I addressed this by maintaining a persistent network connection to the robot bridge for the duration of each trial. The connection is established once at trial start and kept open, eliminating per-action setup overhead.
+
+    \item[Multi-source synchronization.] The Analysis interface requires aligning data streams captured at different sampling rates by different components: video, audio, action logs, and sensor data. The solution is a shared time reference: every data source records its timestamps relative to the same trial start time, $t_0$, so the Analysis interface can align all tracks without requiring manual calibration.
+\end{description}
+
+\section{Implementation Status}
+
+HRIStudio has reached minimum viable product status. The Design, Execution, and Analysis interfaces are operational. The execution engine handles scripted and unscripted actions with full timestamped logging, and I validated robot communication on the NAO6 platform during development. The platform can run a controlled WoZ study without modification to its core architecture or execution workflow.
+
+Work remaining for future development includes broader validation of the configuration file approach on robot platforms beyond NAO6.
+
+\section{Chapter Summary}
+
+This chapter described how HRIStudio realizes the design principles from Chapter~\ref{ch:design} in practice. Experiments are persistent, reusable specifications that produce complete, comparable trial records. The execution engine is event-driven rather than timer-driven, keeping the wizard in control of pacing while logging every action automatically. Per-platform configuration files keep the execution engine hardware-agnostic. The role system enforces access control at the study level. The platform is at minimum viable product status and can run a controlled WoZ study today. HRIStudio is one realization of these principles; the contribution lies in the design principles themselves, which any implementation could adopt.
@@ -1,23 +0,0 @@
-\chapter{System Design: HRIStudio Platform}
-\label{ch:design}
-
-\section{Design Goals}
-% TODO
-
-\section{High-Level Architecture}
-% TODO
-
-\section{Hierarchical Experimental Model}
-% TODO
-
-\section{Visual Experiment Designer}
-% TODO
-
-\section{Execution Interfaces}
-% TODO
-
-\section{Robot Integration and Plugins}
-% TODO
-
-\section{Data Management}
-% TODO
@@ -0,0 +1,130 @@
+\chapter{Pilot Validation Study}
+\label{ch:evaluation}
+
+This chapter presents the pilot validation study used to evaluate whether HRIStudio improves accessibility and reproducibility in WoZ-based HRI research. It defines the research questions, study design, participant roles, task, apparatus, procedure, and measurement instruments.
+
+\section{Research Questions}
+
+The evaluation targets the two problems established in Chapter~\ref{ch:background}. The first is the \emph{Accessibility Problem}: existing tools require substantial programming expertise, which prevents domain experts from conducting independent HRI studies. The second is the \emph{Reproducibility Problem}: without structured logging and protocol enforcement, experiment execution varies across participants and wizards in ways that are difficult to detect or control after the fact.
+
+These problems give rise to two research questions. The first asks whether HRIStudio enables domain experts without prior robotics experience to successfully implement a robot interaction from a written specification. The second asks whether HRIStudio produces more reliable execution of that interaction compared to standard practice.
+
+I hypothesized that wizards using HRIStudio would more completely and correctly implement the written specification, and that their designs would execute more reliably during the trial, compared to wizards using ad hoc programs created for specific social robotics experiments, with Choregraphe as the baseline tool in this study.
+
+\section{Study Design}
+
+I used what Bartneck et al.~\cite{Bartneck2024} call a between-subjects design, in which each participant is assigned to only one condition. I randomly assigned each wizard participant to one of two conditions: HRIStudio or Choregraphe. Both groups received the same task, the same time allocation, and the same training structure. Measuring each participant in only one condition prevents carryover effects, meaning performance changes caused by prior exposure to another condition rather than by the assigned condition itself.
+
+In this study, I defined two types of participants with distinct roles. Wizards were faculty members drawn from across departments who designed and ran the robot interaction. Test subjects were undergraduate students who interacted with the robot during the trial. This separation ensures that the evaluation captures both the design experience and the quality of the resulting interaction. The next section details recruitment, inclusion criteria, and sample rationale for both groups.
+
+\section{Participants}
+
+\textbf{Wizards.} I recruited six Bucknell University faculty members drawn from across departments to serve as wizards. I deliberately recruited from both ends of the programming experience spectrum, targeting participants with substantial programming backgrounds as well as those who described themselves as non-programmers or having minimal coding experience. This cross-departmental recruitment was intentional. A primary claim of HRIStudio is that it lowers the technical barrier for domain experts who are not programmers; drawing wizards from outside computer science allows the data to speak to whether that claim holds for the intended user population.
+
+The key inclusion criterion for all wizards was no prior experience with either the NAO robot or Choregraphe software specifically. This controls for tool familiarity so that performance differences reflect the tools themselves rather than prior exposure. I recruited wizards through direct email. Participation was framed as a voluntary software evaluation unrelated to any professional obligations.
+
+\textbf{Test subjects.} I recruited one undergraduate student per wizard session to serve as a test subject, for a total matching the wizard sample. Their role was to serve as the subjects for the experimental protocol coded by each wizard. To eliminate any risk of coercion, I screened participants to ensure that no test subject was enrolled in a course taught by the wizard they were paired with. Recruitment used campus flyers inviting volunteers to interact with a robot for approximately 15 minutes. There was no compensation for participation.
+
+\textbf{Sample size rationale.} With six wizard participants ($N = 6$) and a matched number of test subjects, this sample size is appropriate for a pilot validation study whose goal is directional evidence and failure-mode identification rather than effect-size estimation for a broad population. The size matches the scope and constraints of this honors thesis: two academic semesters, one undergraduate researcher, and no funded research assistant support. It also reflects the target population and recruitment context. Faculty domain experts outside computer science with no prior NAO or Choregraphe experience are a limited pool at a small liberal arts university and have high competing time demands. This scale is consistent with pilot and feasibility studies in HRI, where small $N$ designs are common in early-stage tool validation~\cite{Steinfeld2009}. Findings should be interpreted as preliminary evidence and directional indicators rather than as conclusive proof.
+
+\section{Task}
+
+Both wizard groups received the same written task specification: the \emph{Interactive Storyteller} scenario. The specification described a robot that introduces an astronaut named Kai, narrates her discovery of a glowing rock on Mars, asks the human subject a comprehension question about the story, and delivers one of two responses depending on whether the answer is correct. The full specification, including exact robot speech, required gestures, and branching logic, is reproduced in Appendix~\ref{app:materials}.
+
+The task was chosen because it requires several distinct capabilities: speech actions, gesture coordination, conditional branching based on human-subject input, and a defined conclusion. In both conditions, wizards had to translate the same written protocol into an executable interaction script, including action ordering, branching logic, and timing decisions. In Choregraphe, that meant assembling and connecting behavior nodes in a finite state machine. In HRIStudio, it meant building a sequential action timeline with conditional branches. This makes the task a direct comparison of how each tool supports coding the robot behavior required by the same protocol.
+
+\section{Robot Platform and Software Apparatus}
+
+Both conditions used the same NAO humanoid robot (Figure~\ref{fig:nao6-photo}), a platform approximately 0.58 meters tall capable of speech synthesis, animated gestures, and head movement. Using the same hardware ensured that any differences in execution quality were attributable to the software, not the robot.
+
+
+\begin{figure}[htbp]
+\centering
+\includegraphics[width=0.45\textwidth]{images/nao6.jpg}
+\caption{The NAO V6 humanoid robot used in both conditions of the pilot study.}
+\label{fig:nao6-photo}
+\end{figure}
+
+
+The control condition used Choregraphe \cite{Pot2009}, a proprietary visual programming tool developed by Aldebaran Robotics and the standard software for NAO programming. Choregraphe organizes behavior as a finite state machine: nodes represent states and edges represent transitions triggered by conditions or timers.
+
+The experimental condition used HRIStudio, described in Chapter~\ref{ch:implementation}. HRIStudio organizes behavior as a sequential action timeline with support for conditional branches. Unlike Choregraphe, it abstracts robot-specific commands through configuration files, though for this study both tools controlled the same NAO platform.
+
+\section{Procedure}
+
+Each wizard completed a single 60-minute session structured in four phases. Each session was run by one wizard and included one test subject during the trial phase.
+
+\subsection{Phase 1: Training (15 minutes)}
+
+I opened each session with a standardized tutorial tailored to the wizard's assigned tool. The tutorial covered how to create speech actions, specify gestures, define conditional branches, and save the completed design. Training was intentionally allocated 15 minutes to allow enough time for wizards to ask clarifying questions about the tool before the design challenge began, while still simulating first encounter with a new tool without extensive onboarding. I answered clarification questions during this phase but did not offer hints about the design challenge.
+
+\subsection{Phase 2: Design Challenge (30 minutes)}
+
+The wizard received the paper specification and had 30 minutes to implement it using their assigned tool. I observed and recorded a screen capture of the wizard's workflow throughout. Using a structured observer data sheet, I logged every instance in which I provided assistance to the wizard, categorizing each by type: \emph{tool-operation} (T), \emph{task clarification} (C), \emph{hardware or technical} (H), or \emph{general} (G). For each tool-operation intervention, I also recorded which rubric item it pertained to. If the wizard declared completion before the time limit, the remaining time was used to review and refine the design.
+
+\subsection{Phase 3: Live Trial (10 minutes)}
+
+After the design phase, a test subject entered the room and the wizard ran their completed program to control the robot during an actual interaction. I video-recorded the full trial to capture robot behavior and timing. I told the test subject they were helping evaluate the robot's performance, not being evaluated themselves. I continued logging any researcher interventions during the trial using the same type categories, noting the relevant ERS rubric item for any tool-operation intervention.
+
+\subsection{Phase 4: Debrief (5 minutes)}
+
+Following the trial, the wizard completed the System Usability Scale survey. The screen recording and video recording served as the primary artifacts for post-session scoring.
+
+\section{Measures}
+\label{sec:measures}
+
+The study collected four measures, two primary and two supplementary.
+
+\subsection{Design Fidelity Score}
+
+The Design Fidelity Score (DFS) measures how completely and correctly the wizard implemented the paper specification. I evaluated the exported project file against nine weighted criteria grouped into three categories: speech actions, gestures and actions, and control flow and logic. Each criterion is scored as present, correct, and independently achieved.
+
+The DFS rubric includes an \emph{Assisted} column. For each rubric item, the researcher marks T if a tool-operation intervention was given specifically for that item during the design phase --- for example, if the researcher explained how to add a gesture node or how to wire a conditional branch. T marks are recorded and reported separately alongside the DFS score; they do not affect the Points total. This preserves the DFS as a clean measure of design fidelity while providing a parallel record of where tool-specific assistance was needed. General interventions --- task clarification, hardware issues, or momentary forgetfulness --- are not marked T, because those categories of difficulty are independent of the tool under evaluation.
+
+This measure is motivated by a gap identified by Riek~\cite{Riek2012}, whose systematic review of 54 published WoZ studies found that only 11\% constrained wizard behavior and fewer than 6\% described wizard training procedures. Porfirio et al.~\cite{Porfirio2023} similarly argued that formal, verifiable behavior specifications are a prerequisite for reproducible HRI. The DFS applies these recommendations as a weighted rubric scored against the exported project file. The complete rubric is reproduced in Appendix~\ref{app:materials}. This measure addresses accessibility: did the tool allow a wizard to independently produce a correct design?
+
+\subsection{Execution Reliability Score}
+
+The Execution Reliability Score (ERS) measures whether the designed interaction executed as intended during the live trial. I reviewed the video recording against the specification and the wizard's design. Evaluation criteria included whether the robot delivered the correct speech at each step, whether gestures executed and synchronized with speech, whether the conditional branch resolved correctly based on the test subject's answer, and whether any errors, disconnections, or hangs occurred.
+
+The ERS rubric applies the same \emph{Assisted} modifier as the DFS, extended to the trial phase. Any tool-operation intervention I provided during the trial --- for example, explaining to the wizard how to launch or advance their program --- caps the affected ERS item at half points. This is scored separately from design-phase interventions: a wizard who needed help only during design can still achieve a full ERS score if the trial runs without assistance, and vice versa. The rubric also records whether the trial reached its conclusion step and whether the test subject was a recruited participant or the researcher, since foreknowledge of the specification on the part of the test subject represents a qualitatively different trial condition. I additionally note whether any branch resolved through programmed conditional logic or through manual intervention by the wizard during execution.
+
+This measure responds directly to Riek's~\cite{Riek2012} finding that only 3.7\% of published WoZ studies reported any measure of wizard error, making it nearly impossible to determine whether execution matched design intent~\cite{OConnor2024, OConnor2025}. The complete rubric is reproduced in Appendix~\ref{app:materials}. This measure addresses reproducibility: did the design translate reliably into execution without researcher support?
+
+\subsection{System Usability Scale}
+
+The System Usability Scale (SUS) is a validated 10-item questionnaire measuring perceived usability \cite{Brooke1996}. Wizards completed the SUS after the debrief phase. Scores range from 0 to 100, with higher scores indicating better perceived usability. The full questionnaire is reproduced in Appendix~\ref{app:materials}.
+
+\subsection{Intervention Log and Session Timing}
+
+During each session, I maintained a structured intervention log on the observer data sheet, recording the timestamp, type code, affected rubric item number, and a brief description for every instance in which I assisted the wizard. Intervention type codes are: T (tool-operation), C (task or specification clarification), H (hardware or technical issue), and G (general or forgetfulness. Only T-type interventions affect rubric scoring; the others are recorded to provide context for interpreting session flow and wizard experience. I also recorded the actual duration of each session phase and the time at which the wizard completed or abandoned the design, providing supplementary evidence about tool accessibility beyond the DFS score itself.
+
+\section{Measurement Instruments}
+
+Table~\ref{tbl:measurement_instruments} summarizes the five instruments, when they were collected, and which research question each addresses.
+
+\begin{table}[htbp]
+\centering
+\footnotesize
+\begin{tabular}{|p{3.0cm}|p{4.4cm}|p{2.4cm}|p{2.8cm}|}
+\hline
+\textbf{Instrument} & \textbf{What it captures} & \textbf{When collected} & \textbf{Research question} \\
+\hline
+Design Fidelity Score (DFS) & Completeness and correctness of the wizard's implementation; caps items where tool-operation assistance was given & Post-session file review & Accessibility \\
+\hline
+Execution Reliability Score (ERS) & Whether the interaction executed as designed during the trial; caps items where trial-phase tool assistance occurred & Post-trial video review & Reproducibility \\
+\hline
+System Usability Scale (SUS) & Wizard's perceived usability of the assigned tool & Debrief phase & User experience \\
+\hline
+Intervention Log & Timestamped record of all researcher assistance by type (T/C/H/G) and affected rubric item & Throughout session & Supplementary \\
+\hline
+Session Timing & Actual duration of each phase; time to design completion & Throughout session & Supplementary \\
+\hline
+\end{tabular}
+\caption{Measurement instruments used in the pilot validation study.}
+\label{tbl:measurement_instruments}
+\end{table}
+
+\section{Chapter Summary}
+
+This chapter described a pilot between-subjects study I designed to test whether the design principles formalized in Chapters~\ref{ch:design} and~\ref{ch:implementation} produce measurably different outcomes from existing practice. Six wizard participants ($N = 6$), drawn from across departments and spanning the programming experience spectrum, each designed and ran the Interactive Storyteller task on a NAO robot using either HRIStudio or Choregraphe. Each 60-minute session was structured in four phases: a 15-minute standardized tutorial, a 30-minute design challenge, a 10-minute live trial, and a 5-minute debrief. I measured design fidelity (DFS) and execution reliability (ERS) against the written specification, applying a per-item scoring modifier that caps any rubric criterion for which tool-operation assistance was given. I also collected perceived usability via the SUS, a structured intervention log categorizing all researcher assistance by type, and session phase timings. Chapter~\ref{ch:results} presents the results.
@@ -1,11 +0,0 @@
-\chapter{Implementation Details}
-\label{ch:implementation}
-
-\section{Technology Stack}
-% TODO
-
-\section{Technical Challenges}
-% TODO
-
-\section{System Capabilities}
-% TODO
@@ -1,11 +0,0 @@
-\chapter{Experimental Evaluation of HRIStudio}
-\label{ch:evaluation}
-
-\section{Evaluation Goals}
-% TODO
-
-\section{Study Design}
-% TODO
-
-\section{Procedure}
-% TODO
@@ -0,0 +1,109 @@
+\chapter{Results}
+\label{ch:results}
+
+This chapter presents the results of the pilot validation study described in Chapter~\ref{ch:evaluation}. Because this is a small pilot, I report descriptive statistics and qualitative observations rather than inferential tests. The goal is directional evidence: do the patterns in the data suggest that HRIStudio changes what wizards can produce and how reliably they can produce it?
+
+\section{Participant Overview}
+
+% TODO: Update session counts when all sessions are complete.
+Table~\ref{tbl:sessions} summarizes the participants and their assigned conditions. Wizards are identified by code to protect confidentiality. Demographic information (programming background: programmer or non-programmer) was collected during recruitment.
+
+\begin{table}[htbp]
+\centering
+\footnotesize
+\begin{tabular}{|l|l|l|l|l|l|l|}
+\hline
+\textbf{ID} & \textbf{Condition} & \textbf{Background} & \textbf{DFS} & \textbf{ERS} & \textbf{SUS} & \textbf{Design Time} \\
+\hline
+W-01 & Choregraphe & Programmer & 70 & 65 & 60 & 35 min \\
+\hline
+W-02 & \textit{[PLACEHOLDER]} & \textit{[PLACEHOLDER]} & --- & --- & --- & --- \\
+\hline
+W-03 & \textit{[PLACEHOLDER]} & \textit{[PLACEHOLDER]} & --- & --- & --- & --- \\
+\hline
+W-04 & \textit{[PLACEHOLDER]} & \textit{[PLACEHOLDER]} & --- & --- & --- & --- \\
+\hline
+\end{tabular}
+\caption{Summary of wizard participants, conditions, and scores. Rows marked PLACEHOLDER are pending completion.}
+\label{tbl:sessions}
+\end{table}
+
+\section{Primary Measures}
+
+\subsection{Design Fidelity Score}
+
+The Design Fidelity Score measures how completely and correctly each wizard implemented the written specification. Scores range from 0 to 100, with full points awarded only when a component is both present and correct.
+
+W-01 (Choregraphe) received a DFS of 70. Analysis of the exported project file indicated that all four interaction steps were present and correctly sequenced, and the conditional branch was implemented and functional. However, W-01 deviated from the specification by modifying the color of the rock from red to a different value, causing the narrative speech and comprehension question to no longer match the written protocol. This reduced the ``Correct'' scores for speech items 2 and 3. The open-hand introduction gesture was present and correctly executed; at least one narrative gesture was included; and both branch responses were implemented, though the correct-branch response speech was also modified to reflect the changed rock color.
+
+% TODO: Add DFS scores for remaining participants and compute condition means when data collection is complete.
+% TODO: Add a bar chart or table comparing DFS by condition.
+\textit{[PLACEHOLDER: DFS results for W-02 through W-0X will be reported here. Condition means and ranges will be summarized in a table.]}
+
+\subsection{Execution Reliability Score}
+
+The Execution Reliability Score measures how faithfully the designed interaction executed during the live trial. W-01 received an ERS of 65. The trial ran for approximately five minutes, which was shorter than anticipated due to the design phase overrunning the scheduled window. The introduction speech and gesture executed correctly. The narrative speech executed but deviated from the specification due to the modified rock color, as described above. The comprehension question was delivered, the branching logic resolved correctly based on the test subject's response, and the appropriate branch response was given. Gesture synchronization was partial: the pause gesture executed, but coordination between speech and movement was inconsistent at several points. No system disconnections or crashes occurred.
+
+% TODO: Add ERS scores for remaining participants and compute condition means.
+% TODO: Note any systematic patterns in execution failures across conditions.
+\textit{[PLACEHOLDER: ERS results for W-02 through W-0X will be reported here.]}
+
+\subsection{System Usability Scale}
+
+W-01 rated Choregraphe with a SUS score of 60. The standard benchmark for SUS scores places 68 as the average; scores below 68 are generally considered below average usability~\cite{Brooke1996}. A score of 60 suggests that W-01 found Choregraphe marginal in usability despite having a programming background, which is consistent with the large number of help requests observed during the design phase.
+
+% TODO: Add SUS scores for remaining participants. Report condition means.
+\textit{[PLACEHOLDER: SUS scores for W-02 through W-0X will be reported here.]}
+
+\section{Supplementary Measures}
+
+\subsection{Session Timing}
+
+Table~\ref{tbl:timing} summarizes the time spent in each phase per session.
+
+\begin{table}[htbp]
+\centering
+\footnotesize
+\begin{tabular}{|l|l|l|l|l|l|}
+\hline
+\textbf{ID} & \textbf{Training} & \textbf{Design} & \textbf{Trial} & \textbf{Debrief} & \textbf{Total} \\
+\hline
+W-01 & 15 min & 35 min & 5 min & 5 min & 60 min \\
+\hline
+W-02 & --- & --- & --- & --- & --- \\
+\hline
+W-03 & --- & --- & --- & --- & --- \\
+\hline
+W-04 & --- & --- & --- & --- & --- \\
+\hline
+\end{tabular}
+\caption{Time spent in each session phase per wizard participant.}
+\label{tbl:timing}
+\end{table}
+
+W-01's design phase extended to 35 minutes, nearly double the 20-minute allocation, compressing the trial and debrief to 5 minutes each. Despite this, W-01 declared the design complete rather than abandoning it, and the robot did execute a recognizable version of the specification during the trial.
+
+\subsection{Help Requests}
+
+% TODO: Report help request counts and types for all sessions.
+W-01 generated a substantial number of help requests during the design phase, primarily concerning Choregraphe's interface rather than the specification itself. The wizard demonstrated understanding of the task but encountered repeated friction with the tool's connection model, behavior box configuration, and branch routing. This pattern --- understanding the goal but struggling with the mechanism --- is characteristic of the accessibility problem described in Chapter~\ref{ch:background}.
+
+\textit{[PLACEHOLDER: Help request counts and categories for all sessions will be reported here.]}
+
+\section{Qualitative Findings}
+
+\subsection{Observed Specification Deviation}
+
+A notable qualitative finding from W-01's session was an unprompted deviation from the written specification: the wizard substituted a different rock color in the robot's speech and comprehension question, departing from the ``red'' specified in the paper protocol. This was not a tool failure; the wizard made a deliberate creative choice that the tool did not prevent or flag. The deviation was undetected until the live trial, when the researcher --- serving as test subject --- did not correctly identify the rock color and triggered the incorrect-answer branch. This incident illustrates the reproducibility problem concretely: without automated protocol enforcement, wizard behavior can drift from the specification in ways that are invisible until execution, affecting the validity of the resulting interaction data.
+
+\subsection{Wizard Experience}
+
+% TODO: Add qualitative observations from remaining sessions.
+W-01 expressed that the training was comprehensible and that the underlying logic of the task was clear. The primary source of frustration was Choregraphe's interface for handling conditional branches and managing the timing of parallel behaviors. Post-session comments suggested that the wizard would not use Choregraphe independently for future HRI work without technical support.
+
+\textit{[PLACEHOLDER: Qualitative observations from remaining sessions will be reported here.]}
+
+\section{Chapter Summary}
+
+% TODO: Update summary when all sessions are complete.
+This chapter presented the results from the pilot validation study. To date, one Choregraphe condition session has been completed (W-01), yielding a DFS of 70, ERS of 65, and SUS of 60. Qualitative observations from this session provide preliminary evidence for both the accessibility problem (substantial help requests and design phase overrun) and the reproducibility problem (unprompted specification deviation undetected until the live trial). Remaining sessions will add data for both conditions; Chapter~\ref{ch:discussion} interprets the available findings in the context of the research questions.
@@ -0,0 +1,66 @@
+\chapter{Discussion}
+\label{ch:discussion}
+
+This chapter interprets the results presented in Chapter~\ref{ch:results} against the two research questions established in Chapter~\ref{ch:evaluation}, situates the findings within the broader literature on WoZ methodology, and identifies the limitations of this study. Where the pilot data derives from an initial subset of sessions, I treat those observations as preliminary evidence and establish the analytical framework that governs interpretation of the full dataset.
+
+\section{Interpretation of Findings}
+
+\subsection{Research Question 1: Accessibility}
+
+The first research question asked whether HRIStudio enables domain experts without prior robotics experience to successfully implement a robot interaction from a written specification. The Choregraphe condition provides the baseline against which this question is evaluated.
+
+W-01's session offers preliminary evidence consistent with the accessibility problem described in Chapter~\ref{ch:background}. W-01 was a Digital Humanities faculty member with no programming background --- precisely the intended user population for tools like Choregraphe. Despite this framing, W-01 required significantly more time than allocated and generated a high volume of help requests, the majority of which concerned the tool's interface rather than the task itself. This distinction matters: W-01 understood what the specification required but could not efficiently translate that understanding into Choregraphe's behavior model. The finite state machine paradigm --- boxes, signals, and explicit connection routing --- imposed cognitive overhead on a domain expert who had no prior exposure to this abstraction.
+
+W-01's SUS score of 60, below the average benchmark of 68~\cite{Brooke1996}, corroborates this observation. Post-session comments indicated that the wizard would not use Choregraphe for future HRI work without technical support, despite completing the design challenge. Together these observations establish a concrete baseline: a tool nominally designed for non-programmers nonetheless required substantial researcher support, produced a high volume of interface-level help requests, and was rated below average in usability by a domain expert with no programming background.
+
+The HRIStudio sessions are evaluated against this baseline. The central comparison is whether wizards using HRIStudio produce higher DFS scores with fewer tool-operation interventions and higher SUS ratings. If HRIStudio's timeline-based interaction model reduces the interface friction observed with Choregraphe, those differences should appear across all three measures simultaneously; a pattern limited to one measure would call for a more qualified interpretation.
+
+% TODO: Replace the forward-looking framing above with the actual condition-level comparison once HRIStudio sessions are complete.
+% TODO: Report mean DFS, SUS, and T-type intervention counts per condition. Discuss what any gap implies for the accessibility claim.
+
+\subsection{Research Question 2: Reproducibility}
+
+The second research question asked whether HRIStudio produces more reliable execution of a designed interaction compared to Choregraphe. The most instructive finding from W-01's session is not a score but an incident: without any technical failure, the wizard substituted a different rock color in the robot's speech and comprehension question, departing from the ``red'' specified in the written protocol. This deviation was not caught during the design phase, was not flagged by the tool, and was only discovered during the live trial.
+
+This is precisely the failure mode the reproducibility problem predicts. Riek's~\cite{Riek2012} review found that fewer than 4\% of published WoZ studies reported any measure of wizard error, meaning most studies have no mechanism to detect whether execution matched design intent. W-01's session demonstrates that such deviations occur even in controlled conditions with a single, simple specification and an engaged wizard. The deviation was not negligence; it was creative drift made possible by a tool that places no structural constraint on what the wizard types into a speech action.
+
+HRIStudio's protocol enforcement model is designed to prevent this class of deviation. By locking speech content at design time and presenting it to the wizard during execution rather than requiring re-entry, HRIStudio eliminates the structural opportunity for this substitution. Whether enforcement translates into measurably higher ERS scores is the empirical question the full dataset addresses. Complementing the ERS, the intervention log records whether any branch during the trial was resolved through programmed conditional logic or by manual re-routing, providing a parallel measure of execution reliability that is independent of the test subject's responses.
+
+% TODO: Replace the forward-looking framing above with actual ERS condition means once HRIStudio sessions are complete.
+% TODO: Report whether any HRIStudio sessions produced specification deviations or required trial-phase T interventions.
+
+\subsection{Session Timing and Downstream Effects}
+
+W-01's design phase extended to 35 minutes, overrunning the 30-minute allocation by five minutes and compressing the trial window to approximately five minutes, well short of the intended ten. This timing pattern is itself evidence for the accessibility claim. If a tool reliably causes design phases to overrun their allocation, the downstream quality of the trial is compromised: a shorter trial produces a less complete ERS and a less representative interaction for the test subject. The difficulty of a tool does not only affect the design experience; it degrades the quality of the data that follow from it. Phase-by-phase timing data collected across all sessions will reveal whether design phase overruns are characteristic of one condition rather than the other, constituting a supplementary indicator of tool accessibility independent of the DFS score.
+
+% TODO: Report mean design phase duration per condition and note whether overruns cluster in the Choregraphe condition.
+
+\section{Comparison to Prior Work}
+
+The findings from W-01's session are broadly consistent with prior characterizations of Choregraphe's usability profile. Pot et al.~\cite{Pot2009} introduced Choregraphe as a tool for enabling non-programmers to create NAO behaviors, but subsequent HRI research has treated it primarily as a programmer's tool in practice. The help request pattern observed --- conceptual understanding blocked by interface friction --- aligns with Riek's~\cite{Riek2012} observation that WoZ tools tend to require substantial technical investment even when the underlying experiment is conceptually simple.
+
+The specification deviation observed in W-01's session connects directly to Porfirio et al.'s~\cite{Porfirio2023} argument that formal, verifiable behavior specifications are a prerequisite for reproducible HRI. Porfirio et al. propose specification languages as the solution; HRIStudio takes a complementary approach by embedding the specification into the execution environment itself, making deviation structurally harder rather than formally detectable after the fact. The practical consequence of this design choice --- whether it reduces deviations in practice --- is what the ERS comparison will reveal.
+
+The SUS score of 60 for Choregraphe falls below scores reported for general-purpose visual programming tools in other HCI studies, though direct comparison is complicated by task and population differences. It is consistent with the finding that domain-specific visual programming environments carry learning curves that programming experience alone does not fully resolve~\cite{Bartneck2024}.
+
+% TODO: Add HRIStudio condition SUS mean to this section and compare to the Choregraphe baseline once sessions are complete.
+
+\section{Limitations}
+
+This study has several limitations that must be considered when interpreting the findings.
+
+\textbf{Sample size.} With six wizard participants ($N = 6$), the study is too small for inferential statistics. The reported scores are descriptive. Patterns in the data can suggest directions for future work but cannot establish causal claims about the effect of the tool on design fidelity or execution reliability.
+
+\textbf{Researcher as test subject.} In W-01's session, the researcher served as the test subject due to participant unavailability. The researcher had foreknowledge of the specification and the study design, which may have introduced familiarity bias into the interaction. Because the DFS and ERS are scored against recordings and exported files rather than the test subject's behavior, this limitation primarily affects the qualitative character of the trial rather than the quantitative scores.
+
+\textbf{Compressed trial window.} W-01's trial lasted approximately five minutes rather than the intended ten. This limits the completeness of the ERS for that session, since several interaction steps were abbreviated under time pressure. Future sessions should enforce the transition to the trial phase at the 30-minute design mark regardless of completion status, consistent with the observer's role defined in the study protocol.
+
+\textbf{Single task.} Both conditions used the same Interactive Storyteller specification. While this controls for task difficulty, it limits generalizability. The task is simple relative to real HRI experiments; the gap between conditions may be larger or smaller with a more complex protocol involving multiple branches or longer interaction sequences.
+
+\textbf{Condition imbalance.} Because participants were randomly assigned, the final sample may distribute programmers unevenly across conditions, confounding the comparison. With a small $N$, random assignment does not guarantee balance across programming background.
+
+\textbf{Platform version.} HRIStudio is under active development. The version used in this study represents the system at a specific point in time; future iterations may behave differently.
+
+\section{Chapter Summary}
+
+This chapter interpreted the results of the pilot study in the context of the two research questions and connected the findings to prior work. The W-01 session provides preliminary evidence for both the accessibility problem and the reproducibility problem: Choregraphe produced significant interface friction for a Digital Humanities faculty member with no programming background, and permitted a specification deviation that was undetected until the live trial. These observations are consistent with the motivating analysis in Chapter~\ref{ch:background} and anchor the comparisons that the full dataset will resolve. The limitations of this pilot study --- sample size, researcher as test subject, compressed trial window, and single task --- are acknowledged and inform the future directions described in Chapter~\ref{ch:conclusion}.
@@ -1,8 +0,0 @@
-\chapter{Results}
-\label{ch:results}
-
-\section{Quantitative Results}
-% TODO
-
-\section{Qualitative Findings}
-% TODO
@@ -0,0 +1,48 @@
+\chapter{Conclusion and Future Work}
+\label{ch:conclusion}
+
+This thesis set out to address two persistent problems in Wizard-of-Oz-based Human-Robot Interaction research. The first is the Accessibility Problem: a high technical barrier prevents domain experts who are not programmers from conducting HRI studies independently. The second is the Reproducibility Problem: the fragmented landscape of custom tools makes it difficult to verify or replicate experimental results across studies and labs. This chapter summarizes the contributions of the work, reflects on what the pilot study results suggest, and identifies directions for future investigation.
+
+\section{Contributions}
+
+This thesis makes three contributions to the field of Human-Robot Interaction research infrastructure.
+
+\textbf{A principled architecture for WoZ platforms.} The primary contribution is a set of design principles for Wizard-of-Oz infrastructure: a hierarchical specification model (Study $\to$ Experiment $\to$ Step $\to$ Action), an event-driven execution model that separates protocol design from live trial control, and a plugin architecture that decouples experiment logic from robot-specific implementations. These principles are not specific to any one robot or institution; they describe a general approach to building WoZ tools that are simultaneously accessible to non-programmers and reproducible across executions. The principles were derived from a systematic analysis of reproducibility failures in published WoZ literature, grounded in the prior work of Riek~\cite{Riek2012} and Porfirio et al.~\cite{Porfirio2023}, and refined through the design and implementation process described in Chapters~\ref{ch:design} and~\ref{ch:implementation}.
+
+\textbf{HRIStudio: a reference implementation.} The second contribution is HRIStudio, an open-source, web-based platform that realizes the design principles described above. HRIStudio provides a visual experiment designer, a consolidated wizard execution interface, role-based access control for research teams, and a repository-based plugin system for integrating robot platforms including the NAO V6 used in this study. As a reference implementation, HRIStudio demonstrates that the design principles are technically feasible and can be delivered in a form that real researchers can use without programming expertise. The platform's architecture is documented in detail in Chapter~\ref{ch:implementation} and the accompanying technical appendix.
+
+\textbf{Pilot empirical evidence.} The third contribution is a pilot between-subjects study comparing HRIStudio against Choregraphe as a representative baseline tool. While the pilot scale precludes inferential claims, the study provides directional evidence on both research questions and produces a concrete demonstration of the reproducibility problem in a controlled setting: a wizard using Choregraphe deviated from the written specification in a way that was undetected until the live trial. This incident motivates the enforcement model at the core of HRIStudio's design and illustrates why the reproducibility problem is difficult to solve through training or norms alone.
+
+\section{Reflection on Research Questions}
+
+The central question this thesis addressed was: \emph{can the right software architecture make Wizard-of-Oz experiments more accessible to non-programmers and more reproducible across participants?} The evidence from the pilot study suggests the answer is yes, with the qualifications appropriate to a small-N directional study.
+
+On accessibility, the Choregraphe condition demonstrates that even a tool described as suitable for non-programmers creates significant interface friction in practice. A wizard with programming experience required more time than allocated, generated a high volume of tool-level help requests, and rated the tool below the average SUS benchmark. The finite state machine model --- boxes connected by signals --- imposed cognitive overhead that domain knowledge of the task alone could not resolve. If HRIStudio's timeline-based model and guided workflow reduce that overhead, the difference should appear as higher DFS scores, fewer tool-operation interventions, and higher SUS ratings across the full sample.
+
+On reproducibility, the specification deviation observed in the Choregraphe session illustrates why enforcement matters. A tool that allows wizards to freely edit speech content at any point in the design process creates opportunities for drift that are invisible until they surface during execution. HRIStudio's protocol enforcement forecloses this class of deviation by construction --- speech is locked at design time and surfaced during execution rather than re-entered. Whether this architectural choice translates into measurably higher execution reliability scores, and whether the proportion of tool-assisted branching resolution differs between conditions, are the questions the full dataset answers.
+
+% TODO: Once all sessions are complete, rewrite the Reflection section with actual condition means for DFS, ERS, and SUS.
+% TODO: Replace the forward-looking framing in both RQ paragraphs with concrete comparative analysis.
+% TODO: Update the chapter intro sentence ("The evidence suggests yes...") to reflect the actual direction of the findings.
+
+\section{Future Directions}
+
+The work described in this thesis suggests several directions for future investigation.
+
+\textbf{Larger validation study.} The most immediate next step is a full-scale study with sufficient participants to support inferential analysis. A sample of 20 or more wizard participants, balanced across programming backgrounds and conditions, would allow the DFS and ERS comparisons to be evaluated for statistical significance. A larger study would also enable subgroup analysis --- for example, whether the accessibility benefit of HRIStudio is concentrated among non-programmers or extends equally to programmers.
+
+\textbf{Multi-task evaluation.} The Interactive Storyteller is a simple single-interaction task with one conditional branch. Real HRI experiments are more complex: they involve multiple conditions, longer interactions, and more elaborate branching logic. Evaluating HRIStudio on richer specifications would test whether the accessibility and reproducibility benefits scale with task complexity, and whether any new limitations emerge at that scale.
+
+\textbf{Longitudinal use.} This study evaluated first-session performance, which captures the initial learning curve but not longer-term practice. A longitudinal study tracking wizard performance across multiple sessions would reveal whether HRIStudio's benefits persist or diminish as wizards become proficient, and whether the tool's structured approach continues to enforce reproducibility over time.
+
+\textbf{Observer and researcher roles.} HRIStudio's role-based architecture includes Observer and Researcher roles that were not formally evaluated in this study. Future work should investigate how these roles support team coordination in multi-experimenter studies, and whether the annotation and logging capabilities they enable produce analysis workflows that are meaningfully more efficient than manual video coding.
+
+\textbf{Platform expansion.} The NAO integration used in this study is one instance of HRIStudio's plugin architecture. Extending the plugin ecosystem to include mobile robots, socially assistive robots, and non-humanoid platforms would broaden the system's applicability and test whether the plugin abstraction is sufficiently general to accommodate the range of robot capabilities used in published HRI research.
+
+\textbf{Community adoption.} The reproducibility problem in WoZ research is ultimately a community problem, not a tool problem. Future work should investigate what it would take for HRIStudio to be adopted as shared infrastructure across multiple labs --- including documentation standards, experiment sharing mechanisms, and incentive structures that make reproducibility a norm rather than an exception.
+
+\section{Closing Remarks}
+
+The Wizard-of-Oz technique is one of the most powerful tools available to HRI researchers: it allows the study of interaction designs that do not yet exist as autonomous systems, accelerating the feedback loop between design intuition and empirical evidence. But the technique has been practiced for decades without the infrastructure needed to make it rigorous. Studies are conducted with custom tools that are never shared, by wizards whose behavior is never verified against a protocol, producing results that cannot be replicated because the conditions that produced them were never precisely recorded.
+
+HRIStudio is an attempt to build that infrastructure. It will not solve the reproducibility problem by itself; that requires community norms, institutional incentives, and continued investment in open, shared tooling. But it demonstrates that the technical barriers are not insurmountable --- that a web-based platform can make WoZ research accessible to domain experts who are not engineers, and that execution enforcement can prevent the kinds of specification drift that silently degrade research quality. That is, at minimum, where the work begins.
@@ -1,11 +0,0 @@
-\chapter{Discussion}
-\label{ch:discussion}
-
-\section{Interpretation of Findings}
-% TODO
-
-\section{Comparison to Prior Work}
-% TODO
-
-\section{Limitations}
-% TODO
@@ -1,8 +0,0 @@
-\chapter{Conclusion and Future Work}
-\label{ch:conclusion}
-
-\section{Contributions}
-% TODO
-
-\section{Future Directions}
-% TODO
@@ -1,11 +1,4 @@
 \chapter{Study Materials}
 \label{app:materials}

-\section{Protocols}
-% TODO
-
-\section{IRB Materials}
-% TODO
-
-\section{Questionnaires}
-% TODO
+\textit{[PLACEHOLDER: Study materials will be inserted here. Content includes recruitment materials, paper specification, consent forms, SUS questionnaire, Design Fidelity Score rubric, Execution Reliability Score rubric, observer data sheet, and training protocol.]}
@@ -1,8 +1,49 @@
 \chapter{Technical Documentation}
 \label{app:tech_docs}

+This appendix documents the specific technologies and libraries used to build HRIStudio, organized by the three architectural layers described in Chapter~\ref{ch:design}. The goal here is reference, not justification; Chapter~\ref{ch:implementation} explains the reasoning behind the major architectural choices.
+
+\section{Technology Stack}
+
+\subsection{User Interface Layer}
+
+The frontend is built on Next.js (App Router) using React and TypeScript. TypeScript is used throughout the entire codebase, including the server and data access layers, so that type inconsistencies between layers are caught at compile time rather than at runtime. Styling is handled with Tailwind CSS and the shadcn/ui component library. The drag-and-drop canvas in the Design interface uses the \texttt{@dnd-kit} library (\texttt{@dnd-kit/core} and \texttt{@dnd-kit/sortable}) to manage nested drag operations for arranging steps and action blocks.
+
+\subsection{Application Logic Layer}
+
+The server runs as a Next.js Node.js process. API routes use tRPC over HTTP for typed request/response calls; real-time communication during live trials uses a persistent WebSocket connection via the \texttt{ws} package. Authentication and session management are handled by NextAuth.js (v5 beta) with the \texttt{@auth/drizzle-adapter} and bcryptjs for password hashing. Currently, credential-based (username and password) authentication is supported.
+
+\subsection{Data and Robot Control Layer}
+
+Experiment protocols, trial records, and user data are stored in PostgreSQL. The schema and all database queries are managed through Drizzle ORM, which provides compile-time type safety for database interactions. Action configuration parameters and plugin-specific fields are stored as JSONB columns, which allows the same schema to accommodate any robot's action types.
+
+Video and audio recordings captured during trials are stored in a self-hosted MinIO instance, an S3-compatible object storage service. Recordings are captured in the browser using the native MediaRecorder API (assisted by \texttt{react-webcam}) and uploaded to MinIO as a chunked transfer when the trial concludes.
+
+Robot communication is handled through a ROS Bridge (\texttt{rosbridge\_suite} or \texttt{ros2-web-bridge}) running on the robot's local network. The server connects to the bridge over a WebSocket and exchanges JSON-encoded ROS messages; it does not run as a ROS node itself. The bridge address is configured per robot in the plugin file (for example, \texttt{"rosbridgeUrl": "ws://localhost:9090"} in the NAO6 plugin).
+
 \section{Deployment}
-% TODO
+
+The full stack is orchestrated using Docker Compose. The \texttt{docker-compose.yml} file defines three services: the PostgreSQL database (\texttt{postgres:15}), the MinIO storage instance, and the Next.js application server. Starting the entire system on any machine with Docker installed is a single \texttt{docker compose up} command. This configuration is intended for on-premises deployment, which is important for studies involving participant data that cannot leave the institution's network.

 \section{Plugin Specification}
-% TODO
+
+Robot capabilities are defined in JSON plugin files. Each file describes a robot platform and the actions it supports. The structure of a plugin file is as follows:
+
+\begin{itemize}
+  \item \textbf{Metadata}: name, version, and a human-readable description of the platform.
+  \item \textbf{ROS configuration} (\texttt{ros2Config}): the bridge URL and any global connection parameters.
+  \item \textbf{Actions}: an array of action definitions. Each action specifies:
+    \begin{itemize}
+      \item A unique action type identifier (e.g., \texttt{speak}, \texttt{raise\_arm})
+      \item A human-readable label shown in the Design interface
+      \item A parameter schema defining the input fields the researcher configures
+      \item The target ROS topic and message type
+      \item A mapping from parameter names to message fields
+    \end{itemize}
+\end{itemize}
+
+When the server dispatches a robot command, it loads the active plugin, locates the matching action definition, constructs the ROS message by applying the parameter mapping, and sends it to the bridge. Adding a new robot means writing a new plugin file; no server code changes are required.
+
+\section{Role-Based Access Control}
+
+HRIStudio uses a two-layer role system. System roles (\texttt{systemRoleEnum}) govern what a user can do across the platform: \emph{administrator}, \emph{researcher}, \emph{wizard}, and \emph{observer}. Study roles (\texttt{studyMemberRoleEnum}) govern what a user can see and do within a specific study: \emph{owner}, \emph{researcher}, \emph{wizard}, and \emph{observer}. A user's system role and study role are checked independently, so a user who is a wizard on one study can be an observer on another without any additional configuration.
@@ -1,3 +1,11 @@
+@book{Baum1900,
+  title={{The Wonderful Wizard of Oz}},
+  author={Baum, L. Frank},
+  year={1900},
+  publisher={George M. Hill Company},
+  address={Chicago, IL}
+}
+
@article{Lu2011,
  title={{Polonius: A Wizard of Oz Interface for HRI Experiments}},
  author={Lu, David V. and Smart, William D.},
@@ -34,6 +42,14 @@
  publisher={IEEE}
 }

+@inproceedings{Quigley2009,
+  title={{ROS: an open-source Robot Operating System}},
+  author={Quigley, Morgan and Conley, Ken and Gerkey, Brian and Faust, Josh and Foote, Tully and Leibs, Jeremy and Wheeler, Rob and Ng, Andrew Y},
+  booktitle={IEEE International Conference on Robotics and Automation},
+  year={2009},
+  url={https://api.semanticscholar.org/CorpusID:6324125}
+}
+
@article{Riek2012,
 author = {Riek, Laurel D.},
 title = {{Wizard of Oz studies in HRI: a systematic review and new reporting guidelines}},
@@ -129,4 +145,75 @@ series = {OzCHI '15}
  keywords={Humanoid robots;Robot programming;Mobile robots;Human robot interaction;Programming environments;Prototypes;Microcomputers;Software tools;Software prototyping;Man machine systems},
  doi={10.1109/ROMAN.2009.5326209}}

+@book{Bartneck2024,
+  title={Human-Robot Interaction -- An Introduction},
+  author={Bartneck, Christoph and Belpaeme, Tony and Eyssel, Friederike and Kanda, Takayuki and Keijsers, Merel and Sabanovic, Selma},
+  year={2024},
+  edition={2nd},
+  publisher={Cambridge University Press},
+  address={Cambridge}
+}

+@inproceedings{Steinfeld2009,
+  author = {Steinfeld, Aaron and Jenkins, Odest Chadwicke and Scassellati, Brian},
+  title = {{The oz of wizard: simulating the human for interaction research}},
+  year = {2009},
+  isbn = {9781605582934},
+  publisher = {Association for Computing Machinery},
+  booktitle = {Proceedings of the 4th ACM/IEEE International Conference on Human Robot Interaction},
+  pages = {101--108},
+  doi = {10.1145/1514095.1514115}
+}
+
+@inproceedings{Gibert2013,
+  author = {Gibert, Guillaume and Petit, Morgan and Lance, Frederic and Pointeau, Gregoire and Dominey, Peter F.},
+  title = {{What makes human so different? Analysis of human-humanoid robot interaction with a super wizard of oz platform}},
+  year = {2013},
+  booktitle = {Proceedings of the 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)},
+  pages = {931--938},
+  doi = {10.1109/IROS.2013.6696465}
+}
+
+@article{Strazdas2020,
+  author = {Strazdas, Daniel and Hintz, Jonathan and Felßberg, Anna Maria and Al-Hamadi, Ayoub},
+  title = {{Robots and wizards: An investigation into natural human–robot interaction}},
+  journal = {IEEE Access},
+  volume = {8},
+  pages = {218808--218821},
+  year = {2020},
+  doi = {10.1109/ACCESS.2020.3042287}
+}
+
+@inproceedings{Helgert2024,
+  author = {Helgert, Anna and Straßmann, Christopher and Eimler, Sabine C.},
+  title = {{Unlocking potentials of virtual reality as a research tool in human-robot interaction: A wizard-of-oz approach}},
+  year = {2024},
+  booktitle = {Proceedings of the 2024 ACM/IEEE International Conference on Human-Robot Interaction},
+  pages = {123--132},
+  doi = {10.1145/3610978.3640741}
+}
+
+@InProceedings{TypeScript2014,
+author="Bierman, Gavin
+and Abadi, Mart{\'i}n
+and Torgersen, Mads",
+editor="Jones, Richard",
+title="Understanding TypeScript",
+booktitle="ECOOP 2014 -- Object-Oriented Programming",
+year="2014",
+publisher="Springer Berlin Heidelberg",
+address="Berlin, Heidelberg",
+pages="257--281",
+abstract="TypeScript is an extension of JavaScript intended to enable easier development of large-scale JavaScript applications. While every JavaScript program is a TypeScript program, TypeScript offers a module system, classes, interfaces, and a rich gradual type system. The intention is that TypeScript provides a smooth transition for JavaScript programmers---well-established JavaScript programming idioms are supported without any major rewriting or annotations. One interesting consequence is that the TypeScript type system is not statically sound by design. The goal of this paper is to capture the essence of TypeScript by giving a precise definition of this type system on a core set of constructs of the language. Our main contribution, beyond the familiar advantages of a robust, mathematical formalization, is a refactoring into a safe inner fragment and an additional layer of unsafe rules.",
+isbn="978-3-662-44202-9"
+}
+
+@article{Brooke1996,
+author = {Brooke, John},
+year = {1995},
+month = {11},
+pages = {},
+title = {SUS: A quick and dirty usability scale},
+volume = {189},
+journal = {Usability Eval. Ind.}
+}
@@ -4,6 +4,16 @@
 %\usepackage{graphics}            %Select graphics package
 \usepackage{graphicx}             %
 %\usepackage{amsthm}              %Add other packages as necessary
+\usepackage{array}                %Extended column types and \arraybackslash
+\usepackage{tabularx}             %Auto-width table columns
+\usepackage{tikz}                 %For programmatic diagrams
+\usetikzlibrary{shapes,arrows,positioning,fit,backgrounds,decorations.pathreplacing}
+\usepackage[
+    hidelinks,
+    linktoc=all,
+    pdfpagemode=UseOutlines
+]{hyperref}  %Enable hyperlinks and PDF bookmarks
+\hyphenation{HRIStudio}
 \begin{document}
 \butitle{A Web-Based Wizard-of-Oz Platform for Collaborative and Reproducible Human-Robot Interaction Research}
 \author{Sean O'Connor}
@@ -13,7 +23,6 @@
 \advisorb{Brian King}
 \chair{Alan Marchiori}
 \maketitle
-\frontmatter

 \acknowledgments{
    (Draft Acknowledgments)
@@ -33,14 +42,13 @@

 \include{chapters/01_introduction}
 \include{chapters/02_background}
-\include{chapters/03_related_work}
-\include{chapters/04_reproducibility}
-\include{chapters/05_system_design}
-\include{chapters/06_implementation}
-\include{chapters/07_evaluation}
-\include{chapters/08_results}
-\include{chapters/09_discussion}
-\include{chapters/10_conclusion}
+\include{chapters/03_reproducibility}
+\include{chapters/04_system_design}
+\include{chapters/05_implementation}
+\include{chapters/06_evaluation}
+\include{chapters/07_results}
+\include{chapters/08_discussion}
+\include{chapters/09_conclusion}

 \backmatter
 %\bibliographystyle{thesis_num}   %This uses BU thesis file thesis_num.bst
@@ -60,6 +68,7 @@
 %J.~Good Phys., {\bf 2}, 294 (2004).
 %\end{thebibliography}

+\makeatletter\@mainmattertrue\makeatother
 \appendix
 \include{chapters/app_materials}
 \include{chapters/app_tech_docs}
Author	SHA1	Message	Date
soconnor	ab48109f64	feat: draft discussion chapter and update thesis structure with preliminary results and placeholder sections. Build Proposal and Thesis / build-github (push) Has been skipped Details Build Proposal and Thesis / build-gitea (push) Successful in 1m24s Details	2026-04-01 17:22:53 -04:00
soconnor	96057e1bf8	Enhance architectural design, implementation, and evaluation chapters with detailed specifications and pilot validation study Build Proposal and Thesis / build-github (push) Has been skipped Details Build Proposal and Thesis / build-gitea (push) Successful in 57s Details	2026-03-26 13:50:07 -04:00
soconnor	7757046eec	Refactor implementation and evaluation chapters for clarity and detail - Revised the implementation chapter to emphasize HRIStudio as a reference implementation of design principles, detailing architectural choices and mechanisms. - Enhanced descriptions of platform architecture, experiment storage, execution engine, and access control. - Updated evaluation chapter to reflect the study as a pilot validation study, clarifying research questions, study design, participant roles, and measures. - Improved consistency in language and structure throughout both chapters. - Added details on participant recruitment and task specifications to better contextualize the study. - Adjusted measurement instruments table to align with the new chapter title. - Updated LaTeX document to include additional TikZ library for improved diagram capabilities.	2026-03-05 23:28:59 -05:00
soconnor	4d960b0ca9	Revise evaluation chapter and appendices; enhance clarity on study design, participant roles, and consent forms for HRIStudio evaluation.	2026-03-05 14:09:57 -05:00
soconnor	fed059252c	Enhance system design and implementation chapters; clarify design decisions, improve technical documentation, and update role-based access control details.	2026-03-04 13:24:36 -05:00
soconnor	88bd10bebb	post-m06-ch04 revisions	2026-03-02 17:00:22 -05:00
soconnor	9128900bc7	Refine introduction, background, reproducibility, and implementation chapters; enhance clarity by emphasizing key challenges and updating references	2026-03-02 12:40:27 -05:00
soconnor	ad940986c7	Enhance clarity and structure in introduction, background, reproducibility, system design, and implementation chapters; add new references and include TikZ for diagrams Build Proposal and Thesis / build-github (push) Has been skipped Details Build Proposal and Thesis / build-gitea (push) Successful in 3m6s Details	2026-02-23 22:24:41 -05:00
soconnor	92ef1b7ef0	Refine background and system design chapters; enhance clarity and structure in experiment protocols and trial execution descriptions	2026-02-23 13:32:09 -05:00
soconnor	a172c6ce0a	Refine introduction and background chapters; enhance clarity and structure in system design section	2026-02-22 22:13:48 -05:00
soconnor	02c40dde96	post-m04-ch03 edits	2026-02-19 23:50:30 -05:00
soconnor	5288007c8b	ch02 - merge 2.2 and 2.3	2026-02-19 23:12:33 -05:00
soconnor	c417f22209	post-m04-ch02 edits	2026-02-19 23:11:07 -05:00
soconnor	9423fc09b6	post-m04-ch01 revisions	2026-02-19 22:48:32 -05:00
soconnor	b75f31271b	Update thesis content and improve reproducibility framework Build Proposal and Thesis / build-github (push) Has been skipped Details Build Proposal and Thesis / build-gitea (push) Successful in 49s Details - Refine introduction and background chapters for clarity and coherence. - Enhance reproducibility chapter by connecting challenges to infrastructure requirements. - Add new references to support the thesis arguments. - Update .gitignore to include IDE files. - Modify hyperref package usage to hide colored boxes in the document.	2026-02-12 10:26:49 -05:00
soconnor	b29e14c054	fill chapter03 gap, write new chapter 3	2026-02-10 00:56:19 -05:00
soconnor	a9cfd1a52c	clean up background	2026-02-10 00:45:24 -05:00
soconnor	bc8e137f5b	update introduction after milestone 3	2026-02-10 00:08:57 -05:00
soconnor	b12066c08b	Update thesis layout and front matter handling Build Proposal and Thesis / build-github (push) Has been skipped Details Build Proposal and Thesis / build-gitea (push) Successful in 53s Details Add geometry package with explicit margins and remove manual page dimension adjustments. Enable double spacing via setspace and switch to myheadings pagestyle. Redefine \frontmatter to start roman numbering at page 4. Modify \maketitle to set page 3 and use singlespace for the title page. Remove the explicit \frontmatter call from thesis.tex.	2026-02-04 13:57:37 -05:00