refactor: update thesis protocol to remove test subjects and screen recordings, add tracking documentation, and refine bibliography entries.

2026-05-08 07:08:55 -04:00 · 2026-04-08 22:43:20 -04:00
parent ab48109f64
commit 659a4b0683
12 changed files with 235 additions and 107 deletions
@@ -19,14 +19,14 @@ To address the accessibility and reproducibility problems in WoZ-based HRI resea

 This approach represents a shift from the current paradigm of custom, robot-specific tools toward a unified platform that can serve as shared infrastructure for the HRI research community. By treating experiment design, execution, and analysis as distinct but integrated phases of a study, such a framework can systematically address both technical barriers and sources of variability that currently limit research quality and reproducibility.

-The contributions of this thesis are the design principles of this approach, namely: a hierarchical specification model, an event-driven execution model, and a protocol/trial separation with explicit deviation logging. Together they form a coherent architecture for WoZ infrastructure that any implementation could adopt. The platform I developed, HRIStudio, is one implementation of this architecture: an open-source reference system that realizes those principles and serves as the instrument for empirical validation.
+The contributions of this thesis are the design principles of this approach, namely: a hierarchical specification model, an event-driven execution model, and a protocol/trial separation with explicit deviation logging. Together they form a coherent architecture for WoZ infrastructure that any implementation could adopt. The platform I developed, HRIStudio, is a complete realization of this architecture: an open-source, web-based platform that serves as both the primary artifact of this thesis and the instrument for empirical validation.

 \section{Research Objectives}

-This thesis builds upon foundational work presented in two prior peer-reviewed publications. Prof. Perrone and I first introduced the conceptual framework for HRIStudio at the 2024 IEEE International Conference on Robot and Human Interactive Communication (RO-MAN) \cite{OConnor2024}, establishing the vision for a collaborative, web-based platform. Subsequently, we published the detailed system architecture and a first prototype at RO-MAN 2025 \cite{OConnor2025}, validating the technical feasibility of web-based robot control. Those publications established the vision and the prototype. This thesis formalizes the contribution: a set of design principles for WoZ infrastructure that simultaneously address the \textit{Accessibility} and \textit{Reproducibility} Problems, a reference implementation of those principles, and pilot empirical evidence that they produce measurably different outcomes in practice.
+This thesis builds upon foundational work presented in two prior peer-reviewed publications. Prof. Perrone and I first introduced the conceptual framework for HRIStudio at the 2024 IEEE International Conference on Robot and Human Interactive Communication (RO-MAN) \cite{OConnor2024}, establishing the vision for a collaborative, web-based platform. Subsequently, we published the detailed system architecture and a first prototype at RO-MAN 2025 \cite{OConnor2025}, validating the technical feasibility of web-based robot control. Those publications established the vision and the prototype. This thesis formalizes the contribution: a set of design principles for WoZ infrastructure that simultaneously address the \textit{Accessibility} and \textit{Reproducibility} Problems, a complete platform that realizes those principles, and pilot empirical evidence that they produce measurably different outcomes in practice.

 The central question this thesis addresses is: \emph{can the right software architecture make Wizard-of-Oz experiments more accessible to non-programmers and more reproducible across participants?} To answer it, I propose a hierarchical, event-driven specification model that separates protocol design from trial execution, enforces action sequences, and logs deviations automatically; implement it as HRIStudio; and evaluate it in a pilot study comparing design fidelity and execution reliability against a representative baseline tool. The goal is not to prove a statistical effect at scale, but to establish directional evidence that the architecture changes what researchers can do and how consistently they can do it.

 \section{Chapter Summary}

-This chapter has established the context and objectives for this thesis. I identified two critical challenges facing WoZ-based HRI research. The first is the \emph{Accessibility Problem}: high technical barriers limit participation by non-programmers. The second is the \emph{Reproducibility Problem}: fragmented tooling makes results difficult to replicate across labs. I proposed a web-based framework approach that addresses these challenges through intuitive design interfaces, enforced experimental protocols, and platform-agnostic architecture. Finally, I posed the central research question (can a hierarchical, event-driven specification model with explicit deviation logging lower the technical barrier and improve reproducibility of WoZ experiments?) and described how this thesis addresses it through formal design, a reference implementation, and a pilot validation study. The next chapters establish the technical and methodological foundations.
+This chapter has established the context and objectives for this thesis. I identified two critical challenges facing WoZ-based HRI research. The first is the \emph{Accessibility Problem}: high technical barriers limit participation by non-programmers. The second is the \emph{Reproducibility Problem}: fragmented tooling makes results difficult to replicate across labs. I proposed a web-based framework approach that addresses these challenges through intuitive design interfaces, enforced experimental protocols, and platform-agnostic architecture. Finally, I posed the central research question (can a hierarchical, event-driven specification model with explicit deviation logging lower the technical barrier and improve reproducibility of WoZ experiments?) and described how this thesis addresses it through formal design, a complete platform, and a pilot validation study. The next chapters establish the technical and methodological foundations.
@@ -15,7 +15,7 @@ A second wave of tools shifted focus toward usability, often achieving accessibi

 Choregraphe \cite{Pot2009}, developed by Aldebaran Robotics for the NAO and Pepper robots, offers a visual programming environment based on connected behavior boxes. Researchers can create complex interaction flows using drag-and-drop blocks without writing code in traditional programming languages. However, when new robot platforms emerge or when hardware becomes obsolete, tools like Choregraphe and WoZ4U lose their utility. Pettersson and Wik, in their review of WoZ tools \cite{Pettersson2015}, note that platform-specific systems often fall out of use as technology evolves, forcing researchers to constantly rebuild their experimental infrastructure.

-Recent years have seen renewed interest in comprehensive WoZ frameworks. Gibert et al. \cite{Gibert2013} developed the Super Wizard of Oz (SWoOZ) platform. This system integrates facial tracking, gesture recognition, and real-time control capabilities to enable naturalistic human-robot interaction studies. Virtual and augmented reality have also emerged as complementary approaches to WoZ. Helgert et al. \cite{Helgert2024} demonstrated how VR-based WoZ environments can simplify experimental setup while providing researchers with precise control over environmental conditions and high fidelity data collection.
+Recent years have seen renewed interest in comprehensive WoZ frameworks. Gibert et al. \cite{Gibert2013} developed the Super Wizard of Oz (SWoOZ) platform. This system integrates facial tracking, gesture recognition, and real-time control capabilities to enable naturalistic human-robot interaction studies. Virtual and augmented reality have also emerged as complementary approaches to WoZ. Helgert et al. \cite{Helgert2024} demonstrated how VR-based WoZ environments can simplify experimental setup while providing researchers with precise control over environmental conditions and high-fidelity data collection.

 This expanding landscape reveals a persistent fundamental gap in the design space of WoZ tools. Flexible, general-purpose platforms like Polonius and OpenWoZ offer powerful capabilities but present high technical barriers. Accessible, user-friendly tools like WoZ4U and Choregraphe lower those barriers but sacrifice cross-platform compatibility and longevity. Newer approaches such as VR-based frameworks attempt to bridge this gap, yet no existing tool successfully combines accessibility, flexibility, deployment portability, and built-in methodological rigor. By methodological rigor, I refer to systematic features that guide experimenters toward best practices like standardized protocols, comprehensive logging, and reproducible experimental designs.

@@ -62,9 +62,9 @@ Figure~\ref{fig:example-hierarchy} maps that study onto the same hierarchy. The
 	\draw[arrow] (exp.south)   -- (step.north);

 	%% ---- Trial column ----
-	\node[trial] (t1) at (7.9, 5.5) {Trial --- P01\\{\footnotesize timestamped log}};
-	\node[trial] (t2) at (7.9, 4.2) {Trial --- P02\\{\footnotesize timestamped log}};
-	\node[trial] (t3) at (7.9, 2.9) {Trial --- P03\\{\footnotesize timestamped log}};
+	\node[trial] (t1) at (7.9, 5.5) {Trial: P01\\{\footnotesize timestamped log}};
+	\node[trial] (t2) at (7.9, 4.2) {Trial: P02\\{\footnotesize timestamped log}};
+	\node[trial] (t3) at (7.9, 2.9) {Trial: P03\\{\footnotesize timestamped log}};

 	%% ---- Separator ----
 	\draw[gray!60, thick, dashed] (4.85, 1.8) -- (4.85, 6.6);
@@ -138,7 +138,7 @@ Together, these three figures motivate why the hierarchy is useful in practice.

 \section{Event-Driven Execution Model}

-To achieve real-time responsiveness while maintaining methodological rigor (R3, R5), the system uses an event-driven execution model rather than a time-driven one. In a time-driven approach, the system advances through actions on a fixed schedule regardless of what the participant is doing, so the robot might speak over a participant who is still talking, or move on before a response has been given. The event-driven model avoids this by letting the wizard trigger each action when the interaction is ready for it. Figure~\ref{fig:event-driven-timeline} contrasts the two approaches using the same four-action sequence: Greet (G), Begin Story (BS), Ask Question (AQ), and End (E). In the time-driven row, fixed intervals $t_0$ through $t_2$ define when each event fires, and dashed vertical lines show where those moments fall relative to the event-driven rows below. In both event-driven rows, the wizard fires the same four labeled events at different real-time positions --- T1 (a faster participant) finishes well before T2 (a slower one) --- while both preserve the same action order.
+To achieve real-time responsiveness while maintaining methodological rigor (R3, R5), the system uses an event-driven execution model rather than a time-driven one. In a time-driven approach, the system advances through actions on a fixed schedule regardless of what the participant is doing, so the robot might speak over a participant who is still talking, or move on before a response has been given. The event-driven model avoids this by letting the wizard trigger each action when the interaction is ready for it. Figure~\ref{fig:event-driven-timeline} contrasts the two approaches using the same four-action sequence: Greet (G), Begin Story (BS), Ask Question (AQ), and End (E). In the time-driven row, fixed intervals $t_0$ through $t_2$ define when each event fires, and dashed vertical lines show where those moments fall relative to the event-driven rows below. In both event-driven rows, the wizard fires the same four labeled events at different real-time positions (T1, a faster participant, finishes well before T2, a slower one), while both preserve the same action order.

 \begin{figure}[htbp]
 \centering
@@ -332,4 +332,4 @@ The design choices described in this chapter were made to meet the requirements

 \section{Chapter Summary}

-This chapter described the architectural design with emphasis on how each design choice directly implements the infrastructure requirements identified in Chapter~\ref{ch:background}. The hierarchical organization of experiment specifications enables intuitive, executable design. The event-driven execution model balances protocol consistency with realistic interaction dynamics. The modular interface architecture separates concerns across design, execution, and analysis phases while maintaining data coherence. The integrated data flow ensures that reproducibility is supported by design rather than by afterthought. The following chapter presents HRIStudio as a reference implementation of these design principles, discussing specific technologies and architectural components.
+This chapter described the architectural design with emphasis on how each design choice directly implements the infrastructure requirements identified in Chapter~\ref{ch:background}. The hierarchical organization of experiment specifications enables intuitive, executable design. The event-driven execution model balances protocol consistency with realistic interaction dynamics. The modular interface architecture separates concerns across design, execution, and analysis phases while maintaining data coherence. The integrated data flow ensures that reproducibility is supported by design rather than by afterthought. The following chapter presents HRIStudio, the platform built on these design principles, describing the specific technologies and architectural components that bring them to life.
@@ -1,7 +1,7 @@
 \chapter{Implementation}
 \label{ch:implementation}

-HRIStudio is a reference implementation of the design principles established in Chapter~\ref{ch:design}. The central contribution of this work is not the tool itself but the design principles that underpin it: the hierarchical specification model, the event-driven execution model, and the integrated data flow. Any system built on those principles would satisfy the same requirements. This chapter explains how HRIStudio realizes them, covering the architectural choices and mechanisms behind how the platform stores experiments, executes trials, integrates robot hardware, and controls access. The specific technologies used in this particular implementation are presented in Appendix~\ref{app:tech_docs}.
+HRIStudio is a complete, operational platform that realizes the design principles established in Chapter~\ref{ch:design}. As the primary artifact of this thesis, it demonstrates that those principles are not merely theoretical: the hierarchical specification model, the event-driven execution model, and the integrated data flow can be built into a system that real researchers use without programming expertise. Any system built on those principles could satisfy the same requirements; HRIStudio is the implementation that proves they work in practice. This chapter explains how HRIStudio realizes those principles, covering the architectural choices and mechanisms behind how the platform stores experiments, executes trials, integrates robot hardware, and controls access. The specific technologies used are presented in Appendix~\ref{app:tech_docs}.

 \section{Platform Architecture}

@@ -9,7 +9,7 @@ HRIStudio follows the model of a web application. Users access it through a stan

 I organized the system into three layers: User Interface, Application Logic, and Data \& Robot Control. This layered structure is shown in Figure~\ref{fig:three-tier}. In the implementation of this architecture, it is essential that the application server and the robot control hardware run on the same local network. This keeps communication latency low during trials: a noticeable delay between the wizard's input and the robot's response would break the interaction.

-I implemented all three layers in the same language — TypeScript~\cite{TypeScript2014}, a statically-typed superset of JavaScript. The single-language decision keeps the type system consistent across the full stack. When the structure of experiment data changes, the type checker surfaces inconsistencies across the entire codebase at compile time rather than allowing them to appear as runtime failures during a trial.
+I implemented all three layers in the same language: TypeScript~\cite{TypeScript2014}, a statically-typed superset of JavaScript. The single-language decision keeps the type system consistent across the full stack. When the structure of experiment data changes, the type checker surfaces inconsistencies across the entire codebase at compile time rather than allowing them to appear as runtime failures during a trial.

 \section{Experiment Storage and Trial Logging}

@@ -93,7 +93,7 @@ The system stores structured and media data separately. Experiment specification

 The execution engine is the component that runs a trial: it loads the experiment, manages the wizard's connection, sends robot commands, and keeps all connected clients in sync.

-When a trial begins, the server loads the experiment and maintains a live connection to the wizard's browser and any observer connections. The execution engine does not advance through the actions of an experiment on a timer; instead, the wizard controls how time advances from action to action. This preserves the natural pacing of the interaction: the wizard advances only when the participant is ready, while the experiment structure ensures the protocol is followed. When the wizard triggers an action, the server sends the related command to the robot, writes the log entry, and pushes the updated experiment state to all connected clients in the same operation — keeping the wizard's view, the observer view, and the actual robot state synchronized in real time.
+When a trial begins, the server loads the experiment and maintains a live connection to the wizard's browser and any observer connections. The execution engine does not advance through the actions of an experiment on a timer; instead, the wizard controls how time advances from action to action. This preserves the natural pacing of the interaction: the wizard advances only when the participant is ready, while the experiment structure ensures the protocol is followed. When the wizard triggers an action, the server sends the related command to the robot, writes the log entry, and pushes the updated experiment state to all connected clients in the same operation, keeping the wizard's view, the observer view, and the actual robot state synchronized in real time.

 No two human subjects respond identically to an experimental protocol. One subject gives a one-word answer; another offers a paragraph; a third asks the robot a question the script never anticipated. A fully programmed robot has no answer for that third subject: the interaction stalls, or immersion breaks. The wizard exists to fill that gap: where the program runs out of instructions, the wizard draws on their knowledge of human social interaction to keep the exchange coherent. Unscripted actions give the wizard the tools to exercise that judgment in the moment. The wizard triggers them via the manual controls in the Execution interface, the robot command runs, and the system logs the action with a deviation flag. This design preserves research value: the interaction gains the flexibility only a human can provide, and that flexibility appears explicitly in the record rather than disappearing into it.

@@ -168,17 +168,17 @@ The role definitions above determine who can view and change data during normal
 The following two problems required specific solutions during implementation.

 \begin{description}
-    \item[Execution latency.] During a trial, the execution engine must respond quickly to wizard input --- a noticeable delay between the button press and the robot's action can disrupt the interaction. I addressed this by maintaining a persistent network connection to the robot bridge for the duration of each trial. The connection is established once at trial start and kept open, eliminating per-action setup overhead.
+    \item[Execution latency.] During a trial, the execution engine must respond quickly to wizard input, as a noticeable delay between the button press and the robot's action can disrupt the interaction. I addressed this by maintaining a persistent network connection to the robot bridge for the duration of each trial. The connection is established once at trial start and kept open, eliminating per-action setup overhead.

    \item[Multi-source synchronization.] The Analysis interface requires aligning data streams captured at different sampling rates by different components: video, audio, action logs, and sensor data. The solution is a shared time reference: every data source records its timestamps relative to the same trial start time, $t_0$, so the Analysis interface can align all tracks without requiring manual calibration.
 \end{description}

 \section{Implementation Status}

-HRIStudio has reached minimum viable product status. The Design, Execution, and Analysis interfaces are operational. The execution engine handles scripted and unscripted actions with full timestamped logging, and I validated robot communication on the NAO6 platform during development. The platform can run a controlled WoZ study without modification to its core architecture or execution workflow.
+HRIStudio is fully operational for controlled Wizard-of-Oz studies. The Design, Execution, and Analysis interfaces are complete and integrated. The execution engine handles scripted and unscripted actions with full timestamped logging, and I validated robot communication on the NAO6 platform during development. A researcher can design an experiment, run a live trial with a wizard, and review the resulting logs and recordings without modification to the platform's core architecture or execution workflow.

 Work remaining for future development includes broader validation of the configuration file approach on robot platforms beyond NAO6.

 \section{Chapter Summary}

-This chapter described how HRIStudio realizes the design principles from Chapter~\ref{ch:design} in practice. Experiments are persistent, reusable specifications that produce complete, comparable trial records. The execution engine is event-driven rather than timer-driven, keeping the wizard in control of pacing while logging every action automatically. Per-platform configuration files keep the execution engine hardware-agnostic. The role system enforces access control at the study level. The platform is at minimum viable product status and can run a controlled WoZ study today. HRIStudio is one realization of these principles; the contribution lies in the design principles themselves, which any implementation could adopt.
+This chapter described how HRIStudio realizes the design principles from Chapter~\ref{ch:design} in practice. Experiments are persistent, reusable specifications that produce complete, comparable trial records. The execution engine is event-driven rather than timer-driven, keeping the wizard in control of pacing while logging every action automatically. Per-platform configuration files keep the execution engine hardware-agnostic. The role system enforces access control at the study level. The platform is fully operational for controlled WoZ studies today, demonstrated through the pilot validation study presented in Chapter~\ref{ch:evaluation}. The design principles are general; HRIStudio shows they are workable.
@@ -1,7 +1,7 @@
 \chapter{Pilot Validation Study}
 \label{ch:evaluation}

-This chapter presents the pilot validation study used to evaluate whether HRIStudio improves accessibility and reproducibility in WoZ-based HRI research. It defines the research questions, study design, participant roles, task, apparatus, procedure, and measurement instruments.
+This chapter presents the pilot validation study used to evaluate whether HRIStudio improves accessibility and reproducibility in WoZ-based HRI research. It defines the research questions, study design, task, apparatus, procedure, and measurement instruments.

 \section{Research Questions}

@@ -9,29 +9,25 @@ The evaluation targets the two problems established in Chapter~\ref{ch:backgroun

 These problems give rise to two research questions. The first asks whether HRIStudio enables domain experts without prior robotics experience to successfully implement a robot interaction from a written specification. The second asks whether HRIStudio produces more reliable execution of that interaction compared to standard practice.

-I hypothesized that wizards using HRIStudio would more completely and correctly implement the written specification, and that their designs would execute more reliably during the trial, compared to wizards using ad hoc programs created for specific social robotics experiments, with Choregraphe as the baseline tool in this study.
+I hypothesized that wizards using HRIStudio would more completely and correctly implement the written specification, and that their designs would execute more reliably during the trial, compared to wizards using ad-hoc programs created for specific social robotics experiments, with Choregraphe as the baseline tool in this study.

 \section{Study Design}

 I used what Bartneck et al.~\cite{Bartneck2024} call a between-subjects design, in which each participant is assigned to only one condition. I randomly assigned each wizard participant to one of two conditions: HRIStudio or Choregraphe. Both groups received the same task, the same time allocation, and the same training structure. Measuring each participant in only one condition prevents carryover effects, meaning performance changes caused by prior exposure to another condition rather than by the assigned condition itself.

-In this study, I defined two types of participants with distinct roles. Wizards were faculty members drawn from across departments who designed and ran the robot interaction. Test subjects were undergraduate students who interacted with the robot during the trial. This separation ensures that the evaluation captures both the design experience and the quality of the resulting interaction. The next section details recruitment, inclusion criteria, and sample rationale for both groups.
-
 \section{Participants}

 \textbf{Wizards.} I recruited six Bucknell University faculty members drawn from across departments to serve as wizards. I deliberately recruited from both ends of the programming experience spectrum, targeting participants with substantial programming backgrounds as well as those who described themselves as non-programmers or having minimal coding experience. This cross-departmental recruitment was intentional. A primary claim of HRIStudio is that it lowers the technical barrier for domain experts who are not programmers; drawing wizards from outside computer science allows the data to speak to whether that claim holds for the intended user population.

 The key inclusion criterion for all wizards was no prior experience with either the NAO robot or Choregraphe software specifically. This controls for tool familiarity so that performance differences reflect the tools themselves rather than prior exposure. I recruited wizards through direct email. Participation was framed as a voluntary software evaluation unrelated to any professional obligations.

-\textbf{Test subjects.} I recruited one undergraduate student per wizard session to serve as a test subject, for a total matching the wizard sample. Their role was to serve as the subjects for the experimental protocol coded by each wizard. To eliminate any risk of coercion, I screened participants to ensure that no test subject was enrolled in a course taught by the wizard they were paired with. Recruitment used campus flyers inviting volunteers to interact with a robot for approximately 15 minutes. There was no compensation for participation.
-
-\textbf{Sample size rationale.} With six wizard participants ($N = 6$) and a matched number of test subjects, this sample size is appropriate for a pilot validation study whose goal is directional evidence and failure-mode identification rather than effect-size estimation for a broad population. The size matches the scope and constraints of this honors thesis: two academic semesters, one undergraduate researcher, and no funded research assistant support. It also reflects the target population and recruitment context. Faculty domain experts outside computer science with no prior NAO or Choregraphe experience are a limited pool at a small liberal arts university and have high competing time demands. This scale is consistent with pilot and feasibility studies in HRI, where small $N$ designs are common in early-stage tool validation~\cite{Steinfeld2009}. Findings should be interpreted as preliminary evidence and directional indicators rather than as conclusive proof.
+\textbf{Sample size rationale.} With six wizard participants ($N = 6$), this sample size is appropriate for a pilot validation study whose goal is directional evidence and failure-mode identification rather than effect-size estimation for a broad population. The size matches the scope and constraints of this honors thesis: two academic semesters, one undergraduate researcher, and no funded research assistant support. It also reflects the target population and recruitment context. Faculty domain experts outside computer science with no prior NAO or Choregraphe experience are a limited pool at a small liberal arts university and have high competing time demands. This scale is consistent with pilot and feasibility studies in HRI, where small $N$ designs are common in early-stage tool validation~\cite{HoffmanZhao2021}. Findings should be interpreted as preliminary evidence and directional indicators rather than as conclusive proof.

 \section{Task}

-Both wizard groups received the same written task specification: the \emph{Interactive Storyteller} scenario. The specification described a robot that introduces an astronaut named Kai, narrates her discovery of a glowing rock on Mars, asks the human subject a comprehension question about the story, and delivers one of two responses depending on whether the answer is correct. The full specification, including exact robot speech, required gestures, and branching logic, is reproduced in Appendix~\ref{app:materials}.
+Both wizard groups received the same written task specification: the \emph{Interactive Storyteller} scenario. The specification described a robot that introduces an astronaut named Kai, narrates her discovery of a glowing rock on Mars, asks a comprehension question, and delivers a response according to the answer given. The full specification, including exact robot speech, required gestures, and branching logic, is reproduced in Appendix~\ref{app:blank_templates}.

-The task was chosen because it requires several distinct capabilities: speech actions, gesture coordination, conditional branching based on human-subject input, and a defined conclusion. In both conditions, wizards had to translate the same written protocol into an executable interaction script, including action ordering, branching logic, and timing decisions. In Choregraphe, that meant assembling and connecting behavior nodes in a finite state machine. In HRIStudio, it meant building a sequential action timeline with conditional branches. This makes the task a direct comparison of how each tool supports coding the robot behavior required by the same protocol.
+The task was chosen because it requires several distinct capabilities: speech actions, gesture coordination, conditional branching, and a defined conclusion. In both conditions, wizards had to translate the same written protocol into an executable interaction script, including action ordering, branching logic, and timing decisions. In Choregraphe, that meant assembling and connecting behavior nodes in a finite state machine. In HRIStudio, it meant building a sequential action timeline with conditional branches. This makes the task a direct comparison of how each tool supports coding the robot behavior required by the same protocol.

 \section{Robot Platform and Software Apparatus}

@@ -41,7 +37,7 @@ Both conditions used the same NAO humanoid robot (Figure~\ref{fig:nao6-photo}),
 \begin{figure}[htbp]
 \centering
 \includegraphics[width=0.45\textwidth]{images/nao6.jpg}
-\caption{The NAO V6 humanoid robot used in both conditions of the pilot study.}
+\caption{The NAO6 humanoid robot used in both conditions of the pilot study.}
 \label{fig:nao6-photo}
 \end{figure}

@@ -52,7 +48,7 @@ The experimental condition used HRIStudio, described in Chapter~\ref{ch:implemen

 \section{Procedure}

-Each wizard completed a single 60-minute session structured in four phases. Each session was run by one wizard and included one test subject during the trial phase.
+Each wizard completed a single 60-minute session structured in four phases.

 \subsection{Phase 1: Training (15 minutes)}

@@ -60,15 +56,15 @@ I opened each session with a standardized tutorial tailored to the wizard's assi

 \subsection{Phase 2: Design Challenge (30 minutes)}

-The wizard received the paper specification and had 30 minutes to implement it using their assigned tool. I observed and recorded a screen capture of the wizard's workflow throughout. Using a structured observer data sheet, I logged every instance in which I provided assistance to the wizard, categorizing each by type: \emph{tool-operation} (T), \emph{task clarification} (C), \emph{hardware or technical} (H), or \emph{general} (G). For each tool-operation intervention, I also recorded which rubric item it pertained to. If the wizard declared completion before the time limit, the remaining time was used to review and refine the design.
+The wizard received the paper specification and had 30 minutes to implement it using their assigned tool. Using a structured observer data sheet, I logged every instance in which I provided assistance to the wizard, categorizing each by type: \emph{tool-operation} (T), \emph{task clarification} (C), \emph{hardware or technical} (H), or \emph{general} (G). For each tool-operation intervention, I also recorded which rubric item it pertained to. If the wizard declared completion before the time limit, the remaining time was used to review and refine the design.

 \subsection{Phase 3: Live Trial (10 minutes)}

-After the design phase, a test subject entered the room and the wizard ran their completed program to control the robot during an actual interaction. I video-recorded the full trial to capture robot behavior and timing. I told the test subject they were helping evaluate the robot's performance, not being evaluated themselves. I continued logging any researcher interventions during the trial using the same type categories, noting the relevant ERS rubric item for any tool-operation intervention.
+After the design phase, the wizard ran their completed program to execute the designed interaction on the robot. I continued logging any researcher interventions during the trial using the same type categories, noting the relevant ERS rubric item for any tool-operation intervention.

 \subsection{Phase 4: Debrief (5 minutes)}

-Following the trial, the wizard completed the System Usability Scale survey. The screen recording and video recording served as the primary artifacts for post-session scoring.
+Following the trial, the wizard completed the System Usability Scale survey. The DFS and ERS were scored during and immediately after the session using live observation and the Observer Data Sheet.

 \section{Measures}
 \label{sec:measures}
@@ -79,28 +75,39 @@ The study collected four measures, two primary and two supplementary.

 The Design Fidelity Score (DFS) measures how completely and correctly the wizard implemented the paper specification. I evaluated the exported project file against nine weighted criteria grouped into three categories: speech actions, gestures and actions, and control flow and logic. Each criterion is scored as present, correct, and independently achieved.

-The DFS rubric includes an \emph{Assisted} column. For each rubric item, the researcher marks T if a tool-operation intervention was given specifically for that item during the design phase --- for example, if the researcher explained how to add a gesture node or how to wire a conditional branch. T marks are recorded and reported separately alongside the DFS score; they do not affect the Points total. This preserves the DFS as a clean measure of design fidelity while providing a parallel record of where tool-specific assistance was needed. General interventions --- task clarification, hardware issues, or momentary forgetfulness --- are not marked T, because those categories of difficulty are independent of the tool under evaluation.
+The DFS rubric includes an \emph{Assisted} column. For each rubric item, the researcher marks T if a tool-operation intervention was given specifically for that item during the design phase (for example, if the researcher explained how to add a gesture node or how to wire a conditional branch). T marks are recorded and reported separately alongside the DFS score; they do not affect the Points total. This preserves the DFS as a clean measure of design fidelity while providing a parallel record of where tool-specific assistance was needed. General interventions (task clarification, hardware issues, or momentary forgetfulness) are not marked T, because those categories of difficulty are independent of the tool under evaluation.

-This measure is motivated by a gap identified by Riek~\cite{Riek2012}, whose systematic review of 54 published WoZ studies found that only 11\% constrained wizard behavior and fewer than 6\% described wizard training procedures. Porfirio et al.~\cite{Porfirio2023} similarly argued that formal, verifiable behavior specifications are a prerequisite for reproducible HRI. The DFS applies these recommendations as a weighted rubric scored against the exported project file. The complete rubric is reproduced in Appendix~\ref{app:materials}. This measure addresses accessibility: did the tool allow a wizard to independently produce a correct design?
+This measure is motivated by a gap identified by Riek~\cite{Riek2012}, whose systematic review of 54 published WoZ studies found that only 11\% constrained wizard behavior and fewer than 6\% described wizard training procedures. Porfirio et al.~\cite{Porfirio2023} similarly argued that formal, verifiable behavior specifications are a prerequisite for reproducible HRI. The DFS applies these recommendations as a weighted rubric scored against the exported project file. The complete rubric is reproduced in Appendix~\ref{app:blank_templates}. This measure addresses accessibility: did the tool allow a wizard to independently produce a correct design?

 \subsection{Execution Reliability Score}

-The Execution Reliability Score (ERS) measures whether the designed interaction executed as intended during the live trial. I reviewed the video recording against the specification and the wizard's design. Evaluation criteria included whether the robot delivered the correct speech at each step, whether gestures executed and synchronized with speech, whether the conditional branch resolved correctly based on the test subject's answer, and whether any errors, disconnections, or hangs occurred.
+The Execution Reliability Score (ERS) measures whether the designed interaction executed as intended during the live trial. I scored the ERS live and immediately after the session, using the Observer Data Sheet and the wizard's exported project file. Evaluation criteria included whether the robot delivered the correct speech at each step, whether gestures executed and synchronized with speech, whether the conditional branch was present in the design and executed during the trial, and whether any errors, disconnections, or hangs occurred.

-The ERS rubric applies the same \emph{Assisted} modifier as the DFS, extended to the trial phase. Any tool-operation intervention I provided during the trial --- for example, explaining to the wizard how to launch or advance their program --- caps the affected ERS item at half points. This is scored separately from design-phase interventions: a wizard who needed help only during design can still achieve a full ERS score if the trial runs without assistance, and vice versa. The rubric also records whether the trial reached its conclusion step and whether the test subject was a recruited participant or the researcher, since foreknowledge of the specification on the part of the test subject represents a qualitatively different trial condition. I additionally note whether any branch resolved through programmed conditional logic or through manual intervention by the wizard during execution.
+The ERS rubric applies the same \emph{Assisted} modifier as the DFS, extended to the trial phase. Any tool-operation intervention I provided during the trial (for example, explaining to the wizard how to launch or advance their program) caps the affected ERS item at half points. This is scored separately from design-phase interventions: a wizard who needed help only during design can still achieve a full ERS score if the trial runs without assistance, and vice versa. The rubric records whether the trial reached its conclusion step. I additionally note whether any branch resolved through programmed conditional logic or through manual intervention by the wizard during execution.

-This measure responds directly to Riek's~\cite{Riek2012} finding that only 3.7\% of published WoZ studies reported any measure of wizard error, making it nearly impossible to determine whether execution matched design intent~\cite{OConnor2024, OConnor2025}. The complete rubric is reproduced in Appendix~\ref{app:materials}. This measure addresses reproducibility: did the design translate reliably into execution without researcher support?
+This measure responds directly to Riek's~\cite{Riek2012} finding that only 3.7\% of published WoZ studies reported any measure of wizard error, making it nearly impossible to determine whether execution matched design intent~\cite{OConnor2024, OConnor2025}. The complete rubric is reproduced in Appendix~\ref{app:blank_templates}. This measure addresses reproducibility: did the design translate reliably into execution without researcher support?

 \subsection{System Usability Scale}

-The System Usability Scale (SUS) is a validated 10-item questionnaire measuring perceived usability \cite{Brooke1996}. Wizards completed the SUS after the debrief phase. Scores range from 0 to 100, with higher scores indicating better perceived usability. The full questionnaire is reproduced in Appendix~\ref{app:materials}.
+The System Usability Scale (SUS) is a validated 10-item questionnaire measuring perceived usability \cite{Brooke1996}. Wizards completed the SUS after the debrief phase. Scores range from 0 to 100, with higher scores indicating better perceived usability. The full questionnaire is reproduced in Appendix~\ref{app:blank_templates}.

 \subsection{Intervention Log and Session Timing}

-During each session, I maintained a structured intervention log on the observer data sheet, recording the timestamp, type code, affected rubric item number, and a brief description for every instance in which I assisted the wizard. Intervention type codes are: T (tool-operation), C (task or specification clarification), H (hardware or technical issue), and G (general or forgetfulness. Only T-type interventions affect rubric scoring; the others are recorded to provide context for interpreting session flow and wizard experience. I also recorded the actual duration of each session phase and the time at which the wizard completed or abandoned the design, providing supplementary evidence about tool accessibility beyond the DFS score itself.
+During each session, I maintained a structured intervention log on the observer data sheet, recording the timestamp, type code, affected rubric item number, and a brief description for every instance in which I assisted the wizard. The four intervention type codes are:
+
+\begin{description}
+    \item[T (tool-operation).] The researcher explained how to operate a specific feature of the assigned software tool.
+    \item[C (task clarification).] The researcher clarified the written specification or an aspect of the task design.
+    \item[H (hardware or technical).] The researcher addressed a robot connection issue or other technical problem outside the wizard's control.
+    \item[G (general).] Brief assistance not attributable to the tool or the task, such as momentary forgetfulness.
+\end{description}
+
+Only T-type interventions affect rubric scoring; the others are recorded to provide context for interpreting session flow and wizard experience. I also recorded the actual duration of each session phase and the time at which the wizard completed or abandoned the design, providing supplementary evidence about tool accessibility beyond the DFS score itself.

 \section{Measurement Instruments}

+The four primary and supplementary measures are designed to work together. The DFS and ERS address separate phases of the session: DFS captures what was designed, and ERS captures whether that design translated faithfully into execution. Taken together, they make it possible to distinguish a wizard who implemented the specification correctly but whose design failed during the trial from one whose design was incomplete but executed without researcher assistance. The SUS grounds both scores in the wizard's subjective experience of the tool. The intervention log and session timing are supplementary: they do not directly answer the research questions but provide context for interpreting the primary scores, particularly for understanding whether help requests concerned the tool itself or the task.
+
 Table~\ref{tbl:measurement_instruments} summarizes the five instruments, when they were collected, and which research question each addresses.

 \begin{table}[htbp]
@@ -112,7 +119,7 @@ Table~\ref{tbl:measurement_instruments} summarizes the five instruments, when th
 \hline
 Design Fidelity Score (DFS) & Completeness and correctness of the wizard's implementation; caps items where tool-operation assistance was given & Post-session file review & Accessibility \\
 \hline
-Execution Reliability Score (ERS) & Whether the interaction executed as designed during the trial; caps items where trial-phase tool assistance occurred & Post-trial video review & Reproducibility \\
+Execution Reliability Score (ERS) & Whether the interaction executed as designed during the trial; caps items where trial-phase tool assistance occurred & Live and post-trial (ODS) & Reproducibility \\
 \hline
 System Usability Scale (SUS) & Wizard's perceived usability of the assigned tool & Debrief phase & User experience \\
 \hline
@@ -5,26 +5,30 @@ This chapter presents the results of the pilot validation study described in Cha

 \section{Participant Overview}

-% TODO: Update session counts when all sessions are complete.
-Table~\ref{tbl:sessions} summarizes the participants and their assigned conditions. Wizards are identified by code to protect confidentiality. Demographic information (programming background: programmer or non-programmer) was collected during recruitment.
+Table~\ref{tbl:sessions} summarizes the participants and their assigned conditions. Wizards are identified by code to protect confidentiality. All six participants were Bucknell University faculty members recruited from departments outside Computer Science. Demographic information (programming background) was collected during recruitment.

+% TODO: Fill in W-06 row once session is complete.
 \begin{table}[htbp]
 \centering
 \footnotesize
 \begin{tabular}{|l|l|l|l|l|l|l|}
 \hline
-\textbf{ID} & \textbf{Condition} & \textbf{Background} & \textbf{DFS} & \textbf{ERS} & \textbf{SUS} & \textbf{Design Time} \\
+\textbf{ID} & \textbf{Condition} & \textbf{Background} & \makecell[l]{\textbf{Programming}\\\textbf{Experience}} & \textbf{DFS} & \textbf{ERS} & \textbf{SUS} \\
 \hline
-W-01 & Choregraphe & Programmer & 70 & 65 & 60 & 35 min \\
+W-01 & Choregraphe & Digital Humanities & None & 42.5 & 65 & 60 \\
 \hline
-W-02 & \textit{[PLACEHOLDER]} & \textit{[PLACEHOLDER]} & --- & --- & --- & --- \\
+W-02 & HRIStudio & Computer Science & Moderate & 100 & 95 & 90 \\
 \hline
-W-03 & \textit{[PLACEHOLDER]} & \textit{[PLACEHOLDER]} & --- & --- & --- & --- \\
+W-03 & Choregraphe & Computer Science & Extensive & 65 & 60 & 75 \\
 \hline
-W-04 & \textit{[PLACEHOLDER]} & \textit{[PLACEHOLDER]} & --- & --- & --- & --- \\
+W-04 & Choregraphe & Chemical Engineering & Moderate & 62.5 & 75 & 42.5 \\
+\hline
+W-05 & HRIStudio & Chemical Engineering & None & 100 & 95 & 70 \\
+\hline
+W-06 & HRIStudio & Computer Science & Extensive & --- & --- & --- \\
 \hline
 \end{tabular}
-\caption{Summary of wizard participants, conditions, and scores. Rows marked PLACEHOLDER are pending completion.}
+\caption{Summary of wizard participants, assigned conditions, and scores.}
 \label{tbl:sessions}
 \end{table}

@@ -34,26 +38,47 @@ W-04 & \textit{[PLACEHOLDER]} & \textit{[PLACEHOLDER]} & --- & --- & --- & --- \

 The Design Fidelity Score measures how completely and correctly each wizard implemented the written specification. Scores range from 0 to 100, with full points awarded only when a component is both present and correct.

-W-01 (Choregraphe) received a DFS of 70. Analysis of the exported project file indicated that all four interaction steps were present and correctly sequenced, and the conditional branch was implemented and functional. However, W-01 deviated from the specification by modifying the color of the rock from red to a different value, causing the narrative speech and comprehension question to no longer match the written protocol. This reduced the ``Correct'' scores for speech items 2 and 3. The open-hand introduction gesture was present and correctly executed; at least one narrative gesture was included; and both branch responses were implemented, though the correct-branch response speech was also modified to reflect the changed rock color.
+W-01 (Choregraphe) received a DFS of 42.5. Analysis of the exported project file found all four interaction steps present and correctly sequenced; the conditional branch was wired and functional. Speech fidelity was partial: W-01 deviated from the specification by substituting a different rock color in the narrative and comprehension question, departing from the ``red'' specified in the paper protocol. Items 1 and 4 (introduction and branch responses) received full points; items 2 and 3 received half points due to the content mismatch. The gesture category scored zero. Both the introduction wave and the narrative gesture were implemented via the tool's \emph{Animated Say} function, which generates motion non-deterministically from a library rather than placing a specific gesture node; under the rubric's clarifying rule, this does not satisfy the Correct criterion. Item 7 (nod or head shake) was not explicitly programmed. The control-flow category was split: item 9 (correct step sequence) received full points; item 8 (conditional branch) received half points because the branch was resolved by manually deleting and re-routing connections during the trial rather than through a dedicated conditional node wired at design time.

-% TODO: Add DFS scores for remaining participants and compute condition means when data collection is complete.
-% TODO: Add a bar chart or table comparing DFS by condition.
-\textit{[PLACEHOLDER: DFS results for W-02 through W-0X will be reported here. Condition means and ranges will be summarized in a table.]}
+W-02 (HRIStudio, programmer) received a DFS of 100. The exported project file confirmed all four interaction steps present and correctly sequenced, speech content matching the written specification verbatim, gestures placed using dedicated action nodes, and the conditional branch wired through HRIStudio's branch component. No tool-operation interventions were logged during the design phase. W-02 completed the design in 24 minutes, within the 30-minute allocation.
+
+W-03 (Choregraphe, programmer) received a DFS of 65. W-03 approached the design as a block programming exercise, constructing extra nodes and attempting a concurrent execution structure not called for by the specification. One C-type clarification was required: I noted that control-flow logic relying on onboard speech recognition was outside the scope of this study, since Wizard-of-Oz execution routes all speech decisions through the wizard rather than the robot. Speech fidelity was partial: two of the three scorable speech items were present, with not all delivered correctly. No conditional branch was implemented in the final design, resulting in zero points for that category. The design phase extended to 37 minutes, seven minutes over the 30-minute allocation.
+
+W-04 (Choregraphe, Chemical Engineering, moderate programmer) received a DFS of 62.5. The design phase ran 35 minutes without reaching completion, making W-04 the only wizard in the study who did not finish the design before the cutoff. Four T-type tool-operation interventions and one C-type clarification were logged. During training, W-04 asked about running two behavior blocks simultaneously and how to edit a block, reflecting early engagement with Choregraphe's concurrent flow model. During the design phase, W-04 asked about interpretation of punctuation in speech content, generating three simultaneous T-type marks across items 1--3. W-04 also independently attempted to use Choregraphe's choice block for conditional branching; the block did not execute correctly. The researcher re-explained the WoZ execution model and how to branch by manual step selection. Speech items 1, 2, and 4 received full points; item 3 (the comprehension question) was absent from the final design. Gesture items 5 and 6 received full points; item 7 (nod or head shake) was present but not marked correct (5/10). The conditional branch received zero points; no functional branch was wired at export. Step sequencing received partial credit (7.5/15).
+
+W-05 (HRIStudio, Chemical Engineering, no programming experience) received a DFS of 100. The design phase completed in 18 minutes, the shortest design phase in the study. Training concluded in 6 minutes with no questions asked; the wizard described the platform as ``pretty straightforward.'' Two T-type interventions and three C-type clarifications were logged during the design phase. The T-type interventions concerned editing properties in the right pane of the experiment designer and understanding that the branch block requires predefined steps; both were addressed without affecting the final design. The C-type clarifications concerned what ``steps'' represent as structural containers, the relationship between the written specification's speech and platform speech actions, and a related conceptual question. The wizard added a creative narrative gesture not specified in the protocol (a crouch animation); this was present and correct under the rubric. The DFS assessment noted that the wizard's design mapped well from the specification.
+
+% TODO: Add DFS scores and per-item breakdown for W-06 when complete.
+% TODO: Add condition means once W-06 is complete.

 \subsection{Execution Reliability Score}

-The Execution Reliability Score measures how faithfully the designed interaction executed during the live trial. W-01 received an ERS of 65. The trial ran for approximately five minutes, which was shorter than anticipated due to the design phase overrunning the scheduled window. The introduction speech and gesture executed correctly. The narrative speech executed but deviated from the specification due to the modified rock color, as described above. The comprehension question was delivered, the branching logic resolved correctly based on the test subject's response, and the appropriate branch response was given. Gesture synchronization was partial: the pause gesture executed, but coordination between speech and movement was inconsistent at several points. No system disconnections or crashes occurred.
+The Execution Reliability Score measures how faithfully the designed interaction executed during the live trial. W-01 received an ERS of 65. The trial ran for approximately five minutes. In this session, I served as the test subject during the live trial. Through that experience I confirmed that a separately recruited participant is not required: the DFS and ERS both evaluate the wizard's implementation and execution fidelity rather than a subject's behavioral responses. Subsequent sessions therefore run the trial phase with the wizard executing the designed interaction directly, without a separate test subject. The introduction speech and gesture executed correctly. The narrative speech executed but deviated from the specification due to the modified rock color described above. The comprehension question was delivered, a branch response was triggered, and the interaction proceeded to its conclusion. Gesture synchronization was partial: a pause gesture executed, but coordination between speech and movement was inconsistent at several points. No system disconnections or crashes occurred.

-% TODO: Add ERS scores for remaining participants and compute condition means.
-% TODO: Note any systematic patterns in execution failures across conditions.
-\textit{[PLACEHOLDER: ERS results for W-02 through W-0X will be reported here.]}
+W-02 (HRIStudio) received an ERS of 95. The trial ran for approximately five minutes. Introduction speech and gesture, narrative speech, comprehension question, and branch response content all executed correctly and matched the specification. During the trial, the interaction briefly advanced to an incorrect step when a branch transition misfired; this was immediately corrected by manually selecting the correct step in the execution interface. This incident was logged as an H-type intervention (platform behavior, not wizard error). The branching item scored 5 out of 10 on its own merits: the branch was present in the design and execution reached the branch step, but the initial misfire meant the transition was not fully correct before manual correction. No other deviations or system failures occurred.
+
+W-03 (Choregraphe) received an ERS of 60. The trial ran for approximately five minutes. Speech execution was partial: two of three items were present but not all delivered correctly. Gesture and speech synchronization was poor throughout the interaction; motion cues were present but did not coordinate reliably with corresponding speech actions. The conditional branch, absent from W-03's design, was not executed during the trial; the interaction proceeded without a branch resolution step. No system disconnections or crashes occurred.
+
+W-04 (Choregraphe) received an ERS of 75. The trial ran for approximately four minutes. Introduction and narrative speech executed correctly. The comprehension question, absent from the design, was not delivered; the interaction proceeded directly to the branch step. A T-type trial intervention was required to remind W-04 how to trigger the branch; the yes-branch response was delivered following that prompt, capping item 4 at 5/10 (T-assisted). Gesture execution was strong: introduction wave, narrative gesture, and nod or head shake all executed correctly. Speech and gesture synchronization scored full points. The pause before the comprehension question scored zero, as no question was delivered. No system errors occurred.
+
+W-05 (HRIStudio) received an ERS of 95. The trial ran for approximately four minutes and reached step 4. The researcher's answer was ``Red'' (the correct answer), and branch A fired via programmed conditional logic. All speech items executed correctly. Introduction gesture, nod or head shake, speech synchronization, and the pre-question pause all scored full points. One trial intervention pair was logged: the researcher briefly forgot they were in live execution (G-type), then was reminded and manually skipped a non-functional crouch action (T-type, capping item 6 at 5/10). The crouch animation exists in HRIStudio's action library but does not execute on the NAO6 robot-side; skipping it was the correct recovery. All other items scored full points and no system errors occurred. The overall ERS assessment recorded that the interaction executed as designed.
+
+% TODO: Add ERS score and breakdown for W-06 when complete.
+% TODO: Once all sessions complete, report condition ERS means and note patterns in execution failures across conditions.

 \subsection{System Usability Scale}

-W-01 rated Choregraphe with a SUS score of 60. The standard benchmark for SUS scores places 68 as the average; scores below 68 are generally considered below average usability~\cite{Brooke1996}. A score of 60 suggests that W-01 found Choregraphe marginal in usability despite having a programming background, which is consistent with the large number of help requests observed during the design phase.
+W-01 rated Choregraphe with a SUS score of 60. The standard benchmark for SUS scores places 68 as the average; scores below 68 are generally considered below average usability~\cite{Brooke1996}. A score of 60 suggests that W-01, a Digital Humanities faculty member with no programming background, found Choregraphe marginal in usability; this outcome is consistent with the high volume of interface-level help requests observed during the design phase.

-% TODO: Add SUS scores for remaining participants. Report condition means.
-\textit{[PLACEHOLDER: SUS scores for W-02 through W-0X will be reported here.]}
+W-02 rated HRIStudio with a SUS score of 90, well above the average benchmark of 68 and the highest score observed so far. W-02, a programmer with a combined CS and psychology background, completed the design phase without tool-operation assistance and rated the platform favorably across usability dimensions.
+
+W-03 rated Choregraphe with a SUS score of 75, above the average benchmark of 68. W-03, a programmer with prior experience in block programming environments, perceived the tool positively in general terms, framing it as a capable system for its category. Post-session comments indicated that W-03 found the tool harder to apply to this specific task than its general capability suggested, particularly given the WoZ framing's constraint against onboard control-flow logic. W-03 had no prior knowledge of HRIStudio, providing no comparative baseline for their usability rating.
+
+W-04 rated Choregraphe with a SUS score of 42.5, the lowest score in the study and well below the average benchmark of 68. Researcher notes recorded that W-04 attempted the task with evident self-driven engagement but that the platform appeared to get in the way. The gap between effort and outcome in W-04's session, a motivated wizard who exceeded the time allocation without completing the design and required four T-type interventions, is directly reflected in this rating.
+
+W-05 rated HRIStudio with a SUS score of 70, above the average benchmark of 68. Post-session comments recorded no issues. W-05, a Chemical Engineering faculty member with no programming background, completed the design well within the allocation and ran the trial to its conclusion without tool-operation difficulty during execution.
+
+% TODO: Add SUS score for W-06 when complete. Then report condition means.

 \section{Supplementary Measures}

@@ -61,6 +86,7 @@ W-01 rated Choregraphe with a SUS score of 60. The standard benchmark for SUS sc

 Table~\ref{tbl:timing} summarizes the time spent in each phase per session.

+% TODO: Fill in W-06 timing row once session is complete.
 \begin{table}[htbp]
 \centering
 \footnotesize
@@ -70,40 +96,72 @@ Table~\ref{tbl:timing} summarizes the time spent in each phase per session.
 \hline
 W-01 & 15 min & 35 min & 5 min & 5 min & 60 min \\
 \hline
-W-02 & --- & --- & --- & --- & --- \\
+W-02 & 7 min & 24 min & 5 min & 5 min & 41 min \\
 \hline
-W-03 & --- & --- & --- & --- & --- \\
+W-03 & 12 min & 37 min & 5 min & 5 min & 59 min \\
 \hline
-W-04 & --- & --- & --- & --- & --- \\
+W-04 & 17 min & 35 min & 4 min & 4 min & 60 min \\
+\hline
+W-05 & 6 min & 18 min & 4 min & 4 min & 32 min \\
+\hline
+W-06 & --- & --- & --- & --- & --- \\
 \hline
 \end{tabular}
 \caption{Time spent in each session phase per wizard participant.}
 \label{tbl:timing}
 \end{table}

-W-01's design phase extended to 35 minutes, nearly double the 20-minute allocation, compressing the trial and debrief to 5 minutes each. Despite this, W-01 declared the design complete rather than abandoning it, and the robot did execute a recognizable version of the specification during the trial.
+W-01's design phase extended to 35 minutes, five minutes over the 30-minute allocation, compressing the trial and debrief to 5 minutes each. Despite this, W-01 declared the design complete rather than abandoning it, and the robot executed a recognizable version of the specification during the trial.

-\subsection{Help Requests}
+W-02's training phase concluded in 7 minutes, roughly half the standard 15-minute allocation. This reflects HRIStudio's more intuitive onboarding rather than simply W-02's technical background: the platform's guided workflow and timeline-based model required less explanation before the wizard was ready to begin the design phase. W-02's design phase then concluded in 24 minutes, within the allocation, and the trial ran for approximately five minutes.

-% TODO: Report help request counts and types for all sessions.
-W-01 generated a substantial number of help requests during the design phase, primarily concerning Choregraphe's interface rather than the specification itself. The wizard demonstrated understanding of the task but encountered repeated friction with the tool's connection model, behavior box configuration, and branch routing. This pattern --- understanding the goal but struggling with the mechanism --- is characteristic of the accessibility problem described in Chapter~\ref{ch:background}.
+W-03's design phase extended to 37 minutes, the longest design phase observed so far, despite W-03's programming background. The overrun reflects not conventional interface friction but the time spent constructing and then revising an over-engineered design; beginning sessions from W-02 onward enforce the 30-minute transition, so W-03's overrun constitutes a procedural exception noted in the observer log.

-\textit{[PLACEHOLDER: Help request counts and categories for all sessions will be reported here.]}
+W-04's design phase ran 35 minutes without completion, the only session in which the wizard did not finish before the cutoff. Training took 17 minutes, the longest training phase in the study; W-04 entered the design phase with questions about concurrent block execution that presaged later difficulties with branching.
+
+W-05's design phase completed in 18 minutes, the shortest in the study. The overall session lasted 32 minutes, also the shortest. Training took 6 minutes with no questions asked. The contrast between W-04 and W-05 is striking: both come from Chemical Engineering, both with no robotics background, yet the difference in tool condition produced a 17-minute gap in design completion time and a qualitatively different session experience.
+
+Across the five completed sessions, Choregraphe design phases averaged approximately 35.7 minutes. W-01 and W-03 both exceeded the 30-minute design target but completed their designs before the session time limit; W-04 was the only wizard cut off by the limit without finishing. HRIStudio design phases averaged 21 minutes across two completed sessions, both well within the allocation. Training phases similarly diverged: Choregraphe training averaged approximately 14.7 minutes, while HRIStudio training averaged 6.5 minutes.
+
+% TODO: Update condition means once W-06 is complete.
+
+\subsection{Intervention Log}
+
+W-01 generated a high volume of help requests during the design phase, primarily concerning Choregraphe's interface rather than the specification itself. The wizard demonstrated understanding of the task but encountered repeated friction with the tool's connection model, behavior box configuration, and branch routing. This pattern, understanding the goal but struggling with the mechanism, is characteristic of the accessibility problem described in Chapter~\ref{ch:background}.
+
+W-02 generated minimal interventions. No T-type tool-operation assistance was required during the design phase; the wizard navigated HRIStudio's interface without guidance. One H-type intervention was logged during the trial phase, corresponding to the branch step misfire described in the ERS section above.
+
+W-03 generated one C-type intervention during the design phase: a clarification that control-flow logic dependent on onboard speech recognition was outside the study's scope. No T-type interventions were required; W-03 navigated Choregraphe independently throughout the design phase. The absence of T-type interventions for W-03, compared to W-01's high T-type volume, suggests that programming background moderates the interface accessibility problem in Choregraphe: the tool does not block programmers the way it blocked a non-programmer, though it still produced a lower DFS than HRIStudio.
+
+W-04 generated the highest T-type count in the Choregraphe condition: five total design-phase interventions (4 T-type, 1 C-type), plus one T-type intervention during the trial. The design-phase T marks covered speech content punctuation ($\times$3, items 1--3) and the failed choice block attempt (item 8). The pattern echoes W-01's volume of tool-level friction, concentrated in a wizard with moderate rather than no programming experience.
+
+W-05 generated five design-phase interventions (2 T-type, 3 C-type) and two trial interventions (1 T-type, 1 G-type). The design-phase T marks concerned interface orientation (right-pane editing, branch block configuration); the C-type clarifications concerned conceptual mappings between the written specification and HRIStudio's structural model. Importantly, none of the clarifications blocked design completion, and the final DFS was unaffected. The C-type pattern for W-05 reflects a different kind of engagement from Choregraphe's T-type pattern: questions about what the tool means rather than how to operate it.
+
+% TODO: Compile a summary intervention table once W-06 is complete.

 \section{Qualitative Findings}

 \subsection{Observed Specification Deviation}

-A notable qualitative finding from W-01's session was an unprompted deviation from the written specification: the wizard substituted a different rock color in the robot's speech and comprehension question, departing from the ``red'' specified in the paper protocol. This was not a tool failure; the wizard made a deliberate creative choice that the tool did not prevent or flag. The deviation was undetected until the live trial, when the researcher --- serving as test subject --- did not correctly identify the rock color and triggered the incorrect-answer branch. This incident illustrates the reproducibility problem concretely: without automated protocol enforcement, wizard behavior can drift from the specification in ways that are invisible until execution, affecting the validity of the resulting interaction data.
+A notable qualitative finding from W-01's session was an unprompted deviation from the written specification: the wizard substituted a different rock color in the robot's speech and comprehension question, departing from the ``red'' specified in the paper protocol. This was not a tool failure; the wizard made a deliberate creative choice that the tool did not prevent or flag. The deviation was undetected until the live trial, when it surfaced during execution. This incident illustrates the reproducibility problem concretely: without automated protocol enforcement, wizard behavior can drift from the specification in ways that are invisible until execution, affecting the validity of the resulting interaction data.
+
+No specification deviations from the written protocol were observed in W-02, W-04, or W-05. W-03 introduced extra nodes beyond the specification's scope, which was addressed by a C-type clarification during design. W-05 added a creative gesture not required by the specification (crouch), which was not a deviation from the protocol's content but an elaboration of the gesture category; it scored within the rubric and was noted for completeness. The speech substitution incident in W-01 remains the only case of content drift from the written specification, and it occurred exclusively in the Choregraphe condition.

 \subsection{Wizard Experience}

-% TODO: Add qualitative observations from remaining sessions.
 W-01 expressed that the training was comprehensible and that the underlying logic of the task was clear. The primary source of frustration was Choregraphe's interface for handling conditional branches and managing the timing of parallel behaviors. Post-session comments suggested that the wizard would not use Choregraphe independently for future HRI work without technical support.

-\textit{[PLACEHOLDER: Qualitative observations from remaining sessions will be reported here.]}
+W-02 engaged with HRIStudio's timeline-based interface without requiring tool-operation guidance. The session proceeded efficiently, and W-02's combined CS and psychology background appeared to support both the technical implementation and the contextual understanding of the interaction scenario. No notable sources of friction were observed during design or trial phases.
+
+W-03 approached the task as a programming challenge, applying Choregraphe's full feature set beyond what the specification required. When the WoZ framing was clarified (specifically that branching should reflect wizard decisions rather than onboard robot logic), W-03 revised the design but the over-engineered structure introduced earlier persisted in the final export and was reflected in the DFS score. W-03 described Choregraphe as a powerful block programming environment, but noted that applying it to this task was harder than its general capability implied, a characterization consistent with the tool-task mismatch the study is designed to surface.
+
+W-04 approached the session with clear engagement and self-driven exploration: independently attempting Choregraphe features (concurrent blocks, choice node) that went beyond what prior instructions had covered. The researcher noted ``Great attempt. Self-driven to explore.'' The SUS score of 42.5 reflects a session where ambition consistently exceeded what the tool's interface could support without additional guidance. W-04's post-session comment that quality was attempted but the platform got in the way is arguably the most direct characterization of the accessibility problem in the dataset.
+
+W-05 presented the clearest demonstration of HRIStudio's accessibility case. With no programming background, W-05 trained in 6 minutes, asked no questions, completed the design in 18 minutes with a creative addition, and ran the trial to completion. The researcher's session notes observed: ``Overall good session. Learning: different backgrounds determine tool curiosity and drive to self-explore.'' W-05's willingness to add a crouch gesture beyond the specification, and their straightforward navigation of the platform without tool-operation confusion, suggests that HRIStudio's design model successfully supports exploratory use by non-programmers without producing the friction pattern observed in the Choregraphe condition.
+
+% TODO: Add qualitative observations for W-06 when complete.

 \section{Chapter Summary}

-% TODO: Update summary when all sessions are complete.
-This chapter presented the results from the pilot validation study. To date, one Choregraphe condition session has been completed (W-01), yielding a DFS of 70, ERS of 65, and SUS of 60. Qualitative observations from this session provide preliminary evidence for both the accessibility problem (substantial help requests and design phase overrun) and the reproducibility problem (unprompted specification deviation undetected until the live trial). Remaining sessions will add data for both conditions; Chapter~\ref{ch:discussion} interprets the available findings in the context of the research questions.
+% TODO: Update condition means and summary once W-06 is complete.
+This chapter presented results from five completed sessions of the pilot validation study. Across the three Choregraphe sessions (W-01, W-03, W-04), DFS scores were 42.5, 65, and 62.5 (mean 56.7); ERS scores were 65, 60, and 75 (mean 66.7); and SUS scores were 60, 75, and 42.5 (mean 59.2). Design phases in the Choregraphe condition averaged 35.7 minutes; W-01 and W-03 exceeded the 30-minute target but completed their designs, while W-04 was the only wizard cut off by the session time limit without finishing. Across the two completed HRIStudio sessions (W-02, W-05), DFS scores were both 100 (mean 100); ERS scores were both 95 (mean 95); and SUS scores were 90 and 70 (mean 80). HRIStudio design phases averaged 21 minutes, both within the allocation. The only unprompted speech content deviation observed in the dataset occurred in the Choregraphe condition (W-01). Branching failures or absences appeared in two of three Choregraphe sessions (W-03, W-04) and in neither completed HRIStudio session. The direction of the evidence across all five measures consistently favors HRIStudio. One HRIStudio session (W-06) remains; Chapter~\ref{ch:discussion} interprets the available findings in the context of the research questions.
@@ -9,14 +9,13 @@ This chapter interprets the results presented in Chapter~\ref{ch:results} agains

 The first research question asked whether HRIStudio enables domain experts without prior robotics experience to successfully implement a robot interaction from a written specification. The Choregraphe condition provides the baseline against which this question is evaluated.

-W-01's session offers preliminary evidence consistent with the accessibility problem described in Chapter~\ref{ch:background}. W-01 was a Digital Humanities faculty member with no programming background --- precisely the intended user population for tools like Choregraphe. Despite this framing, W-01 required significantly more time than allocated and generated a high volume of help requests, the majority of which concerned the tool's interface rather than the task itself. This distinction matters: W-01 understood what the specification required but could not efficiently translate that understanding into Choregraphe's behavior model. The finite state machine paradigm --- boxes, signals, and explicit connection routing --- imposed cognitive overhead on a domain expert who had no prior exposure to this abstraction.
+The five completed sessions provide directional evidence on the accessibility question. Across the three Choregraphe wizards, design fidelity scores were 42.5, 65, and 62.5, yielding a condition mean of 56.7. Across the two completed HRIStudio sessions, both wizards achieved a DFS of 100. No HRIStudio wizard required a T-type tool-operation intervention that reflected an inability to operate the platform; the T-type marks logged for W-05 concerned interface orientation (where to find property editing, how the branch block is configured) rather than fundamental operational barriers. By contrast, all three Choregraphe wizards required T-type assistance for core design tasks: W-01 for connection routing and branch wiring, W-03 for none but over-engineered the design beyond the specification, and W-04 for speech content punctuation and a failed choice block attempt.

-W-01's SUS score of 60, below the average benchmark of 68~\cite{Brooke1996}, corroborates this observation. Post-session comments indicated that the wizard would not use Choregraphe for future HRI work without technical support, despite completing the design challenge. Together these observations establish a concrete baseline: a tool nominally designed for non-programmers nonetheless required substantial researcher support, produced a high volume of interface-level help requests, and was rated below average in usability by a domain expert with no programming background.
+The SUS scores reinforce this pattern. Choregraphe SUS scores were 60, 75, and 42.5 (mean 59.2), all at or below the average usability benchmark of 68~\cite{Brooke1996}. HRIStudio SUS scores were 90 and 70 (mean 80), both above the benchmark. The Choregraphe condition produced the lowest single SUS score in the study (42.5, W-04), a wizard who described the platform as getting in the way of their attempt. The HRIStudio condition produced the highest (90, W-02). Across programming backgrounds, the gap is consistent: W-01 (non-programmer, Choregraphe, SUS 60) versus W-05 (non-programmer, HRIStudio, SUS 70); W-04 (moderate programmer, Choregraphe, SUS 42.5) versus W-02 (programmer, HRIStudio, SUS 90).

-The HRIStudio sessions are evaluated against this baseline. The central comparison is whether wizards using HRIStudio produce higher DFS scores with fewer tool-operation interventions and higher SUS ratings. If HRIStudio's timeline-based interaction model reduces the interface friction observed with Choregraphe, those differences should appear across all three measures simultaneously; a pattern limited to one measure would call for a more qualified interpretation.
+The most striking accessibility finding comes from W-05: a Chemical Engineering faculty member with no programming experience trained in 6 minutes, completed a perfect design in 18 minutes with no operational confusion, and ran the trial to conclusion. This outcome directly addresses the accessibility claim. HRIStudio's timeline-based model and guided workflow allowed a domain novice to implement the written specification correctly on their first attempt, without the interface friction that blocked or slowed all three Choregraphe wizards. Session timing data underscores the difference: Choregraphe design phases averaged 35.7 minutes (two overruns, one incomplete), while HRIStudio design phases averaged 21 minutes (both within the allocation). Underlying this difference is a structural property of the two tools: HRIStudio's model is domain-specific to Wizard-of-Oz execution, so wizard effort is channeled toward implementing the specification more completely rather than elaborating the tool's architecture. Choregraphe's general-purpose programming model makes the opposite available, and both W-03 and W-04 took it, spending time on concurrent execution structures and a speech-recognition-driven choice block that the WoZ context does not support. Neither HRIStudio wizard had that option, and both scored 100 on the DFS.

-% TODO: Replace the forward-looking framing above with the actual condition-level comparison once HRIStudio sessions are complete.
-% TODO: Report mean DFS, SUS, and T-type intervention counts per condition. Discuss what any gap implies for the accessibility claim.
+% TODO: Add W-06 data to condition means once session is complete.

 \subsection{Research Question 2: Reproducibility}

@@ -24,26 +23,27 @@ The second research question asked whether HRIStudio produces more reliable exec

 This is precisely the failure mode the reproducibility problem predicts. Riek's~\cite{Riek2012} review found that fewer than 4\% of published WoZ studies reported any measure of wizard error, meaning most studies have no mechanism to detect whether execution matched design intent. W-01's session demonstrates that such deviations occur even in controlled conditions with a single, simple specification and an engaged wizard. The deviation was not negligence; it was creative drift made possible by a tool that places no structural constraint on what the wizard types into a speech action.

-HRIStudio's protocol enforcement model is designed to prevent this class of deviation. By locking speech content at design time and presenting it to the wizard during execution rather than requiring re-entry, HRIStudio eliminates the structural opportunity for this substitution. Whether enforcement translates into measurably higher ERS scores is the empirical question the full dataset addresses. Complementing the ERS, the intervention log records whether any branch during the trial was resolved through programmed conditional logic or by manual re-routing, providing a parallel measure of execution reliability that is independent of the test subject's responses.
+HRIStudio's protocol enforcement model is designed to prevent this class of deviation by locking speech content at design time. The available data supports this design intent. No speech content deviations occurred in either completed HRIStudio session. W-05 added an action beyond the specification (a crouch gesture), but this was an elaboration of the gesture category rather than a substitution of specified content, and it was scored within the rubric. The Choregraphe condition produced the only speech substitution in the dataset (W-01) and two sessions in which branching was absent from the design entirely (W-03, W-04).

-% TODO: Replace the forward-looking framing above with actual ERS condition means once HRIStudio sessions are complete.
-% TODO: Report whether any HRIStudio sessions produced specification deviations or required trial-phase T interventions.
+ERS scores reflect the downstream effect of these design differences. Choregraphe ERS scores were 65, 60, and 75 (mean 66.7). HRIStudio ERS scores were both 95 (mean 95). The branching item is particularly instructive: in the Choregraphe condition, branch execution was either absent from the design entirely (W-03) or present but not implemented as conditional logic (W-01, W-04). W-01 resolved the branch by manually re-routing connections during the trial; W-04 required a T-type trial intervention to be reminded how to trigger the branch step. In both completed HRIStudio sessions, the conditional branch was present in the design and executed during the trial. W-05's branch fired cleanly via programmed conditional logic; W-02's session saw a brief platform-side step misfire immediately corrected by manual step selection, logged as an H-type (platform behavior) intervention rather than a wizard error. In neither HRIStudio session did branch execution depend on tool-operation guidance from the researcher.
+
+% TODO: Add W-06 ERS data once session is complete.

 \subsection{Session Timing and Downstream Effects}

-W-01's design phase extended to 35 minutes, overrunning the 30-minute allocation by five minutes and compressing the trial window to approximately five minutes, well short of the intended ten. This timing pattern is itself evidence for the accessibility claim. If a tool reliably causes design phases to overrun their allocation, the downstream quality of the trial is compromised: a shorter trial produces a less complete ERS and a less representative interaction for the test subject. The difficulty of a tool does not only affect the design experience; it degrades the quality of the data that follow from it. Phase-by-phase timing data collected across all sessions will reveal whether design phase overruns are characteristic of one condition rather than the other, constituting a supplementary indicator of tool accessibility independent of the DFS score.
+W-01's design phase extended to 35 minutes, overrunning the 30-minute allocation by five minutes and leaving approximately five minutes for the trial phase. It is worth distinguishing between the two factors at play here: the overrun reflected both the tool's demands on the wizard and a procedural decision not to interrupt W-01 at the 30-minute mark. Subsequent sessions enforce the transition to the trial phase at 30 minutes regardless of design completion status, consistent with the observer protocol. That said, if a tool's demands make design completion within the allocation genuinely difficult, the risk of an overrun is real regardless of enforcement: a wizard who has not finished at 30 minutes faces a reduced trial window no matter when the cutoff is applied. Phase-by-phase timing data collected across all sessions will reveal whether design phase overruns are characteristic of one condition rather than the other, constituting a supplementary indicator of tool accessibility independent of the DFS score.

-% TODO: Report mean design phase duration per condition and note whether overruns cluster in the Choregraphe condition.
+Across the five completed sessions, design phase overruns are concentrated in the Choregraphe condition. W-01 and W-03 each exceeded the 30-minute design target but completed their designs before the session time limit; W-04 was the only wizard cut off by the limit without finishing. No HRIStudio wizard exceeded the target. This pattern holds across programming backgrounds: W-01 (non-programmer) and W-03 (programmer) both overran in the Choregraphe condition, while W-05 (non-programmer, HRIStudio) completed in 18 minutes. The timing data thus corroborates the DFS and SUS findings as a supplementary accessibility indicator, and supports the conclusion that the overrun pattern is attributable to tool condition rather than wizard background alone.

 \section{Comparison to Prior Work}

-The findings from W-01's session are broadly consistent with prior characterizations of Choregraphe's usability profile. Pot et al.~\cite{Pot2009} introduced Choregraphe as a tool for enabling non-programmers to create NAO behaviors, but subsequent HRI research has treated it primarily as a programmer's tool in practice. The help request pattern observed --- conceptual understanding blocked by interface friction --- aligns with Riek's~\cite{Riek2012} observation that WoZ tools tend to require substantial technical investment even when the underlying experiment is conceptually simple.
+The findings from W-01's session are broadly consistent with prior characterizations of Choregraphe's usability profile. Pot et al.~\cite{Pot2009} introduced Choregraphe as a tool for enabling non-programmers to create NAO behaviors, but subsequent HRI research has treated it primarily as a programmer's tool in practice. The pattern of help requests observed, in which W-01 understood the task but struggled with the tool's interface mechanisms, aligns with Riek's~\cite{Riek2012} observation that WoZ tools tend to require substantial technical investment even when the underlying experiment is conceptually simple.

-The specification deviation observed in W-01's session connects directly to Porfirio et al.'s~\cite{Porfirio2023} argument that formal, verifiable behavior specifications are a prerequisite for reproducible HRI. Porfirio et al. propose specification languages as the solution; HRIStudio takes a complementary approach by embedding the specification into the execution environment itself, making deviation structurally harder rather than formally detectable after the fact. The practical consequence of this design choice --- whether it reduces deviations in practice --- is what the ERS comparison will reveal.
+The specification deviation observed in W-01's session connects directly to Porfirio et al.'s~\cite{Porfirio2023} argument that formal, verifiable behavior specifications are a prerequisite for reproducible HRI. Porfirio et al. propose specification languages as the solution; HRIStudio takes a complementary approach by embedding the specification into the execution environment itself, making deviation structurally harder rather than formally detectable after the fact. The practical consequence of this design choice (whether it reduces deviations in practice) is what the ERS comparison will reveal.

 The SUS score of 60 for Choregraphe falls below scores reported for general-purpose visual programming tools in other HCI studies, though direct comparison is complicated by task and population differences. It is consistent with the finding that domain-specific visual programming environments carry learning curves that programming experience alone does not fully resolve~\cite{Bartneck2024}.

-% TODO: Add HRIStudio condition SUS mean to this section and compare to the Choregraphe baseline once sessions are complete.
+The HRIStudio SUS mean of 80 (across two completed sessions) compared to the Choregraphe mean of 59.2 is consistent with this expectation. A 20-point gap is practically significant even in a pilot sample: it places the Choregraphe condition below average usability and the HRIStudio condition well above it, across wizards with different programming backgrounds. The Choregraphe score of 42.5 from W-04 falls in a range typically characterized as poor usability, a finding that is especially notable given that W-04 had moderate programming experience and engaged with the tool actively rather than passively.

 \section{Limitations}

@@ -51,16 +51,14 @@ This study has several limitations that must be considered when interpreting the

 \textbf{Sample size.} With six wizard participants ($N = 6$), the study is too small for inferential statistics. The reported scores are descriptive. Patterns in the data can suggest directions for future work but cannot establish causal claims about the effect of the tool on design fidelity or execution reliability.

-\textbf{Researcher as test subject.} In W-01's session, the researcher served as the test subject due to participant unavailability. The researcher had foreknowledge of the specification and the study design, which may have introduced familiarity bias into the interaction. Because the DFS and ERS are scored against recordings and exported files rather than the test subject's behavior, this limitation primarily affects the qualitative character of the trial rather than the quantitative scores.
-
-\textbf{Compressed trial window.} W-01's trial lasted approximately five minutes rather than the intended ten. This limits the completeness of the ERS for that session, since several interaction steps were abbreviated under time pressure. Future sessions should enforce the transition to the trial phase at the 30-minute design mark regardless of completion status, consistent with the observer's role defined in the study protocol.
+\textbf{Trial execution without a separate test subject.} Following scheduling difficulties, the study protocol was adjusted so that the wizard executes the designed interaction directly rather than running it for a separate test subject. Because the DFS and ERS are scored against the exported project file and live observation rather than a subject's behavioral responses, this change does not affect the primary quantitative measures. The trial phase evaluates whether the wizard's design executes as specified; the presence or absence of a separate subject does not alter that criterion.

 \textbf{Single task.} Both conditions used the same Interactive Storyteller specification. While this controls for task difficulty, it limits generalizability. The task is simple relative to real HRI experiments; the gap between conditions may be larger or smaller with a more complex protocol involving multiple branches or longer interaction sequences.

 \textbf{Condition imbalance.} Because participants were randomly assigned, the final sample may distribute programmers unevenly across conditions, confounding the comparison. With a small $N$, random assignment does not guarantee balance across programming background.

-\textbf{Platform version.} HRIStudio is under active development. The version used in this study represents the system at a specific point in time; future iterations may behave differently.
+\textbf{Platform version.} HRIStudio is under active development. The version used in this study represents the system at a specific point in time. Future iterations may change how the wizard interface presents protocol steps, how branch conditions are constructed during the design phase, or how protocol enforcement is applied during execution. Any of these changes could affect how easily a non-programmer completes the design challenge or how reliably the tool enforces the specification during the trial, potentially altering the DFS and ERS scores observed under otherwise identical conditions. Results from this study therefore describe the system as it existed at the time of data collection and may not generalize to later releases.

 \section{Chapter Summary}

-This chapter interpreted the results of the pilot study in the context of the two research questions and connected the findings to prior work. The W-01 session provides preliminary evidence for both the accessibility problem and the reproducibility problem: Choregraphe produced significant interface friction for a Digital Humanities faculty member with no programming background, and permitted a specification deviation that was undetected until the live trial. These observations are consistent with the motivating analysis in Chapter~\ref{ch:background} and anchor the comparisons that the full dataset will resolve. The limitations of this pilot study --- sample size, researcher as test subject, compressed trial window, and single task --- are acknowledged and inform the future directions described in Chapter~\ref{ch:conclusion}.
+This chapter interpreted the results of five completed pilot sessions against the two research questions and connected the findings to prior work. Across all primary measures, the directional evidence favors HRIStudio. The Choregraphe condition produced a mean DFS of 56.7, mean ERS of 66.7, and mean SUS of 59.2, with design phase overruns in all three sessions and branching failures or absences in two. The two completed HRIStudio sessions produced mean DFS and ERS scores of 100 and 95 respectively, mean SUS of 80, both design phases within the allocation, and no speech content deviations. The specification deviation observed in W-01 illustrates the reproducibility problem concretely; its absence in the HRIStudio condition is consistent with the enforcement model's design intent. One HRIStudio session (W-06) remains; its results will complete the condition comparison. The limitations of this pilot study, including sample size, task simplicity, and condition imbalance by programming background, are acknowledged and inform the future directions described in Chapter~\ref{ch:conclusion}.
@@ -5,31 +5,27 @@ This thesis set out to address two persistent problems in Wizard-of-Oz-based Hum

 \section{Contributions}

-This thesis makes three contributions to the field of Human-Robot Interaction research infrastructure.
+This thesis makes three contributions to the field of HRI research infrastructure.

 \textbf{A principled architecture for WoZ platforms.} The primary contribution is a set of design principles for Wizard-of-Oz infrastructure: a hierarchical specification model (Study $\to$ Experiment $\to$ Step $\to$ Action), an event-driven execution model that separates protocol design from live trial control, and a plugin architecture that decouples experiment logic from robot-specific implementations. These principles are not specific to any one robot or institution; they describe a general approach to building WoZ tools that are simultaneously accessible to non-programmers and reproducible across executions. The principles were derived from a systematic analysis of reproducibility failures in published WoZ literature, grounded in the prior work of Riek~\cite{Riek2012} and Porfirio et al.~\cite{Porfirio2023}, and refined through the design and implementation process described in Chapters~\ref{ch:design} and~\ref{ch:implementation}.

-\textbf{HRIStudio: a reference implementation.} The second contribution is HRIStudio, an open-source, web-based platform that realizes the design principles described above. HRIStudio provides a visual experiment designer, a consolidated wizard execution interface, role-based access control for research teams, and a repository-based plugin system for integrating robot platforms including the NAO V6 used in this study. As a reference implementation, HRIStudio demonstrates that the design principles are technically feasible and can be delivered in a form that real researchers can use without programming expertise. The platform's architecture is documented in detail in Chapter~\ref{ch:implementation} and the accompanying technical appendix.
+\textbf{HRIStudio: a complete, operational platform.} The second contribution is HRIStudio, an open-source, web-based platform that fully realizes the design principles described above. HRIStudio provides a visual experiment designer, a consolidated wizard execution interface, role-based access control for research teams, and a repository-based plugin system for integrating robot platforms including the NAO6 used in this study. HRIStudio demonstrates that the design principles are not only technically feasible but can be delivered as a complete system that real researchers use without programming expertise, making it both an artifact and an instrument of validation. The platform's architecture is documented in detail in Chapter~\ref{ch:implementation} and the accompanying technical appendix.

 \textbf{Pilot empirical evidence.} The third contribution is a pilot between-subjects study comparing HRIStudio against Choregraphe as a representative baseline tool. While the pilot scale precludes inferential claims, the study provides directional evidence on both research questions and produces a concrete demonstration of the reproducibility problem in a controlled setting: a wizard using Choregraphe deviated from the written specification in a way that was undetected until the live trial. This incident motivates the enforcement model at the core of HRIStudio's design and illustrates why the reproducibility problem is difficult to solve through training or norms alone.

 \section{Reflection on Research Questions}

-The central question this thesis addressed was: \emph{can the right software architecture make Wizard-of-Oz experiments more accessible to non-programmers and more reproducible across participants?} The evidence from the pilot study suggests the answer is yes, with the qualifications appropriate to a small-N directional study.
+The central question this thesis addressed was: \emph{can the right software architecture make Wizard-of-Oz experiments more accessible to non-programmers and more reproducible across participants?} The evidence from the pilot study suggests the answer is yes, with the qualifications appropriate to a small $N$ directional study.

-On accessibility, the Choregraphe condition demonstrates that even a tool described as suitable for non-programmers creates significant interface friction in practice. A wizard with programming experience required more time than allocated, generated a high volume of tool-level help requests, and rated the tool below the average SUS benchmark. The finite state machine model --- boxes connected by signals --- imposed cognitive overhead that domain knowledge of the task alone could not resolve. If HRIStudio's timeline-based model and guided workflow reduce that overhead, the difference should appear as higher DFS scores, fewer tool-operation interventions, and higher SUS ratings across the full sample.
+On accessibility, the evidence from five completed sessions is consistent and directional. The Choregraphe condition produced a mean DFS of 56.7 across three wizards, with design phases averaging 35.7 minutes; W-01 and W-03 exceeded the 30-minute target but completed their designs, while W-04 was the only wizard cut off by the session time limit without finishing. The two completed HRIStudio sessions each produced a DFS of 100, with design phases averaging 21 minutes, both within the allocation. The most direct demonstration comes from W-05: a Chemical Engineering faculty member with no programming background trained in 6 minutes, completed a perfect design in 18 minutes, and ran the trial to completion without tool-operation difficulty. Choregraphe's finite state machine model, with boxes connected by signals, imposed cognitive overhead that domain knowledge of the task alone could not resolve; HRIStudio's timeline-based model did not produce this friction for any wizard regardless of background. SUS scores reflect the same pattern: Choregraphe mean 59.2 (below average), HRIStudio mean 80 (above average).

-On reproducibility, the specification deviation observed in the Choregraphe session illustrates why enforcement matters. A tool that allows wizards to freely edit speech content at any point in the design process creates opportunities for drift that are invisible until they surface during execution. HRIStudio's protocol enforcement forecloses this class of deviation by construction --- speech is locked at design time and surfaced during execution rather than re-entered. Whether this architectural choice translates into measurably higher execution reliability scores, and whether the proportion of tool-assisted branching resolution differs between conditions, are the questions the full dataset answers.
-
-% TODO: Once all sessions are complete, rewrite the Reflection section with actual condition means for DFS, ERS, and SUS.
-% TODO: Replace the forward-looking framing in both RQ paragraphs with concrete comparative analysis.
-% TODO: Update the chapter intro sentence ("The evidence suggests yes...") to reflect the actual direction of the findings.
+On reproducibility, the specification deviation observed in W-01's Choregraphe session, a substituted rock color in the robot's speech that was undetected until execution, illustrates the failure mode the reproducibility problem predicts. No equivalent speech content deviation occurred in either HRIStudio session. Branching, the other primary reliability measure, was present in the design and executed in both HRIStudio sessions. W-05's branch fired cleanly via programmed conditional logic; W-02's session experienced a brief platform-side misfire corrected immediately by manual step selection, logged as an H-type (platform behavior) rather than a wizard error. In neither HRIStudio session was branching absent from the design or dependent on tool-operation guidance from the researcher. By contrast, branching was absent from two Choregraphe designs entirely (W-03, W-04) and resolved by manual re-routing in a third (W-01). ERS condition means reflect the outcome: 66.7 for Choregraphe, 95 for HRIStudio (two sessions complete). The enforcement model's design intent, locking speech at design time and presenting it during execution rather than requiring re-entry, appears to produce the reliability difference the architecture was designed to achieve. One HRIStudio session (W-06) remains; its inclusion will complete the condition comparison and may refine these means, but is unlikely to reverse the direction of the evidence.

 \section{Future Directions}

 The work described in this thesis suggests several directions for future investigation.

-\textbf{Larger validation study.} The most immediate next step is a full-scale study with sufficient participants to support inferential analysis. A sample of 20 or more wizard participants, balanced across programming backgrounds and conditions, would allow the DFS and ERS comparisons to be evaluated for statistical significance. A larger study would also enable subgroup analysis --- for example, whether the accessibility benefit of HRIStudio is concentrated among non-programmers or extends equally to programmers.
+\textbf{Larger validation study.} The most immediate next step is a full-scale study with sufficient participants to support inferential analysis. A sample of 20 or more wizard participants, balanced across programming backgrounds and conditions, would allow the DFS and ERS comparisons to be evaluated for statistical significance. A larger study would also enable subgroup analysis, for example whether the accessibility benefit of HRIStudio is concentrated among non-programmers or extends equally to programmers.

 \textbf{Multi-task evaluation.} The Interactive Storyteller is a simple single-interaction task with one conditional branch. Real HRI experiments are more complex: they involve multiple conditions, longer interactions, and more elaborate branching logic. Evaluating HRIStudio on richer specifications would test whether the accessibility and reproducibility benefits scale with task complexity, and whether any new limitations emerge at that scale.

@@ -39,10 +35,10 @@ The work described in this thesis suggests several directions for future investi

 \textbf{Platform expansion.} The NAO integration used in this study is one instance of HRIStudio's plugin architecture. Extending the plugin ecosystem to include mobile robots, socially assistive robots, and non-humanoid platforms would broaden the system's applicability and test whether the plugin abstraction is sufficiently general to accommodate the range of robot capabilities used in published HRI research.

-\textbf{Community adoption.} The reproducibility problem in WoZ research is ultimately a community problem, not a tool problem. Future work should investigate what it would take for HRIStudio to be adopted as shared infrastructure across multiple labs --- including documentation standards, experiment sharing mechanisms, and incentive structures that make reproducibility a norm rather than an exception.
+\textbf{Community adoption.} The reproducibility problem in WoZ research is ultimately a community problem, not a tool problem. Future work should investigate what it would take for HRIStudio to be adopted as shared infrastructure across multiple labs, including documentation standards, experiment sharing mechanisms, and incentive structures that make reproducibility a norm rather than an exception.

 \section{Closing Remarks}

 The Wizard-of-Oz technique is one of the most powerful tools available to HRI researchers: it allows the study of interaction designs that do not yet exist as autonomous systems, accelerating the feedback loop between design intuition and empirical evidence. But the technique has been practiced for decades without the infrastructure needed to make it rigorous. Studies are conducted with custom tools that are never shared, by wizards whose behavior is never verified against a protocol, producing results that cannot be replicated because the conditions that produced them were never precisely recorded.

-HRIStudio is an attempt to build that infrastructure. It will not solve the reproducibility problem by itself; that requires community norms, institutional incentives, and continued investment in open, shared tooling. But it demonstrates that the technical barriers are not insurmountable --- that a web-based platform can make WoZ research accessible to domain experts who are not engineers, and that execution enforcement can prevent the kinds of specification drift that silently degrade research quality. That is, at minimum, where the work begins.
+HRIStudio is an attempt to build that infrastructure. It will not solve the reproducibility problem by itself; that requires community norms, institutional incentives, and continued investment in open, shared tooling. But it demonstrates that the technical barriers are not insurmountable: a web-based platform can make WoZ research accessible to domain experts who are not engineers, and execution enforcement can prevent the kinds of specification drift that silently degrade research quality. That is, at minimum, where the work begins.
@@ -0,0 +1,15 @@
+\chapter{Blank Study Templates}
+\label{app:blank_templates}
+
+This appendix contains the blank versions of all study instruments used in the pilot validation study. These templates were used to produce the completed materials in Appendix~\ref{app:completed_materials}.
+
+A note on the Informed Consent Form (ICF): the ICF was submitted with the original protocol to the Bucknell University Institutional Review Board (Protocol \#2526-025) and reflects the study design as initially proposed. The protocol was refined before data collection began; the key differences between the ICF and the executed protocol are as follows. First, phase durations were adjusted: Training was planned at 15 minutes, the Design Challenge at 30 minutes, the Live Trial at 10 minutes, and the Debrief at 5 minutes, rather than the 10/20/15/15-minute allocations stated in the ICF. Second, screen recording during the design phase was not implemented, as the DFS is scored from the exported project file rather than from screen footage. Third, the live trial was conducted with the researcher serving as the test subject rather than a recruited student volunteer, as discussed in Chapter~\ref{ch:evaluation}. The ODS, DFS, ERS, and SUS templates reflect the protocol as executed.
+
+\medskip
+\noindent\textbf{Contents of this appendix, in order:} ODS, DFS, ERS, SUS, ICF
+
+\includepdf[pages=-,pagecommand={}]{pdfs/templates/ODS-Template.pdf}
+\includepdf[pages=-,pagecommand={}]{pdfs/templates/DFS-Template.pdf}
+\includepdf[pages=-,pagecommand={}]{pdfs/templates/ERS-Template.pdf}
+\includepdf[pages=-,pagecommand={}]{pdfs/templates/SUS-Template.pdf}
+\includepdf[pages=-,pagecommand={}]{pdfs/templates/ICF-Template.pdf}
@@ -1,4 +1,44 @@
-\chapter{Study Materials}
-\label{app:materials}
+\chapter{Completed Study Materials}
+\label{app:completed_materials}

-\textit{[PLACEHOLDER: Study materials will be inserted here. Content includes recruitment materials, paper specification, consent forms, SUS questionnaire, Design Fidelity Score rubric, Execution Reliability Score rubric, observer data sheet, and training protocol.]}
+This appendix contains the completed study instruments for each of the five sessions conducted prior to the submission of this thesis (W-01 through W-05). The DFS and ERS were scored during and immediately after each session using live observation and the Observer Data Sheet; the SUS was completed by the wizard during the debrief phase.
+
+\medskip
+\noindent\textbf{Contents of this appendix, in order:}
+\begin{itemize}
+  \item \textbf{W-01 (Choregraphe):} ODS, DFS, ERS, SUS
+  \item \textbf{W-02 (HRIStudio):} ODS, DFS, ERS, SUS
+  \item \textbf{W-03 (Choregraphe):} ODS, DFS, ERS, SUS
+  \item \textbf{W-04 (Choregraphe):} ODS, DFS, ERS, SUS
+  \item \textbf{W-05 (HRIStudio):} ODS, DFS, ERS, SUS
+\end{itemize}
+
+% --- W-01 -------------------------------------------------------------------
+\includepdf[pages=-,pagecommand={}]{pdfs/completed/01/ODS-01.pdf}
+\includepdf[pages=-,pagecommand={}]{pdfs/completed/01/DFS-01.pdf}
+\includepdf[pages=-,pagecommand={}]{pdfs/completed/01/ERS-01.pdf}
+\includepdf[pages=-,pagecommand={}]{pdfs/completed/01/SUS-01C.pdf}
+
+% --- W-02 -------------------------------------------------------------------
+\includepdf[pages=-,pagecommand={}]{pdfs/completed/02/ODS-02.pdf}
+\includepdf[pages=-,pagecommand={}]{pdfs/completed/02/DFS-02.pdf}
+\includepdf[pages=-,pagecommand={}]{pdfs/completed/02/ERS-02.pdf}
+\includepdf[pages=-,pagecommand={}]{pdfs/completed/02/SUS-02H.pdf}
+
+% --- W-03 -------------------------------------------------------------------
+\includepdf[pages=-,pagecommand={}]{pdfs/completed/03/ODS-03.pdf}
+\includepdf[pages=-,pagecommand={}]{pdfs/completed/03/DFS-03.pdf}
+\includepdf[pages=-,pagecommand={}]{pdfs/completed/03/ERS-03.pdf}
+\includepdf[pages=-,pagecommand={}]{pdfs/completed/03/SUS-03C.pdf}
+
+% --- W-04 -------------------------------------------------------------------
+\includepdf[pages=-,pagecommand={}]{pdfs/completed/04/ODS-04.pdf}
+\includepdf[pages=-,pagecommand={}]{pdfs/completed/04/DFS-04.pdf}
+\includepdf[pages=-,pagecommand={}]{pdfs/completed/04/ERS-04.pdf}
+\includepdf[pages=-,pagecommand={}]{pdfs/completed/04/SRS-04C.pdf}
+
+% --- W-05 -------------------------------------------------------------------
+\includepdf[pages=-,pagecommand={}]{pdfs/completed/05/ODS-05.pdf}
+\includepdf[pages=-,pagecommand={}]{pdfs/completed/05/DFS-05.pdf}
+\includepdf[pages=-,pagecommand={}]{pdfs/completed/05/ERS-05.pdf}
+\includepdf[pages=-,pagecommand={}]{pdfs/completed/05/SUS-05.pdf}
@@ -69,7 +69,7 @@ keywords = {systematic review, reporting guidelines, methodology, human-robot in
 }

@inproceedings{Pettersson2015,
-author = {{Pettersson, John S\"{o}ren and Wik, Malin}},
+author = {Pettersson, John S\"{o}ren and Wik, Malin},
 title = {{The longevity of general purpose Wizard-of-Oz tools}},
 year = {2015},
 isbn = {9781450336734},
@@ -79,7 +79,7 @@ url = {https://doi.org/10.1145/2838739.2838825},
 doi = {10.1145/2838739.2838825},
 abstract = {The Wizard-of-Oz method has been around for decades, allowing researchers and practitioners to conduct prototyping without programming. An extensive literature review conducted by the authors revealed, however, that the re-usable tools supporting the method did not seem to last more than a few years. While generic systems start to appear around the turn of the millennium, most have already fallen out of use.Our interest in undertaking this review was inspired by the ongoing re-development of our own Wizard-of-Oz tool, the Ozlab, into a system based on web technology.We found three factors that arguably explain why Ozlab is still in use after 15 years instead of the two-three years lifetime of other generic systems: the general approach used from its inception; its inclusion in introductory HCI curricula, and the flexible and situation-dependent design of the wizard's user interface.},
 booktitle = {Proceedings of the Annual Meeting of the Australian Special Interest Group for Computer Human Interaction},
-pages = {422–426},
+pages = {422--426},
 numpages = {5},
 keywords = {Wizard user interface, Wizard of Oz, Software Sustainability, Non-functional requirements, GUI articulation},
 location = {Parkville, VIC, Australia},
@@ -130,13 +130,13 @@ series = {OzCHI '15}
  title = {{A Web-Based Wizard-of-Oz Platform for Collaborative and Reproducible Human-Robot Interaction Research}},
  author = {Sean O'Connor and L. Felipe Perrone},
  year = {2025},
-  organization = {2025 34th IEEE International Conference on Robot and Human Interactive Communication (RO-MAN)},
+  booktitle = {Proceedings of the 34th IEEE International Conference on Robot and Human Interactive Communication (RO-MAN)},
  abstract = {Human-robot interaction (HRI) research plays a pivotal role in shaping how robots communicate and collaborate with humans. However, conducting HRI studies can be challenging, particularly those employing the Wizard-of-Oz (WoZ) technique. WoZ user studies can have technical and methodological complexities that may render the results irreproducible. We propose to address these challenges with HRIStudio, a modular web-based platform designed to streamline the design, the execution, and the analysis of WoZ experiments. HRIStudio offers an intuitive interface for experiment creation, real-time control and monitoring during experimental runs, and comprehensive data logging and playback tools for analysis and reproducibility. By lowering technical barriers, promoting collaboration, and offering methodological guidelines, HRIStudio aims to make human-centered robotics research easier and empower researchers to develop scientifically rigorous user studies.},
 }

@INPROCEEDINGS{Pot2009,
  author={Pot, E. and Monceaux, J. and Gelin, R. and Maisonnier, B.},
-  booktitle={RO-MAN 2009 - The 18th IEEE International Symposium on Robot and Human Interactive Communication}, 
+  booktitle={Proceedings of the 18th IEEE International Symposium on Robot and Human Interactive Communication (RO-MAN 2009)},
  title={Choregraphe: a graphical tool for humanoid robot programming},
  year={2009},
  volume={},
@@ -217,3 +217,14 @@ title = {SUS: A quick and dirty usability scale},
 volume = {189},
 journal = {Usability Eval. Ind.}
 }
+
+@article{HoffmanZhao2021,
+  author    = {Hoffman, Guy and Zhao, Xuan},
+  title     = {A Primer for Conducting Experiments in Human--Robot Interaction},
+  journal   = {ACM Transactions on Human-Robot Interaction},
+  volume    = {10},
+  number    = {3},
+  articleno = {14},
+  year      = {2021},
+  doi       = {10.1145/3412374}
+}
@@ -3,8 +3,10 @@
 %\documentclass[twoadv}{buthesis} %Allows entry of second advisor
 %\usepackage{graphics}            %Select graphics package
 \usepackage{graphicx}             %
+\usepackage{pdfpages}             %For including PDF pages in appendices
 %\usepackage{amsthm}              %Add other packages as necessary
 \usepackage{array}                %Extended column types and \arraybackslash
+\usepackage{makecell}             %Multi-line table header cells
 \usepackage{tabularx}             %Auto-width table columns
 \usepackage{tikz}                 %For programmatic diagrams
 \usetikzlibrary{shapes,arrows,positioning,fit,backgrounds,decorations.pathreplacing}
@@ -70,6 +72,7 @@

 \makeatletter\@mainmattertrue\makeatother
 \appendix
+\include{chapters/app_blank_templates}
 \include{chapters/app_materials}
 \include{chapters/app_tech_docs}