Refactor implementation and evaluation chapters for clarity and detail

- Revised the implementation chapter to emphasize HRIStudio as a reference implementation of design principles, detailing architectural choices and mechanisms. - Enhanced descriptions of platform architecture, experiment storage, execution engine, and access control. - Updated evaluation chapter to reflect the study as a pilot validation study, clarifying research questions, study design, participant roles, and measures. - Improved consistency in language and structure throughout both chapters. - Added details on participant recruitment and task specifications to better contextualize the study. - Adjusted measurement instruments table to align with the new chapter title. - Updated LaTeX document to include additional TikZ library for improved diagram capabilities.
2026-05-08 07:08:55 -04:00 · 2026-03-05 23:28:59 -05:00
parent 4d960b0ca9
commit 7757046eec
7 changed files with 160 additions and 117 deletions
@@ -19,14 +19,14 @@ To address the accessibility and reproducibility problems in WoZ-based HRI resea

 This approach represents a shift from the current paradigm of custom, robot-specific tools toward a unified platform that can serve as shared infrastructure for the HRI research community. By treating experiment design, execution, and analysis as distinct but integrated phases of a study, such a framework can systematically address both technical barriers and sources of variability that currently limit research quality and reproducibility.

-The implementation of this approach, realized as HRIStudio, demonstrates the feasibility of web-based control for real-time robot interaction studies. HRIStudio is an open-source proof-of-concept implementation that validates the proposed framework and serves as the reference system evaluated in this thesis.
+The design principles behind this approach (a hierarchical specification model, an event-driven execution model, and a protocol/trial separation with explicit deviation logging) are the contribution of this thesis. Together they form a coherent architecture for WoZ infrastructure that any implementation could adopt. The platform I developed, HRIStudio, is my take at one such implementation: an open-source reference system that realizes those principles and serves as the instrument for empirical validation.

 \section{Research Objectives}

-This thesis builds upon foundational work presented in two prior peer-reviewed publications. Prof. Perrone and I first introduced the conceptual framework for HRIStudio at the 2024 IEEE International Conference on Robot and Human Interactive Communication (RO-MAN) \cite{OConnor2024}, establishing the vision for a collaborative, web-based platform. Subsequently, we published the detailed system architecture and a first prototype at RO-MAN 2025 \cite{OConnor2025}, validating the technical feasibility of web-based robot control. These publications form the foundation upon which this thesis asks its central research question: can a unified, web-based software framework for Wizard-of-Oz experiments measurably improve both disciplinary accessibility and scientific reproducibility of Human-Robot Interaction research compared to existing platform-specific tools?
+This thesis builds upon foundational work presented in two prior peer-reviewed publications. Prof. Perrone and I first introduced the conceptual framework for HRIStudio at the 2024 IEEE International Conference on Robot and Human Interactive Communication (RO-MAN) \cite{OConnor2024}, establishing the vision for a collaborative, web-based platform. Subsequently, we published the detailed system architecture and a first prototype at RO-MAN 2025 \cite{OConnor2025}, validating the technical feasibility of web-based robot control. Those publications established the vision and the prototype. This thesis formalizes the contribution: a set of design principles for WoZ infrastructure that simultaneously address the \textit{Accessibility} and \textit{Reproducibility} Problems, a reference implementation of those principles, and pilot empirical evidence that they produce measurably different outcomes in practice.

-To answer this question, this thesis validates the framework through a user study, in which I implement the architectural concepts from the prior work in a complete, functional software platform and evaluate it with real users. The study compares setup effort, protocol adherence, and usability between HRIStudio and a representative baseline. The successful demonstration of this approach would provide evidence that thoughtful software infrastructure can lower barriers to entry in HRI while simultaneously improving the methodological rigor of the field.
+The central question this thesis addresses is: \emph{can the right software architecture make Wizard-of-Oz experiments more accessible to non-programmers and more reproducible across participants?} To answer it, I propose a hierarchical, event-driven specification model that separates protocol design from trial execution, enforces action sequences, and logs deviations automatically; implement it as HRIStudio; and evaluate it in a pilot study comparing design fidelity and execution reliability against a representative baseline tool. The goal is not to prove a statistical effect at scale, but to establish directional evidence that the architecture changes what researchers can do and how consistently they can do it.

 \section{Chapter Summary}

-This chapter has established the context and objectives for this thesis. I identified two critical challenges facing WoZ-based HRI research. The first is the \emph{Accessibility Problem}: high technical barriers limit participation by non-programmers. The second is the \emph{Reproducibility Problem}: fragmented tooling makes results difficult to replicate across labs. I proposed a web-based framework approach that addresses these challenges through intuitive design interfaces, enforced experimental protocols, and platform-agnostic architecture. Finally, I articulated a central research question and outlined how this thesis validates that approach through implementation and a user study. To validate this approach, the next chapters establish the technical and methodological foundations.
+This chapter has established the context and objectives for this thesis. I identified two critical challenges facing WoZ-based HRI research. The first is the \emph{Accessibility Problem}: high technical barriers limit participation by non-programmers. The second is the \emph{Reproducibility Problem}: fragmented tooling makes results difficult to replicate across labs. I proposed a web-based framework approach that addresses these challenges through intuitive design interfaces, enforced experimental protocols, and platform-agnostic architecture. Finally, I posed the central research question (can a hierarchical, event-driven specification model with explicit deviation logging lower the technical barrier and improve reproducibility of WoZ experiments?) and described how this thesis addresses it through formal design, a reference implementation, and a pilot validation study. The next chapters establish the technical and methodological foundations.
@@ -23,7 +23,7 @@ Moreover, few platforms directly address the methodological concerns raised by s

 \section{Requirements for Modern WoZ Infrastructure}

-This thesis represents the culmination of a multi-year research effort to develop infrastructure that addresses the challenges identified in the WoZ platform landscape. Based on the analysis of existing platforms and identified methodological gaps, I derived requirements for a modern WoZ research infrastructure. Through our preliminary work \cite{OConnor2024}, we identified six critical capabilities that a comprehensive platform should provide:
+This thesis is the latest step in a multi-year effort to build infrastructure that addresses the challenges identified in the WoZ platform landscape. Based on the analysis of existing platforms and identified methodological gaps, I derived requirements for a modern WoZ research infrastructure. Through our preliminary work \cite{OConnor2024}, we identified six critical capabilities that a comprehensive platform should provide:

 \begin{description}
 \item[R1: Integrated workflow.] All phases of the experimental workflow (design, execution, and analysis) should be integrated within a single unified environment to minimize context switching and tool fragmentation.
@@ -34,14 +34,14 @@ This thesis represents the culmination of a multi-year research effort to develo
 \item[R6: Collaborative support.] Multiple team members should be able to contribute to experiment design and review execution data, supporting truly interdisciplinary research.
 \end{description}

-To the best of my knowledge, no existing platform satisfies all six requirements. Most critically, the trade-off between accessibility and flexibility remains unresolved, and few tools embed methodological best practices directly into their design, like training wheels on a bicycle, guiding experimenters to follow sound methodology by default.
+To the best of my knowledge, no existing platform satisfies all six requirements. Most critically, the trade-off between accessibility and flexibility remains unresolved. Few tools embed methodological best practices directly into their design to guide experimenters toward sound methodology by default.

-The ideas presented here build upon prior work established in two peer-reviewed publications. We first introduced the concept for HRIStudio as a Late-Breaking Report at the 2024 IEEE International Conference on Robot and Human Interactive Communication (RO-MAN) \cite{OConnor2024}. In that position paper, we identified the lack of accessible tooling as a primary barrier to entry in HRI and proposed the high-level vision of a web-based, collaborative platform. We established the core requirements listed above and argued for a web-based approach to achieve them.
+This work builds on two prior peer-reviewed publications. We first introduced the concept for HRIStudio as a Late-Breaking Report at the 2024 IEEE International Conference on Robot and Human Interactive Communication (RO-MAN) \cite{OConnor2024}. In that position paper, we identified the lack of accessible tooling as a primary barrier to entry in HRI and proposed the high-level vision of a web-based, collaborative platform. We established the core requirements listed above and argued for a web-based approach to achieve them.

 Following the initial proposal, we published the detailed system architecture and preliminary prototype as a full paper at RO-MAN 2025 \cite{OConnor2025}. That publication validated the technical feasibility of our approach, detailing the communication protocols, data models, and plugin architecture necessary to support real-time robot control using standard web technologies while maintaining platform independence.

-While those prior publications established the conceptual framework and technical architecture, this thesis focuses on the realization and empirical validation of the platform. I extend that research in two key ways. First, I implement a functional software system that addresses engineering challenges related to stability, latency, and deployment, providing a minimum viable product for evaluation. Second, I provide a rigorous user study comparing the proposed framework against a representative baseline tool. This empirical evaluation provides evidence to support the claim that thoughtful infrastructure design can improve both accessibility and reproducibility in HRI research.
+While those prior publications established the conceptual framework and technical architecture, this thesis formalizes those design principles, realizes them in a complete implementation, and tests whether they produce measurably different outcomes in a pilot validation study. The pilot study compares design fidelity and execution reliability between HRIStudio and a representative baseline tool, showing whether these principles translate into better outcomes for real researchers.

 \section{Chapter Summary}

-This chapter has established the technical and methodological context for this thesis. Existing WoZ platforms fall into two categories: general-purpose tools like Polonius and OpenWoZ that offer flexibility but high technical barriers, and platform-specific systems like WoZ4U and Choregraphe that prioritize usability at the cost of cross-platform generality. Recent approaches such as VR-based frameworks attempt to bridge this gap, yet to the best of my knowledge, no existing tool successfully combines accessibility, flexibility, and embedded methodological rigor. Based on this landscape analysis, I identified six critical requirements for modern WoZ infrastructure (R1-R6): integrated workflows, low technical barriers, real-time control across platforms, automated logging, platform-agnostic design, and collaborative support. These requirements form the foundation for evaluating how the proposed framework advances the state of WoZ research infrastructure. The next chapter examines the broader reproducibility challenges that justify why these requirements are essential.
+This chapter has established the technical and methodological context for this thesis. Existing WoZ platforms fall into two categories: general-purpose tools like Polonius and OpenWoZ that offer flexibility but high technical barriers, and platform-specific systems like WoZ4U and Choregraphe that prioritize usability at the cost of cross-platform generality. Recent approaches such as VR-based frameworks attempt to bridge this gap, yet to the best of my knowledge, no existing tool successfully combines accessibility, flexibility, and embedded methodological rigor. Based on this landscape analysis, I identified six critical requirements for modern WoZ infrastructure (R1--R6): integrated workflows, low technical barriers, real-time control across platforms, automated logging, platform-agnostic design, and collaborative support. These requirements are the standard against which the proposed design is evaluated in Chapter~\ref{ch:evaluation}. The next chapter examines the broader reproducibility challenges that justify why these requirements are essential.
@@ -1,7 +1,7 @@
 \chapter{Reproducibility Challenges}
 \label{ch:reproducibility}

-Having established the landscape of existing WoZ platforms and their limitations, I now examine the factors that make WoZ experiments difficult to reproduce and how software infrastructure can address them. This chapter analyzes the sources of variability in WoZ studies and examines how current practices in infrastructure and reporting contribute to reproducibility problems. Understanding these challenges is essential for designing a system that supports experimentation at scale while remaining scientifically rigorous.
+Having established the landscape of existing WoZ platforms and their limitations, I now examine the factors that make WoZ experiments difficult to reproduce and how software infrastructure can address them. This chapter analyzes the sources of variability in WoZ studies and examines how current practices in infrastructure and reporting contribute to reproducibility problems. Understanding these challenges is essential for designing a system that supports reproducible, rigorous experimentation.

 \section{Sources of Variability}

@@ -31,8 +31,8 @@ Based on this analysis, I identify specific ways that software infrastructure ca

 \section{Connecting Reproducibility Challenges to Infrastructure Requirements}

-The reproducibility challenges identified above directly motivate the infrastructure requirements (R1-R6) established in Chapter~\ref{ch:background}. Inconsistent wizard behavior creates the need for enforced experimental protocols (R1, R2) that guide wizards systematically. The lack of comprehensive data undermines analysis, motivating automatic logging requirements (R4). Technical fragmentation violates platform agnosticism (R5). Each lab builds custom software tied to specific hardware, and these custom systems become obsolete when hardware evolves. Incomplete documentation reflects the need for self-documenting protocol specifications (R1, R2) that are simultaneously executable and shareable. As Chapter~\ref{ch:background} demonstrated, no existing platform simultaneously satisfies all six requirements. Addressing this gap requires rethinking how WoZ infrastructure is designed, prioritizing reproducibility and methodological rigor as first-class design goals rather than afterthoughts.
+The reproducibility challenges identified above directly motivate the infrastructure requirements (R1--R6) established in Chapter~\ref{ch:background}. Inconsistent wizard behavior creates the need for enforced execution protocols (R1) that guide wizards step by step, and for automatic logging (R4) that captures any deviations that occur. Timing errors specifically motivate responsive, fine-grained real-time control (R3): a wizard working with a sluggish interface introduces latency that disrupts the interaction and confounds timing analysis. Technical fragmentation forces each lab to rebuild infrastructure as hardware changes, violating platform agnosticism (R5). Incomplete documentation reflects the need for self-documenting, code-free protocol specifications (R1, R2) that are simultaneously executable and shareable. Finally, the isolation of individual research groups motivates collaborative support (R6): allowing multiple team members to observe and review live trials enables the shared scrutiny that reproducibility requires. As Chapter~\ref{ch:background} demonstrated, no existing platform simultaneously satisfies all six requirements. Addressing this gap requires rethinking how WoZ infrastructure is designed, prioritizing reproducibility and methodological rigor as first-class design goals rather than afterthoughts.

 \section{Chapter Summary}

-This chapter has analyzed the reproducibility challenges inherent in WoZ-based HRI research, identifying three primary sources of variability: inconsistent wizard behavior, fragmented technical infrastructure, and incomplete documentation. Rather than treating these challenges as inherent to the WoZ paradigm, I showed how each stems from gaps in current infrastructure. Software design can systematically mitigate these challenges through enforced experimental protocols, comprehensive automatic logging, self-documenting experiment designs, and platform-independent abstractions. These design goals directly address the six infrastructure requirements identified in Chapter~\ref{ch:background}. The following chapters describe the design, implementation, and empirical evaluation of a system that prioritizes reproducibility as a foundational design principle from inception.
+This chapter has analyzed the reproducibility challenges inherent in WoZ-based HRI research, identifying three primary sources of variability: inconsistent wizard behavior, fragmented technical infrastructure, and incomplete documentation. Rather than treating these challenges as inherent to the WoZ paradigm, I showed how each stems from gaps in current infrastructure. Software design can systematically mitigate these challenges through enforced experimental protocols, comprehensive automatic logging, self-documenting experiment designs, and platform-independent abstractions. These design goals directly address the six infrastructure requirements identified in Chapter~\ref{ch:background}. The following chapters describe the design, implementation, and pilot validation of a system that prioritizes reproducibility as a foundational design principle from inception.
@@ -1,7 +1,7 @@
-\chapter{System Design}
+\chapter{Architectural Design}
 \label{ch:design}

-Chapter~\ref{ch:background} established six requirements for modern WoZ infrastructure, labeled R1 through R6. This chapter presents the design decisions that address them: the hierarchical organization of experiment specifications, the event-driven execution model, the modular interface architecture, and the integrated data flow.
+Chapter~\ref{ch:background} established six requirements for modern WoZ infrastructure, labeled R1 through R6, and Chapter~\ref{ch:reproducibility} showed the reproducibility problems that motivate them. This chapter presents the architectural contribution of this thesis: a hierarchical specification model, an event-driven execution model, a modular interface architecture, and an integrated data flow that together address all six requirements. These are design principles, not implementation details; they apply to any system built with the same goals.

 \section{Hierarchical Organization of Experiments}

@@ -9,68 +9,78 @@ WoZ studies involve multiple reusable conditions, shared protocol phases, and pl

 The terms in this hierarchy are used in a strict way. A \emph{study} is the top-level research container that groups related protocol conditions. An \emph{experiment} is one reusable condition within that study (for example, a control versus experimental condition). A \emph{step} is one phase of the protocol timeline (for example, an introduction, telling a story, or testing recall). An \emph{action} is the smallest executable unit inside a step (for example, trigger a gesture, play audio, or speak a prompt).

-Figure~\ref{fig:experiment-hierarchy} shows the generic schema. Reading top-down, one study contains many experiments, each experiment contains many steps, and each step contains many actions. The dashed trial nodes indicate execution instances of a protocol, not new protocols. This protocol-versus-instance separation is central for reproducibility because researchers can repeat the same designed experiment across participants while preserving traceability of what was specified versus what was executed.
+Figure~\ref{fig:experiment-hierarchy} shows the generic schema as a linear chain. Reading top-down, one study contains one or more experiments, each experiment contains one or more steps, and each step contains one or more actions. Figure~\ref{fig:trial-instantiation} shows the protocol-versus-instance separation in isolation. The left column holds the protocol designed once before the study begins; the right column shows the separate trial records produced each time a participant runs it. A dashed line marks the protocol/trial boundary: everything to its left was authored by the researcher before any participant arrived; everything to its right was generated during a live session. The \textit{instantiates} arrows from the experiment node fan out to each trial record, making the relationship explicit. This separation is central to reproducibility: the same experiment specification generates a distinct, timestamped record per participant, so researchers can compare across participants without conflating what was designed with what was executed.

-To illustrate the same schema with a concrete case, consider an interactive storytelling study with the research question: \emph{Does robot interaction modality influence participant recall performance?} The two conditions differ in how the robot looks and behaves: NAO6 has a human-like form and uses expressive gestures, while TurtleBot is visibly machine-like with no social movement cues. This keeps the narrative task the same across both conditions while changing only how the robot delivers it.
+To illustrate how the schema can be used with a concrete example, consider an interactive storytelling study with the research question: \emph{Does robot interaction modality influence participant recall performance?} The two conditions differ in how the robot looks and behaves: NAO6 has a human-like form and uses expressive gestures, while TurtleBot is visibly machine-like with no social movement cues. This keeps the narrative task the same across both conditions while changing only how the robot delivers it.

-Figure~\ref{fig:example-hierarchy} maps that study onto the same hierarchy. The study branches into two experiments (TurtleBot with only voice, NAO6 with added gestures), each experiment uses the same ordered steps (Intro, Story Telling, Recall Test), and each step contains actions. The figure expands only the Story Telling step to keep the diagram readable, but Intro and Recall Test follow the same structure. Together, Figure~\ref{fig:experiment-hierarchy} and Figure~\ref{fig:example-hierarchy} move from abstract schema to concrete instantiation.
+Figure~\ref{fig:example-hierarchy} maps that study onto the same hierarchy. The study branches into two experiments (TurtleBot with only voice, NAO6 with added gestures), each experiment uses the same ordered steps (Intro, Story Telling, Recall Test), and each step contains actions. The figure expands only the Story Telling step to keep the diagram readable, but Intro and Recall Test follow the same structure. Figures~\ref{fig:experiment-hierarchy}, \ref{fig:trial-instantiation}, and~\ref{fig:example-hierarchy} together progress from abstract schema, to protocol-versus-instance separation, to a concrete instantiation.

 \begin{figure}[htbp]
 \centering
 \begin{tikzpicture}[
-	nodebox/.style={rectangle, draw=black, thick, fill=gray!15, align=center, font=\small, inner sep=4pt},
-	nodeboxdark/.style={rectangle, draw=black, thick, fill=gray!30, align=center, font=\small, inner sep=4pt},
-	nodeboxlight/.style={rectangle, draw=black, thick, dashed, fill=gray!8, align=center, font=\small, inner sep=4pt},
+	nodebox/.style={rectangle, draw=black, thick, fill=gray!15, align=center,
+		text width=3.2cm, minimum height=1.0cm, font=\small, inner sep=4pt},
+	nodeboxdark/.style={rectangle, draw=black, thick, fill=gray!35, align=center,
+		text width=3.2cm, minimum height=1.0cm, font=\small, inner sep=4pt},
 	arrow/.style={->, thick},
-	instof/.style={->, thick, dashed},
-	cardinality/.style={font=\small, fill=white, inner sep=2pt}]
+	label/.style={font=\small\itshape, fill=white, inner sep=2pt}]

-	% Top level: Study
-	\node[nodebox] (study) at (0, 5.5) {Study};
+	\node[nodebox]     (study)  at (0,  6.0) {Study};
+	\node[nodebox]     (exp)    at (0,  4.0) {Experiment};
+	\node[nodebox]     (step)   at (0,  2.0) {Step};
+	\node[nodeboxdark] (action) at (0,  0.0) {Action};

-	% Second level: Multiple Experiments
-	\node[nodebox] (exp1) at (-4.5, 3.5) {Experiment 1};
-	\node[nodebox] (exp2) at (0, 3.5) {Experiment 2};
-	\node[nodebox] (exp3) at (4.5, 3.5) {Experiment 3};
-
-	\draw[arrow] (study.south) -- (exp1.north);
-	\draw[arrow] (study.south) -- (exp2.north);
-	\draw[arrow] (study.south) -- (exp3.north);
-	\node[cardinality] at (0, 4.5) {has many};
-
-	% Third level: Steps (showing detail for Experiment 2)
-	\node[nodebox] (step1) at (-3, 1.8) {Step 1};
-	\node[nodebox] (step2) at (0, 1.8) {Step 2};
-	\node[nodebox] (step3) at (3, 1.8) {Step 3};
-
-	\draw[arrow] (exp2.south) -- (step1.north);
-	\draw[arrow] (exp2.south) -- (step2.north);
-	\draw[arrow] (exp2.south) -- (step3.north);
-	\node[cardinality] at (0, 2.65) {has many};
-
-	% Fourth level: Actions (showing detail for Step 2)
-	\node[nodeboxdark] (action1) at (-4.5, -0.2) {Action 1};
-	\node[nodeboxdark] (action2) at (-2.2, -0.2) {Action 2};
-	\node[nodeboxdark] (action3) at (0.1, -0.2) {Action 3};
-
-	\draw[arrow] (step2.south) -- (action1.north);
-	\draw[arrow] (step2.south) -- (action2.north);
-	\draw[arrow] (step2.south) -- (action3.north);
-	\node[cardinality] at (0, 0.8) {has many};
-
-	% Trials as instances of Experiment 3 (positioned separately)
-	\node[nodeboxlight] (trial1) at (8.5, 4.2) {Trial (P01)};
-	\node[nodeboxlight] (trial2) at (8.5, 2.8) {Trial (P02)};
-	
-	\draw[instof] (exp3.east) -- (trial1.west);
-	\draw[instof] (exp3.east) -- (trial2.west);
-	\node[cardinality] at (6.5, 4.8) {instantiates};
+	\draw[arrow] (study.south)  -- node[label, right=6pt] {has one or more} (exp.north);
+	\draw[arrow] (exp.south)    -- node[label, right=6pt] {has one or more} (step.north);
+	\draw[arrow] (step.south)   -- node[label, right=6pt] {has one or more} (action.north);

 \end{tikzpicture}
-\caption{Hierarchical organization showing cardinality: a study has many experiments, an experiment has many steps, and a step has many actions. Trials represent specific execution instances of an experiment protocol.}
+\caption{The four-level experiment specification hierarchy.}
 \label{fig:experiment-hierarchy}
 \end{figure}

+\begin{figure}[htbp]
+\centering
+\begin{tikzpicture}[
+	spec/.style={rectangle, draw=black, thick, fill=gray!15, align=center,
+		text width=3.2cm, minimum height=1.0cm, font=\small, inner sep=4pt},
+	trial/.style={rectangle, draw=black, thick, dashed, fill=gray!5, align=center,
+		text width=3.2cm, minimum height=1.0cm, font=\small, inner sep=4pt},
+	arrow/.style={->, thick},
+	darrow/.style={->, thick, dashed}]
+
+	%% ---- Column headers ----
+	\node[font=\small\bfseries] at (1.9,  7.0) {Protocol (designed once)};
+	\node[font=\small\bfseries] at (7.9,  7.0) {Trials (run per participant)};
+
+	%% ---- Protocol column ----
+	\node[spec] (study) at (1.9, 5.8) {Study};
+	\node[spec] (exp)   at (1.9, 4.2) {Experiment};
+	\node[spec] (step)  at (1.9, 2.6) {Step};
+
+	\draw[arrow] (study.south) -- (exp.north);
+	\draw[arrow] (exp.south)   -- (step.north);
+
+	%% ---- Trial column ----
+	\node[trial] (t1) at (7.9, 5.5) {Trial --- P01\\{\footnotesize timestamped log}};
+	\node[trial] (t2) at (7.9, 4.2) {Trial --- P02\\{\footnotesize timestamped log}};
+	\node[trial] (t3) at (7.9, 2.9) {Trial --- P03\\{\footnotesize timestamped log}};
+
+	%% ---- Separator ----
+	\draw[gray!60, thick, dashed] (4.85, 1.8) -- (4.85, 6.6);
+	\node[font=\footnotesize\itshape, gray!80] at (4.85, 1.4) {protocol\,/\,trial boundary};
+
+	%% ---- Instantiation arrows + label ----
+	\node[font=\small\itshape] at (6.35, 6.3) {instantiates};
+	\draw[darrow] (exp.east) -- (t1.west);
+	\draw[darrow] (exp.east) -- (t2.west);
+	\draw[darrow] (exp.east) -- (t3.west);
+
+\end{tikzpicture}
+\caption{One experiment protocol instantiated as a separate trial record per participant.}
+\label{fig:trial-instantiation}
+\end{figure}
+
 \begin{figure}[htbp]
 \centering
 \begin{tikzpicture}[
@@ -120,15 +130,15 @@ Figure~\ref{fig:example-hierarchy} maps that study onto the same hierarchy. The
 	\draw[arrow] (tb_s2.south) -- (tb_a3.north);

 \end{tikzpicture}
-\caption{Example hierarchy in the same structure as Figure~\ref{fig:experiment-hierarchy}: labels are embedded in each box, each experiment has independent steps, and Story Telling expands to multiple concrete actions.}
+\caption{A recall study with two conditions mapped onto the four-level hierarchy.}
 \label{fig:example-hierarchy}
 \end{figure}

-Together, these two figures motivate why the hierarchy is useful in practice. The layered structure lets researchers define protocols at whatever level they care about without writing code, which keeps the tool accessible to non-programmers. The step and action levels also align naturally with live trial flow, so the wizard stays guided by the protocol while retaining control over timing, which supports the real-time control requirement. Action-level execution provides a natural unit for timestamped logging and post-hoc analysis, satisfying the automated logging requirement. Finally, keeping experiment definitions separate from trial instances means the same protocol can be reproduced across participants and conditions, supporting both the integrated workflow and collaborative support requirements.
+Together, these three figures motivate why the hierarchy is useful in practice. The layered structure lets researchers define protocols at whatever level they care about without writing code, which keeps the tool accessible to non-programmers. The step and action levels also align naturally with live trial flow, so the wizard stays guided by the protocol while retaining control over timing, which supports the real-time control requirement. Action-level execution provides a natural unit for timestamped logging and post-trial analysis, satisfying the automated logging requirement. Finally, keeping experiment definitions separate from trial instances means the same protocol can be reproduced across participants and conditions, supporting both the integrated workflow and collaborative support requirements.

 \section{Event-Driven Execution Model}

-To achieve real-time responsiveness while maintaining methodological rigor (R3, R5), the system uses an event-driven execution model rather than a time-driven one. In a time-driven approach, the system advances through actions on a fixed schedule regardless of what the participant is doing, so the robot might speak over a participant who is still talking, or move on before a response has been given. The event-driven model avoids this by letting the wizard trigger each action when the interaction is ready for it. Figure~\ref{fig:event-driven-timeline} contrasts the two approaches across two trials of the same experiment.
+To achieve real-time responsiveness while maintaining methodological rigor (R3, R5), the system uses an event-driven execution model rather than a time-driven one. In a time-driven approach, the system advances through actions on a fixed schedule regardless of what the participant is doing, so the robot might speak over a participant who is still talking, or move on before a response has been given. The event-driven model avoids this by letting the wizard trigger each action when the interaction is ready for it. Figure~\ref{fig:event-driven-timeline} contrasts the two approaches using the same four-action sequence: Greet (G), Begin Story (BS), Ask Question (AQ), and End (E). In the time-driven row, fixed intervals $t_0$ through $t_2$ define when each event fires, and dashed vertical lines show where those moments fall relative to the event-driven rows below. In both event-driven rows, the wizard fires the same four labeled events at different real-time positions --- T1 (a faster participant) finishes well before T2 (a slower one) --- while both preserve the same action order.

 \begin{figure}[htbp]
 \centering
@@ -161,6 +171,14 @@ To achieve real-time responsiveness while maintaining methodological rigor (R3,
 	\node[font=\scriptsize, above=3pt] at (7.0,  3.5) {Ask Question};
 	\node[font=\scriptsize, above=3pt] at (10.5, 3.5) {End};

+	%% ---- Time interval braces below time-driven row ----
+	\draw[decorate, decoration={brace, amplitude=4pt, mirror}]
+		(1.0, 3.2) -- (3.5, 3.2) node[midway, below=6pt, font=\scriptsize] {$t_0$};
+	\draw[decorate, decoration={brace, amplitude=4pt, mirror}]
+		(3.5, 3.2) -- (7.0, 3.2) node[midway, below=6pt, font=\scriptsize] {$t_1$};
+	\draw[decorate, decoration={brace, amplitude=4pt, mirror}]
+		(7.0, 3.2) -- (10.5, 3.2) node[midway, below=6pt, font=\scriptsize] {$t_2$};
+
 	% Dashed vertical alignment lines
 	\draw[dashed, gray!70] (1.0,  3.35) -- (1.0,  0.35);
 	\draw[dashed, gray!70] (3.5,  3.35) -- (3.5,  0.35);
@@ -173,31 +191,43 @@ To achieve real-time responsiveness while maintaining methodological rigor (R3,
 	\node[dot] at (5.5, 2.0) {};
 	\node[dot] at (7.8, 2.0) {};

+	% Event-driven S1 labels
+	\node[font=\scriptsize, below=3pt] at (1.0, 2.0) {G};
+	\node[font=\scriptsize, below=3pt] at (2.5, 2.0) {BS};
+	\node[font=\scriptsize, below=3pt] at (5.5, 2.0) {AQ};
+	\node[font=\scriptsize, below=3pt] at (7.8, 2.0) {E};
+
 	% Event-driven S2 (slower participant)
 	\node[dot] at (1.0,  0.5) {};
 	\node[dot] at (4.3,  0.5) {};
 	\node[dot] at (8.5,  0.5) {};
 	\node[dot] at (10.8, 0.5) {};

+	% Event-driven S2 labels
+	\node[font=\scriptsize, below=3pt] at (1.0,  0.5) {G};
+	\node[font=\scriptsize, below=3pt] at (4.3,  0.5) {BS};
+	\node[font=\scriptsize, below=3pt] at (8.5,  0.5) {AQ};
+	\node[font=\scriptsize, below=3pt] at (10.8, 0.5) {E};
+
 	% Time axis label
 	\node[font=\small\itshape] at (5.75, -0.25) {time};

 \end{tikzpicture}
-\caption{The same four-action protocol executed under time-driven (top) and event-driven (bottom, two trials) models. Dashed lines mark the fixed schedule. Under the event-driven model, the wizard advances each action when the participant is ready, so trials differ in duration while preserving action order.}
+\caption{Time-driven (top) versus event-driven (bottom, two trials) execution of the same four-action protocol.}
 \label{fig:event-driven-timeline}
 \end{figure}

 This approach has several implications. First, not all trials of the same experiment will have identical timing or duration; the length of a learning task, for example, depends on the participant's progress. The system records the actual timing of actions, permitting researchers to capture these natural variations in their data. Second, the event-driven model enables the wizard to respond contextually without departing from the protocol; the wizard remains guided by the sequence of available actions while having control over when to advance based on participant cues.

-The system guides the wizard through the protocol step by step, ensuring the intended sequence is followed. Every action is logged with a timestamp whether it was scripted or not, and anything outside the protocol is flagged as a deviation. This means inconsistent wizard behavior shows up in the data rather than disappearing into it.
+The system guides the wizard through the protocol step-by-step, ensuring the intended sequence is followed. Every action is logged with a timestamp whether it was scripted or not, and anything outside the protocol is flagged as a deviation. This means inconsistent wizard behavior shows up in the data rather than disappearing into it.

 \section{Modular Interface Architecture}

-Researchers interact with the system through three interfaces, one per phase of a study: designing a protocol, running a live trial, and reviewing the results.
+Researchers interact with the system through three interfaces, each one encapsulating a specific phase of an experimental study: designing a protocol, running a live trial, and reviewing the results.

 \subsection{Design Interface}

-The \emph{Design} interface gives researchers a drag-and-drop canvas for building experiment protocols. Researchers drag pre-built action components, including robot movements, speech, wizard instructions, and conditional logic, onto the canvas and drop them into sequence. Clicking a component opens a side panel where its parameters can be set, such as the text for a speech action or the gesture name for a movement.
+The \emph{Design} interface gives researchers a drag-and-drop canvas for building experiment protocols, creating a visual programming environment. Researchers drag pre-built action components, including robot movements, speech, wizard instructions, and conditional logic, onto the canvas and drop them into sequence. Clicking a component opens a side panel where its parameters can be set, such as the text for a speech action or the gesture name for a movement.

 By treating experiment design as a visual specification task, the interface lowers technical barriers (R2) and ensures that the resulting protocol specification is human-readable and shareable alongside research results. The specification is stored in a structured format that can be both displayed as a timeline for analysis and executed by the platform's runtime.

@@ -215,13 +245,13 @@ After a trial concludes, the \emph{Analysis} interface lets researchers review e

 \section{Data Flow and Infrastructure Implementation}

-To ensure that data from every experimental phase remains traceable and accessible, the system organizes its internals into three architectural layers and defines a clear data pathway from protocol design through post-trial analysis, covering how experiment specifications, control commands, and recorded data move through the system.
+To ensure that data from every experimental phase remains traceable, the system organizes its internals into three architectural layers and defines a clear data pathway from protocol design through post-trial analysis, covering how experiment specifications, control commands, and recorded data move through the system.

 \subsection{Architectural Layers}

-The architecture separates the system into three distinct layers, each with a specific responsibility. The \emph{user interface layer} runs in researchers' web browsers and handles all visual interfaces (Design, Execution, Analysis), managing user interactions such as clicking buttons, dragging experiment components, and viewing live trial status. The \emph{application logic layer} operates as a server process that manages experiment data, coordinates trial execution, authenticates users, and orchestrates communication between the interface and the robot. The \emph{data and robot control layer} encompasses long-term storage of experiment protocols and trial data, as well as direct communication with robot hardware.
+The system is structured as a three-layer architecture, each with a specific responsibility. The \emph{user interface layer} runs in researchers' web browsers and handles all visual interfaces (Design, Execution, Analysis), managing user interactions such as clicking buttons, dragging experiment components, and viewing live trial status. The \emph{application logic layer} operates as a server process that manages experiment data, coordinates trial execution, authenticates users, and orchestrates communication between the interface and the robot. The \emph{data and robot control layer} encompasses long-term storage of experiment protocols and trial data, as well as direct communication with robot hardware.

-This separation provides several benefits. Different parts of the system can evolve independently; for example, improving the user interface does not require changes to robot control logic. The separation also clarifies responsibilities: the user interface never directly commands robot hardware; all robot actions flow through the application logic layer, which can enforce safety constraints and maintain consistent logging. Figure~\ref{fig:three-tier} illustrates this layered architecture.
+This separation of concerns provides two concrete benefits. First, each layer can evolve independently: improving the user interface requires no changes to robot control logic, and swapping in a different storage backend requires no changes to the execution engine. Second, the separation enforces clear responsibilities: the user interface never directly commands robot hardware; all robot actions flow through the application logic layer, which maintains consistent logging. Figure~\ref{fig:three-tier} illustrates this layered architecture.

 \begin{figure}[htbp]
 \centering
@@ -258,11 +288,11 @@ This separation provides several benefits. Different parts of the system can evo

 \subsection{Data Flow Through Experimental Phases}

-During the design phase, researchers create experiment specifications that are stored in the system database. During a live experiment session, the system manages bidirectional communication between the wizard's interface and the robot control layer. All actions, sensor data, and events are streamed to a data logging service that stores complete session records. After the experiment, researchers access these records through the Analysis interface.
+During the design phase, researchers create experiment specifications that are stored in the system database. During a live experiment session, the system manages bidirectional communication between the wizard's interface and the robot control layer. All actions, sensor data, and events are streamed to a data logging service that stores complete session records. After the experiment, researchers can inspect these records through the Analysis interface.

 The flow of data during a trial proceeds through six distinct phases, as shown in Figure~\ref{fig:trial-dataflow}. First, a researcher creates an experiment protocol using the Design interface. Second, when a trial begins, the application server loads the protocol and begins stepping through it, sending commands to the robot and waiting for events such as wizard inputs, sensor readings, or timeouts. Third, every action, both planned protocol steps and unexpected events, is immediately written to the trial log with precise timing information. Fourth, the Execution interface continuously displays the current state, allowing the wizard and observers to monitor progress in real-time. Fifth, when the trial concludes, all recorded media (video and audio) is transferred from the browser to the server and associated with the trial record. Sixth, the Analysis interface retrieves the stored trial data and reconstructs exactly what happened, synchronized with the video and audio recordings.

-This design ensures comprehensive documentation of every trial, supporting both fine-grained analysis and reproducibility. Researchers can review not just what they planned to happen, but what actually occurred, including timing variations and unexpected events.
+This design ensures comprehensive documentation of every trial, supporting both fine-grained analysis and reproducibility. Researchers can review not just what they intended to happen, but what actually occurred, including timing variations and unexpected events.

 \begin{figure}[htbp]
 \centering
@@ -292,8 +322,8 @@ This design ensures comprehensive documentation of every trial, supporting both

 \subsection{Requirements Satisfaction}

-The design choices described in this chapter map directly onto the requirements from Chapter~\ref{ch:background}. Having the researcher work through a single platform from protocol creation to post-trial review satisfies R1 (integrated workflow) without extra tooling. The visual drag-and-drop Design interface removes the need for programming knowledge, satisfying R2 (low technical barriers) by keeping the system accessible to researchers without a software background. Event-driven execution satisfies R3 (real-time control) by giving the wizard control over pacing while keeping the trial on protocol. All actions are logged automatically at the system level, satisfying R4 (automated logging) without requiring researchers to instrument their studies manually. The three-layer architecture decouples action specifications from robot-specific commands, satisfying R5 (platform agnosticism) by letting the same protocol run on different hardware without modification. Finally, shared live views and multi-user access let interdisciplinary teams observe and annotate the same trial simultaneously, satisfying R6 (collaborative support).
+The design choices described in this chapter map directly onto the requirements from Chapter~\ref{ch:background}. Having the researcher work through a single platform from protocol creation to post-trial review satisfies R1 (integrated workflow) without extra tooling. The visual drag-and-drop Design interface removes the need for programming knowledge, satisfying R2 (low technical barriers) by keeping the system accessible to researchers without a software background. Event-driven execution satisfies R3 (real-time control) by giving the wizard control over pacing while keeping the trial on protocol. All actions are logged automatically at the system level, satisfying R4 (automated logging) without requiring researchers to add logging by hand. The three-layer architecture decouples action specifications from robot-specific commands, satisfying R5 (platform agnosticism) by letting the same protocol run on different hardware without modification. Finally, shared live views and multi-user access let interdisciplinary teams observe and annotate the same trial simultaneously, satisfying R6 (collaborative support).

 \section{Chapter Summary}

-This chapter described a system design with emphasis on how architectural choices directly implement the infrastructure requirements identified in Chapter~\ref{ch:background}. The hierarchical organization of experiment specifications enables intuitive, executable design. The event-driven execution model balances protocol consistency with realistic interaction dynamics. The modular interface architecture separates concerns across design, execution, and analysis phases while maintaining data coherence. The integrated data flow ensures that reproducibility is supported by design rather than by afterthought. The following chapter presents HRIStudio as a reference implementation of these design principles, discussing specific technologies and architectural components.
+This chapter described the architectural design with emphasis on how each design choice directly implements the infrastructure requirements identified in Chapter~\ref{ch:background}. The hierarchical organization of experiment specifications enables intuitive, executable design. The event-driven execution model balances protocol consistency with realistic interaction dynamics. The modular interface architecture separates concerns across design, execution, and analysis phases while maintaining data coherence. The integrated data flow ensures that reproducibility is supported by design rather than by afterthought. The following chapter presents HRIStudio as a reference implementation of these design principles, discussing specific technologies and architectural components.
@@ -1,19 +1,21 @@
 \chapter{Implementation}
 \label{ch:implementation}

-This chapter explains how HRIStudio implements the design from Chapter~\ref{ch:design}. It covers the architectural choices and mechanisms behind how the platform stores experiments, executes trials, integrates robot hardware, and controls access. Technology stack specifics are in Appendix~\ref{app:tech_docs}.
+HRIStudio is a reference implementation of the design principles established in Chapter~\ref{ch:design}. The central contribution of this work is not the tool itself but the design concepts that underpin it: the hierarchical specification model, the event-driven execution model, and the integrated data flow. Any system built on those concepts would satisfy the same requirements. This chapter explains how HRIStudio realizes them, covering the architectural choices and mechanisms behind how the platform stores experiments, executes trials, integrates robot hardware, and controls access. Technology stack specifics are presented in Appendix~\ref{app:tech_docs}.

 \section{Platform Architecture}

-HRIStudio runs as a web application. Researchers access it through a standard browser without installing specialized software, and the entire study team, including researchers, wizards, and observers, connect to the same shared system. This eliminates installation complexity and ensures the platform works identically on any operating system, directly addressing the low-technical-barrier requirement (R2, from Chapter~\ref{ch:background}). It also enables natural collaboration (R6): multiple team members can access experiment data and observe live trials simultaneously from different machines without any additional configuration.
+HRIStudio follows the model of a web application. Researchers access it through a standard browser without installing specialized software, and the entire study team, including researchers, wizards, and observers, connect to the same shared system. This eliminates the need for a local installation and ensures the platform works identically on any operating system, directly addressing the low-technical-barrier requirement (R2, from Chapter~\ref{ch:background}). It also enables easy collaboration (R6): multiple team members can access experiment data and observe live trials simultaneously from different machines without any additional configuration.

-The system is organized into three layers: a browser-based user interface, an application server that manages execution, authentication, and logging, and a data and robot control layer covering storage and hardware communication. These layers are described architecturally in Chapter~\ref{ch:design}; what matters for implementation is that the server runs on the same local network as the robot hardware. This keeps communication latency low during live trials, where a delay between the wizard's input and the robot's response would disrupt the interaction. All three layers are implemented in the same language--TypeScript \cite{TypeScript2014}, a statically-typed superset of JavaScript. When the structure of experiment data changes, the type checker surfaces inconsistencies across the entire codebase at compile time rather than allowing them to surface as runtime failures during a live trial.
+I organized the system into three layers: a browser-based user interface, an application server that manages execution, authentication, and logging, and a data and robot control layer covering storage and hardware communication. This layered structure is shown in Figure~\ref{fig:three-tier}. A key deployment constraint is that the application server runs on the same local network as the robot hardware. This keeps communication latency low during live trials: a noticeable delay between the wizard's input and the robot's response would break the interaction.
+
+I implemented all three layers in the same language — TypeScript~\cite{TypeScript2014}, a statically-typed superset of JavaScript. The single-language decision keeps the type system consistent across the full stack. When the structure of experiment data changes, the type checker surfaces inconsistencies across the entire codebase at compile time rather than allowing them to appear as runtime failures during a live trial.

 \section{Experiment Storage and Trial Logging}

-Experiments are saved to persistent storage when a researcher completes them in the Design interface. A saved experiment is a complete, reusable specification that can be run across any number of trials without modification.
+The system saves experiments to persistent storage when a researcher completes them in the Design interface. A saved experiment is a complete, reusable specification that a researcher can run across any number of trials without modification.

-When a trial begins, the system creates a new trial record linked to that experiment. Every action the wizard triggers during the trial is written to that record with a precise timestamp, whether it was scripted or not. Video, audio, and robot sensor data are recorded alongside the action log for the duration of the trial. Unscripted actions are flagged as deviations. Because the trial record and the experiment reference the same underlying specification, the Analysis interface can directly compare what was planned against what was executed for any trial, without any manual work by the researcher. Figure~\ref{fig:trial-record} shows the structure of a completed trial record.
+When a trial begins, the system creates a new trial record linked to that experiment. The system writes every action the wizard triggers to that record with a precise timestamp, whether scripted or not, including any unscripted actions triggered outside the protocol. The system flags those unscripted actions as deviations. The browser records video, audio, and robot sensor data alongside the action log for the duration of the trial. The Analysis interface can directly compare what was planned against what was executed for any trial, without any manual work by the researcher, because the trial record and the experiment reference the same underlying specification. Figure~\ref{fig:trial-record} shows the structure of a completed trial record: action log entries, video, audio, and robot sensor data all share a common timestamp reference so the Analysis interface can align them without manual synchronization; dashed lines mark step boundaries; and the system flags any deviation from the experiment specification inline.

 \begin{figure}[htbp]
 \centering
@@ -79,27 +81,29 @@ When a trial begins, the system creates a new trial record linked to that experi
    };

 \end{tikzpicture}
-\caption{Structure of a completed trial record. Action log entries, video, audio, and robot sensor data share a common timestamp reference so the Analysis interface can align them without manual synchronization. Deviations from the experiment specification are flagged inline. Dashed lines mark step boundaries.}
+\caption{Structure of a completed trial record, showing synchronized action log, media, and sensor tracks.}
 \label{fig:trial-record}
 \end{figure}

-Video and audio are recorded locally in the researcher's browser during the trial rather than streamed to the server in real time. This prevents network delays or server load from dropping frames or degrading audio quality during the interaction. When the trial concludes, the browser transfers the complete recordings to the server and associates them with the trial record. Because the timestamp when recording starts is logged alongside the action log, the Analysis interface can align video and audio with the logged actions without any manual synchronization.
+Video and audio are recorded locally in the researcher's browser during the trial rather than streamed to the server in real time. This prevents network delays or server load from dropping frames or degrading audio quality during the interaction. When the trial concludes, the browser transfers the complete recordings to the server and associates them with the trial record. The Analysis interface can align video and audio with the logged actions without any manual synchronization, because the timestamp when recording starts is logged alongside the action log.

-This reflects a deliberate split in how data is stored. Experiment specifications and trial records are kept in a structured database, which makes it efficient to query across trials, for example retrieving all trials for a specific participant or comparing action timing across conditions. Video and audio files are stored separately in a dedicated file store, since their size makes them unsuitable for a database but their content is not queried directly.
+The system stores structured and media data separately. Experiment specifications and trial records live in a structured database, which makes it efficient to query across trials (for example, retrieving all trials for a specific participant or comparing action timing across conditions). Video and audio files live in a dedicated file store, since their size makes them unsuitable for a database and the system never queries their content directly.

 \section{The Execution Engine}

-When a trial begins, the server loads the experiment and maintains a live connection to the wizard's browser and any observer connections. The execution engine does not advance the experiment on a timer; it waits for the wizard to trigger each step. This preserves the natural pacing of the interaction: the wizard advances only when the participant is ready, while the experiment structure ensures the protocol is followed. When the wizard triggers an action, the server dispatches the robot command, writes the log entry, and pushes the updated experiment state to all connected clients in the same operation. This is what keeps the wizard's view, the observer view, and the actual robot state synchronized in real time.
+The execution engine is the component that runs a live trial: it loads the experiment, manages the wizard's connection, dispatches robot commands, and keeps all connected clients in sync.

-Unscripted actions go through the same path. The wizard triggers them via the manual controls in the Execution interface, the robot command runs, and the action is logged with a deviation flag. The result is a complete, unambiguous trial record regardless of how closely the interaction followed the script.
+When a trial begins, the server loads the experiment and maintains a live connection to the wizard's browser and any observer connections. The execution engine does not advance the experiment on a timer; it waits for the wizard to trigger each step. This preserves the natural pacing of the interaction: the wizard advances only when the participant is ready, while the experiment structure ensures the protocol is followed. When the wizard triggers an action, the server dispatches the robot command, writes the log entry, and pushes the updated experiment state to all connected clients in the same operation — keeping the wizard's view, the observer view, and the actual robot state synchronized in real time.
+
+No two participants respond identically. One subject gives a one-word answer; another offers a paragraph; a third asks the robot a question the script never anticipated. A fully programmed robot has no answer for that third subject: the interaction stalls, or immersion breaks. The wizard exists to fill that gap: where the program runs out of instructions, the wizard draws on their knowledge of human social interaction to keep the exchange coherent. Unscripted actions give the wizard the tools to exercise that judgment in the moment. The wizard triggers them via the manual controls in the Execution interface, the robot command runs, and the system logs the action with a deviation flag. This design preserves research value: the interaction gains the flexibility only a human can provide, and that flexibility appears explicitly in the record rather than disappearing into it.

 \section{Robot Integration}

-Each robot platform is described by a configuration file that lists the actions it supports and specifies how each one maps to a command the robot understands. The execution engine reads this file at startup and uses it whenever it needs to dispatch a command: it looks up the action type, assembles the appropriate message, and sends it to the robot over a bridge process running on the local network. The web server itself has no knowledge of any specific robot; all hardware-specific logic lives in the configuration file.
+A configuration file describes each robot platform, listing the actions it supports and specifying how each one maps to a command the robot understands. The execution engine reads this file at startup and uses it whenever it needs to dispatch a command: it looks up the action type, assembles the appropriate message, and sends it to the robot over a bridge process running on the local network. The web server itself has no knowledge of any specific robot; all hardware-specific logic lives in the configuration file.

-Control flow elements such as branches and conditionals are treated the same way as robot actions. They appear as action groups in the experiment and are resolved by the execution engine at runtime, so researchers can freely mix logical decisions and physical robot behaviors when designing an experiment without any special handling.
+The execution engine treats control flow elements such as branches and conditionals the same way as robot actions. They appear as action groups in the experiment and resolve at runtime, so researchers can freely mix logical decisions and physical robot behaviors when designing an experiment without any special handling.

-Figure~\ref{fig:plugin-architecture} illustrates how the same abstract actions map to different robot-specific commands through each platform's configuration, using NAO6 and TurtleBot as an example.
+Figure~\ref{fig:plugin-architecture} illustrates this mapping using NAO6 and TurtleBot as an example. Actions a platform does not support (such as \texttt{raise\_arm} on TurtleBot) appear as explicitly unsupported in the configuration file rather than silently failing. The experiment itself does not change between platforms.

 \begin{figure}[htbp]
 \centering
@@ -142,30 +146,39 @@ Figure~\ref{fig:plugin-architecture} illustrates how the same abstract actions m
    \draw[arrow] (cfg.east) -- (tb.west);

 \end{tikzpicture}
-\caption{The same abstract actions in an experiment are translated to platform-specific robot commands through each robot's configuration file. Actions a platform does not support are declared explicitly rather than silently failing. The experiment itself does not change between platforms.}
+\caption{Abstract experiment actions translated to platform-specific robot commands through per-platform configuration files.}
 \label{fig:plugin-architecture}
 \end{figure}

 \section{Access Control}

-Each study has a membership list with assigned roles: owner, researcher, wizard, and observer. These roles determine what each team member can see and do within that study. A wizard can trigger actions during a live trial; observers can watch and annotate but cannot trigger anything. This allows studies to separate the wizard's role from the research team's observing role without any additional configuration.
+I implemented access control using a role-based access control (RBAC) model. Each study has a membership list, and every member is assigned one of four roles that define a clear separation of duties: those who own the study, those who design it, those who run it, and those who observe it. This enforces need-to-know access at the study level so that each team member sees only what their role requires.

-The role system also supports double-blind designs, where certain team members are restricted from seeing condition assignments or result data until the study concludes.
+\begin{description}
+    \item[Owner.] Full control over the study: can invite or remove members, configure the study settings, and access all data.
+    \item[Researcher.] Can create and modify experiment designs and review all collected trial data, but cannot manage team membership.
+    \item[Wizard.] Can trigger actions during a live trial and view the execution interface, but cannot modify the experiment design or access other wizards' sessions.
+    \item[Observer.] Read-only access: can watch a live trial in real time and annotate significant moments, but cannot trigger actions or modify any data.
+\end{description}
+
+The role system also supports double-blind designs~\cite{Bartneck2024}: the Owner can restrict a Wizard's view of condition assignments, and restrict Researchers from accessing result data until the study concludes, without any changes to the underlying experiment.

 \section{Architectural Challenges}

-Two design problems required specific design choices during implementation.
+Two problems required specific solutions during implementation.

-During a live trial, the execution engine must respond quickly to wizard input. A noticeable delay between the button press and the robot's action can disrupt the interaction. The engine addresses this by maintaining a persistent connection for the duration of each trial. The connection is established once at trial start and held open, so there is no per-action setup overhead.
+\begin{description}
+    \item[Execution latency.] During a live trial, the execution engine must respond quickly to wizard input — a noticeable delay between the button press and the robot's action can disrupt the interaction. I addressed this by maintaining a persistent connection for the duration of each trial. The connection is established once at trial start and kept open, eliminating per-action setup overhead.

-Multi-source synchronization requires aligning data streams during analysis that were captured at different sampling rates by different components: video, audio, action logs, and sensor data. The solution is a shared time reference. Every data source records its timestamps relative to the same trial start time, $t_0$, so the Analysis interface can align all tracks without requiring manual calibration. This is the timestamp structure shown in Figure~\ref{fig:trial-record}.
+    \item[Multi-source synchronization.] Analysis requires aligning data streams captured at different sampling rates by different components: video, audio, action logs, and sensor data. The solution is a shared time reference: every data source records its timestamps relative to the same trial start time, $t_0$, so the Analysis interface can align all tracks without requiring manual calibration. This is the timestamp structure shown in Figure~\ref{fig:trial-record}.
+\end{description}

 \section{Implementation Status}

-HRIStudio has reached minimum viable product status. The Design, Execution, and Analysis interfaces are operational. The execution engine handles scripted and unscripted actions with full timestamped logging, and robot communication has been validated with the NAO6 platform. The platform is capable of running a controlled WoZ study without modification.
+HRIStudio has reached minimum viable product status. The Design, Execution, and Analysis interfaces are operational. The execution engine handles scripted and unscripted actions with full timestamped logging, and I validated robot communication on the NAO6 platform during development. The platform can run a controlled WoZ study without modification.

 Work remaining for future development includes support for studies that use more than one robot at a time and validation of the configuration file approach on robot platforms beyond NAO6.

 \section{Chapter Summary}

-This chapter described how HRIStudio's design is realized in practice. Experiments are persistent, reusable specifications that produce complete, comparable trial records. The execution engine is event-driven rather than timer-driven, keeping the wizard in control of pacing while automatically logging every action. Robot hardware integration is handled through per-platform configuration files, keeping the execution engine itself hardware-agnostic. Access control is enforced at the study level through assigned roles. The platform is at minimum viable product status and is capable of running a controlled WoZ study.
+This chapter described how HRIStudio realizes the design concepts from Chapter~\ref{ch:design} in practice. Experiments are persistent, reusable specifications that produce complete, comparable trial records. The execution engine is event-driven rather than timer-driven, keeping the wizard in control of pacing while logging every action automatically. Per-platform configuration files keep the execution engine hardware-agnostic. The role system enforces access control at the study level. The platform is at minimum viable product status and can run a controlled WoZ study today. HRIStudio is one realization of these concepts; the contribution lies in the design principles themselves, which any implementation could adopt.
@@ -1,7 +1,7 @@
-\chapter{Evaluation Study}
+\chapter{Pilot Validation Study}
 \label{ch:evaluation}

-Chapters~\ref{ch:design} and~\ref{ch:implementation} described a platform designed to address two specific problems in WoZ-based HRI research: the high technical barrier that limits who can design robot interactions, and the methodological inconsistency that limits how reproducible those interactions are once designed. This chapter describes a user study conducted to test whether HRIStudio actually improves on both dimensions compared to the current standard tool. The study design, participant selection, task, procedure, and measures are documented here.
+Chapters~\ref{ch:design} and~\ref{ch:implementation} described a platform designed to address two specific problems in WoZ-based HRI research: the high technical barrier that limits who can design robot interactions, and the methodological inconsistency that limits how reproducible those interactions are once designed. HRIStudio is a reference implementation of those design concepts; the underlying contribution is not the tool itself but the principles that govern it. A study comparing HRIStudio against existing practice therefore tests whether those design concepts produce measurably better outcomes in the hands of real researchers. This chapter describes that study: participant selection, task, procedure, and measures.

 \section{Research Questions}

@@ -13,31 +13,31 @@ I hypothesized that wizards using HRIStudio would more completely and correctly

 \section{Study Design}

-The study used a between-subjects design~\cite{Bartneck2024}. Each wizard participant was randomly assigned to one of two conditions: HRIStudio or Choregraphe. Both groups received the same task, the same time allocation, and the same training structure. Measuring each participant in only one condition prevents carryover effects that would arise from using the same tool twice in sequence.
+I used a between-subjects design~\cite{Bartneck2024}. I randomly assigned each wizard participant to one of two conditions: HRIStudio or Choregraphe. Both groups received the same task, the same time allocation, and the same training structure. Measuring each participant in only one condition prevents carryover effects that would arise from using the same tool twice in sequence.

-Two types of participants took part with distinct roles. Wizards were CS faculty who designed and ran the robot interaction. Test subjects were undergraduate students who interacted with the robot during the live trial. This separation ensures that the evaluation captures both the design experience and the quality of the resulting interaction.
+Two types of participants took part with distinct roles. Wizards were faculty members drawn from across departments who designed and ran the robot interaction. Test subjects were undergraduate students who interacted with the robot during the live trial. This separation ensures that the evaluation captures both the design experience and the quality of the resulting interaction.

 \section{Participants}

 \subsection{Wizards}

-Eight Bucknell University faculty members drawn from across departments were recruited to serve as wizards. Participants were deliberately drawn from both ends of the programming experience spectrum: four had substantial programming backgrounds, and four described themselves as non-programmers or having minimal coding experience. This cross-departmental recruitment was intentional. A primary claim of HRIStudio is that it lowers the technical barrier for domain experts who are not programmers; drawing wizards from outside computer science allows the data to speak to whether that claim holds for the intended user population.
+I recruited eight Bucknell University faculty members drawn from across departments to serve as wizards. I deliberately recruited from both ends of the programming experience spectrum: four had substantial programming backgrounds, and four described themselves as non-programmers or having minimal coding experience. This cross-departmental recruitment was intentional. A primary claim of HRIStudio is that it lowers the technical barrier for domain experts who are not programmers; drawing wizards from outside computer science allows the data to speak to whether that claim holds for the intended user population.

-The key inclusion criterion for all wizards was no prior experience with either the NAO robot or Choregraphe software specifically. This controls for tool familiarity so that performance differences reflect the tools themselves rather than prior exposure. Wizards were recruited through direct email. Participation was framed as a voluntary software evaluation unrelated to any professional obligations.
+The key inclusion criterion for all wizards was no prior experience with either the NAO robot or Choregraphe software specifically. This controls for tool familiarity so that performance differences reflect the tools themselves rather than prior exposure. I recruited wizards through direct email. Participation was framed as a voluntary software evaluation unrelated to any professional obligations.

 \subsection{Test Subjects}

-Eight undergraduate students from Bucknell University were recruited to serve as test subjects. Their role was to interact with the robot during the live trial portion of each wizard's session. Participants were screened to ensure that no test subject was enrolled in a course taught by the wizard they were paired with, to eliminate any risk of coercion. Test subjects were recruited through campus flyers inviting volunteers to interact with a robot for approximately 15 minutes. All participants received international snacks and refreshments upon arrival, regardless of whether they completed the full session.
+I recruited eight undergraduate students from Bucknell University to serve as test subjects. Their role was to interact with the robot during the live trial portion of each wizard's session. I screened participants to ensure that no test subject was enrolled in a course taught by the wizard they were paired with, to eliminate any risk of coercion. I recruited test subjects through campus flyers inviting volunteers to interact with a robot for approximately 15 minutes. I provided all participants with international snacks and refreshments upon arrival, regardless of whether they completed the full session.

 \subsection{Sample Size Rationale}

-With $N = 16$ total participants, this study is small by the standards of a mature research program. That is intentional and appropriate given three constraints. First, this is an honors thesis project conducted over two academic semesters by a single undergraduate researcher with no funded research assistant support. The total person-hours available for participant recruitment, scheduling, session facilitation, and data processing are genuinely bounded. Second, the scope of the study is validation rather than definitive evaluation: the goal is to determine whether HRIStudio produces measurably different outcomes from Choregraphe and to identify failure modes, not to establish effect sizes for a broad population. Third, recruiting faculty from outside computer science for a 75-minute technology evaluation at a small liberal arts university is practically difficult. The target population --- domain experts with no prior robotics tool exposure --- is limited in size and has high competing time demands. Eight participants spans the available pool without resorting to participants who do not meet the inclusion criteria.
+With $N = 16$ total participants, this study is small by the standards of a mature research program. That is intentional and appropriate given three constraints. First, this is an honors thesis project conducted over two academic semesters by a single undergraduate researcher with no funded research assistant support. The total person-hours available for participant recruitment, scheduling, session facilitation, and data processing are genuinely bounded. Second, the scope of the study is validation rather than definitive evaluation: the goal is to determine whether HRIStudio produces measurably different outcomes from Choregraphe and to identify failure modes, not to establish effect sizes for a broad population. Third, recruiting faculty from outside computer science for a 75-minute technology evaluation at a small liberal arts university is practically difficult. The target population --- domain experts with no prior robotics tool exposure --- is limited in size and has high competing time demands. Eight participants span the available pool without relaxing the inclusion criteria.

 This scale is consistent with pilot and feasibility studies in HRI, where small $N$ designs are common in early-stage tool validation~\cite{Steinfeld2009}. Findings should be interpreted as preliminary evidence and directional indicators rather than as conclusive proof.

 \section{Task}

-Both wizard groups were given the same written task specification: the \emph{Interactive Storyteller} scenario. The specification described a robot that introduces an astronaut named Dara, narrates her discovery of an anomalous glowing rock on Mars, asks the participant a comprehension question about the story, and delivers one of two responses depending on whether the answer is correct. The full specification, including exact robot speech, required gestures, and branching logic, is reproduced in Appendix~\ref{app:materials}.
+Both wizard groups received the same written task specification: the \emph{Interactive Storyteller} scenario. The specification described a robot that introduces an astronaut named Dara, narrates her discovery of an anomalous glowing rock on Mars, asks the participant a comprehension question about the story, and delivers one of two responses depending on whether the answer is correct. The full specification, including exact robot speech, required gestures, and branching logic, is reproduced in Appendix~\ref{app:materials}.

 The task was chosen because it requires several distinct capabilities: speech actions, gesture coordination, conditional branching based on participant input, and a defined conclusion. This exercises the core features of both tools and produces an artifact that can be evaluated against a clear specification.

@@ -55,15 +55,15 @@ Each wizard completed a single 75-minute session structured in four phases. Test

 \subsection{Phase 1: Training (15 minutes)}

-The session opened with a standardized tutorial tailored to the wizard's assigned tool. The tutorial covered how to create speech actions, specify gestures, define conditional branches, and save the completed design. Training was intentionally brief to simulate a domain expert encountering a new tool without dedicated onboarding. The researcher answered clarification questions but did not offer hints about the design challenge.
+I opened each session with a standardized tutorial tailored to the wizard's assigned tool. The tutorial covered how to create speech actions, specify gestures, define conditional branches, and save the completed design. Training was intentionally brief to simulate a domain expert encountering a new tool without dedicated onboarding. I answered clarification questions but did not offer hints about the design challenge.

 \subsection{Phase 2: Design Challenge (30 minutes)}

-The wizard received the paper specification and had 30 minutes to implement it using their assigned tool. The researcher observed silently and recorded a screen capture of the wizard's workflow throughout. The researcher noted time to completion, help requests, and any observable errors or misconceptions. If the wizard declared completion before the 30-minute limit, the remaining time was used to review and refine the design.
+The wizard received the paper specification and had 30 minutes to implement it using their assigned tool. I observed silently and recorded a screen capture of the wizard's workflow throughout. I noted time to completion, help requests, and any observable errors or misconceptions. If the wizard declared completion before the 30-minute limit, the remaining time was used to review and refine the design.

 \subsection{Phase 3: Live Trial (15 minutes)}

-After the design phase, a test subject entered the room and the wizard ran their completed program to control the robot during an actual interaction. The researcher video-recorded the full trial to capture robot behavior and timing. The test subject was told they were helping evaluate the robot's performance, not being evaluated themselves.
+After the design phase, a test subject entered the room and the wizard ran their completed program to control the robot during an actual interaction. I video-recorded the full trial to capture robot behavior and timing. I told the test subject they were helping evaluate the robot's performance, not being evaluated themselves.

 \subsection{Phase 4: Debrief (15 minutes)}

@@ -72,19 +72,19 @@ Following the live trial, the wizard exported their completed project file and c
 \section{Measures}
 \label{sec:measures}

-Four measures were collected, two primary and two supplementary.
+The study collected four measures, two primary and two supplementary.

 \subsection{Design Fidelity Score}

-The Design Fidelity Score measures how completely and correctly the wizard implemented the paper specification. The exported project file was evaluated against five criteria: whether all four interaction steps were present, whether robot speech matched the specification word-for-word, whether gestures were assigned to the correct steps, whether the conditional branch triggered on the correct condition, and whether both response branches were complete and correctly ordered. Each criterion was scored as met or not met, and the score is the proportion of criteria satisfied.
+The Design Fidelity Score measures how completely and correctly the wizard implemented the paper specification. I evaluated the exported project file against five criteria: whether all four interaction steps were present, whether robot speech matched the specification word-for-word, whether gestures were assigned to the correct steps, whether the conditional branch triggered on the correct condition, and whether both response branches were complete and correctly ordered. I scored each criterion as met or not met; the DFS is the proportion of criteria satisfied.

-This measure is motivated by a gap identified by Riek~\cite{Riek2012}, whose systematic review of 54 published WoZ studies found that only 11\% constrained what the wizard could recognize and fewer than 6\% described wizard training procedures --- meaning the vast majority of WoZ studies never verified whether the wizard's design matched any formal specification. Porfirio et al.~\cite{Porfirio2023} similarly argued that formal, verifiable behavior specifications are a prerequisite for reproducible HRI, and the preliminary design of HRIStudio identified specification adherence as a primary evaluation target~\cite{OConnor2024}. The DFS operationalizes these recommendations as a weighted rubric applied to the exported project file. The complete rubric is reproduced in Appendix~\ref{app:materials}. This measure addresses accessibility: did the tool allow a non-expert to produce a correct design?
+This measure is motivated by a gap identified by Riek~\cite{Riek2012}, whose systematic review of 54 published WoZ studies found that only 11\% constrained what the wizard could recognize and fewer than 6\% described wizard training procedures, meaning the vast majority of WoZ studies never verified whether the wizard's design matched any formal specification. Porfirio et al.~\cite{Porfirio2023} similarly argued that formal, verifiable behavior specifications are a prerequisite for reproducible HRI, and the preliminary design of HRIStudio identified specification adherence as a primary evaluation target~\cite{OConnor2024}. The DFS applies these recommendations as a weighted rubric scored against the exported project file. The complete rubric is reproduced in Appendix~\ref{app:materials}. This measure addresses accessibility: did the tool allow a non-expert to produce a correct design?

 \subsection{Execution Reliability Score}

-The Execution Reliability Score measures whether the designed interaction executed as intended during the live trial. The video recording was reviewed against the specification and the wizard's design. Evaluation criteria included whether the robot delivered the correct speech at each step, whether gestures executed and were synchronized with speech, whether the conditional branch resolved correctly based on the test subject's answer, and whether any errors, disconnections, or hangs occurred. The score is the proportion of the interaction that executed without error.
+The Execution Reliability Score measures whether the designed interaction executed as intended during the live trial. I reviewed the video recording against the specification and the wizard's design. Evaluation criteria included whether the robot delivered the correct speech at each step, whether gestures executed and synchronized with speech, whether the conditional branch resolved correctly based on the test subject's answer, and whether any errors, disconnections, or hangs occurred. The score is the proportion of the interaction that executed without error.

-This measure responds directly to Riek's~\cite{Riek2012} finding that only 3.7\% of published WoZ studies reported any measure of wizard error, making it nearly impossible to determine whether execution matched design intent. Without an execution-level metric, a study could report a technically correct design that nonetheless failed during the live trial due to timing errors, disconnections, or mishandled branches --- precisely the class of problem HRIStudio was designed to detect and log~\cite{OConnor2024, OConnor2025}. The ERS captures those deviations quantitatively. The complete rubric is reproduced in Appendix~\ref{app:materials}. This measure addresses reproducibility: did the design translate reliably into execution?
+This measure responds directly to Riek's~\cite{Riek2012} finding that only 3.7\% of published WoZ studies reported any measure of wizard error, making it nearly impossible to determine whether execution matched design intent. Without an execution-level metric, a study could report a technically correct design that nonetheless failed during the live trial due to timing errors, disconnections, or mishandled branches, exactly the kind of problem HRIStudio was designed to detect and log~\cite{OConnor2024, OConnor2025}. The ERS captures those deviations quantitatively. The complete rubric is reproduced in Appendix~\ref{app:materials}. This measure addresses reproducibility: did the design translate reliably into execution?

 \subsection{System Usability Scale}

@@ -114,10 +114,10 @@ System Usability Scale & Wizard's perceived usability of the assigned tool & Deb
 Time-to-Completion \& Help Requests & Task duration and support requests during design & Throughout design phase & Supplementary \\
 \hline
 \end{tabular}
-\caption{Measurement instruments used in the evaluation study.}
+\caption{Measurement instruments used in the pilot validation study.}
 \label{tbl:measurement_instruments}
 \end{table}

 \section{Chapter Summary}

-This chapter described a between-subjects study comparing HRIStudio and Choregraphe across eight wizard participants --- four with programming backgrounds and four without --- each of whom designed and ran the Interactive Storyteller task on a NAO robot. Performance was measured through design fidelity against the written specification, execution reliability during the live trial, perceived usability via the SUS, and supplementary timing and help data. Chapter~\ref{ch:results} presents the results.
+This chapter described a pilot between-subjects study I designed to test whether the design principles formalized in Chapters~\ref{ch:design} and~\ref{ch:implementation} produce measurably different outcomes from existing practice. Eight wizard participants (four with programming backgrounds and four without) each designed and ran the Interactive Storyteller task on a NAO robot using either HRIStudio or Choregraphe. I measured design fidelity against the written specification, execution reliability during the live trial, perceived usability via the SUS, and supplementary timing and help data. Chapter~\ref{ch:results} presents the results.
@@ -7,7 +7,7 @@
 \usepackage{array}                %Extended column types and \arraybackslash
 \usepackage{tabularx}             %Auto-width table columns
 \usepackage{tikz}                 %For programmatic diagrams
-\usetikzlibrary{shapes,arrows,positioning,fit,backgrounds}
+\usetikzlibrary{shapes,arrows,positioning,fit,backgrounds,decorations.pathreplacing}
 \usepackage[
    hidelinks,
    linktoc=all,