Add appendix on AI-assisted development workflow for HRIStudio

This commit introduces a new appendix detailing the role of AI coding assistants in the development of HRIStudio. It covers the context of the project, tools used, division of responsibility, interaction patterns, and reflections on research integrity. The workflow is documented to provide transparency and insight into the development process, emphasizing the collaboration between human decisions and AI assistance.
2026-05-08 07:08:55 -04:00 · 2026-04-20 23:15:23 -04:00
parent 086b53880f
commit a7508c5698
14 changed files with 344 additions and 45 deletions
@@ -9,7 +9,7 @@ To build the social robots of tomorrow, researchers must study how people respon
 Social robotics, a subfield of HRI, focuses on robots designed for social interaction with humans, and it poses unique challenges for autonomy. In a typical social robotics interaction, a robot operates autonomously based on pre-programmed behaviors. Because human reactions to robot behaviors are not always predictable, pre-programmed autonomy often fails to respond appropriately to subtle social cues, causing the interaction to degrade.
-To overcome this limitation, researchers use the WoZ technique. The name references L. Frank Baum's story \cite{Baum1900}, in which the "great and powerful" Oz is revealed to be an ordinary person operating machinery behind a curtain, creating an illusion of magic. In WoZ experiments, the wizard similarly creates an illusion of robot intelligence from behind the scenes. Consider a scenario where a researcher wants to test whether a robot tutor can effectively encourage student subjects during a learning task. Rather than building a complete autonomous system with speech recognition, natural language understanding, and emotion detection, the researcher may use a WoZ setup: a human operator (the ``wizard'') sits in a separate room, observing the interaction through cameras and microphones. When the subject appears frustrated, the wizard makes the robot say an encouraging phrase and perform a supportive gesture. To the subject, the robot appears to be acting autonomously, responding naturally to the subject's emotional state. This methodology allows researchers to rapidly prototype and test interaction designs, gathering valuable data about human responses before investing in the development of complex autonomous capabilities.
+To overcome this limitation, researchers use the WoZ technique. The name references L. Frank Baum's story \cite{Baum1900}, in which the ``great and powerful'' Oz is revealed to be an ordinary person operating machinery behind a curtain, creating an illusion of magic. In WoZ experiments, the wizard similarly creates an illusion of robot intelligence from behind the scenes. Consider a scenario where a researcher wants to test whether a robot tutor can effectively encourage student subjects during a learning task. Rather than building a complete autonomous system with speech recognition, natural language understanding, and emotion detection, the researcher may use a WoZ setup: a human operator (the ``wizard'') sits in a separate room, observing the interaction through cameras and microphones. When the subject appears frustrated, the wizard makes the robot say an encouraging phrase and perform a supportive gesture. To the subject, the robot appears to be acting autonomously, responding naturally to the subject's emotional state. This methodology allows researchers to rapidly prototype and test interaction designs, gathering valuable data about human responses before investing in the development of complex autonomous capabilities.
 Despite its versatility, WoZ research faces two critical challenges. The first is \emph{The Accessibility Problem}: many non-programmers, such as experts in psychology or sociology, may find it challenging to conduct their own studies without engineering support. The second is \emph{The Reproducibility Problem}: the hardware landscape is highly fragmented, and researchers frequently build custom control interfaces for specific robots and experiments. Because these tools are tightly coupled to particular hardware, running the same social interaction script on a different robot platform typically requires rebuilding the implementation from scratch. These tools are rarely shared, making it difficult for a researcher to reproduce the same study across different robot platforms or for other labs to replicate results.
@@ -3,7 +3,7 @@
 This chapter provides the necessary context for understanding the challenges addressed by this thesis. I survey the landscape of existing WoZ platforms, analyze their capabilities and limitations, and establish requirements that a modern infrastructure should satisfy. Finally, I position this thesis within the context of prior work on this topic.
-As established in Chapter~\ref{ch:intro}, the WoZ technique enables researchers to prototype and test robot interaction designs before autonomous capabilities are developed. This thesis is situated within a specific subset of HRI activity: social robotics, a subfield concerned with robots designed for direct social interaction with humans, and more narrowly within that, WoZ experiments used to prototype and evaluate social robot behaviors. To understand how the proposed framework advances this research paradigm, I review the existing landscape of WoZ platforms, identify their limitations relative to disciplinary needs, and establish requirements for a more comprehensive approach. HRI is fundamentally a multidisciplinary field which brings together engineers, psychologists, designers, and domain experts from various application areas \cite{Bartneck2024}. Yet two challenges have historically limited participation from non-technical researchers in WoZ-based HRI studies. First, each research group builds custom software for specific robots, creating tool fragmentation across the field. Second, high technical barriers prevent many domain experts from conducting independent studies.
+As established in Chapter~\ref{ch:intro}, the WoZ technique enables researchers to prototype and test robot interaction designs before autonomous capabilities are developed. This thesis is situated within a specific subset of HRI activity: social robotics, a subfield concerned with robots designed for direct social interaction with humans, and more narrowly within that, WoZ experiments used to prototype and evaluate social robot behaviors. To understand how the proposed framework advances this research paradigm, I review the existing landscape of WoZ platforms, identify their limitations relative to disciplinary needs, and establish requirements for a more comprehensive approach. HRI is fundamentally a multidisciplinary field which brings together engineers, psychologists, designers, and domain experts from various application areas \cite{Bartneck2024}. Yet two challenges have historically limited participation from non-technical researchers in WoZ-based HRI studies. First, high technical barriers prevent many domain experts from conducting independent studies. Second, each research group builds custom software for specific robots, creating tool fragmentation across the field.
 \section{Existing WoZ Platforms and Tools}
@@ -253,7 +253,7 @@ To ensure that data from every experimental phase remains traceable, the system
 \subsection{Architectural Layers}
-Like the ISO/OSI reference model for networking software, HRIStudio separates its communicative and functional responsibilities into distinct layers, as shown in Figure~\ref{fig:three-tier}. More specifically, the system is organized as a three-layer architecture, each layer with a specific responsibility:
+HRIStudio separates its communicative and functional responsibilities into distinct layers, in a manner analogous to the layered reference models used in networking software. More specifically, the system is organized as a three-layer architecture, as shown in Figure~\ref{fig:three-tier}, each layer with a specific responsibility:
 \begin{description}
 \item[User Interface layer.] Runs in researchers' web browsers and exposes the three interfaces (Design, Execution, Analysis), managing user interactions such as clicking buttons, dragging and dropping experiment components, and reviewing experimental results.
@@ -11,6 +11,15 @@ I organized the system into three layers: User Interface, Application Logic, and
 I implemented all three layers in the same language: TypeScript~\cite{TypeScript2014}, a statically-typed superset of JavaScript. The single-language decision keeps the type system consistent across the full stack. When the structure of experiment data changes, the type checker surfaces inconsistencies across the entire codebase at compile time rather than allowing them to appear as runtime failures during a trial.
 HRIStudio is released as open-source software under the MIT License, with the application hosted at a public repository~\cite{HRIStudioRepo} and the companion robot plugin repository hosted separately~\cite{RobotPluginsRepo}. Both are available for inspection, extension, and deployment by other research groups.
 \subsection{Working with AI Coding Assistants}
 \label{sec:ai-ws}
 The scale of the implementation described in this chapter, a full-stack TypeScript application spanning user interface, application logic, persistent storage, and real-time robot control, would not have been possible within the timeframe of this thesis without the use of AI coding assistants. I distinguish clearly between the engineering and implementation roles in this work: I architected the system, made the design decisions documented in Chapter~\ref{ch:design} and this chapter, specified the behavior and constraints of each component, and reviewed and integrated all code before it entered the codebase. AI agents acted as software developers working under that direction, producing TypeScript code in response to the specifications I provided and the feedback I gave as the implementation evolved. The division of labor was consistent throughout: I engineered, they implemented.
 The tools I used in this capacity spanned several vendors and interaction paradigms, and the set evolved as the AI landscape changed over the course of the project. Claude~\cite{Anthropic2024Claude} was the conversational model I relied on most consistently for design discussions and code review. I used Claude Code~\cite{AnthropicClaudeCode}, OpenCode~\cite{OpenCode}, the Gemini CLI~\cite{GeminiCLI}, and Google Antigravity~\cite{GoogleAntigravity} as terminal- and editor-integrated coding agents for implementing the features I specified; the Zed editor~\cite{ZedEditor} served as the surrounding development environment and provided its own AI-assisted editing features. These tools overlapped in places, but I generally used one at a time and switched between them as new capabilities became available and as I learned which tool suited which kind of work. Appendix~\ref{app:ai_workflow} documents this workflow in more detail: the division of responsibility between me and the agents, the kinds of tasks each category of tool handled well, and the limits I ran into.
 \section{Experiment Storage and Trial Logging}
 The system saves experiment descriptions to persistent storage when a researcher completes them in the Design interface. A saved experiment is a complete, reusable specification that a researcher can run across any number of trials without modification.
@@ -85,10 +94,19 @@ When a trial begins, the system creates a new trial record linked to that experi
 \label{fig:trial-record}
 \end{figure}
-Video and audio are recorded locally in the researcher's browser during the trial rather than streamed to the server in real time. This prevents network delays or server load from dropping frames or degrading audio quality during the interaction. When the trial concludes, the browser transfers the complete recordings to the server and associates them with the trial record. The Analysis interface can align video and audio with the logged actions without any manual synchronization, because the timestamp when recording starts is logged alongside the action log.
+Video and audio are recorded locally in the wizard's browser during the trial rather than streamed to the server in real time. The wizard's browser is the canonical recording client because the wizard is the only role required for a trial to run; observer and researcher roles connect in read-only capacities and do not capture media. Recording locally prevents network delays or server load from dropping frames or degrading audio quality during the interaction. When the trial concludes, the wizard's browser transfers the complete recordings to the server and associates them with the trial record. The Analysis interface can align video and audio with the logged actions without any manual synchronization, because the timestamp when recording starts is logged alongside the action log.
 The system stores structured and media data separately. Experiment specifications and trial records are stored in the same structured database, which makes it efficient to query across trials (for example, retrieving all trials for a specific participant or comparing action timing across conditions). Video and audio files are stored in a dedicated file store, since their size makes them unsuitable for a database and the system never queries their content directly.
 Figure~\ref{fig:trial-report} shows the Analysis interface reconstructing a completed trial. The recorded video is presented alongside a synchronized action log, with each logged event linked to its moment in the recording so researchers can jump directly to the corresponding interaction without manual cross-referencing.
 \begin{figure}[htbp]
 \centering
 \includegraphics[width=0.95\textwidth]{assets/trial-report.png}
 \caption{The HRIStudio Analysis interface showing a completed trial with video and a synchronized, timestamped action log.}
 \label{fig:trial-report}
 \end{figure}
 \section{The Execution Engine}
 The execution engine is the component that runs a trial: it loads the experiment, manages the wizard's connection, sends robot commands, and keeps all connected clients in sync.
@@ -97,6 +115,15 @@ When a trial begins, the server loads the experiment and maintains a live connec
 No two human subjects respond identically to an experimental protocol. One subject gives a one-word answer; another offers a paragraph; a third asks the robot a question the script never anticipated. Unscripted actions give the wizard the tools to record how these interactions unfold when deviations from the script are required. The wizard triggers them via the manual controls in the Execution interface, the robot command runs, and the system logs the action with a deviation flag. This design preserves research value: the interaction gains the flexibility only a human can provide, and that flexibility appears explicitly in the record rather than disappearing into it.
 Figure~\ref{fig:execution-view} shows the Execution interface as it appears to a wizard during a live trial. The current step is highlighted in the protocol sidebar, the available actions for that step are surfaced as triggerable buttons, and the wizard has manual-control affordances for introducing unscripted actions that the system will flag as deviations in the trial log.
 \begin{figure}[htbp]
 \centering
 \includegraphics[width=0.95\textwidth]{assets/execution-view.png}
 \caption{The HRIStudio Execution interface during a live trial, showing the current step, available actions, and manual deviation controls.}
 \label{fig:execution-view}
 \end{figure}
 \section{Robot Integration}
 A plugin file describes each robot platform, listing the actions it supports and specifying how each one maps to a command the robot understands. The execution engine reads this file at startup and uses it whenever it needs to dispatch a command: it looks up the action type, assembles the appropriate message, and sends it to the robot over a bridge process running on the local network. The web server itself has no knowledge of any specific robot; all hardware-specific logic lives in the plugin file.
@@ -13,7 +13,7 @@ I hypothesized that HRIStudio would improve both accessibility and reproducibili
 \section{Study Design}
-I used what Bartneck et al.~\cite{Bartneck2024} call a between-subjects design, in which each participant is assigned to only one condition. I randomly assigned each wizard participant to one of two conditions: HRIStudio or Choregraphe. Both groups received the same task, the same time allocation, and a similar training structure. Measuring each participant in only one condition prevents carryover effects, meaning performance changes caused by prior exposure to another condition rather than by the assigned condition itself.
+I used what Bartneck et al.~\cite{Bartneck2024} call a between-subjects design, in which each participant is assigned to only one condition. To ensure that programming experience was balanced across conditions, I stratified assignment by self-reported programming background: each wizard was first classified into one of three strata (\emph{None}, \emph{Moderate}, or \emph{Extensive} programming experience), and then randomly assigned within their stratum to one of the two conditions (HRIStudio or Choregraphe). This produced a design in which each condition contained exactly one wizard at each experience level, allowing the tool effect to be evaluated without confounding from the distribution of programming experience. Both groups received the same task, the same time allocation, and a similar training structure. Measuring each participant in only one condition prevents carryover effects, meaning performance changes caused by prior exposure to another condition rather than by the assigned condition itself.
 \section{Participants}
@@ -48,6 +48,26 @@ The control condition used Choregraphe \cite{Pot2009}, a proprietary visual prog
 The experimental condition used HRIStudio, described in Chapter~\ref{ch:implementation}. HRIStudio organizes behavior as a sequential action timeline with support for conditional branches. Unlike Choregraphe, it abstracts robot-specific commands through plugin files, though for this study both tools controlled the same NAO platform.
 Figure~\ref{fig:design-tool-compare} places the two design environments side by side. On the left, Choregraphe's behavior-box canvas (Figure~\ref{fig:choregraphe-ui}) lets the wizard wire nodes and transitions in a finite-state-machine layout. On the right, HRIStudio's experiment designer (Figure~\ref{fig:hristudio-designer}) presents the same protocol as a vertical action timeline with dedicated blocks for speech, gesture, and conditional branching.
 \begin{figure}[htbp]
 \centering
 \begin{minipage}[t]{0.48\textwidth}
 \centering
 \includegraphics[width=\textwidth]{assets/choregraphe.png}
 \subcaption{Choregraphe: behavior-box canvas with nodes and transitions.}
 \label{fig:choregraphe-ui}
 \end{minipage}\hfill
 \begin{minipage}[t]{0.48\textwidth}
 \centering
 \includegraphics[width=\textwidth]{assets/experiment-designer.png}
 \subcaption{HRIStudio: vertical action timeline with structured step and action blocks.}
 \label{fig:hristudio-designer}
 \end{minipage}
 \caption{The two design environments compared. Each wizard used one of these tools to implement the Interactive Storyteller specification.}
 \label{fig:design-tool-compare}
 \end{figure}
 \section{Procedure}
 Each wizard completed a single 60-minute session structured in four phases.
@@ -77,7 +97,7 @@ The study collected five measures, two primary and three supplementary, operatio
 I define the Design Fidelity Score (DFS) as a measure of how completely and correctly the wizard implemented the specification. I evaluated the exported project file against nine weighted criteria grouped into three categories: speech actions, gestures and actions, and control flow and logic. Each criterion is scored as present, correct, and independently achieved.
-The DFS rubric includes an \emph{Assisted} column. For each rubric item, the researcher marks T if a tool-operation intervention was given specifically for that item during the design phase (for example, if the researcher explained how to add a gesture node or how to wire a conditional branch). T marks are recorded and reported separately alongside the DFS score; they do not affect the Points total. This preserves the DFS as a clean measure of design fidelity while providing a parallel record of where tool-specific assistance was needed. General interventions (task clarification, hardware issues, or momentary forgetfulness) are not marked T, because those categories of difficulty are independent of the tool under evaluation.
+The DFS rubric includes an \emph{Assisted} column. For each rubric item, I marked a T if I provided a tool-operation intervention specifically for that item during the design phase (for example, if I explained how to add a gesture node or how to wire a conditional branch). T marks are recorded and reported separately alongside the DFS score; they do not affect the Points total. This preserves the DFS as a clean measure of design fidelity while providing a parallel record of where tool-specific assistance was needed. General interventions (task clarification, hardware issues, or momentary forgetfulness) are not marked T, because those categories of difficulty are independent of the tool under evaluation.
 DFS is motivated by a gap identified by Riek~\cite{Riek2012}, whose systematic review of 54 published WoZ studies found that only 11\% constrained wizard behavior and fewer than 6\% described wizard training procedures. Porfirio et al.~\cite{Porfirio2023} similarly argued that formal, verifiable behavior specifications are a prerequisite for reproducible HRI. The DFS applies these recommendations as a weighted rubric scored against the exported project file. The complete rubric is reproduced in Appendix~\ref{app:blank_templates}. This measure addresses the question: did the tool allow a wizard to independently produce a correct design?
@@ -98,9 +118,9 @@ The System Usability Scale (SUS) is a validated 10-item questionnaire measuring
 During each session, I maintained a structured intervention log on the observer data sheet, recording the timestamp, type code, affected rubric item number, and a brief description for every instance in which I assisted the wizard. The four intervention type codes are:
 \begin{description}
-    \item[T (tool-operation).] The researcher explained how to operate a specific feature of the assigned software tool.
+    \item[T (tool-operation).] I explained how to operate a specific feature of the assigned software tool.
-    \item[C (task clarification).] The researcher clarified the written specification or an aspect of the task design.
+    \item[C (task clarification).] I clarified the written specification or an aspect of the task design.
-    \item[H (hardware or technical).] The researcher addressed a robot connection issue or other technical problem outside the wizard's control.
+    \item[H (hardware or technical).] I addressed a robot connection issue or other technical problem outside the wizard's control.
    \item[G (general).] Brief assistance not attributable to the tool or the task, such as momentary forgetfulness.
 \end{description}
@@ -38,7 +38,7 @@ This table also presents numerical data representing the study's results, which
 \subsection{Design Fidelity Score (DFS)}
-The Design Fidelity Score measures how completely and correctly each wizard implemented the written specification, the experiment they received. Scores range from 0 to 100, with full points awarded only when a component — a rubric criterion representing a required speech action, gesture, or control-flow element — is both present and correct. (For a full description of rubric categories, see Section~\ref{sec:measures}.)
+The Design Fidelity Score measures how completely and correctly each wizard implemented the written specification of their assigned experiment. Scores range from 0 to 100, with full points awarded only when a component — a rubric criterion representing a required speech action, gesture, or control-flow element — is both present and correct. (For a full description of rubric categories, see Section~\ref{sec:measures}.)
 Across the six participants, DFS scores divided sharply by study condition: all three HRIStudio wizards achieved a perfect score of 100, while the three Choregraphe wizards scored 42.5, 65, and 62.5. The following paragraphs describe the key findings from each session.
@@ -92,37 +92,63 @@ W-06 rated HRIStudio with a SUS score of 70. W-06, a Computer Science faculty me
 HRIStudio study condition SUS scores were 90, 70, and 70 (mean 76.7). Choregraphe study condition SUS scores were 60, 75, and 42.5 (mean 59.2).
 Figure~\ref{fig:results-chart} summarizes the three primary measures side-by-side. In each group, the left bar represents the Choregraphe mean and the right bar represents the HRIStudio mean. HRIStudio exceeds Choregraphe on every measure, with the largest gap on DFS (43.3 points) and the smallest on SUS (17.5 points).
 \begin{figure}[htbp]
 \centering
 \begin{tikzpicture}
    % Axes
    \draw[thick] (0,0) -- (0,6.3);
    \draw[thick] (0,0) -- (11.2,0);
    % Y-axis ticks and labels (0--100, with 1 unit = 0.06 cm)
    \foreach \tick/\val in {0/0, 1.2/20, 2.4/40, 3.6/60, 4.8/80, 6.0/100} {
        \draw (-0.08, \tick) -- (0, \tick);
        \node[left, font=\footnotesize] at (-0.05, \tick) {\val};
    }
    \node[rotate=90, font=\small] at (-1.05, 3.0) {Mean Score (0--100)};
    % Horizontal gridlines
    \foreach \tick in {1.2, 2.4, 3.6, 4.8, 6.0} {
        \draw[gray!25, thin] (0.02, \tick) -- (11.2, \tick);
    }
    % DFS group
    \fill[gray!40, draw=black] (1.0, 0) rectangle (2.3, 3.402);
    \fill[gray!75, draw=black] (2.4, 0) rectangle (3.7, 6.000);
    \node[font=\footnotesize] at (1.65, 3.60) {56.7};
    \node[font=\footnotesize] at (3.05, 6.20) {100};
    \node[font=\small] at (2.35, -0.38) {DFS};
    % ERS group
    \fill[gray!40, draw=black] (4.5, 0) rectangle (5.8, 4.002);
    \fill[gray!75, draw=black] (5.9, 0) rectangle (7.2, 5.802);
    \node[font=\footnotesize] at (5.15, 4.20) {66.7};
    \node[font=\footnotesize] at (6.55, 6.00) {96.7};
    \node[font=\small] at (5.85, -0.38) {ERS};
    % SUS group
    \fill[gray!40, draw=black] (8.0, 0) rectangle (9.3, 3.552);
    \fill[gray!75, draw=black] (9.4, 0) rectangle (10.7, 4.602);
    \node[font=\footnotesize] at (8.65, 3.75) {59.2};
    \node[font=\footnotesize] at (10.05, 4.80) {76.7};
    \node[font=\small] at (9.35, -0.38) {SUS};
    % Legend
    \fill[gray!40, draw=black] (2.6, -1.25) rectangle (3.0, -1.00);
    \node[anchor=west, font=\footnotesize] at (3.1, -1.125) {Choregraphe};
    \fill[gray!75, draw=black] (7.0, -1.25) rectangle (7.4, -1.00);
    \node[anchor=west, font=\footnotesize] at (7.5, -1.125) {HRIStudio};
 \end{tikzpicture}
 \caption{Mean scores by condition across the three primary outcome measures. Within each group, the left bar is Choregraphe and the right bar is HRIStudio.}
 \label{fig:results-chart}
 \end{figure}
 \section{Supplementary Measures}
 \subsection{Session Timing}
 Table~\ref{tbl:timing} summarizes the time spent in each phase per session.
 \begin{table}[htbp]
 \centering
 \footnotesize
 \begin{tabular}{|l|l|l|l|l|l|}
 \hline
 \textbf{ID} & \textbf{Training} & \textbf{Design} & \textbf{Trial} & \textbf{Debrief} & \textbf{Total} \\
 \hline
 W-01 & 15 min & 35 min & 5 min & 5 min & 60 min \\
 \hline
 W-02 & 7 min & 24 min & 5 min & 5 min & 41 min \\
 \hline
 W-03 & 12 min & 37 min & 5 min & 5 min & 59 min \\
 \hline
 W-04 & 17 min & 35 min & 4 min & 4 min & 60 min \\
 \hline
 W-05 & 6 min & 18 min & 4 min & 4 min & 32 min \\
 \hline
 W-06 & 8 min & 21 min & 3 min & 5 min & 37 min \\
 \hline
 \end{tabular}
 \caption{Time spent in each session phase per wizard participant.}
 \label{tbl:timing}
 \end{table}
 W-01's design phase extended to 35 minutes, five minutes over the 30-minute allocation, compressing the trial and debrief to 5 minutes each. Despite this, W-01 declared the design complete rather than abandoning it, and the robot executed a recognizable version of the specification during the trial.
 W-02's training phase concluded in 7 minutes, roughly half the standard 15-minute allocation. This reflects HRIStudio's more intuitive onboarding rather than simply W-02's technical background: the platform's guided workflow and timeline-based model required less explanation before the wizard was ready to begin the design phase. W-02's design phase then concluded in 24 minutes, within the allocation, and the trial ran for approximately five minutes.
@@ -137,6 +163,59 @@ W-06's training phase concluded in 8 minutes and the design phase completed in 2
 Across all six sessions, Choregraphe design phases averaged approximately 35.7 minutes; W-01 and W-03 exceeded the 30-minute target but completed their designs before the session time limit, while W-04 was the only wizard cut off by the limit without finishing. HRIStudio design phases averaged 21 minutes across three sessions, all within the allocation. Training phases similarly diverged: Choregraphe training averaged approximately 14.7 minutes, while HRIStudio training averaged 7 minutes.
 Figure~\ref{fig:timing-chart} compares the per-condition means for training, design, and total session duration. The gap is concentrated in the design phase and carries through to the total session length; training duration also diverges, with Choregraphe wizards requiring roughly twice as long to reach readiness.
 \begin{figure}[htbp]
 \centering
 \begin{tikzpicture}
    % Axes (1 minute = 0.1 cm, so 60 min = 6 cm)
    \draw[thick] (0,0) -- (0,6.3);
    \draw[thick] (0,0) -- (11.2,0);
    % Y-axis ticks and labels (0--60 minutes)
    \foreach \tick/\val in {0/0, 1/10, 2/20, 3/30, 4/40, 5/50, 6/60} {
        \draw (-0.08, \tick) -- (0, \tick);
        \node[left, font=\footnotesize] at (-0.05, \tick) {\val};
    }
    \node[rotate=90, font=\small] at (-1.05, 3.0) {Mean Duration (minutes)};
    % Horizontal gridlines
    \foreach \tick in {1,2,3,4,5,6} {
        \draw[gray!25, thin] (0.02, \tick) -- (11.2, \tick);
    }
    % Training group — Choregraphe 14.7, HRIStudio 7.0
    \fill[gray!40, draw=black] (1.0, 0) rectangle (2.3, 1.47);
    \fill[gray!75, draw=black] (2.4, 0) rectangle (3.7, 0.70);
    \node[font=\footnotesize] at (1.65, 1.67) {14.7};
    \node[font=\footnotesize] at (3.05, 0.90) {7.0};
    \node[font=\small] at (2.35, -0.38) {Training};
    % Design group — Choregraphe 35.7, HRIStudio 21.0
    \fill[gray!40, draw=black] (4.5, 0) rectangle (5.8, 3.57);
    \fill[gray!75, draw=black] (5.9, 0) rectangle (7.2, 2.10);
    \node[font=\footnotesize] at (5.15, 3.77) {35.7};
    \node[font=\footnotesize] at (6.55, 2.30) {21.0};
    \node[font=\small] at (5.85, -0.38) {Design};
    % Total group — Choregraphe 59.7, HRIStudio 36.7
    \fill[gray!40, draw=black] (8.0, 0) rectangle (9.3, 5.97);
    \fill[gray!75, draw=black] (9.4, 0) rectangle (10.7, 3.67);
    \node[font=\footnotesize] at (8.65, 6.17) {59.7};
    \node[font=\footnotesize] at (10.05, 3.87) {36.7};
    \node[font=\small] at (9.35, -0.38) {Total Session};
    % Legend
    \fill[gray!40, draw=black] (2.6, -1.25) rectangle (3.0, -1.00);
    \node[anchor=west, font=\footnotesize] at (3.1, -1.125) {Choregraphe};
    \fill[gray!75, draw=black] (7.0, -1.25) rectangle (7.4, -1.00);
    \node[anchor=west, font=\footnotesize] at (7.5, -1.125) {HRIStudio};
 \end{tikzpicture}
 \caption{Mean phase durations (in minutes) by condition. Within each group, the left bar is Choregraphe and the right bar is HRIStudio.}
 \label{fig:timing-chart}
 \end{figure}
 \subsection{Intervention Log}
 W-01 generated a high volume of help requests during the design phase, primarily concerning Choregraphe's interface rather than the specification itself. The wizard demonstrated understanding of the task but encountered repeated friction with the tool's connection model, behavior box configuration, and branch routing. This pattern, understanding the goal but struggling with the mechanism, is characteristic of the accessibility problem described in Chapter~\ref{ch:background}.
@@ -11,7 +11,7 @@ The first research question asked whether HRIStudio enables domain experts witho
 The six completed sessions provide directional evidence on the accessibility question. Across the three Choregraphe wizards, design fidelity scores were 42.5, 65, and 62.5, yielding a Choregraphe mean of 56.7. Across the three HRIStudio sessions, all three wizards achieved a DFS of 100. No HRIStudio wizard required a T-type intervention that reflected an inability to operate the platform; the T-type marks logged for W-05 concerned interface orientation, and those logged for W-06 concerned gesture execution details (parallel execution and posture-reset blocks), neither of which constituted fundamental operational barriers. By contrast, Choregraphe produced design difficulties across all three sessions. W-01 required T-type assistance for connection routing and branch wiring. W-03 required no T-type interventions but over-engineered the design, adding concurrent execution nodes and attempting onboard speech-recognition logic that falls outside the WoZ paradigm. W-04 required T-type assistance for speech content punctuation and a failed choice block attempt.
-The SUS scores reinforce this pattern. Choregraphe SUS scores were 60, 75, and 42.5 (mean 59.2), all at or below the average usability benchmark of 68~\cite{Brooke1996}. HRIStudio SUS scores were 90, 70, and 70 (mean 76.7), all above the benchmark. The Choregraphe study condition produced the lowest single SUS score in the study (42.5, W-04), a wizard who described the platform as getting in the way of their attempt. The HRIStudio study condition produced the highest (90, W-02). With programming backgrounds now balanced across study conditions---each study condition contains one wizard with \emph{None} programming experience, one with \emph{Moderate} experience, and one with \emph{Extensive} experience---a cross-background comparison is possible: W-01 (\emph{None}, Choregraphe, SUS 60) versus W-05 (\emph{None}, HRIStudio, SUS 70); W-04 (\emph{Moderate}, Choregraphe, SUS 42.5) versus W-02 (\emph{Moderate}, HRIStudio, SUS 90); W-03 (\emph{Extensive}, Choregraphe, SUS 75) versus W-06 (\emph{Extensive}, HRIStudio, SUS 70). HRIStudio scores exceed Choregraphe scores at the \emph{None} and \emph{Moderate} levels; at the \emph{Extensive} level the scores reverse by five points, suggesting that extensive programming experience largely attenuates the tool-level usability difference. It is worth noting that only one participant (W-01, Digital Humanities) came from a non-STEM discipline; the remaining five wizards held backgrounds in Computer Science, Chemical Engineering, or Logic and Philosophy of Science, a composition that limits claims about accessibility for humanities-domain researchers.
+The SUS scores reinforce this pattern. Choregraphe SUS scores were 60, 75, and 42.5 (mean 59.2), all at or below the average usability benchmark of 68~\cite{Brooke1996}. HRIStudio SUS scores were 90, 70, and 70 (mean 76.7), all above the benchmark. The Choregraphe study condition produced the lowest single SUS score in the study (42.5, W-04), a wizard who described the platform as getting in the way of their attempt. The HRIStudio study condition produced the highest (90, W-02). Because assignment was stratified by programming background, each condition contains exactly one wizard with \emph{None} experience, one with \emph{Moderate} experience, and one with \emph{Extensive} experience, enabling a direct cross-background comparison: W-01 (\emph{None}, Choregraphe, SUS 60) versus W-05 (\emph{None}, HRIStudio, SUS 70); W-04 (\emph{Moderate}, Choregraphe, SUS 42.5) versus W-02 (\emph{Moderate}, HRIStudio, SUS 90); W-03 (\emph{Extensive}, Choregraphe, SUS 75) versus W-06 (\emph{Extensive}, HRIStudio, SUS 70). HRIStudio scores exceed Choregraphe scores at the \emph{None} and \emph{Moderate} levels; at the \emph{Extensive} level the scores reverse by five points, suggesting that extensive programming experience largely attenuates the tool-level usability difference. It is worth noting that only one participant (W-01, Digital Humanities) came from a non-STEM discipline; the remaining five wizards held backgrounds in Computer Science, Chemical Engineering, or Logic and Philosophy of Science, a composition that limits claims about accessibility for humanities-domain researchers.
 The most striking accessibility finding comes from W-05: a Chemical Engineering faculty member with no programming experience trained in 6 minutes, completed a perfect design in 18 minutes with no operational confusion, and ran the trial to conclusion. This outcome directly addresses the accessibility research question. HRIStudio's timeline-based model and guided workflow allowed a domain novice to implement the written specification correctly on their first attempt, without the interface friction that blocked or slowed all three Choregraphe wizards. Session timing data underscores the difference: Choregraphe design phases averaged 35.7 minutes (two overruns, one incomplete), while HRIStudio design phases averaged 21 minutes (all three within the allocation). Underlying this difference is a structural property of the two tools: HRIStudio's model is domain-specific to Wizard-of-Oz execution, so wizard effort is channeled toward implementing the specification more completely rather than elaborating the tool's architecture. Choregraphe's general-purpose programming model makes the opposite available, and both W-03 and W-04 took it, spending time on concurrent execution structures and a speech-recognition-driven choice block that the WoZ context does not support. No HRIStudio wizard had that option, and all three scored 100 on the DFS.
@@ -29,15 +29,15 @@ ERS scores reflect the downstream effect of these design differences. Choregraph
 W-01's design phase extended to 35 minutes, overrunning the 30-minute allocation by five minutes and leaving approximately five minutes for the trial phase. It is worth distinguishing between the two factors at play here: the overrun reflected both the tool's demands on the wizard and a procedural decision not to interrupt W-01 at the 30-minute mark. Subsequent sessions enforced the transition to the trial phase at 30 minutes regardless of design completion status, consistent with the observer protocol. That said, if a tool's demands make design completion within the allocation genuinely difficult, the risk of an overrun is real regardless of enforcement: a wizard who has not finished at 30 minutes faces a reduced trial window no matter when the cutoff is applied.
-Across all six sessions, design phase overruns are concentrated in the Choregraphe study condition. W-01 and W-03 each exceeded the 30-minute design target but completed their designs before the session time limit; W-04 was the only wizard cut off by the limit without finishing. No HRIStudio wizard exceeded the target. This pattern holds across programming backgrounds: W-01 (non-programmer) and W-03 (extensive programmer) both overran in the Choregraphe study condition, while W-05 (non-programmer, HRIStudio) completed in 18 minutes and W-06 (extensive programmer, HRIStudio) completed in 21 minutes. The timing data thus corroborates the DFS and SUS findings as a supplementary accessibility indicator, and supports the conclusion that the overrun pattern is attributable to assigned tool rather than wizard background alone. With programming backgrounds balanced across study conditions, the design-phase timing difference cannot be attributed to prior programming experience.
+Across all six sessions, design phase overruns are concentrated in the Choregraphe study condition. W-01 and W-03 each exceeded the 30-minute design target but completed their designs before the session time limit; W-04 was the only wizard cut off by the limit without finishing. No HRIStudio wizard exceeded the target. This pattern holds across programming backgrounds: W-01 (non-programmer) and W-03 (extensive programmer) both overran in the Choregraphe study condition, while W-05 (non-programmer, HRIStudio) completed in 18 minutes and W-06 (extensive programmer, HRIStudio) completed in 21 minutes. The timing data thus corroborates the DFS and SUS findings as a supplementary accessibility indicator, and supports the conclusion that the overrun pattern is attributable to assigned tool rather than wizard background alone. Because programming experience was balanced across conditions by stratified assignment, the design-phase timing difference cannot be attributed to prior programming experience.
 \section{Comparison to Prior Work}
-With programming backgrounds now balanced across study conditions, the overall 17.5-point gap in both means reflects a genuine tool-level effect rather than a sampling artifact. Pot et al.~\cite{Pot2009} introduced Choregraphe as a tool for enabling non-programmers to create NAO behaviors, but subsequent HRI research has treated it primarily as a programmer's tool in practice. This study confirms that characterization: W-01 (no programming experience) and W-04 (moderate experience) both required substantial T-type assistance and produced incomplete or deviation-prone designs, while W-03 (extensive experience) navigated the interface without T-type support yet still over-engineered the design and scored below every HRIStudio participant on both DFS and ERS. Riek's~\cite{Riek2012} observation that WoZ tools tend to require substantial technical investment even when the underlying experiment is conceptually simple holds across all three Choregraphe sessions regardless of background. In contrast, the HRIStudio results support the claim advanced in prior work~\cite{OConnor2024, OConnor2025} that a domain-specific, web-based platform can decouple task complexity from interface complexity: all three HRIStudio wizards---spanning no, moderate, and extensive programming experience---achieved a perfect DFS, and none encountered a fundamental barrier to operating the platform.
+Because assignment was stratified by programming experience, the overall 17.5-point gap in both means reflects a genuine tool-level effect rather than a sampling artifact. Pot et al.~\cite{Pot2009} introduced Choregraphe as a tool for enabling non-programmers to create NAO behaviors, but subsequent HRI research has treated it primarily as a programmer's tool in practice. This study confirms that characterization: W-01 (no programming experience) and W-04 (moderate experience) both required substantial T-type assistance and produced incomplete or deviation-prone designs, while W-03 (extensive experience) navigated the interface without T-type support yet still over-engineered the design and scored below every HRIStudio participant on both DFS and ERS. Riek's~\cite{Riek2012} observation that WoZ tools tend to require substantial technical investment even when the underlying experiment is conceptually simple holds across all three Choregraphe sessions regardless of background. In contrast, the HRIStudio results support the claim advanced in prior work~\cite{OConnor2024, OConnor2025} that a domain-specific, web-based platform can decouple task complexity from interface complexity: all three HRIStudio wizards---spanning no, moderate, and extensive programming experience---achieved a perfect DFS, and none encountered a fundamental barrier to operating the platform.
 The specification deviation in W-01's session connects directly to Porfirio et al.'s~\cite{Porfirio2023} argument that formal, verifiable behavior specifications are a prerequisite for reproducible HRI. Porfirio et al. propose specification languages as the solution; HRIStudio takes a complementary approach by embedding the specification into the execution environment, making deviation structurally harder rather than formally detectable after the fact. The ERS data confirms this design intent: no speech content deviations occurred across all three HRIStudio sessions, and the HRIStudio ERS mean of 96.7 versus 66.7 for Choregraphe supports the conclusion that structural enforcement produces more reliable execution in practice. Riek's~\cite{Riek2012} finding that only 3.7\% of published WoZ studies reported any measure of wizard error makes this comparison particularly significant: the ERS operationalizes exactly the kind of execution measurement the literature has consistently omitted, and the difference it surfaces here is substantial.
-The SUS scores are consistent with prior tool evaluations in HCI. The Choregraphe mean of 59.2 falls below the average benchmark of 68~\cite{Brooke1996} and below scores reported for general-purpose visual programming environments in comparable studies, consistent with Bartneck et al.'s~\cite{Bartneck2024} finding that domain-specific design is necessary to make tools genuinely accessible to non-programmers. The HRIStudio mean of 76.7 places the platform above the benchmark across all three sessions. With programming backgrounds balanced across study conditions, the overall 17.5-point gap in the two conditions' means reflects a genuine tool-level effect rather than a sampling artifact. The gap is largest at the Moderate experience level (W-02 HRIStudio 90 vs.\ W-04 Choregraphe 42.5) and smallest at the Extensive level, where the scores reverse by five points (W-03 Choregraphe 75 vs.\ W-06 HRIStudio 70), suggesting that extensive programming experience largely attenuates the tool-level usability difference while the accessibility advantage remains pronounced for non-programmers and moderate programmers.
+The SUS scores are consistent with prior tool evaluations in HCI. The Choregraphe mean of 59.2 falls below the average benchmark of 68~\cite{Brooke1996} and below scores reported for general-purpose visual programming environments in comparable studies, consistent with Bartneck et al.'s~\cite{Bartneck2024} finding that domain-specific design is necessary to make tools genuinely accessible to non-programmers. The HRIStudio mean of 76.7 places the platform above the benchmark across all three sessions. Because programming experience is balanced across conditions by design, the overall 17.5-point gap in the two conditions' means reflects a genuine tool-level effect rather than an artifact of the sample's background composition. The gap is largest at the Moderate experience level (W-02 HRIStudio 90 vs.\ W-04 Choregraphe 42.5) and smallest at the Extensive level, where the scores reverse by five points (W-03 Choregraphe 75 vs.\ W-06 HRIStudio 70), suggesting that extensive programming experience largely attenuates the tool-level usability difference while the accessibility advantage remains pronounced for non-programmers and moderate programmers.
 \section{Limitations}
@@ -49,10 +49,10 @@ This study has several limitations that must be considered when interpreting the
 \textbf{Single task.} Both study conditions used the same Interactive Storyteller specification. While this controls for task difficulty, it limits generalizability. The task is simple relative to real HRI experiments; the gap between conditions may be larger or smaller with a more complex protocol involving multiple branches or longer interaction sequences.
-\textbf{Condition imbalance.} Random assignment produced a programming-background distribution that happens to be balanced: each study condition contains one wizard with no programming experience, one with moderate experience, and one with extensive experience. While this balance is favorable for interpretation, it was not guaranteed by design. The small $N$ means that balance on other potentially relevant dimensions (disciplinary background, prior experience with visual programming tools, or familiarity with robots more broadly) was not assessed or controlled.
+\textbf{Uncontrolled dimensions.} Programming experience was balanced across conditions by stratified assignment (see Section~\ref{sec:measures} and Chapter~\ref{ch:evaluation}): each condition contains one wizard at each of the three experience levels (\emph{None}, \emph{Moderate}, \emph{Extensive}). This controls for programming background as a potential confounder but does not extend to other dimensions. The small $N$ means that balance on other potentially relevant dimensions (disciplinary background, prior experience with visual programming tools, or familiarity with robots more broadly) was not assessed or controlled and remains a source of variability not addressed in this pilot.
 \textbf{Platform version.} HRIStudio is continuously evolving. The version used in this study represents the system at a specific point in time. Future iterations may change how the wizard interface presents protocol steps, how branch conditions are constructed during the design phase, or how protocol enforcement is applied during execution. Any of these changes could affect how easily a non-programmer completes the design challenge or how reliably the tool enforces the specification during the trial, potentially altering the DFS and ERS scores observed under otherwise identical conditions. Results from this study therefore describe the system as it existed at the time of data collection and may not generalize to later releases.
 \section{Chapter Summary}
-This chapter interpreted the results of all six completed pilot sessions against the two research questions and connected the findings to prior work. Across all primary measures, the directional evidence favors HRIStudio. HRIStudio wizards uniformly achieved perfect design fidelity (DFS 100) and near-perfect execution reliability (mean ERS 96.7), while Choregraphe wizards averaged DFS 56.7 and ERS 66.7, with design overruns in all three sessions and no session completing without researcher guidance. The W-01 content deviation (see Section~\ref{sec:results-qualitative}) illustrates the reproducibility problem concretely; its absence in all three HRIStudio sessions is consistent with the enforcement model's design intent. Programming backgrounds are balanced across study conditions, strengthening the cross-background comparisons. The limitations of this pilot study, including sample size, task simplicity, and the single-session design, are acknowledged and inform the future directions described in Chapter~\ref{ch:conclusion}.
+This chapter interpreted the results of all six completed pilot sessions against the two research questions and connected the findings to prior work. Across all primary measures, the directional evidence favors HRIStudio. HRIStudio wizards uniformly achieved perfect design fidelity (DFS 100) and near-perfect execution reliability (mean ERS 96.7), while Choregraphe wizards averaged DFS 56.7 and ERS 66.7, with design overruns in all three sessions and no session completing without researcher guidance. The W-01 content deviation (see Section~\ref{sec:results-qualitative}) illustrates the reproducibility problem concretely; its absence in all three HRIStudio sessions is consistent with the enforcement model's design intent. Programming backgrounds are balanced across study conditions by stratified assignment, strengthening the cross-background comparisons. The limitations of this pilot study, including sample size, task simplicity, and the single-session design, are acknowledged and inform the future directions described in Chapter~\ref{ch:conclusion}.
@@ -0,0 +1,107 @@
 \chapter{AI-Assisted Development Workflow}
 \label{app:ai_workflow}
 This appendix documents the role that AI coding assistants played in the construction of HRIStudio. It is included both for transparency about how the system was built and because the workflow itself is, in my view, one of the more interesting artifacts produced by the project. Section~\ref{sec:ai-ws} in Chapter~\ref{ch:implementation} introduces the topic briefly; here I describe the division of labor, the specific tools I used, the tasks each handled well, the limits I ran into, and the integrity controls I maintained between implementation work and the evaluation reported in Chapter~\ref{ch:results}.
 \section{Context}
 \label{sec:ai-context}
 HRIStudio was built by a single undergraduate in parallel with a full course load, a thesis writeup, and the pilot validation study described in Chapter~\ref{ch:evaluation}. The feature surface described in Chapters~\ref{ch:design} and~\ref{ch:implementation} is larger than what a solo developer on that schedule could reasonably have produced without assistance, and the deadline constraints did not allow for the kind of team that a system of this scope would normally involve. AI coding assistants made the scope tractable. They did not replace design judgment, but they substantially reduced the cost of the mechanical work that sits between a well-specified design and a working feature: scaffolding new modules, implementing well-defined CRUD and validation code, applying consistent patterns across files, and producing the many small edits that a project of this size accumulates.
 The set of tools available to a solo developer changed substantially during the project's timeline. When I began, agentic coding tools were still early and most of my AI use was conversational. By the end of the project, multiple mature terminal- and editor-integrated agents were available. I changed tools as the landscape evolved and used what was available to me at each point. Tools overlapped in places, but I generally used one at a time for a given task; I did not operate a fleet of agents in parallel or maintain a consistent pipeline across tools.
 \section{Tools Used}
 \label{sec:ai-tools}
 Table~\ref{tbl:ai-tools} lists the tools I used during development and the capacity in which I used each. The split between them was determined partly by capability and partly by availability over time.
 \begin{table}[htbp]
 \centering
 \footnotesize
 \begin{tabular}{|l|l|p{3.4in}|}
 \hline
 \textbf{Tool} & \textbf{Category} & \textbf{Primary use} \\
 \hline
 Claude~\cite{Anthropic2024Claude} & Chat model & Design discussions, architectural review, debugging assistance, refactoring proposals, occasional help drafting commit messages. \\
 \hline
 Claude Code~\cite{AnthropicClaudeCode} & Terminal agent & Multi-file feature implementation against a written spec; codemod-style refactors; test scaffolding. \\
 \hline
 OpenCode~\cite{OpenCode} & Terminal agent & Same class of task as Claude Code, used when I preferred an open-source workflow or a different backing model. \\
 \hline
 Gemini CLI~\cite{GeminiCLI} & Terminal agent & Occasional cross-check on changes produced by a different agent, and work against Google's models when I wanted a second reading of a larger diff. \\
 \hline
 Google Antigravity~\cite{GoogleAntigravity} & IDE agent & Editor-integrated agentic coding work, primarily late in the project as the tool became available. \\
 \hline
 Zed~\cite{ZedEditor} & Editor & Day-to-day development environment; provided its own AI-assisted editing features alongside the agents listed above. \\
 \hline
 \end{tabular}
 \caption{AI tools used during HRIStudio development.}
 \label{tbl:ai-tools}
 \end{table}
 I did not use these tools as a coordinated pipeline. I used whichever one fit the task in front of me at the moment, with the set of options expanding as the year progressed. Some of the work overlaps between tools --- any of the agents can, in principle, produce the same diff for a well-scoped task --- but I generally used one at a time and did not run multiple agents against the same code simultaneously.
 \section{Division of Responsibility}
 \label{sec:ai-division}
 My working rule throughout the project was that I did the engineering and the agents did the implementation. In practice, this meant that I was responsible for every decision that had downstream consequences for the shape of the system, and the agents were responsible for producing the code that realized those decisions. Concretely, I did the following work directly, without delegating it to an agent:
 \begin{itemize}
 \item \textbf{Architecture.} The three-tier structure described in Chapter~\ref{ch:design}, the separation between experiment specifications and trial records, the choice to route all robot communication through plugin files, and the overall shape of the event-driven execution model were mine. I wrote these decisions as prose before any code was written.
 \item \textbf{Data model.} The PostgreSQL schema and the tRPC procedure boundaries were designed by me. Because downstream type safety depends on the shape of the schema and the API, I was unwilling to let an agent make those choices.
 \item \textbf{Research design.} The pilot validation study in Chapter~\ref{ch:evaluation} was designed and analyzed entirely by me. The Observer Data Sheet, Design Fidelity Score rubric, and Execution Reliability Score rubric were written by hand. No AI tool was used to score sessions, compute results, or draft claims about what the data showed.
 \item \textbf{The prose of this thesis.} Every chapter was written by me. AI tools occasionally helped me reword an awkward sentence or catch an inconsistency between sections, but the structure of the argument and the specific claims I make are my own.
 \end{itemize}
 The agents handled the work that sat inside those decisions: implementing tRPC procedures from a written signature, generating the Drizzle migration files that matched a schema change I had specified, producing React components from a layout sketch and a list of props, writing the serializer that turned a plugin definition into the JSON format the runtime expected, and applying consistent edits across files when I changed a shared interface. I read every diff before accepting it. When a diff was wrong, I either explained what was wrong and asked for a revision with specifics, or I discarded it and wrote the code myself.
 \section{A Representative Interaction Pattern}
 \label{sec:ai-pattern}
 The typical loop I followed for a medium-sized feature proceeded in five steps.
 First, I wrote the specification. This was usually a short markdown document I kept in a scratch file: a statement of what the feature should do, the tRPC procedure signature it would expose, the tables it would touch, the React components that would consume it, and the acceptance criteria that would let me know it was complete. Writing the specification was design work, and I did it myself.
 Second, I handed the specification to an agent with the repository open. The agent read the relevant existing files, produced a diff that implemented the specification, and reported what it had done.
 Third, I read the diff. This step was non-negotiable: I did not accept code I had not read. For small changes I read directly; for larger ones I asked the agent for a summary first and then read the diff file by file.
 Fourth, I ran the code. I ran the development server, exercised the feature manually, checked the database state where relevant, and ran whatever tests existed. If the feature did not work, I returned to step three with a specific failure to investigate.
 Fifth, I either accepted the diff, asked for a revision, or discarded it. A revision request described the specific thing that was wrong, not a vague instruction to \textit{try again}. Discarding happened when the agent had misunderstood the specification in a way that made a revision more expensive than rewriting from scratch.
 This loop is unremarkable. It is the same loop I would follow if I were reviewing a pull request from a junior engineer. The key point is that the agent's output was treated as a draft pull request that I, as the engineer, either accepted, requested changes on, or rejected --- not as finished work.
 \section{What Worked and What Did Not}
 \label{sec:ai-limits}
 The tasks that agents handled well were those with a narrow and well-specified interface. Implementing a tRPC procedure from a signature, writing a Drizzle migration that matched a schema diff, adding a new field through an existing form, or applying a consistent rename across files --- these were cheap to specify and the agent's output was usually accepted on the first or second iteration. Agents were also good at scaffolding: producing the initial shape of a component, test file, or API route that I then edited to completion.
 The tasks that agents handled poorly were those that required reasoning across more of the system than the context window could hold, or that depended on a piece of context I had not written down. Cross-cutting changes to the experiment and trial data models, for example, required careful coordination across the schema, the tRPC procedures, the execution runtime, and the analysis interface; when I tried to delegate changes of this shape to an agent, the diffs were often locally plausible but globally inconsistent. I ended up doing that work myself. Subtle concurrency and timing questions in the execution layer were another category the agents did not handle well; the event-driven execution model in Chapter~\ref{ch:design} has enough non-obvious ordering constraints that an agent without the full picture tended to introduce races. Those parts of the codebase I wrote by hand.
 Across the full set of tools I used, the differences in capability for the work I asked of them were smaller than I expected. Any of the agents could, in principle, produce a correct diff for a well-scoped task, and when one tool failed it was usually because the task was underspecified rather than because of a difference in model capability. The practical differences between tools mattered more at the workflow level --- which shell integration I preferred, how the tool handled long diffs, how it behaved when it needed to ask for clarification --- than at the capability level.
 \section{Research Integrity}
 \label{sec:ai-integrity}
 Because this thesis reports an empirical evaluation, I treat the boundary between AI-assisted development and the evaluation itself as a matter of research integrity rather than a matter of preference. The following statements reflect the actual workflow I followed:
 \begin{itemize}
 \item No AI tool generated, modified, or interpreted any of the evaluation data reported in Chapter~\ref{ch:results}. Every Design Fidelity Score, Execution Reliability Score, and System Usability Scale rating was recorded by me during or immediately after each session from direct observation, using the rubrics in Appendix~\ref{app:blank_templates}.
 \item No AI tool produced the tables, means, or comparative claims in Chapter~\ref{ch:results}. The numbers were tabulated by hand from the completed Observer Data Sheets reproduced in Appendix~\ref{app:completed_materials}, and the claims about what those numbers support or do not support are mine.
 \item No AI tool drafted the prose of this thesis. The chapters were written by me, in my own voice, and I am responsible for every claim they make and every argument they advance. AI tools were occasionally used as a proofreading aid --- catching typos, flagging awkward phrasing, or suggesting an alternative word --- but the sentences are mine.
 \item The code that implements HRIStudio and that was the subject of the evaluation was written under the workflow described in Sections~\ref{sec:ai-division} and~\ref{sec:ai-pattern}. Agents produced drafts; I read, tested, and accepted or rejected every one. The final state of the code is the product of my engineering decisions, regardless of who wrote any particular line.
 \end{itemize}
 \section{A Note on the Workflow as a Contribution}
 \label{sec:ai-reflection}
 The workflow described in this appendix is not a contribution of the thesis, and I do not claim that it is generalizable or optimal. I describe it because it is the actual workflow under which the system was built, and because a reader evaluating the claims in Chapter~\ref{ch:results} is entitled to know how the system being evaluated came into existence.
 The more interesting observation, at least to me, is about where the boundary between human and agent naturally fell in practice. It fell at the point where a task required a decision with downstream consequences for the shape of the system. Tasks that realized a decision were inexpensive to delegate and inexpensive to verify; tasks that made a decision were neither, and delegating them produced diffs that were locally plausible and globally wrong. Whether that boundary will move as tools improve is a question I cannot answer from the evidence of a single project, but the boundary was stable across every tool I used during this one.
@@ -228,3 +228,67 @@ doi = {10.1201/9781498710411-35}
  year      = {2021},
  doi       = {10.1145/3412374}
 }
@misc{HRIStudioRepo,
  author       = {O'Connor, Sean},
  title        = {{HRIStudio: A Web-Based Wizard-of-Oz Platform for Human-Robot Interaction Research}},
  howpublished = {GitHub repository},
  year         = {2026},
  url          = {https://github.com/soconnor0919/hristudio}
 }
@misc{RobotPluginsRepo,
  author       = {O'Connor, Sean},
  title        = {{HRIStudio Robot Plugins Repository}},
  howpublished = {GitHub repository},
  year         = {2026},
  url          = {https://github.com/soconnor0919/robot-plugins}
 }
@misc{Anthropic2024Claude,
  author       = {{Anthropic}},
  title        = {{Claude}},
  howpublished = {Large language model},
  year         = {2024--2026},
  url          = {https://www.anthropic.com/claude}
 }
@misc{AnthropicClaudeCode,
  author       = {{Anthropic}},
  title        = {{Claude Code}},
  howpublished = {Agentic coding assistant},
  year         = {2024--2026},
  url          = {https://www.anthropic.com/claude-code}
 }
@misc{OpenCode,
  author       = {{sst}},
  title        = {{OpenCode}},
  howpublished = {Open-source AI coding agent},
  year         = {2024--2026},
  url          = {https://opencode.ai}
 }
@misc{GeminiCLI,
  author       = {{Google}},
  title        = {{Gemini CLI}},
  howpublished = {Open-source AI agent},
  year         = {2025--2026},
  url          = {https://github.com/google-gemini/gemini-cli}
 }
@misc{GoogleAntigravity,
  author       = {{Google}},
  title        = {{Antigravity}},
  howpublished = {Agentic development platform},
  year         = {2025--2026},
  url          = {https://antigravity.google}
 }
@misc{ZedEditor,
  author       = {{Zed Industries}},
  title        = {{Zed}},
  howpublished = {Collaborative code editor},
  year         = {2023--2026},
  url          = {https://zed.dev}
 }
@@ -4,6 +4,7 @@
 %\usepackage{graphics}            %Select graphics package
 \usepackage{graphicx}             %
 \usepackage{pdfpages}             %For including PDF pages in appendices
 \usepackage{subcaption}           %For sub-figures with captions
 %\usepackage{amsthm}              %Add other packages as necessary
 \usepackage{array}                %Extended column types and \arraybackslash
 \usepackage{makecell}             %Multi-line table header cells
@@ -52,9 +53,9 @@
 \abstract{
    \begin{spacing}{1.3}
    {\setlength{\parskip}{0.1in}
-    The Wizard-of-Oz (WoZ) technique is widely used in Human-Robot Interaction research, but two persistent problems limit its effectiveness: existing tools impose technical barriers that exclude non-engineering domain experts (the Accessibility Problem), and the fragmented landscape of robot-specific implementations makes interaction scripts difficult to port across platforms (the Reproducibility Problem --- concerning execution consistency and portability, not third-party replication). Through a literature review, I identified three design principles to address both: a hierarchical specification model, an event-driven execution model, and a plugin architecture that decouples experiment logic from robot-specific implementations. I realized these principles in HRIStudio, an open-source, web-based platform providing a visual experiment designer, a guided wizard execution interface, automated timestamped logging with deviation tracking, and role-based access control.
+    The Wizard-of-Oz (WoZ) technique is widely used in Human-Robot Interaction (HRI) research, but two persistent problems limit its effectiveness: existing tools impose technical barriers that exclude non-engineering domain experts (the Accessibility Problem), and the fragmented landscape of robot-specific implementations makes interaction scripts difficult to port across platforms (the Reproducibility Problem --- concerning execution consistency and portability, not third-party replication). Through a literature review, I identified three design principles to address both: a hierarchical specification model, an event-driven execution model, and a plugin architecture that decouples experiment logic from robot-specific implementations. I realized these principles in HRIStudio, an open-source, web-based platform providing a visual experiment designer, a guided wizard execution interface, automated timestamped logging with deviation tracking, and role-based access control.
-    I evaluated HRIStudio in a pilot between-subjects study (N=6) against Choregraphe, the standard NAO programming tool. HRIStudio wizards achieved higher design fidelity, execution reliability, and perceived usability across all six sessions; the only unprompted specification deviation in the dataset occurred in the Choregraphe condition. While the pilot scale precludes inferential claims, the directional evidence across all measures supports the position that a tool built to realize the identified design principles can have significant impact on accessibility and reproducibility in WoZ-based HRI research.
+    I evaluated HRIStudio in a pilot between-subjects study (N=6) against Choregraphe, the standard programming tool for the NAO robot. HRIStudio wizards achieved higher design fidelity, execution reliability, and perceived usability across all six sessions; the only unprompted specification deviation in the dataset occurred in the Choregraphe condition. While the pilot scale precludes inferential claims, the directional evidence across all measures supports the position that a tool built to realize the identified design principles can have significant impact on accessibility and reproducibility in WoZ-based HRI research.
    }
    \end{spacing}
 }
@@ -94,5 +95,6 @@
 \include{chapters/app_blank_templates}
 \include{chapters/app_materials}
 \include{chapters/app_tech_docs}
 \include{chapters/app_ai_development}
 \end{document}