From 5d8ef0ce762f6c43e2a9cf288edc2b814b2c3af1 Mon Sep 17 00:00:00 2001 From: Sean O'Connor Date: Thu, 30 Apr 2026 00:19:02 -0400 Subject: [PATCH] revisions of the revisions --- thesis/chapters/01_introduction.tex | 2 +- thesis/chapters/02_background.tex | 6 ++-- thesis/chapters/03_reproducibility.tex | 4 +-- thesis/chapters/04_system_design.tex | 14 ++++---- thesis/chapters/05_implementation.tex | 2 +- thesis/chapters/06_evaluation.tex | 2 +- thesis/chapters/07_results.tex | 50 +++++++++++++++++++------- thesis/chapters/app_ai_development.tex | 14 ++++---- thesis/chapters/app_tech_docs.tex | 16 ++++----- 9 files changed, 66 insertions(+), 44 deletions(-) diff --git a/thesis/chapters/01_introduction.tex b/thesis/chapters/01_introduction.tex index d452c98..fadd316 100644 --- a/thesis/chapters/01_introduction.tex +++ b/thesis/chapters/01_introduction.tex @@ -29,4 +29,4 @@ The central question this thesis addresses is: \emph{can the right software arch \section{Chapter Summary} -This chapter has established the context and objectives for this thesis. I identified two critical challenges facing WoZ-based HRI research. The first is the \emph{Accessibility Problem}: high technical barriers limit participation by non-programmers. The second is the \emph{Reproducibility Problem}: fragmented tooling makes results difficult to replicate across labs. I proposed a web-based framework approach that addresses these challenges through intuitive design interfaces, enforced experimental protocols, and platform-agnostic architecture. Finally, I posed the central research question and described how this thesis addresses it through formal design, a reference implementation, and a pilot validation study. The next chapters establish the technical and methodological foundations. +This chapter has established the context and objectives for this thesis. I identified two critical challenges facing WoZ-based HRI research. The first is the \emph{Accessibility Problem}: high technical barriers limit participation by non-programmers. The second is the \emph{Reproducibility Problem}: fragmented tooling makes results difficult to replicate across labs. I proposed a web-based framework approach that addresses these challenges through intuitive design interfaces, enforced experimental protocols, and platform-agnostic architecture. Finally, I posed the central research question and described how this thesis addresses it through formal design, a reference implementation, and a pilot validation study. diff --git a/thesis/chapters/02_background.tex b/thesis/chapters/02_background.tex index 357c4dd..b566977 100644 --- a/thesis/chapters/02_background.tex +++ b/thesis/chapters/02_background.tex @@ -61,13 +61,11 @@ This expanding landscape reveals a persistent fundamental gap in the design spac \node[desc] at (8.25, 1.7) {Requires specialized\\training\\No methodological rigor}; \end{tikzpicture} -\caption{The design space of WoZ tools categorized by technical barrier and methodological rigor. A fundamental gap exists for a platform that is both accessible and rigorous.} +\caption{WoZ tool design space by technical barrier and methodological rigor.} \label{fig:tool-matrix} \end{figure} -By methodological rigor, I refer to systematic features that guide experimenters toward best practices: consistently following experimental protocols, maintaining comprehensive logging, and producing reproducible experimental designs. - -Moreover, few platforms directly address the methodological concerns raised by systematic reviews of WoZ research. Riek's influential analysis \cite{Riek2012} of 54 HRI studies uncovered widespread inconsistencies in how wizard behaviors were controlled and reported. Very few studies documented standardized wizard training procedures or measured wizard error rates, raising questions about internal validity---that is, whether observed outcomes can be attributed to the intended experimental manipulation rather than to uncontrolled variation in wizard behavior. The tools themselves often exacerbate this problem: poorly designed interfaces increase cognitive load on wizards, leading to timing errors and behavioral inconsistencies that can confound experimental results. Recent work by Strazdas et al. \cite{Strazdas2020} further demonstrates the importance of careful interface design in WoZ systems, showing that intuitive wizard interfaces directly improve both the quality of robot behavior and the reliability of collected data. +The missing quadrant in Figure~\ref{fig:tool-matrix} matters because methodological rigor requires systematic features that guide experimenters toward best practices: consistently following experimental protocols, maintaining comprehensive logging, and producing reproducible experimental designs. Few platforms directly address the methodological concerns raised by systematic reviews of WoZ research. Riek's influential analysis \cite{Riek2012} of 54 HRI studies uncovered widespread inconsistencies in how wizard behaviors were controlled and reported. Very few studies documented standardized wizard training procedures or measured wizard error rates, raising questions about internal validity---that is, whether observed outcomes can be attributed to the intended experimental manipulation rather than to uncontrolled variation in wizard behavior. The tools themselves often exacerbate this problem: poorly designed interfaces increase cognitive load on wizards, leading to timing errors and behavioral inconsistencies that can confound experimental results. Recent work by Strazdas et al. \cite{Strazdas2020} further demonstrates the importance of careful interface design in WoZ systems, showing that intuitive wizard interfaces directly improve both the quality of robot behavior and the reliability of collected data. \section{Requirements for Modern WoZ Infrastructure} diff --git a/thesis/chapters/03_reproducibility.tex b/thesis/chapters/03_reproducibility.tex index 9f2dc58..f9f661a 100644 --- a/thesis/chapters/03_reproducibility.tex +++ b/thesis/chapters/03_reproducibility.tex @@ -5,10 +5,10 @@ Having established the landscape of existing WoZ platforms and their limitations \section{Sources of Variability} -\emph{The Reproducibility Problem}, as introduced in Chapter~\ref{ch:intro}, encompasses two related challenges. The first concerns \emph{execution consistency}: whether a wizard reliably follows the same experimental script across multiple trials with different participants, producing comparable robot behavior in each. The second concerns \emph{cross-platform reproducibility}: whether the same experiment can be transferred to a different robot platform with minimal change to the implementing program. Both stem from gaps in current WoZ infrastructure and are examined in this chapter. A third interpretation of the term — independent replication of a published study by researchers at other institutions — is distinct from both and is not what this thesis evaluates. It is also worth noting that execution consistency, as defined here, corresponds to what the measurement literature sometimes calls \emph{repeatability}: the degree to which the same procedure produces consistent results when repeated across multiple trials of the same study. +\emph{The Reproducibility Problem}, as introduced in Chapter~\ref{ch:intro}, encompasses two related challenges. The first concerns \emph{execution consistency}: whether a wizard reliably follows the same experimental script across multiple trials with different participants, producing comparable robot behavior in each. The second concerns \emph{cross-platform reproducibility}: whether the same experiment can be transferred to a different robot platform with minimal change to the implementing program. Both stem from gaps in current WoZ infrastructure and are examined in this chapter. It is important to note that the term reproducibility may also refer to \emph{allowing independent replications of published studies}; this is not what this thesis evaluates. Execution consistency, as defined here, corresponds to what the measurement literature sometimes calls \emph{repeatability}: the degree to which the same procedure produces consistent results when repeated across multiple trials of the same study. In WoZ-based HRI studies, multiple sources of variability can compromise execution consistency. The wizard is simultaneously the strength and weakness of the WoZ paradigm. While human control enables sophisticated, adaptive interactions, it also introduces inconsistency. Consider a wizard conducting multiple trials of the same experiment with different participants. - Even with a detailed script, the wizard may vary in timing, with delays between a participant's action and the robot's response fluctuating based on the wizard's attention, fatigue, or interpretation of when to act. When a script allows for choices, different wizards may make different selections, or the same wizard may act differently across trials. Furthermore, a wizard may accidentally skip steps, trigger actions in the wrong order, or misinterpret experimental protocols. + Even with a detailed script, the wizard may vary in timing, with the delay between a participant's action and the robot's response fluctuating based on the wizard's attention, fatigue, or interpretation of when to act. When a script allows for choices, different wizards may make different selections, or the same wizard may act differently across trials. Furthermore, a wizard may accidentally skip steps, trigger actions in the wrong order, or misinterpret experimental protocols. Riek's systematic review \cite{Riek2012} found that very few published studies reported measuring wizard error rates or providing standardized wizard training. Without such measures, it becomes impossible to determine whether experimental results reflect the intended interaction design or inadvertent variations in wizard behavior. diff --git a/thesis/chapters/04_system_design.tex b/thesis/chapters/04_system_design.tex index 38b0e93..b9450ae 100644 --- a/thesis/chapters/04_system_design.tex +++ b/thesis/chapters/04_system_design.tex @@ -13,7 +13,7 @@ Figure~\ref{fig:experiment-hierarchy} shows this hierarchical structure. Reading Figure~\ref{fig:trial-instantiation} illustrates how a protocol definition relates to its instantiation. The left column holds the protocol, defined before the study begins; the right column shows how the abstraction defined as a protocol is instantiated as independent trials. A dashed line marks the protocol/trial boundary: everything to its left was authored by the researcher before any participant arrived; everything to its right was generated during a live session. The \textit{instantiates} arrows from the experiment node fan out to each trial record, making the relationship explicit. This separation is central to reproducibility: the same experiment specification generates a distinct, timestamped record per participant, so researchers can compare across participants without conflating what was designed with what was executed. -To illustrate the hierarchy with a concrete example, consider an interactive storytelling study with the research question: \emph{Does how the robot tells a story affect how a human will remember the story?} The two experiments use different robots: the NAO6, a humanoid robot with expressive gestures and a human-like form, and the TurtleBot, a wheeled mobile robot that is visibly machine-like with no social movement cues. The narrative task remains the same across both experiments; only how the robot delivers it changes. +To illustrate the hierarchy with a concrete example, consider an interactive storytelling study with the research question: \emph{Does how the robot tells a story affect how a human will remember the story?} The experiment might use different robots, for instance Pepper, NAO6, and TurtleBot. Figure~\ref{fig:robot-morphologies} shows the morphology of these three different robots: Pepper and NAO6 are humanoid social robots with expressive gestures and human-like forms, while TurtleBot is a wheeled mobile robot with a visibly machine-like form and no social movement cues. In the example below, the narrative task remains the same across two robot-specific experiments; only how the robot delivers it changes. \begin{figure}[htbp] \centering @@ -37,7 +37,7 @@ To illustrate the hierarchy with a concrete example, consider an interactive sto \caption{TurtleBot (Mechanical)} \label{fig:robot-turtlebot} \end{subfigure} -\caption{Diverse robot morphologies supported by the HRIStudio architecture, ranging from expressive humanoid forms to purely mechanical platforms.} +\caption{Three robot morphologies supported by the HRIStudio architecture.} \label{fig:robot-morphologies} \end{figure} @@ -293,7 +293,7 @@ This separation of concerns provides two concrete benefits. First, each layer ca \centering \begin{tikzpicture}[ layer/.style={rectangle, draw=black, thick, fill, minimum width=6.5cm, minimum height=1cm, align=center, text width=6.2cm}, - arrow/.style={->, thick, line width=1.5pt}] + arrow/.style={-, thick, line width=1.5pt}] % Layer 1: UI \node[layer, fill=gray!15] (ui) at (0, 3.5) { @@ -314,8 +314,8 @@ This separation of concerns provides two concrete benefits. First, each layer ca }; % Arrows (bidirectional) - \draw[<->, thick, line width=1.5pt] (ui.south) -- (logic.north); - \draw[<->, thick, line width=1.5pt] (logic.south) -- (data.north); + \draw[-, thick, line width=1.5pt] (ui.south) -- (logic.north); + \draw[-, thick, line width=1.5pt] (logic.south) -- (data.north); \end{tikzpicture} \caption{Three-layer architecture separates user interface, application logic, and data/robot control.} @@ -326,7 +326,7 @@ This separation of concerns provides two concrete benefits. First, each layer ca During the design phase, researchers create experiment specifications that are stored in the system database. During a trial, the system manages bidirectional communication between the wizard's interface and the robot control layer. All actions, sensor data, and events are streamed to a data logging service that stores complete records. After the trial, researchers can inspect these records through the Analysis interface. -The flow of data during a trial proceeds through six distinct phases, as shown in Figure~\ref{fig:trial-dataflow}: +The flow of data during a trial proceeds through six distinct phases as discussed below; these phases are summarized in Figure~\ref{fig:trial-dataflow}: \begin{enumerate} \item A researcher creates an experiment protocol using the Design interface. @@ -361,7 +361,7 @@ This design creates automatically a comprehensive documentation of every trial, \draw[arrow] (s5.south) -- (s6.north); \end{tikzpicture} -\caption{Trial data flow: from protocol design through execution and recording, to analysis and playback.} +\caption{Six-phase trial data flow.} \label{fig:trial-dataflow} \end{figure} diff --git a/thesis/chapters/05_implementation.tex b/thesis/chapters/05_implementation.tex index 6a8860a..e22fee9 100644 --- a/thesis/chapters/05_implementation.tex +++ b/thesis/chapters/05_implementation.tex @@ -75,7 +75,7 @@ HRIStudio is implemented as a set of containerized services that work together t \draw[arrow] (bridge_cont.east) -- node[above, font=\scriptsize, align=center] {NAOqi\\API} (robot.west); \end{tikzpicture} -\caption{The containerized architecture of HRIStudio and the NAO6 integration bridge. The wizard's browser maintains two independent WebSocket connections: one for system state and logging, and one for direct robot control.} +\caption{Containerized HRIStudio and NAO6 integration architecture.} \label{fig:system-architecture} \end{figure} diff --git a/thesis/chapters/06_evaluation.tex b/thesis/chapters/06_evaluation.tex index 08a01c3..84f1284 100644 --- a/thesis/chapters/06_evaluation.tex +++ b/thesis/chapters/06_evaluation.tex @@ -13,7 +13,7 @@ I hypothesized that HRIStudio would improve both accessibility and reproducibili \section{Study Design} -I used what Bartneck et al.~\cite{Bartneck2024} call a between-subjects design, in which each participant is assigned to only one condition. To ensure that programming experience was balanced across conditions, I stratified assignment by self-reported programming background: each wizard was first classified into one of three strata (\emph{None}, \emph{Moderate}, or \emph{Extensive} programming experience), and then randomly assigned within their stratum to one of the two conditions (HRIStudio or Choregraphe). This produced a design in which each condition contained exactly one wizard at each experience level, allowing the tool effect to be evaluated without confounding from the distribution of programming experience. Both groups received the same task, the same time allocation, and a similar training structure. Measuring each participant in only one condition prevents carryover effects, meaning performance changes caused by prior exposure to another condition rather than by the assigned condition itself. +I used what Bartneck et al.~\cite{Bartneck2024} call a \emph{between-subjects design}, in which each participant is assigned to only one condition. To ensure that programming experience was balanced across conditions, I stratified assignment by self-reported programming background: each wizard was first classified as having \emph{None}, \emph{Moderate}, or \emph{Extensive} programming experience, and then randomly assigned within that stratum to HRIStudio or Choregraphe. This produced a design in which each condition contained exactly one wizard at each experience level, reducing the risk that tool effects would be confused with differences in programming experience. Both groups received the same task, the same time allocation, and a similar training structure. Because each wizard used only one tool, the design also avoided carryover effects from prior exposure to the other condition. \section{Participants} diff --git a/thesis/chapters/07_results.tex b/thesis/chapters/07_results.tex index 20adb76..8619afb 100644 --- a/thesis/chapters/07_results.tex +++ b/thesis/chapters/07_results.tex @@ -5,37 +5,61 @@ This chapter presents the results of the pilot validation study described in Cha \section{Participant Overview} -Table~\ref{tbl:sessions} summarizes the personas and their assigned conditions. Wizards are identified by code to protect confidentiality. All six participants were Bucknell University professors drawn from Computer Science, Chemical Engineering, Digital Humanities, and Logic and Philosophy of Science. Demographic information (programming background) was collected during recruitment. +Table~\ref{tbl:sessions} summarizes the participants and their assigned conditions. Wizards are identified by code to protect confidentiality. All six participants were Bucknell University professors drawn from Computer Science, Chemical Engineering, Digital Humanities, and Logic and Philosophy of Science. Demographic information (programming background) was collected during recruitment. \begin{table}[htbp] \centering \footnotesize -\begin{tabular}{|l|l|l|l|l|l|l|} +\begin{tabular}{|l|l|l|l|} \hline -\textbf{ID} & \textbf{Condition} & \textbf{Background} & \makecell[l]{\textbf{Programming}\\\textbf{Experience}} & \textbf{DFS} & \textbf{ERS} & \textbf{SUS} \\ +\textbf{ID} & \textbf{Condition} & \textbf{Background} & \makecell[l]{\textbf{Programming}\\\textbf{Experience}} \\ \hline -W-01 & Choregraphe & Digital Humanities & None & 42.5 & 65 & 60 \\ +W-01 & Choregraphe & Digital Humanities & None \\ \hline -W-02 & HRIStudio & Logic and Philosophy of Science & Moderate & 100 & 95 & 90 \\ +W-02 & HRIStudio & Logic and Philosophy of Science & Moderate \\ \hline -W-03 & Choregraphe & Computer Science & Extensive & 65 & 60 & 75 \\ +W-03 & Choregraphe & Computer Science & Extensive \\ \hline -W-04 & Choregraphe & Chemical Engineering & Moderate & 62.5 & 75 & 42.5 \\ +W-04 & Choregraphe & Chemical Engineering & Moderate \\ \hline -W-05 & HRIStudio & Chemical Engineering & None & 100 & 95 & 70 \\ +W-05 & HRIStudio & Chemical Engineering & None \\ \hline -W-06 & HRIStudio & Computer Science & Extensive & 100 & 100 & 70 \\ +W-06 & HRIStudio & Computer Science & Extensive \\ \hline \end{tabular} -\caption{Summary of wizard participants, assigned conditions, and scores.} +\caption{Summary of wizard participants and assigned conditions.} \label{tbl:sessions} \end{table} -This table also presents numerical data representing the study's results, which is discussed next. +Table~\ref{tbl:primary-outcomes} presents the primary outcome scores, which are discussed next. \section{Primary Measures} +\begin{table}[htbp] +\centering +\footnotesize +\begin{tabular}{|l|l|r|r|r|} +\hline +\textbf{ID} & \textbf{Condition} & \textbf{DFS} & \textbf{ERS} & \textbf{SUS} \\ +\hline +W-01 & Choregraphe & 42.5 & 65 & 60 \\ +\hline +W-02 & HRIStudio & 100 & 95 & 90 \\ +\hline +W-03 & Choregraphe & 65 & 60 & 75 \\ +\hline +W-04 & Choregraphe & 62.5 & 75 & 42.5 \\ +\hline +W-05 & HRIStudio & 100 & 95 & 70 \\ +\hline +W-06 & HRIStudio & 100 & 100 & 70 \\ +\hline +\end{tabular} +\caption{Primary outcome scores by wizard and condition.} +\label{tbl:primary-outcomes} +\end{table} + \subsection{Design Fidelity Score (DFS)} The Design Fidelity Score measures how completely and correctly each wizard implemented the written specification of their assigned experiment. Scores range from 0 to 100, with full points awarded only when a component — a rubric criterion representing a required speech action, gesture, or control-flow element — is both present and correct. (For a full description of rubric categories, see Section~\ref{sec:measures}.) @@ -141,7 +165,7 @@ Figure~\ref{fig:results-chart} summarizes the three primary measures side-by-sid \node[anchor=west, font=\footnotesize] at (7.5, -1.125) {HRIStudio}; \end{tikzpicture} -\caption{Mean scores by condition across the three primary outcome measures. Within each group, the left bar is Choregraphe and the right bar is HRIStudio.} +\caption{Mean scores by condition across the three primary outcome measures.} \label{fig:results-chart} \end{figure} @@ -212,7 +236,7 @@ Figure~\ref{fig:timing-chart} compares the per-condition means for training, des \node[anchor=west, font=\footnotesize] at (7.5, -1.125) {HRIStudio}; \end{tikzpicture} -\caption{Mean phase durations (in minutes) by condition. Within each group, the left bar is Choregraphe and the right bar is HRIStudio.} +\caption{Mean phase durations by condition.} \label{fig:timing-chart} \end{figure} diff --git a/thesis/chapters/app_ai_development.tex b/thesis/chapters/app_ai_development.tex index ad9046a..49b7069 100644 --- a/thesis/chapters/app_ai_development.tex +++ b/thesis/chapters/app_ai_development.tex @@ -1,14 +1,14 @@ \chapter{AI-Assisted Development Workflow} \label{app:ai_workflow} -This appendix documents the role that AI coding assistants played in the construction of HRIStudio. It is included both for transparency about how the system was built and because the workflow itself is, in my view, one of the more interesting artifacts produced by the project. Section~\ref{sec:ai-ws} in Chapter~\ref{ch:implementation} introduces the topic briefly; here I describe the division of labor, the specific tools I used, the tasks each handled well, the limits I encountered, and the integrity controls I maintained between implementation work and the evaluation reported in Chapter~\ref{ch:results}. +This appendix documents the role that AI coding assistants played in the construction of HRIStudio. It is included both for transparency about how the system was built and because the workflow itself is, in my view, one of the more interesting artifacts produced by the project. Section~\ref{sec:ai-ws} in Chapter~\ref{ch:implementation} introduces the topic briefly; here I describe the specific responsibilities I kept for myself, the tasks I delegated to coding agents, the tools I used, the limits I encountered, and the integrity controls I maintained between implementation work and the evaluation reported in Chapter~\ref{ch:results}. \section{Context} \label{sec:ai-context} -HRIStudio was built by a single undergraduate in parallel with a full course load, a thesis writeup, and the pilot validation study described in Chapter~\ref{ch:evaluation}. The feature surface described in Chapters~\ref{ch:design} and~\ref{ch:implementation} is larger than what a solo developer on that schedule could reasonably have produced without assistance, and the deadline constraints did not allow for the kind of team that a system of this scope would normally involve. AI coding assistants made the scope tractable. They did not replace design judgment, but they substantially reduced the cost of the mechanical work that sits between a well-specified design and a working feature: scaffolding new modules, implementing well-defined CRUD and validation code, applying consistent patterns across files, and producing the many small edits that a project of this size accumulates. +I built HRIStudio while also carrying a full course load, writing this thesis, and running the pilot validation study described in Chapter~\ref{ch:evaluation}. The feature surface described in Chapters~\ref{ch:design} and~\ref{ch:implementation} is larger than what I could reasonably have produced on that schedule without assistance, given both the scope and the level of ambition of the work. AI coding assistants made that scope tractable. They did not replace design judgment; they reduced the cost of the mechanical work that sits between a well-specified design and a working feature: scaffolding new modules, implementing well-defined create/read/update/delete (CRUD) and validation code, applying consistent patterns across files, and producing the many small edits that a project of this size accumulates. -The set of tools available to a solo developer changed substantially during the project's timeline. When I began, agentic coding tools were still early and most of my AI use was conversational, primarily through Cursor~\cite{CursorEditor} and Zed~\cite{ZedEditor}. By the end of the project, multiple mature terminal- and editor-integrated agents were available. I changed tools as the landscape evolved, eventually moving into a mixed workflow between Visual Studio Code, Antigravity~\cite{GoogleAntigravity}, Claude Code~\cite{AnthropicClaudeCode}, and OpenCode~\cite{OpenCode}. +The set of tools available to me as a solo developer changed substantially during the project's timeline. When I began, agentic coding tools were still early and most of my AI use was conversational, primarily through Cursor~\cite{CursorEditor} and Zed~\cite{ZedEditor}. By the end of the project, multiple mature terminal- and editor-integrated agents were available. I changed tools as the landscape evolved, eventually moving into a mixed workflow between Visual Studio Code, Antigravity~\cite{GoogleAntigravity}, Claude Code~\cite{AnthropicClaudeCode}, and OpenCode~\cite{OpenCode}. \section{Tools and Hardware} \label{sec:ai-tools} @@ -46,7 +46,7 @@ Beyond cloud-hosted models, I experimented with local execution using \texttt{ll \section{Division of Responsibility} \label{sec:ai-division} -My working rule throughout the project was that I did the engineering and the agents did the implementation. In practice, this meant that I was responsible for every decision that had downstream consequences for the shape of the system, and the agents were responsible for producing the code that realized those decisions. Concretely, I did the following work directly, without delegating it to an agent: +My working rule throughout the project was for me to handle the engineering and for the agents to flesh out the implementation. In practice, this meant that I was responsible for every decision that had downstream consequences for the shape of the system, and the agents were responsible for producing code that realized those decisions. Concretely, I did the following work directly, without delegating it to an agent: \begin{itemize} \item \textbf{Architecture.} The three-tier structure described in Chapter~\ref{ch:design}, the separation between experiment specifications and trial records, the choice to route all robot communication through plugin files, and the overall shape of the event-driven execution model were mine. I wrote these decisions as prose before any code was written. @@ -55,15 +55,15 @@ My working rule throughout the project was that I did the engineering and the ag \item \textbf{Research design.} The pilot validation study in Chapter~\ref{ch:evaluation} was designed and analyzed entirely by me. The Observer Data Sheet, Design Fidelity Score rubric, and Execution Reliability Score rubric were written by hand. No AI tool was used to score sessions, compute results, or draft claims about what the data showed. -\item \textbf{The prose of this thesis.} Every chapter was written by me. The structure of the argument and the specific claims I make are my own. While AI assisted with the nuances of \LaTeX{} formatting (particularly the generation of TikZ diagrams and complex chart syntax), the linguistic content is mine. +\item \textbf{The prose of this thesis.} Every chapter was written by me. The structure of the argument and the specific claims I make are my own. While AI assisted with the nuances of \LaTeX{} formatting (particularly the generation of TikZ diagrams and complex chart syntax), the content is mine. \end{itemize} \section{Evolution of the Workflow} \label{sec:ai-pattern} -The way I used these tools changed as they improved. Early in the project, I treated the agent's output as a draft that required line-by-line review. The typical loop followed five steps: writing a specification, generating a diff, reading the diff, running the code, and then accepting or rejecting. +My use of these tools changed over the course of the project, and evolved as the models improved. Early on, I treated the agent's output as a draft that required line-by-line review. The typical loop followed five steps: writing a specification, generating a diff, reading the diff, running the code, and then accepting or rejecting the change. -As the models improved and the agents became more reliable, the focus of my effort shifted. By the final stages of development, I spent significantly less time on manual line-by-line diff reviews and more time on empirical testing. I moved from being a ``code reviewer'' to a ``test-driven supervisor.'' If the agent produced a feature that passed my manual acceptance tests and integrated correctly with the existing system, I was more likely to accept the implementation without a complete audit of every semicolon. This shift allowed me to increase the velocity of development significantly in the weeks leading up to the evaluation. +As the models improved and the agents became more reliable, the focus of my effort shifted. By the final stages of development, I spent significantly less time on manual line-by-line reviews and more time on empirical testing. I moved from being a ``code reviewer'' to a ``test-driven supervisor.'' If the agent produced a feature that passed my manual acceptance tests and integrated correctly with the existing system, I was more likely to accept the implementation without auditing every line in the program. This shift allowed me to increase the velocity of development significantly in the weeks leading up to the evaluation. \section{What Worked and What Did Not} \label{sec:ai-limits} diff --git a/thesis/chapters/app_tech_docs.tex b/thesis/chapters/app_tech_docs.tex index 7536152..53adccb 100644 --- a/thesis/chapters/app_tech_docs.tex +++ b/thesis/chapters/app_tech_docs.tex @@ -84,12 +84,12 @@ The Next.js application server and the Bun WebSocket server run outside Docker o The NAO6 integration stack is defined in a separate repository and provides three ROS~2 services that collectively bridge HRIStudio to the physical robot. \begin{enumerate} -\item The \textbf{nao\_driver} service runs the NaoQi driver ROS~2 node, which connects to the NAO's proprietary framework over the local network and publishes sensor data (joint states, camera feeds) as standard ROS~2 topics. -\item The \textbf{ros\_bridge} service runs the rosbridge WebSocket server, which exposes all ROS~2 topics over a WebSocket interface on a configurable port (default~9090). This is the endpoint that the HRIStudio server connects to. +\item The \textbf{nao\_driver} service runs the NAOqi driver ROS~2 node, which connects to the NAO's proprietary framework over the local network and publishes sensor data (joint states, camera feeds) as standard ROS~2 topics. +\item The \textbf{ros\_bridge} service runs the \texttt{rosbridge} WebSocket server, which exposes all ROS~2 topics over a WebSocket interface on a configurable port (default~9090). This is the endpoint that the HRIStudio server connects to. \item The \textbf{ros\_api} service provides runtime introspection of available ROS~2 topics, services, and parameters. \end{enumerate} -All three services are built from a single Dockerfile based on the ROS~2 Humble base image (Ubuntu~22.04). The image installs the NaoQi driver and rosbridge server packages along with their dependencies (NaoQi libraries, bridge message types, OpenCV bridge, and TF2) and builds them with colcon. All services use host networking so that ROS~2 discovery and the NaoQi connection operate without port forwarding. +All three services are built from a single Dockerfile based on the ROS~2 Humble base image (Ubuntu~22.04). The image installs the NAOqi driver and \texttt{rosbridge} server packages along with their dependencies (NAOqi libraries, bridge message types, OpenCV bridge, and TF2) and builds them with colcon. All services use host networking so that ROS~2 discovery and the NAOqi connection operate without port forwarding. Before starting the driver, an initialization script connects to the NAO via SSH and prepares it for external control: @@ -103,7 +103,7 @@ Environment variables for the robot IP address, credentials, and bridge port are \subsection{Communication Between Stacks} -Figure~\ref{fig:deployment-arch} shows the relationship between the two Docker stacks and the components that run on the host. The HRIStudio server communicates with the robot integration stack over a single WebSocket connection to the \texttt{rosbridge\_websocket} endpoint. For actions that bypass ROS entirely (posture changes, animation playback), the server connects directly to the NAO via SSH and invokes NaoQi commands through the \texttt{qicli} command-line tool. Both communication paths are configured per-robot in the plugin file. +Figure~\ref{fig:deployment-arch} shows the relationship between the two Docker stacks and the components that run on the host. The HRIStudio server communicates with the robot integration stack over a single WebSocket connection to the \texttt{rosbridge\_websocket} endpoint. For actions that bypass ROS entirely (posture changes, animation playback), the server connects directly to the NAO via SSH and invokes NAOqi commands through the \texttt{qicli} command-line tool. Both communication paths are configured per-robot in the plugin file. \begin{figure}[htbp] \centering @@ -159,7 +159,7 @@ Figure~\ref{fig:deployment-arch} shows the relationship between the two Docker s %% ---- NAO Robot ---- \node[box, fill=gray!40, minimum width=2.8cm] (nao) at (0, -0.8) - {NAO6 Robot\\[-1pt]{\scriptsize NaoQi}}; + {NAO6 Robot\\[-1pt]{\scriptsize NAOqi}}; %% ---- Arrows: browser to host ---- \draw[arrow] (browser.south west) -- node[lbl, left] {HTTP} (nextjs.north); @@ -231,13 +231,13 @@ Each action definition specifies: \item A ROS~2 dispatch block containing the target topic, message type, and a payload mapping. \end{itemize} -The payload mapping supports two modes. In \emph{static} mode, the plugin defines a fixed message template with placeholder tokens (e.g., \texttt{\{\{text\}\}}) that the execution engine fills from the researcher's parameters. In \emph{SSH} mode, the action bypasses ROS entirely and executes a shell command on the robot via SSH; this is used for NaoQi-native operations such as posture changes and animation playback that are not exposed as ROS~2 topics. +The payload mapping supports two modes. In \emph{static} mode, the plugin defines a fixed message template with placeholder tokens (e.g., \texttt{\{\{text\}\}}) that the execution engine fills from the researcher's parameters. In \emph{SSH} mode, the action bypasses ROS entirely and executes a shell command on the robot via SSH; this is used for NAOqi-native operations such as posture changes and animation playback that are not exposed as ROS~2 topics. -The NAO6 plugin defines 20 actions across three categories: speech (say text, say with emotion), movement (walk forward/backward, turn, stop, wake up, rest, stand, sit, crouch), and animation (bow, wave, nod, head shake, shrug, enthusiastic gesture, and others). Movement actions publish ROS~2 Twist messages to the velocity command topic. Animation actions publish animation path strings to the animation topic. Posture and lifecycle commands use SSH mode to call NaoQi services directly via the \texttt{qicli} command-line tool. +The NAO6 plugin defines 20 actions across three categories: speech (say text, say with emotion), movement (walk forward/backward, turn, stop, wake up, rest, stand, sit, crouch), and animation (bow, wave, nod, head shake, shrug, enthusiastic gesture, and others). Movement actions publish ROS~2 Twist messages to the velocity command topic. Animation actions publish animation path strings to the animation topic. Posture and lifecycle commands use SSH mode to call NAOqi services directly via the \texttt{qicli} command-line tool. \subsection{Adding a New Robot} -Adding support for a new robot platform requires writing a single JSON plugin file and placing it in the repository. No changes to the HRIStudio server code are required. The plugin author defines the robot's capabilities, maps each action to a ROS~2 topic or SSH command, and specifies the parameter schema for each action. After the repository is synced, the new robot's actions appear in the experiment designer and can be used in any study. +Adding support for a new robot platform requires writing a single JSON plugin file and placing it in the plugin repository. No changes to the HRIStudio server code are required. The plugin author defines the robot's capabilities, maps each action to a ROS~2 topic or SSH command, and specifies the parameter schema for each action. After the repository is synced, the new robot's actions appear in the experiment designer and can be used in any study. \section{Database Schema}