mirror of
https://github.com/soconnor0919/honors-thesis.git
synced 2026-05-08 07:08:55 -04:00
post-defense revisions complete
This commit is contained in:
@@ -17,7 +17,55 @@ Choregraphe \cite{Pot2009}, developed by Aldebaran Robotics for the NAO and Pepp
|
||||
|
||||
Recent years have seen renewed interest in comprehensive WoZ frameworks. Gibert et al. \cite{Gibert2013} developed the Super Wizard of Oz (SWoOZ) platform. This system integrates facial tracking, gesture recognition, and real-time control capabilities to enable naturalistic human-robot interaction studies. Virtual and augmented reality have also emerged as complementary approaches to WoZ. Helgert et al. \cite{Helgert2024} demonstrated how VR-based WoZ environments can simplify experimental setup while providing researchers with precise control over environmental conditions and high-fidelity data collection.
|
||||
|
||||
This expanding landscape reveals a persistent fundamental gap in the design space of WoZ tools. Flexible, general-purpose platforms like Polonius and OpenWoZ offer powerful capabilities but present high technical barriers. Accessible, user-friendly tools like WoZ4U and Choregraphe lower those barriers but sacrifice cross-platform compatibility and longevity. Newer approaches such as VR-based frameworks attempt to bridge this gap, yet no existing tool successfully combines accessibility, flexibility, deployment portability, and built-in methodological rigor. By methodological rigor, I refer to systematic features that guide experimenters toward best practices: consistently following experimental protocols, maintaining comprehensive logging, and producing reproducible experimental designs.
|
||||
This expanding landscape reveals a persistent fundamental gap in the design space of WoZ tools. Flexible, general-purpose platforms like Polonius and OpenWoZ offer powerful capabilities but present high technical barriers. Accessible, user-friendly tools like WoZ4U and Choregraphe lower those barriers but sacrifice cross-platform compatibility and longevity. Newer approaches such as VR-based frameworks attempt to bridge this gap, yet no existing tool successfully combines accessibility, flexibility, deployment portability, and built-in methodological rigor.
|
||||
|
||||
\begin{figure}[htbp]
|
||||
\centering
|
||||
\begin{tikzpicture}[
|
||||
scale=1.0,
|
||||
quadbox/.style={rectangle, draw=white, ultra thick, minimum width=5.5cm, minimum height=4.5cm, align=center},
|
||||
title/.style={font=\small\bfseries, align=center},
|
||||
desc/.style={font=\footnotesize, text=gray!60, align=center},
|
||||
axislabel/.style={font=\small\bfseries, align=center}
|
||||
]
|
||||
|
||||
% Quadrant Backgrounds
|
||||
\fill[gray!20] (0, 4.5) rectangle (5.5, 9.0); % Top Left (HRIStudio)
|
||||
\fill[gray!15] (5.5, 4.5) rectangle (11.0, 9.0); % Top Right (Polonius)
|
||||
\fill[gray!10] (0, 0) rectangle (5.5, 4.5); % Bottom Left (WoZ4U)
|
||||
\fill[gray!5] (5.5, 0) rectangle (11.0, 4.5); % Bottom Right (Choregraphe)
|
||||
|
||||
% Quadrant Lines
|
||||
\draw[white, ultra thick] (5.5, 0) -- (5.5, 9.0);
|
||||
\draw[white, ultra thick] (0, 4.5) -- (11.0, 4.5);
|
||||
|
||||
% Axis Labels
|
||||
\node[axislabel, above] at (2.75, 9.2) {Low technical barrier};
|
||||
\node[axislabel, above] at (8.25, 9.2) {High technical barrier};
|
||||
\node[axislabel, left] at (-0.2, 6.75) {More rigorous};
|
||||
\node[axislabel, left] at (-0.2, 2.25) {Less rigorous};
|
||||
|
||||
% Top Left: The Gap
|
||||
\node[axislabel] at (2.75, 6.75) {\Huge ?};
|
||||
|
||||
% Top Right: Polonius, OpenWoZ, SWoOZ
|
||||
\node[title] at (8.25, 7.4) {Polonius, OpenWoZ\\SWoOZ, VR Environments};
|
||||
\node[desc] at (8.25, 6.0) {Flexible and powerful,\\but requires significant\\programming expertise};
|
||||
|
||||
% Bottom Left: WoZ4U
|
||||
\node[title] at (2.75, 2.7) {WoZ4U};
|
||||
\node[desc] at (2.75, 1.7) {Accessible, but\\platform-specific\\No methodological rigor};
|
||||
|
||||
% Bottom Right: Choregraphe
|
||||
\node[title] at (8.25, 2.7) {Choregraphe};
|
||||
\node[desc] at (8.25, 1.7) {Requires specialized\\training\\No methodological rigor};
|
||||
|
||||
\end{tikzpicture}
|
||||
\caption{The design space of WoZ tools categorized by technical barrier and methodological rigor. A fundamental gap exists for a platform that is both accessible and rigorous.}
|
||||
\label{fig:tool-matrix}
|
||||
\end{figure}
|
||||
|
||||
By methodological rigor, I refer to systematic features that guide experimenters toward best practices: consistently following experimental protocols, maintaining comprehensive logging, and producing reproducible experimental designs.
|
||||
|
||||
Moreover, few platforms directly address the methodological concerns raised by systematic reviews of WoZ research. Riek's influential analysis \cite{Riek2012} of 54 HRI studies uncovered widespread inconsistencies in how wizard behaviors were controlled and reported. Very few studies documented standardized wizard training procedures or measured wizard error rates, raising questions about internal validity---that is, whether observed outcomes can be attributed to the intended experimental manipulation rather than to uncontrolled variation in wizard behavior. The tools themselves often exacerbate this problem: poorly designed interfaces increase cognitive load on wizards, leading to timing errors and behavioral inconsistencies that can confound experimental results. Recent work by Strazdas et al. \cite{Strazdas2020} further demonstrates the importance of careful interface design in WoZ systems, showing that intuitive wizard interfaces directly improve both the quality of robot behavior and the reliability of collected data.
|
||||
|
||||
|
||||
@@ -7,7 +7,8 @@ Having established the landscape of existing WoZ platforms and their limitations
|
||||
|
||||
\emph{The Reproducibility Problem}, as introduced in Chapter~\ref{ch:intro}, encompasses two related challenges. The first concerns \emph{execution consistency}: whether a wizard reliably follows the same experimental script across multiple trials with different participants, producing comparable robot behavior in each. The second concerns \emph{cross-platform reproducibility}: whether the same experiment can be transferred to a different robot platform with minimal change to the implementing program. Both stem from gaps in current WoZ infrastructure and are examined in this chapter. A third interpretation of the term — independent replication of a published study by researchers at other institutions — is distinct from both and is not what this thesis evaluates. It is also worth noting that execution consistency, as defined here, corresponds to what the measurement literature sometimes calls \emph{repeatability}: the degree to which the same procedure produces consistent results when repeated across multiple trials of the same study.
|
||||
|
||||
In WoZ-based HRI studies, multiple sources of variability can compromise execution consistency. The wizard is simultaneously the strength and weakness of the WoZ paradigm. While human control enables sophisticated, adaptive interactions, it also introduces inconsistency. Consider a wizard conducting multiple trials of the same experiment with different participants. Even with a detailed script, the wizard may vary in timing, with delays between a participant's action and the robot's response fluctuating based on the wizard's attention, fatigue, or interpretation of when to act. When a script allows for choices, different wizards may make different selections, or the same wizard may act differently across trials. Furthermore, a wizard may accidentally skip steps, trigger actions in the wrong order, or misinterpret experimental protocols.
|
||||
In WoZ-based HRI studies, multiple sources of variability can compromise execution consistency. The wizard is simultaneously the strength and weakness of the WoZ paradigm. While human control enables sophisticated, adaptive interactions, it also introduces inconsistency. Consider a wizard conducting multiple trials of the same experiment with different participants.
|
||||
Even with a detailed script, the wizard may vary in timing, with delays between a participant's action and the robot's response fluctuating based on the wizard's attention, fatigue, or interpretation of when to act. When a script allows for choices, different wizards may make different selections, or the same wizard may act differently across trials. Furthermore, a wizard may accidentally skip steps, trigger actions in the wrong order, or misinterpret experimental protocols.
|
||||
|
||||
Riek's systematic review \cite{Riek2012} found that very few published studies reported measuring wizard error rates or providing standardized wizard training. Without such measures, it becomes impossible to determine whether experimental results reflect the intended interaction design or inadvertent variations in wizard behavior.
|
||||
|
||||
|
||||
@@ -15,6 +15,32 @@ Figure~\ref{fig:trial-instantiation} illustrates how a protocol definition relat
|
||||
|
||||
To illustrate the hierarchy with a concrete example, consider an interactive storytelling study with the research question: \emph{Does how the robot tells a story affect how a human will remember the story?} The two experiments use different robots: the NAO6, a humanoid robot with expressive gestures and a human-like form, and the TurtleBot, a wheeled mobile robot that is visibly machine-like with no social movement cues. The narrative task remains the same across both experiments; only how the robot delivers it changes.
|
||||
|
||||
\begin{figure}[htbp]
|
||||
\centering
|
||||
\begin{subfigure}[b]{0.3\textwidth}
|
||||
\centering
|
||||
\includegraphics[width=\textwidth]{images/nao6.jpg}
|
||||
\caption{NAO6 (Humanoid)}
|
||||
\label{fig:robot-nao}
|
||||
\end{subfigure}
|
||||
\hfill
|
||||
\begin{subfigure}[b]{0.3\textwidth}
|
||||
\centering
|
||||
\includegraphics[width=\textwidth]{images/pepper.png}
|
||||
\caption{Pepper (Social)}
|
||||
\label{fig:robot-pepper}
|
||||
\end{subfigure}
|
||||
\hfill
|
||||
\begin{subfigure}[b]{0.3\textwidth}
|
||||
\centering
|
||||
\includegraphics[width=\textwidth]{images/turtlebot.png}
|
||||
\caption{TurtleBot (Mechanical)}
|
||||
\label{fig:robot-turtlebot}
|
||||
\end{subfigure}
|
||||
\caption{Diverse robot morphologies supported by the HRIStudio architecture, ranging from expressive humanoid forms to purely mechanical platforms.}
|
||||
\label{fig:robot-morphologies}
|
||||
\end{figure}
|
||||
|
||||
Figure~\ref{fig:example-hierarchy} maps the study presented above onto the hierarchical elements defined in Figure~\ref{fig:experiment-hierarchy}. The study branches into two experiments (TurtleBot with only voice, NAO6 with added gestures), each experiment uses the same sequence of ordered steps (Intro, Story Telling, Recall Test), and each step defines the specific actions the robot will perform. The figure expands only the Story Telling step to keep the diagram readable, but Intro and Recall Test follow the same structure.
|
||||
|
||||
Together, these three figures motivate why the hierarchy is useful in practice. These three figures are interrelated as follows: Figure~\ref{fig:experiment-hierarchy} defines the experimental structure as an abstraction; Figure~\ref{fig:trial-instantiation} shows how the abstract experimental structure is instantiated as concrete trial records; and Figure~\ref{fig:example-hierarchy} shows the expansion of each element of the experimental structure.
|
||||
|
||||
@@ -7,11 +7,81 @@ HRIStudio is a complete, operational platform that realizes the design principle
|
||||
|
||||
HRIStudio follows the model of a web application. Users access it through a standard browser without installing specialized software, and the entire study team, including researchers, wizards, and observers, connect to the same shared system. This eliminates the need for a local installation and ensures the platform works identically on any operating system, directly addressing the low-technical-barrier requirement (R2, from Chapter~\ref{ch:background}). It also enables easy collaboration (R6): multiple team members can access experiment data and observe trials simultaneously from different machines without any additional configuration.
|
||||
|
||||
I organized the system into three layers: User Interface, Application Logic, and Data \& Robot Control. This layered structure is presented in Chapter~\ref{ch:design} and shown in Figure~\ref{fig:three-tier}. In practice, the User Interface layer runs in each researcher's browser (the client), while the Application Logic and Data \& Robot Control layers run on a shared application server. It is essential that this server and the robot control hardware run on the same local network. This keeps communication latency low during trials: a noticeable delay between the wizard's input and the robot's response would break the interaction.
|
||||
I organized the system into three layers: User Interface, Application Logic, and Data \& Robot Control. This layered structure is presented in Chapter~\ref{ch:design} and shown in Figure~\ref{fig:three-tier}. In practice, the User Interface layer runs in each researcher's browser (the client), while the Application Logic and Data \& Robot Control layers run on a shared application server.
|
||||
|
||||
While the system can run entirely on a single machine for local testing, this architecture allows the components to be distributed across different systems. The application server can be hosted centrally or even in a remote data center, enabling observers to connect to a live trial from any location with internet access. In such a configuration, it is essential that the robot control hardware and the client computer running the wizard's Execution interface stay on the same local network as the robot. This ensures that the WebSocket-based communication between the wizard and the robot bridge maintains low latency, as a noticeable delay between the wizard's input and the robot's response would break the interaction.
|
||||
|
||||
This flexibility of deployment also addresses the varying data security and compliance needs of different research institutions. A lab may choose to host HRIStudio on a public-facing server to prioritize collaborative ease and accessibility for remote team members. Alternatively, a lab with strict data privacy requirements or institutional review board (IRB) constraints can deploy the entire stack on a private, air-gapped network. Because the platform is self-contained and does not rely on external cloud services for its core execution logic, researchers have full control over where their experimental data is stored and who can access it.
|
||||
|
||||
I implemented all three layers in the same language: TypeScript~\cite{TypeScript2014}, a statically-typed superset of JavaScript. The single-language decision keeps the type system consistent across the full stack. When the structure of experiment data changes, the type checker surfaces inconsistencies across the entire codebase at compile time rather than allowing them to appear as runtime failures during a trial.
|
||||
|
||||
HRIStudio is released as open-source software under the MIT License, with the application hosted at a public repository~\cite{HRIStudioRepo} and the companion robot plugin repository hosted separately~\cite{RobotPluginsRepo}. Both are available for inspection, extension, and deployment by other research groups.
|
||||
HRIStudio is released as open-source software under the MIT License, with the application hosted at a public repository~\cite{HRIStudioRepo}. The companion robot plugin repository~\cite{RobotPluginsRepo} is maintained as a git submodule and is updated whenever HRIStudio requires schema or protocol updates. Both repositories are available for inspection, extension, and deployment by other research groups.
|
||||
|
||||
HRIStudio is implemented as a set of containerized services that work together to provide the platform's functionality. This modular architecture ensures that each component can be scaled or replaced independently as requirements change.
|
||||
|
||||
\begin{figure}[htbp]
|
||||
\centering
|
||||
\begin{tikzpicture}[
|
||||
node distance=0.8cm and 1.8cm,
|
||||
servicebox/.style={rectangle, draw=black, thick, fill=gray!15, align=center, font=\small, inner sep=5pt, minimum width=2.2cm},
|
||||
containerbox/.style={rectangle, draw=black, thick, dashed, fill=gray!5, align=center, font=\small\bfseries, inner sep=12pt},
|
||||
wsbox/.style={rectangle, draw=black, ultra thick, fill=white, align=center, font=\scriptsize\bfseries, inner sep=3pt},
|
||||
arrow/.style={->, thick, >=stealth},
|
||||
darrow/.style={<->, thick, >=stealth, dashed},
|
||||
labelstyle/.style={font=\scriptsize\itshape, align=center}
|
||||
]
|
||||
|
||||
% HRIStudio System Container Services
|
||||
\node[servicebox] (nextjs) {Next.js\\Server};
|
||||
\node[servicebox, below=of nextjs] (postgres) {PostgreSQL\\Database};
|
||||
\node[servicebox, below=of postgres] (minio) {MinIO\\Object Storage};
|
||||
\draw[arrow] (nextjs) -- (postgres);
|
||||
\draw[arrow] (nextjs) -- (minio);
|
||||
|
||||
% HRIStudio Container Boundary
|
||||
\begin{scope}[on background layer]
|
||||
\node[containerbox, fit=(nextjs) (postgres) (minio), inner sep=15pt] (hri_cont) {};
|
||||
\node[anchor=south, font=\small\bfseries, yshift=2pt] at (hri_cont.north) {HRIStudio System};
|
||||
\end{scope}
|
||||
|
||||
% NAO6 Integration Bridge Container Services
|
||||
\node[servicebox, right=4.5cm of nextjs] (driver) {NAOqi\\Driver};
|
||||
\node[servicebox, below=of driver] (ros) {ROS 2\\Core};
|
||||
\node[servicebox, below=of ros] (adapter) {HRIStudio\\Adapter};
|
||||
\draw[darrow] (driver) -- (ros);
|
||||
\draw[darrow] (ros) -- (adapter);
|
||||
|
||||
% Bridge Container Boundary
|
||||
\begin{scope}[on background layer]
|
||||
\node[containerbox, fit=(driver) (ros) (adapter), inner sep=15pt] (bridge_cont) {};
|
||||
\node[anchor=south, font=\small\bfseries, yshift=2pt] at (bridge_cont.north) {NAO6 Bridge};
|
||||
\end{scope}
|
||||
|
||||
% Client/Wizard
|
||||
\node[servicebox] (client) at ($(hri_cont.north)!0.5!(bridge_cont.north) + (0, 2.2)$) {Wizard Browser};
|
||||
|
||||
% WebSocket Connections
|
||||
\node[wsbox] (sys_ws) at ($(client.south)!0.5!(hri_cont.north)$) {System WebSocket};
|
||||
\node[wsbox] (robot_ws) at ($(client.south)!0.5!(bridge_cont.north)$) {Robot WebSocket};
|
||||
|
||||
\draw[darrow] (client.south) -- (sys_ws.north);
|
||||
\draw[darrow] (sys_ws.south) -- (hri_cont.north);
|
||||
|
||||
\draw[darrow] (client.south) -- (robot_ws.north);
|
||||
\draw[darrow] (robot_ws.south) -- (bridge_cont.north);
|
||||
|
||||
% Hardware
|
||||
\node[servicebox, right=1.5cm of bridge_cont] (robot) {NAO6\\Robot};
|
||||
\draw[arrow] (bridge_cont.east) -- node[above, font=\scriptsize, align=center] {NAOqi\\API} (robot.west);
|
||||
|
||||
\end{tikzpicture}
|
||||
\caption{The containerized architecture of HRIStudio and the NAO6 integration bridge. The wizard's browser maintains two independent WebSocket connections: one for system state and logging, and one for direct robot control.}
|
||||
\label{fig:system-architecture}
|
||||
\end{figure}
|
||||
|
||||
The HRIStudio system consists of three primary services: a Next.js application server that handles the user interface and business logic, a PostgreSQL database for persistent storage of experiment and trial data, and a MinIO object storage service for managing large media files like video and audio recordings. For robot integration, the \texttt{nao6-hristudio-integration} bridge also employs a containerized structure consisting of the NAOqi driver, a ROS 2 core for message routing, and a specialized adapter that communicates with HRIStudio.
|
||||
|
||||
During a live trial, the wizard's browser establishes two independent WebSocket connections. The System WebSocket connects to the HRIStudio server to manage trial state, protocol progression, and logging. The Robot WebSocket connects directly to the integration bridge to provide low-latency control of the robot platform. This split-connection model ensures that system-level management does not introduce latency into the robot's physical responses.
|
||||
|
||||
\subsection{Working with AI Coding Assistants}
|
||||
\label{sec:ai-ws}
|
||||
@@ -126,7 +196,7 @@ Figure~\ref{fig:execution-view} shows the Execution interface as it appears to a
|
||||
|
||||
\section{Robot Integration}
|
||||
|
||||
A plugin file describes each robot platform, listing the actions it supports and specifying how each one maps to a command the robot understands. The execution engine reads this file at startup and uses it whenever it needs to dispatch a command: it looks up the action type, assembles the appropriate message, and sends it to the robot over a bridge process running on the local network. The web server itself has no knowledge of any specific robot; all hardware-specific logic lives in the plugin file.
|
||||
A plugin file describes each robot platform, listing the actions it supports and specifying how each one maps to a command the robot understands. The execution engine reads this file at startup and uses it whenever it needs to dispatch a command: it looks up the action type, assembles the appropriate message, and sends it to the robot over a bridge process running on the local network. For the NAO6 platform, I developed a specialized ROS-based bridge called \texttt{nao6-hristudio-integration}~\cite{NaoIntegrationRepo} that translates HRIStudio commands into the NAOqi API calls required by the robot. The web server itself has no knowledge of any specific robot; all hardware-specific logic lives in the plugin file.
|
||||
|
||||
The execution engine treats control flow elements such as branches and conditionals, which function as elements of a computer program, the same way as robot actions. These control-flow elements appear as action groups in the experiment and are evaluated during the trial, so researchers can freely mix logical decisions and physical robot behaviors when designing an experiment without any special handling.
|
||||
|
||||
@@ -177,6 +247,10 @@ Figure~\ref{fig:plugin-architecture} illustrates this mapping using NAO6 and Tur
|
||||
\label{fig:plugin-architecture}
|
||||
\end{figure}
|
||||
|
||||
\subsection{Containerized Development Environment}
|
||||
|
||||
To support development and testing for the NAO platform, I also developed \texttt{nao-workspace}, a containerized workspace~\cite{NaoWorkspaceRepo}. This was motivated by the technical constraints of Choregraphe and its related libraries, which only supported x86-64 systems running Ubuntu 22.04. The containerized structure was the only way I could run the proprietary NAO development tools on modern hardware. While I developed this stack primarily to enable technical testing and material preparation during the project, the resulting tooling may be useful to other HRI researchers facing similar platform constraints.
|
||||
|
||||
\section{Access Control}
|
||||
|
||||
I implemented access control using a role-based access control (RBAC) model with two layers. System-level roles govern what a user can do across the platform (administrator, researcher, wizard, observer), while study-level roles govern what a user can see and do within a specific study (owner, researcher, wizard, observer). The two layers are checked independently, so a user who is a wizard on one study can be an observer on another without any additional configuration. Within a study, the four study-level roles define a clear separation of capabilities: those who own the study, those who design it, those who run it, and those who observe it. This enforces need-to-know access at the study level so that each team member sees or is able to modify only what their role requires. The capabilities and constraints for each role are described below:
|
||||
|
||||
@@ -1,16 +1,16 @@
|
||||
\chapter{AI-Assisted Development Workflow}
|
||||
\label{app:ai_workflow}
|
||||
|
||||
This appendix documents the role that AI coding assistants played in the construction of HRIStudio. It is included both for transparency about how the system was built and because the workflow itself is, in my view, one of the more interesting artifacts produced by the project. Section~\ref{sec:ai-ws} in Chapter~\ref{ch:implementation} introduces the topic briefly; here I describe the division of labor, the specific tools I used, the tasks each handled well, the limits I ran into, and the integrity controls I maintained between implementation work and the evaluation reported in Chapter~\ref{ch:results}.
|
||||
This appendix documents the role that AI coding assistants played in the construction of HRIStudio. It is included both for transparency about how the system was built and because the workflow itself is, in my view, one of the more interesting artifacts produced by the project. Section~\ref{sec:ai-ws} in Chapter~\ref{ch:implementation} introduces the topic briefly; here I describe the division of labor, the specific tools I used, the tasks each handled well, the limits I encountered, and the integrity controls I maintained between implementation work and the evaluation reported in Chapter~\ref{ch:results}.
|
||||
|
||||
\section{Context}
|
||||
\label{sec:ai-context}
|
||||
|
||||
HRIStudio was built by a single undergraduate in parallel with a full course load, a thesis writeup, and the pilot validation study described in Chapter~\ref{ch:evaluation}. The feature surface described in Chapters~\ref{ch:design} and~\ref{ch:implementation} is larger than what a solo developer on that schedule could reasonably have produced without assistance, and the deadline constraints did not allow for the kind of team that a system of this scope would normally involve. AI coding assistants made the scope tractable. They did not replace design judgment, but they substantially reduced the cost of the mechanical work that sits between a well-specified design and a working feature: scaffolding new modules, implementing well-defined CRUD and validation code, applying consistent patterns across files, and producing the many small edits that a project of this size accumulates.
|
||||
|
||||
The set of tools available to a solo developer changed substantially during the project's timeline. When I began, agentic coding tools were still early and most of my AI use was conversational. By the end of the project, multiple mature terminal- and editor-integrated agents were available. I changed tools as the landscape evolved and used what was available to me at each point. Tools overlapped in places, but I generally used one at a time for a given task; I did not operate a fleet of agents in parallel or maintain a consistent pipeline across tools.
|
||||
The set of tools available to a solo developer changed substantially during the project's timeline. When I began, agentic coding tools were still early and most of my AI use was conversational, primarily through Cursor~\cite{CursorEditor} and Zed~\cite{ZedEditor}. By the end of the project, multiple mature terminal- and editor-integrated agents were available. I changed tools as the landscape evolved, eventually moving into a mixed workflow between Visual Studio Code, Antigravity~\cite{GoogleAntigravity}, Claude Code~\cite{AnthropicClaudeCode}, and OpenCode~\cite{OpenCode}.
|
||||
|
||||
\section{Tools Used}
|
||||
\section{Tools and Hardware}
|
||||
\label{sec:ai-tools}
|
||||
|
||||
Table~\ref{tbl:ai-tools} lists the tools I used during development and the capacity in which I used each. The split between them was determined partly by capability and partly by availability over time.
|
||||
@@ -22,24 +22,26 @@ Table~\ref{tbl:ai-tools} lists the tools I used during development and the capac
|
||||
\hline
|
||||
\textbf{Tool} & \textbf{Category} & \textbf{Primary use} \\
|
||||
\hline
|
||||
Claude~\cite{Anthropic2024Claude} & Chat model & Design discussions, architectural review, debugging assistance, refactoring proposals, occasional help drafting commit messages. \\
|
||||
Claude~\cite{Anthropic2024Claude} & Chat model & Design discussions, architectural review, debugging assistance, and refactoring proposals. \\
|
||||
\hline
|
||||
Claude Code~\cite{AnthropicClaudeCode} & Terminal agent & Multi-file feature implementation against a written spec; codemod-style refactors; test scaffolding. \\
|
||||
Claude Code~\cite{AnthropicClaudeCode} & Terminal agent & Multi-file feature implementation against a written spec; codemod-style refactors; and test scaffolding. \\
|
||||
\hline
|
||||
OpenCode~\cite{OpenCode} & Terminal agent & Same class of task as Claude Code, used when I preferred an open-source workflow or a different backing model. \\
|
||||
\hline
|
||||
Gemini CLI~\cite{GeminiCLI} & Terminal agent & Occasional cross-check on changes produced by a different agent, and work against Google's models when I wanted a second reading of a larger diff. \\
|
||||
\hline
|
||||
Google Antigravity~\cite{GoogleAntigravity} & IDE agent & Editor-integrated agentic coding work, primarily late in the project as the tool became available. \\
|
||||
Antigravity~\cite{GoogleAntigravity} & IDE agent & Editor-integrated agentic coding work, primarily late in the project as the tool became available. \\
|
||||
\hline
|
||||
Zed~\cite{ZedEditor} & Editor & Day-to-day development environment; provided its own AI-assisted editing features alongside the agents listed above. \\
|
||||
Cursor~\cite{CursorEditor} & Editor & Early development; AI-native editing and indexing. \\
|
||||
\hline
|
||||
Zed~\cite{ZedEditor} & Editor & High-performance editing; transition phase before moving to specialized agents. \\
|
||||
\hline
|
||||
\end{tabular}
|
||||
\caption{AI tools used during HRIStudio development.}
|
||||
\label{tbl:ai-tools}
|
||||
\end{table}
|
||||
|
||||
I did not use these tools as a coordinated pipeline. I used whichever one fit the task in front of me at the moment, with the set of options expanding as the year progressed. Some of the work overlaps between tools --- any of the agents can, in principle, produce the same diff for a well-scoped task --- but I generally used one at a time and did not run multiple agents against the same code simultaneously.
|
||||
Beyond cloud-hosted models, I experimented with local execution using \texttt{llama.cpp} to run various open-weights models on my local hardware (Apple M4 Pro, 14-core CPU, 48GB RAM). While the hardware was capable of running 7B and 14B parameter models with high throughput, the reasoning performance of the local models frequently lagged behind the state-of-the-art frontier models. I found that the additional cognitive overhead of correcting errors in local model output outweighed the benefits of offline execution, leading me to rely primarily on the cloud-hosted agents for complex implementation tasks.
|
||||
|
||||
\section{Division of Responsibility}
|
||||
\label{sec:ai-division}
|
||||
@@ -53,36 +55,24 @@ My working rule throughout the project was that I did the engineering and the ag
|
||||
|
||||
\item \textbf{Research design.} The pilot validation study in Chapter~\ref{ch:evaluation} was designed and analyzed entirely by me. The Observer Data Sheet, Design Fidelity Score rubric, and Execution Reliability Score rubric were written by hand. No AI tool was used to score sessions, compute results, or draft claims about what the data showed.
|
||||
|
||||
\item \textbf{The prose of this thesis.} Every chapter was written by me. AI tools occasionally helped me reword an awkward sentence or catch an inconsistency between sections, but the structure of the argument and the specific claims I make are my own.
|
||||
\item \textbf{The prose of this thesis.} Every chapter was written by me. The structure of the argument and the specific claims I make are my own. While AI assisted with the nuances of \LaTeX{} formatting (particularly the generation of TikZ diagrams and complex chart syntax), the linguistic content is mine.
|
||||
\end{itemize}
|
||||
|
||||
The agents handled the work that sat inside those decisions: implementing tRPC procedures from a written signature, generating the Drizzle migration files that matched a schema change I had specified, producing React components from a layout sketch and a list of props, writing the serializer that turned a plugin definition into the JSON format the runtime expected, and applying consistent edits across files when I changed a shared interface. I read every diff before accepting it. When a diff was wrong, I either explained what was wrong and asked for a revision with specifics, or I discarded it and wrote the code myself.
|
||||
|
||||
\section{A Representative Interaction Pattern}
|
||||
\section{Evolution of the Workflow}
|
||||
\label{sec:ai-pattern}
|
||||
|
||||
The typical loop I followed for a medium-sized feature proceeded in five steps.
|
||||
The way I used these tools changed as they improved. Early in the project, I treated the agent's output as a draft that required line-by-line review. The typical loop followed five steps: writing a specification, generating a diff, reading the diff, running the code, and then accepting or rejecting.
|
||||
|
||||
First, I wrote the specification. This was usually a short markdown document I kept in a scratch file: a statement of what the feature should do, the tRPC procedure signature it would expose, the tables it would touch, the React components that would consume it, and the acceptance criteria that would let me know it was complete. Writing the specification was design work, and I did it myself.
|
||||
|
||||
Second, I handed the specification to an agent with the repository open. The agent read the relevant existing files, produced a diff that implemented the specification, and reported what it had done.
|
||||
|
||||
Third, I read the diff. This step was non-negotiable: I did not accept code I had not read. For small changes I read directly; for larger ones I asked the agent for a summary first and then read the diff file by file.
|
||||
|
||||
Fourth, I ran the code. I ran the development server, exercised the feature manually, checked the database state where relevant, and ran whatever tests existed. If the feature did not work, I returned to step three with a specific failure to investigate.
|
||||
|
||||
Fifth, I either accepted the diff, asked for a revision, or discarded it. A revision request described the specific thing that was wrong, not a vague instruction to \textit{try again}. Discarding happened when the agent had misunderstood the specification in a way that made a revision more expensive than rewriting from scratch.
|
||||
|
||||
This loop is unremarkable. It is the same loop I would follow if I were reviewing a pull request from a junior engineer. The key point is that the agent's output was treated as a draft pull request that I, as the engineer, either accepted, requested changes on, or rejected --- not as finished work.
|
||||
As the models improved and the agents became more reliable, the focus of my effort shifted. By the final stages of development, I spent significantly less time on manual line-by-line diff reviews and more time on empirical testing. I moved from being a ``code reviewer'' to a ``test-driven supervisor.'' If the agent produced a feature that passed my manual acceptance tests and integrated correctly with the existing system, I was more likely to accept the implementation without a complete audit of every semicolon. This shift allowed me to increase the velocity of development significantly in the weeks leading up to the evaluation.
|
||||
|
||||
\section{What Worked and What Did Not}
|
||||
\label{sec:ai-limits}
|
||||
|
||||
The tasks that agents handled well were those with a narrow and well-specified interface. Implementing a tRPC procedure from a signature, writing a Drizzle migration that matched a schema diff, adding a new field through an existing form, or applying a consistent rename across files --- these were cheap to specify and the agent's output was usually accepted on the first or second iteration. Agents were also good at scaffolding: producing the initial shape of a component, test file, or API route that I then edited to completion.
|
||||
The tasks that agents handled well were those with a narrow and well-specified interface. Implementing a tRPC procedure from a signature, writing a Drizzle migration that matched a schema diff, adding a new field through an existing form, or applying a consistent rename across files: these were cheap to specify and the agent's output was usually accepted on the first or second iteration. Agents were also good at scaffolding: producing the initial shape of a component, test file, or API route that I then edited to completion.
|
||||
|
||||
The tasks that agents handled poorly were those that required reasoning across more of the system than the context window could hold, or that depended on a piece of context I had not written down. Cross-cutting changes to the experiment and trial data models, for example, required careful coordination across the schema, the tRPC procedures, the execution runtime, and the analysis interface; when I tried to delegate changes of this shape to an agent, the diffs were often locally plausible but globally inconsistent. I ended up doing that work myself. Subtle concurrency and timing questions in the execution layer were another category the agents did not handle well; the event-driven execution model in Chapter~\ref{ch:design} has enough non-obvious ordering constraints that an agent without the full picture tended to introduce races. Those parts of the codebase I wrote by hand.
|
||||
The tasks that agents handled poorly were those that required reasoning across more of the system than the context window could hold, or that depended on a piece of context I had not written down. Cross-cutting changes to the experiment and trial data models, for example, required careful coordination across the schema, the tRPC procedures, the execution runtime, and the analysis interface. When I tried to delegate changes of this shape to an agent, the diffs were often locally plausible but globally inconsistent; I ended up doing that work myself. Subtle concurrency and timing questions in the execution layer were another category the agents did not handle well. The event-driven execution model in Chapter~\ref{ch:design} has enough non-obvious ordering constraints that an agent without the full picture tended to introduce races; those parts of the codebase I wrote by hand.
|
||||
|
||||
Across the full set of tools I used, the differences in capability for the work I asked of them were smaller than I expected. Any of the agents could, in principle, produce a correct diff for a well-scoped task, and when one tool failed it was usually because the task was underspecified rather than because of a difference in model capability. The practical differences between tools mattered more at the workflow level --- which shell integration I preferred, how the tool handled long diffs, how it behaved when it needed to ask for clarification --- than at the capability level.
|
||||
Across the full set of tools I used, the differences in capability for the work I asked of them were smaller than I expected. Any of the agents could, in principle, produce a correct diff for a well-scoped task, and when one tool failed it was usually because the task was underspecified rather than because of a difference in model capability. The practical differences between tools mattered more at the workflow level, such as which shell integration I preferred, how the tool handled long diffs, and how it behaved when it needed to ask for clarification, than at the capability level.
|
||||
|
||||
\section{Research Integrity}
|
||||
\label{sec:ai-integrity}
|
||||
@@ -94,7 +84,7 @@ Because this thesis reports an empirical evaluation, I treat the boundary betwee
|
||||
|
||||
\item No AI tool produced the tables, means, or comparative claims in Chapter~\ref{ch:results}. The numbers were tabulated by hand from the completed Observer Data Sheets reproduced in Appendix~\ref{app:completed_materials}, and the claims about what those numbers support or do not support are mine.
|
||||
|
||||
\item No AI tool drafted the prose of this thesis. The chapters were written by me, in my own voice, and I am responsible for every claim they make and every argument they advance. AI tools were occasionally used as a proofreading aid --- catching typos, flagging awkward phrasing, or suggesting an alternative word --- but the sentences are mine.
|
||||
\item No AI tool drafted the prose of this thesis. The chapters were written by me, in my own voice, and I am responsible for every claim they make and every argument they advance. AI tools were occasionally used as a proofreading aid to catch typos, flag awkward phrasing, or suggest an alternative word; however, the sentences are mine.
|
||||
|
||||
\item The code that implements HRIStudio and that was the subject of the evaluation was written under the workflow described in Sections~\ref{sec:ai-division} and~\ref{sec:ai-pattern}. Agents produced drafts; I read, tested, and accepted or rejected every one. The final state of the code is the product of my engineering decisions, regardless of who wrote any particular line.
|
||||
\end{itemize}
|
||||
|
||||
Reference in New Issue
Block a user