Compare commits

..

6 Commits

Author SHA1 Message Date
soconnor 5d8ef0ce76 revisions of the revisions
Build Proposal and Thesis / build-github (push) Has been skipped
Build Proposal and Thesis / build-gitea (push) Successful in 1m3s
2026-04-30 00:19:02 -04:00
soconnor 51009cd1ce add signed cover page
Build Proposal and Thesis / build-github (push) Has been skipped
Build Proposal and Thesis / build-gitea (push) Successful in 1m42s
2026-04-29 12:42:41 -04:00
soconnor 28c852a867 feat: add honors council representative and update department name in thesis
Build Proposal and Thesis / build-github (push) Has been skipped
Build Proposal and Thesis / build-gitea (push) Successful in 1m4s
2026-04-21 11:00:20 -04:00
soconnor 1404945756 post-defense revisions complete
Build Proposal and Thesis / build-github (push) Has been skipped
Build Proposal and Thesis / build-gitea (push) Successful in 1m3s
2026-04-21 00:25:54 -04:00
soconnor 5017133cfb add embedded PDFs to git
Build Proposal and Thesis / build-github (push) Has been skipped
Build Proposal and Thesis / build-gitea (push) Successful in 1m43s
2026-04-20 23:19:10 -04:00
soconnor a7508c5698 Add appendix on AI-assisted development workflow for HRIStudio
This commit introduces a new appendix detailing the role of AI coding assistants in the development of HRIStudio. It covers the context of the project, tools used, division of responsibility, interaction patterns, and reflections on research integrity. The workflow is documented to provide transparency and insight into the development process, emphasizing the collaboration between human decisions and AI assistance.
2026-04-20 23:15:23 -04:00
57 changed files with 595 additions and 114 deletions
+2
View File
@@ -12,6 +12,8 @@
*.synctex.gz *.synctex.gz
*.dvi *.dvi
*.pdf *.pdf
!pdfs/**
!thesis/pdfs/**
# Build directory # Build directory
build/ build/
-20
View File
@@ -1,20 +0,0 @@
.DS_Store
# LaTeX build artifacts (if they leak into root)
*.aux
*.bbl
*.blg
*.log
*.out
*.toc
*.lof
*.lot
*.fls
*.fdb_latexmk
*.synctex.gz
*.dvi
*.pdf
# Directories
build/
out/
Binary file not shown.

After

Width:  |  Height:  |  Size: 230 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 254 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 260 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 297 KiB

+25 -11
View File
@@ -36,6 +36,7 @@
\setlength{\parskip}{0.2in} \setlength{\parskip}{0.2in}
\newcommand{\advisor}[1]{\newcommand{\advisorname}{#1}} \newcommand{\advisor}[1]{\newcommand{\advisorname}{#1}}
\newcommand{\advisorb}[1]{\newcommand{\advisornameb}{#1}} \newcommand{\advisorb}[1]{\newcommand{\advisornameb}{#1}}
\newcommand{\honorscouncilrep}[1]{\newcommand{\honorscouncilrepname}{#1}}
\newcommand{\chair}[1]{\newcommand{\chairname}{#1}} \newcommand{\chair}[1]{\newcommand{\chairname}{#1}}
\newcommand{\department}[1]{\newcommand{\departmentname}{#1}} \newcommand{\department}[1]{\newcommand{\departmentname}{#1}}
\newcommand{\butitle}[1]{\newcommand{\titletext}{#1}} \newcommand{\butitle}[1]{\newcommand{\titletext}{#1}}
@@ -114,33 +115,46 @@ in Partial Fulfillment of the Requirements for the Degree of\\
\today \today
\end{center} \end{center}
\vspace{0.03in}
{\small
\ifthenelse{\boolean{@twoadv}}{ \ifthenelse{\boolean{@twoadv}}{
\vspace{0.25in}
Approved: \hspace{0.2in}\underline{\hspace{2.5in}}\\ Approved: \hspace{0.2in}\underline{\hspace{2.5in}}\\
\mbox{\hspace{1.3in}}\advisorname\\ \mbox{\hspace{1.3in}}\advisorname\\
\mbox{\hspace{1.3in}}Thesis Advisor \mbox{\hspace{1.3in}}Thesis Advisor
\vspace{0.25in} \vspace{0.03in}
\mbox{\hspace{1.0in}}\underline{\hspace{2.5in}}\\ \mbox{\hspace{1.0in}}\underline{\hspace{2.5in}}\\
\mbox{\hspace{1.3in}}\advisornameb\\ \mbox{\hspace{1.3in}}\advisornameb\\
\mbox{\hspace{1.3in}}Second Reader \mbox{\hspace{1.3in}}Reader
\vspace{0.25in} \vspace{0.03in}
\mbox{\hspace{1.0in}}\underline{\hspace{2.5in}}\\ \mbox{\hspace{1.0in}}\underline{\hspace{2.5in}}\\
\mbox{\hspace{1.3in}}\chairname\\ \mbox{\hspace{1.3in}}\chairname\\
\mbox{\hspace{1.3in}}Chair of the Department of \departmentname} \mbox{\hspace{1.3in}}Chair of the Department of \departmentname
{\vspace{1.0in} \vspace{0.03in}
Approved: \hspace{0.2in}\underline{\hspace{2.5in}}\\ \mbox{\hspace{1.0in}}\underline{\hspace{2.5in}}\\
\mbox{\hspace{1.3in}}\honorscouncilrepname\\
\mbox{\hspace{1.3in}}Honors Council Representative}
{Approved: \hspace{0.2in}\underline{\hspace{2.5in}}\\
\mbox{\hspace{1.3in}}\advisorname \\ \mbox{\hspace{1.3in}}\advisorname \\
\mbox{\hspace{1.3in}}Thesis Advisor \mbox{\hspace{1.3in}}Thesis Advisor
\vspace{0.5in} \vspace{0.03in}
\mbox{\hspace{1.0in}}\underline{\hspace{2.5in}}\\
\mbox{\hspace{1.3in}}\advisornameb\\
\mbox{\hspace{1.3in}}Reader
\vspace{0.03in}
\mbox{\hspace{1.0in}}\underline{\hspace{2.5in}}\\ \mbox{\hspace{1.0in}}\underline{\hspace{2.5in}}\\
\mbox{\hspace{1.3in}}\chairname\\ \mbox{\hspace{1.3in}}\chairname\\
\mbox{\hspace{1.3in}}Chair of the Department of \departmentname} \mbox{\hspace{1.3in}}Chair of the Department of \departmentname
\vspace{0.03in}
\mbox{\hspace{1.0in}}\underline{\hspace{2.5in}}\\
\mbox{\hspace{1.3in}}\honorscouncilrepname\\
\mbox{\hspace{1.3in}}Honors Council Representative}
}
\end{singlespace} \end{singlespace}
\vfill \vfill
\end{titlepage}} \end{titlepage}}
+2 -2
View File
@@ -9,7 +9,7 @@ To build the social robots of tomorrow, researchers must study how people respon
Social robotics, a subfield of HRI, focuses on robots designed for social interaction with humans, and it poses unique challenges for autonomy. In a typical social robotics interaction, a robot operates autonomously based on pre-programmed behaviors. Because human reactions to robot behaviors are not always predictable, pre-programmed autonomy often fails to respond appropriately to subtle social cues, causing the interaction to degrade. Social robotics, a subfield of HRI, focuses on robots designed for social interaction with humans, and it poses unique challenges for autonomy. In a typical social robotics interaction, a robot operates autonomously based on pre-programmed behaviors. Because human reactions to robot behaviors are not always predictable, pre-programmed autonomy often fails to respond appropriately to subtle social cues, causing the interaction to degrade.
To overcome this limitation, researchers use the WoZ technique. The name references L. Frank Baum's story \cite{Baum1900}, in which the "great and powerful" Oz is revealed to be an ordinary person operating machinery behind a curtain, creating an illusion of magic. In WoZ experiments, the wizard similarly creates an illusion of robot intelligence from behind the scenes. Consider a scenario where a researcher wants to test whether a robot tutor can effectively encourage student subjects during a learning task. Rather than building a complete autonomous system with speech recognition, natural language understanding, and emotion detection, the researcher may use a WoZ setup: a human operator (the ``wizard'') sits in a separate room, observing the interaction through cameras and microphones. When the subject appears frustrated, the wizard makes the robot say an encouraging phrase and perform a supportive gesture. To the subject, the robot appears to be acting autonomously, responding naturally to the subject's emotional state. This methodology allows researchers to rapidly prototype and test interaction designs, gathering valuable data about human responses before investing in the development of complex autonomous capabilities. To overcome this limitation, researchers use the WoZ technique. The name references L. Frank Baum's story \cite{Baum1900}, in which the ``great and powerful'' Oz is revealed to be an ordinary person operating machinery behind a curtain, creating an illusion of magic. In WoZ experiments, the wizard similarly creates an illusion of robot intelligence from behind the scenes. Consider a scenario where a researcher wants to test whether a robot tutor can effectively encourage student subjects during a learning task. Rather than building a complete autonomous system with speech recognition, natural language understanding, and emotion detection, the researcher may use a WoZ setup: a human operator (the ``wizard'') sits in a separate room, observing the interaction through cameras and microphones. When the subject appears frustrated, the wizard makes the robot say an encouraging phrase and perform a supportive gesture. To the subject, the robot appears to be acting autonomously, responding naturally to the subject's emotional state. This methodology allows researchers to rapidly prototype and test interaction designs, gathering valuable data about human responses before investing in the development of complex autonomous capabilities.
Despite its versatility, WoZ research faces two critical challenges. The first is \emph{The Accessibility Problem}: many non-programmers, such as experts in psychology or sociology, may find it challenging to conduct their own studies without engineering support. The second is \emph{The Reproducibility Problem}: the hardware landscape is highly fragmented, and researchers frequently build custom control interfaces for specific robots and experiments. Because these tools are tightly coupled to particular hardware, running the same social interaction script on a different robot platform typically requires rebuilding the implementation from scratch. These tools are rarely shared, making it difficult for a researcher to reproduce the same study across different robot platforms or for other labs to replicate results. Despite its versatility, WoZ research faces two critical challenges. The first is \emph{The Accessibility Problem}: many non-programmers, such as experts in psychology or sociology, may find it challenging to conduct their own studies without engineering support. The second is \emph{The Reproducibility Problem}: the hardware landscape is highly fragmented, and researchers frequently build custom control interfaces for specific robots and experiments. Because these tools are tightly coupled to particular hardware, running the same social interaction script on a different robot platform typically requires rebuilding the implementation from scratch. These tools are rarely shared, making it difficult for a researcher to reproduce the same study across different robot platforms or for other labs to replicate results.
@@ -29,4 +29,4 @@ The central question this thesis addresses is: \emph{can the right software arch
\section{Chapter Summary} \section{Chapter Summary}
This chapter has established the context and objectives for this thesis. I identified two critical challenges facing WoZ-based HRI research. The first is the \emph{Accessibility Problem}: high technical barriers limit participation by non-programmers. The second is the \emph{Reproducibility Problem}: fragmented tooling makes results difficult to replicate across labs. I proposed a web-based framework approach that addresses these challenges through intuitive design interfaces, enforced experimental protocols, and platform-agnostic architecture. Finally, I posed the central research question and described how this thesis addresses it through formal design, a reference implementation, and a pilot validation study. The next chapters establish the technical and methodological foundations. This chapter has established the context and objectives for this thesis. I identified two critical challenges facing WoZ-based HRI research. The first is the \emph{Accessibility Problem}: high technical barriers limit participation by non-programmers. The second is the \emph{Reproducibility Problem}: fragmented tooling makes results difficult to replicate across labs. I proposed a web-based framework approach that addresses these challenges through intuitive design interfaces, enforced experimental protocols, and platform-agnostic architecture. Finally, I posed the central research question and described how this thesis addresses it through formal design, a reference implementation, and a pilot validation study.
+49 -3
View File
@@ -3,7 +3,7 @@
This chapter provides the necessary context for understanding the challenges addressed by this thesis. I survey the landscape of existing WoZ platforms, analyze their capabilities and limitations, and establish requirements that a modern infrastructure should satisfy. Finally, I position this thesis within the context of prior work on this topic. This chapter provides the necessary context for understanding the challenges addressed by this thesis. I survey the landscape of existing WoZ platforms, analyze their capabilities and limitations, and establish requirements that a modern infrastructure should satisfy. Finally, I position this thesis within the context of prior work on this topic.
As established in Chapter~\ref{ch:intro}, the WoZ technique enables researchers to prototype and test robot interaction designs before autonomous capabilities are developed. This thesis is situated within a specific subset of HRI activity: social robotics, a subfield concerned with robots designed for direct social interaction with humans, and more narrowly within that, WoZ experiments used to prototype and evaluate social robot behaviors. To understand how the proposed framework advances this research paradigm, I review the existing landscape of WoZ platforms, identify their limitations relative to disciplinary needs, and establish requirements for a more comprehensive approach. HRI is fundamentally a multidisciplinary field which brings together engineers, psychologists, designers, and domain experts from various application areas \cite{Bartneck2024}. Yet two challenges have historically limited participation from non-technical researchers in WoZ-based HRI studies. First, each research group builds custom software for specific robots, creating tool fragmentation across the field. Second, high technical barriers prevent many domain experts from conducting independent studies. As established in Chapter~\ref{ch:intro}, the WoZ technique enables researchers to prototype and test robot interaction designs before autonomous capabilities are developed. This thesis is situated within a specific subset of HRI activity: social robotics, a subfield concerned with robots designed for direct social interaction with humans, and more narrowly within that, WoZ experiments used to prototype and evaluate social robot behaviors. To understand how the proposed framework advances this research paradigm, I review the existing landscape of WoZ platforms, identify their limitations relative to disciplinary needs, and establish requirements for a more comprehensive approach. HRI is fundamentally a multidisciplinary field which brings together engineers, psychologists, designers, and domain experts from various application areas \cite{Bartneck2024}. Yet two challenges have historically limited participation from non-technical researchers in WoZ-based HRI studies. First, high technical barriers prevent many domain experts from conducting independent studies. Second, each research group builds custom software for specific robots, creating tool fragmentation across the field.
\section{Existing WoZ Platforms and Tools} \section{Existing WoZ Platforms and Tools}
@@ -17,9 +17,55 @@ Choregraphe \cite{Pot2009}, developed by Aldebaran Robotics for the NAO and Pepp
Recent years have seen renewed interest in comprehensive WoZ frameworks. Gibert et al. \cite{Gibert2013} developed the Super Wizard of Oz (SWoOZ) platform. This system integrates facial tracking, gesture recognition, and real-time control capabilities to enable naturalistic human-robot interaction studies. Virtual and augmented reality have also emerged as complementary approaches to WoZ. Helgert et al. \cite{Helgert2024} demonstrated how VR-based WoZ environments can simplify experimental setup while providing researchers with precise control over environmental conditions and high-fidelity data collection. Recent years have seen renewed interest in comprehensive WoZ frameworks. Gibert et al. \cite{Gibert2013} developed the Super Wizard of Oz (SWoOZ) platform. This system integrates facial tracking, gesture recognition, and real-time control capabilities to enable naturalistic human-robot interaction studies. Virtual and augmented reality have also emerged as complementary approaches to WoZ. Helgert et al. \cite{Helgert2024} demonstrated how VR-based WoZ environments can simplify experimental setup while providing researchers with precise control over environmental conditions and high-fidelity data collection.
This expanding landscape reveals a persistent fundamental gap in the design space of WoZ tools. Flexible, general-purpose platforms like Polonius and OpenWoZ offer powerful capabilities but present high technical barriers. Accessible, user-friendly tools like WoZ4U and Choregraphe lower those barriers but sacrifice cross-platform compatibility and longevity. Newer approaches such as VR-based frameworks attempt to bridge this gap, yet no existing tool successfully combines accessibility, flexibility, deployment portability, and built-in methodological rigor. By methodological rigor, I refer to systematic features that guide experimenters toward best practices: consistently following experimental protocols, maintaining comprehensive logging, and producing reproducible experimental designs. This expanding landscape reveals a persistent fundamental gap in the design space of WoZ tools. Flexible, general-purpose platforms like Polonius and OpenWoZ offer powerful capabilities but present high technical barriers. Accessible, user-friendly tools like WoZ4U and Choregraphe lower those barriers but sacrifice cross-platform compatibility and longevity. Newer approaches such as VR-based frameworks attempt to bridge this gap, yet no existing tool successfully combines accessibility, flexibility, deployment portability, and built-in methodological rigor.
Moreover, few platforms directly address the methodological concerns raised by systematic reviews of WoZ research. Riek's influential analysis \cite{Riek2012} of 54 HRI studies uncovered widespread inconsistencies in how wizard behaviors were controlled and reported. Very few studies documented standardized wizard training procedures or measured wizard error rates, raising questions about internal validity---that is, whether observed outcomes can be attributed to the intended experimental manipulation rather than to uncontrolled variation in wizard behavior. The tools themselves often exacerbate this problem: poorly designed interfaces increase cognitive load on wizards, leading to timing errors and behavioral inconsistencies that can confound experimental results. Recent work by Strazdas et al. \cite{Strazdas2020} further demonstrates the importance of careful interface design in WoZ systems, showing that intuitive wizard interfaces directly improve both the quality of robot behavior and the reliability of collected data. \begin{figure}[htbp]
\centering
\begin{tikzpicture}[
scale=1.0,
quadbox/.style={rectangle, draw=white, ultra thick, minimum width=5.5cm, minimum height=4.5cm, align=center},
title/.style={font=\small\bfseries, align=center},
desc/.style={font=\footnotesize, text=gray!60, align=center},
axislabel/.style={font=\small\bfseries, align=center}
]
% Quadrant Backgrounds
\fill[gray!20] (0, 4.5) rectangle (5.5, 9.0); % Top Left (HRIStudio)
\fill[gray!15] (5.5, 4.5) rectangle (11.0, 9.0); % Top Right (Polonius)
\fill[gray!10] (0, 0) rectangle (5.5, 4.5); % Bottom Left (WoZ4U)
\fill[gray!5] (5.5, 0) rectangle (11.0, 4.5); % Bottom Right (Choregraphe)
% Quadrant Lines
\draw[white, ultra thick] (5.5, 0) -- (5.5, 9.0);
\draw[white, ultra thick] (0, 4.5) -- (11.0, 4.5);
% Axis Labels
\node[axislabel, above] at (2.75, 9.2) {Low technical barrier};
\node[axislabel, above] at (8.25, 9.2) {High technical barrier};
\node[axislabel, left] at (-0.2, 6.75) {More rigorous};
\node[axislabel, left] at (-0.2, 2.25) {Less rigorous};
% Top Left: The Gap
\node[axislabel] at (2.75, 6.75) {\Huge ?};
% Top Right: Polonius, OpenWoZ, SWoOZ
\node[title] at (8.25, 7.4) {Polonius, OpenWoZ\\SWoOZ, VR Environments};
\node[desc] at (8.25, 6.0) {Flexible and powerful,\\but requires significant\\programming expertise};
% Bottom Left: WoZ4U
\node[title] at (2.75, 2.7) {WoZ4U};
\node[desc] at (2.75, 1.7) {Accessible, but\\platform-specific\\No methodological rigor};
% Bottom Right: Choregraphe
\node[title] at (8.25, 2.7) {Choregraphe};
\node[desc] at (8.25, 1.7) {Requires specialized\\training\\No methodological rigor};
\end{tikzpicture}
\caption{WoZ tool design space by technical barrier and methodological rigor.}
\label{fig:tool-matrix}
\end{figure}
The missing quadrant in Figure~\ref{fig:tool-matrix} matters because methodological rigor requires systematic features that guide experimenters toward best practices: consistently following experimental protocols, maintaining comprehensive logging, and producing reproducible experimental designs. Few platforms directly address the methodological concerns raised by systematic reviews of WoZ research. Riek's influential analysis \cite{Riek2012} of 54 HRI studies uncovered widespread inconsistencies in how wizard behaviors were controlled and reported. Very few studies documented standardized wizard training procedures or measured wizard error rates, raising questions about internal validity---that is, whether observed outcomes can be attributed to the intended experimental manipulation rather than to uncontrolled variation in wizard behavior. The tools themselves often exacerbate this problem: poorly designed interfaces increase cognitive load on wizards, leading to timing errors and behavioral inconsistencies that can confound experimental results. Recent work by Strazdas et al. \cite{Strazdas2020} further demonstrates the importance of careful interface design in WoZ systems, showing that intuitive wizard interfaces directly improve both the quality of robot behavior and the reliability of collected data.
\section{Requirements for Modern WoZ Infrastructure} \section{Requirements for Modern WoZ Infrastructure}
+3 -2
View File
@@ -5,9 +5,10 @@ Having established the landscape of existing WoZ platforms and their limitations
\section{Sources of Variability} \section{Sources of Variability}
\emph{The Reproducibility Problem}, as introduced in Chapter~\ref{ch:intro}, encompasses two related challenges. The first concerns \emph{execution consistency}: whether a wizard reliably follows the same experimental script across multiple trials with different participants, producing comparable robot behavior in each. The second concerns \emph{cross-platform reproducibility}: whether the same experiment can be transferred to a different robot platform with minimal change to the implementing program. Both stem from gaps in current WoZ infrastructure and are examined in this chapter. A third interpretation of the term — independent replication of a published study by researchers at other institutions — is distinct from both and is not what this thesis evaluates. It is also worth noting that execution consistency, as defined here, corresponds to what the measurement literature sometimes calls \emph{repeatability}: the degree to which the same procedure produces consistent results when repeated across multiple trials of the same study. \emph{The Reproducibility Problem}, as introduced in Chapter~\ref{ch:intro}, encompasses two related challenges. The first concerns \emph{execution consistency}: whether a wizard reliably follows the same experimental script across multiple trials with different participants, producing comparable robot behavior in each. The second concerns \emph{cross-platform reproducibility}: whether the same experiment can be transferred to a different robot platform with minimal change to the implementing program. Both stem from gaps in current WoZ infrastructure and are examined in this chapter. It is important to note that the term reproducibility may also refer to \emph{allowing independent replications of published studies}; this is not what this thesis evaluates. Execution consistency, as defined here, corresponds to what the measurement literature sometimes calls \emph{repeatability}: the degree to which the same procedure produces consistent results when repeated across multiple trials of the same study.
In WoZ-based HRI studies, multiple sources of variability can compromise execution consistency. The wizard is simultaneously the strength and weakness of the WoZ paradigm. While human control enables sophisticated, adaptive interactions, it also introduces inconsistency. Consider a wizard conducting multiple trials of the same experiment with different participants. Even with a detailed script, the wizard may vary in timing, with delays between a participant's action and the robot's response fluctuating based on the wizard's attention, fatigue, or interpretation of when to act. When a script allows for choices, different wizards may make different selections, or the same wizard may act differently across trials. Furthermore, a wizard may accidentally skip steps, trigger actions in the wrong order, or misinterpret experimental protocols. In WoZ-based HRI studies, multiple sources of variability can compromise execution consistency. The wizard is simultaneously the strength and weakness of the WoZ paradigm. While human control enables sophisticated, adaptive interactions, it also introduces inconsistency. Consider a wizard conducting multiple trials of the same experiment with different participants.
Even with a detailed script, the wizard may vary in timing, with the delay between a participant's action and the robot's response fluctuating based on the wizard's attention, fatigue, or interpretation of when to act. When a script allows for choices, different wizards may make different selections, or the same wizard may act differently across trials. Furthermore, a wizard may accidentally skip steps, trigger actions in the wrong order, or misinterpret experimental protocols.
Riek's systematic review \cite{Riek2012} found that very few published studies reported measuring wizard error rates or providing standardized wizard training. Without such measures, it becomes impossible to determine whether experimental results reflect the intended interaction design or inadvertent variations in wizard behavior. Riek's systematic review \cite{Riek2012} found that very few published studies reported measuring wizard error rates or providing standardized wizard training. Without such measures, it becomes impossible to determine whether experimental results reflect the intended interaction design or inadvertent variations in wizard behavior.
+33 -7
View File
@@ -13,7 +13,33 @@ Figure~\ref{fig:experiment-hierarchy} shows this hierarchical structure. Reading
Figure~\ref{fig:trial-instantiation} illustrates how a protocol definition relates to its instantiation. The left column holds the protocol, defined before the study begins; the right column shows how the abstraction defined as a protocol is instantiated as independent trials. A dashed line marks the protocol/trial boundary: everything to its left was authored by the researcher before any participant arrived; everything to its right was generated during a live session. The \textit{instantiates} arrows from the experiment node fan out to each trial record, making the relationship explicit. This separation is central to reproducibility: the same experiment specification generates a distinct, timestamped record per participant, so researchers can compare across participants without conflating what was designed with what was executed. Figure~\ref{fig:trial-instantiation} illustrates how a protocol definition relates to its instantiation. The left column holds the protocol, defined before the study begins; the right column shows how the abstraction defined as a protocol is instantiated as independent trials. A dashed line marks the protocol/trial boundary: everything to its left was authored by the researcher before any participant arrived; everything to its right was generated during a live session. The \textit{instantiates} arrows from the experiment node fan out to each trial record, making the relationship explicit. This separation is central to reproducibility: the same experiment specification generates a distinct, timestamped record per participant, so researchers can compare across participants without conflating what was designed with what was executed.
To illustrate the hierarchy with a concrete example, consider an interactive storytelling study with the research question: \emph{Does how the robot tells a story affect how a human will remember the story?} The two experiments use different robots: the NAO6, a humanoid robot with expressive gestures and a human-like form, and the TurtleBot, a wheeled mobile robot that is visibly machine-like with no social movement cues. The narrative task remains the same across both experiments; only how the robot delivers it changes. To illustrate the hierarchy with a concrete example, consider an interactive storytelling study with the research question: \emph{Does how the robot tells a story affect how a human will remember the story?} The experiment might use different robots, for instance Pepper, NAO6, and TurtleBot. Figure~\ref{fig:robot-morphologies} shows the morphology of these three different robots: Pepper and NAO6 are humanoid social robots with expressive gestures and human-like forms, while TurtleBot is a wheeled mobile robot with a visibly machine-like form and no social movement cues. In the example below, the narrative task remains the same across two robot-specific experiments; only how the robot delivers it changes.
\begin{figure}[htbp]
\centering
\begin{subfigure}[b]{0.3\textwidth}
\centering
\includegraphics[width=\textwidth]{images/nao6.jpg}
\caption{NAO6 (Humanoid)}
\label{fig:robot-nao}
\end{subfigure}
\hfill
\begin{subfigure}[b]{0.3\textwidth}
\centering
\includegraphics[width=\textwidth]{images/pepper.png}
\caption{Pepper (Social)}
\label{fig:robot-pepper}
\end{subfigure}
\hfill
\begin{subfigure}[b]{0.3\textwidth}
\centering
\includegraphics[width=\textwidth]{images/turtlebot.png}
\caption{TurtleBot (Mechanical)}
\label{fig:robot-turtlebot}
\end{subfigure}
\caption{Three robot morphologies supported by the HRIStudio architecture.}
\label{fig:robot-morphologies}
\end{figure}
Figure~\ref{fig:example-hierarchy} maps the study presented above onto the hierarchical elements defined in Figure~\ref{fig:experiment-hierarchy}. The study branches into two experiments (TurtleBot with only voice, NAO6 with added gestures), each experiment uses the same sequence of ordered steps (Intro, Story Telling, Recall Test), and each step defines the specific actions the robot will perform. The figure expands only the Story Telling step to keep the diagram readable, but Intro and Recall Test follow the same structure. Figure~\ref{fig:example-hierarchy} maps the study presented above onto the hierarchical elements defined in Figure~\ref{fig:experiment-hierarchy}. The study branches into two experiments (TurtleBot with only voice, NAO6 with added gestures), each experiment uses the same sequence of ordered steps (Intro, Story Telling, Recall Test), and each step defines the specific actions the robot will perform. The figure expands only the Story Telling step to keep the diagram readable, but Intro and Recall Test follow the same structure.
@@ -253,7 +279,7 @@ To ensure that data from every experimental phase remains traceable, the system
\subsection{Architectural Layers} \subsection{Architectural Layers}
Like the ISO/OSI reference model for networking software, HRIStudio separates its communicative and functional responsibilities into distinct layers, as shown in Figure~\ref{fig:three-tier}. More specifically, the system is organized as a three-layer architecture, each layer with a specific responsibility: HRIStudio separates its communicative and functional responsibilities into distinct layers, in a manner analogous to the layered reference models used in networking software. More specifically, the system is organized as a three-layer architecture, as shown in Figure~\ref{fig:three-tier}, each layer with a specific responsibility:
\begin{description} \begin{description}
\item[User Interface layer.] Runs in researchers' web browsers and exposes the three interfaces (Design, Execution, Analysis), managing user interactions such as clicking buttons, dragging and dropping experiment components, and reviewing experimental results. \item[User Interface layer.] Runs in researchers' web browsers and exposes the three interfaces (Design, Execution, Analysis), managing user interactions such as clicking buttons, dragging and dropping experiment components, and reviewing experimental results.
@@ -267,7 +293,7 @@ This separation of concerns provides two concrete benefits. First, each layer ca
\centering \centering
\begin{tikzpicture}[ \begin{tikzpicture}[
layer/.style={rectangle, draw=black, thick, fill, minimum width=6.5cm, minimum height=1cm, align=center, text width=6.2cm}, layer/.style={rectangle, draw=black, thick, fill, minimum width=6.5cm, minimum height=1cm, align=center, text width=6.2cm},
arrow/.style={->, thick, line width=1.5pt}] arrow/.style={-, thick, line width=1.5pt}]
% Layer 1: UI % Layer 1: UI
\node[layer, fill=gray!15] (ui) at (0, 3.5) { \node[layer, fill=gray!15] (ui) at (0, 3.5) {
@@ -288,8 +314,8 @@ This separation of concerns provides two concrete benefits. First, each layer ca
}; };
% Arrows (bidirectional) % Arrows (bidirectional)
\draw[<->, thick, line width=1.5pt] (ui.south) -- (logic.north); \draw[-, thick, line width=1.5pt] (ui.south) -- (logic.north);
\draw[<->, thick, line width=1.5pt] (logic.south) -- (data.north); \draw[-, thick, line width=1.5pt] (logic.south) -- (data.north);
\end{tikzpicture} \end{tikzpicture}
\caption{Three-layer architecture separates user interface, application logic, and data/robot control.} \caption{Three-layer architecture separates user interface, application logic, and data/robot control.}
@@ -300,7 +326,7 @@ This separation of concerns provides two concrete benefits. First, each layer ca
During the design phase, researchers create experiment specifications that are stored in the system database. During a trial, the system manages bidirectional communication between the wizard's interface and the robot control layer. All actions, sensor data, and events are streamed to a data logging service that stores complete records. After the trial, researchers can inspect these records through the Analysis interface. During the design phase, researchers create experiment specifications that are stored in the system database. During a trial, the system manages bidirectional communication between the wizard's interface and the robot control layer. All actions, sensor data, and events are streamed to a data logging service that stores complete records. After the trial, researchers can inspect these records through the Analysis interface.
The flow of data during a trial proceeds through six distinct phases, as shown in Figure~\ref{fig:trial-dataflow}: The flow of data during a trial proceeds through six distinct phases as discussed below; these phases are summarized in Figure~\ref{fig:trial-dataflow}:
\begin{enumerate} \begin{enumerate}
\item A researcher creates an experiment protocol using the Design interface. \item A researcher creates an experiment protocol using the Design interface.
@@ -335,7 +361,7 @@ This design creates automatically a comprehensive documentation of every trial,
\draw[arrow] (s5.south) -- (s6.north); \draw[arrow] (s5.south) -- (s6.north);
\end{tikzpicture} \end{tikzpicture}
\caption{Trial data flow: from protocol design through execution and recording, to analysis and playback.} \caption{Six-phase trial data flow.}
\label{fig:trial-dataflow} \label{fig:trial-dataflow}
\end{figure} \end{figure}
+104 -3
View File
@@ -7,10 +7,89 @@ HRIStudio is a complete, operational platform that realizes the design principle
HRIStudio follows the model of a web application. Users access it through a standard browser without installing specialized software, and the entire study team, including researchers, wizards, and observers, connect to the same shared system. This eliminates the need for a local installation and ensures the platform works identically on any operating system, directly addressing the low-technical-barrier requirement (R2, from Chapter~\ref{ch:background}). It also enables easy collaboration (R6): multiple team members can access experiment data and observe trials simultaneously from different machines without any additional configuration. HRIStudio follows the model of a web application. Users access it through a standard browser without installing specialized software, and the entire study team, including researchers, wizards, and observers, connect to the same shared system. This eliminates the need for a local installation and ensures the platform works identically on any operating system, directly addressing the low-technical-barrier requirement (R2, from Chapter~\ref{ch:background}). It also enables easy collaboration (R6): multiple team members can access experiment data and observe trials simultaneously from different machines without any additional configuration.
I organized the system into three layers: User Interface, Application Logic, and Data \& Robot Control. This layered structure is presented in Chapter~\ref{ch:design} and shown in Figure~\ref{fig:three-tier}. In practice, the User Interface layer runs in each researcher's browser (the client), while the Application Logic and Data \& Robot Control layers run on a shared application server. It is essential that this server and the robot control hardware run on the same local network. This keeps communication latency low during trials: a noticeable delay between the wizard's input and the robot's response would break the interaction. I organized the system into three layers: User Interface, Application Logic, and Data \& Robot Control. This layered structure is presented in Chapter~\ref{ch:design} and shown in Figure~\ref{fig:three-tier}. In practice, the User Interface layer runs in each researcher's browser (the client), while the Application Logic and Data \& Robot Control layers run on a shared application server.
While the system can run entirely on a single machine for local testing, this architecture allows the components to be distributed across different systems. The application server can be hosted centrally or even in a remote data center, enabling observers to connect to a live trial from any location with internet access. In such a configuration, it is essential that the robot control hardware and the client computer running the wizard's Execution interface stay on the same local network as the robot. This ensures that the WebSocket-based communication between the wizard and the robot bridge maintains low latency, as a noticeable delay between the wizard's input and the robot's response would break the interaction.
This flexibility of deployment also addresses the varying data security and compliance needs of different research institutions. A lab may choose to host HRIStudio on a public-facing server to prioritize collaborative ease and accessibility for remote team members. Alternatively, a lab with strict data privacy requirements or institutional review board (IRB) constraints can deploy the entire stack on a private, air-gapped network. Because the platform is self-contained and does not rely on external cloud services for its core execution logic, researchers have full control over where their experimental data is stored and who can access it.
I implemented all three layers in the same language: TypeScript~\cite{TypeScript2014}, a statically-typed superset of JavaScript. The single-language decision keeps the type system consistent across the full stack. When the structure of experiment data changes, the type checker surfaces inconsistencies across the entire codebase at compile time rather than allowing them to appear as runtime failures during a trial. I implemented all three layers in the same language: TypeScript~\cite{TypeScript2014}, a statically-typed superset of JavaScript. The single-language decision keeps the type system consistent across the full stack. When the structure of experiment data changes, the type checker surfaces inconsistencies across the entire codebase at compile time rather than allowing them to appear as runtime failures during a trial.
HRIStudio is released as open-source software under the MIT License, with the application hosted at a public repository~\cite{HRIStudioRepo}. The companion robot plugin repository~\cite{RobotPluginsRepo} is maintained as a git submodule and is updated whenever HRIStudio requires schema or protocol updates. Both repositories are available for inspection, extension, and deployment by other research groups.
HRIStudio is implemented as a set of containerized services that work together to provide the platform's functionality. This modular architecture ensures that each component can be scaled or replaced independently as requirements change.
\begin{figure}[htbp]
\centering
\begin{tikzpicture}[
node distance=0.8cm and 1.8cm,
servicebox/.style={rectangle, draw=black, thick, fill=gray!15, align=center, font=\small, inner sep=5pt, minimum width=2.2cm},
containerbox/.style={rectangle, draw=black, thick, dashed, fill=gray!5, align=center, font=\small\bfseries, inner sep=12pt},
wsbox/.style={rectangle, draw=black, ultra thick, fill=white, align=center, font=\scriptsize\bfseries, inner sep=3pt},
arrow/.style={->, thick, >=stealth},
darrow/.style={<->, thick, >=stealth, dashed},
labelstyle/.style={font=\scriptsize\itshape, align=center}
]
% HRIStudio System Container Services
\node[servicebox] (nextjs) {Next.js\\Server};
\node[servicebox, below=of nextjs] (postgres) {PostgreSQL\\Database};
\node[servicebox, below=of postgres] (minio) {MinIO\\Object Storage};
\draw[arrow] (nextjs) -- (postgres);
\draw[arrow] (nextjs) -- (minio);
% HRIStudio Container Boundary
\begin{scope}[on background layer]
\node[containerbox, fit=(nextjs) (postgres) (minio), inner sep=15pt] (hri_cont) {};
\node[anchor=south, font=\small\bfseries, yshift=2pt] at (hri_cont.north) {HRIStudio System};
\end{scope}
% NAO6 Integration Bridge Container Services
\node[servicebox, right=4.5cm of nextjs] (driver) {NAOqi\\Driver};
\node[servicebox, below=of driver] (ros) {ROS 2\\Core};
\node[servicebox, below=of ros] (adapter) {HRIStudio\\Adapter};
\draw[darrow] (driver) -- (ros);
\draw[darrow] (ros) -- (adapter);
% Bridge Container Boundary
\begin{scope}[on background layer]
\node[containerbox, fit=(driver) (ros) (adapter), inner sep=15pt] (bridge_cont) {};
\node[anchor=south, font=\small\bfseries, yshift=2pt] at (bridge_cont.north) {NAO6 Bridge};
\end{scope}
% Client/Wizard
\node[servicebox] (client) at ($(hri_cont.north)!0.5!(bridge_cont.north) + (0, 2.2)$) {Wizard Browser};
% WebSocket Connections
\node[wsbox] (sys_ws) at ($(client.south)!0.5!(hri_cont.north)$) {System WebSocket};
\node[wsbox] (robot_ws) at ($(client.south)!0.5!(bridge_cont.north)$) {Robot WebSocket};
\draw[darrow] (client.south) -- (sys_ws.north);
\draw[darrow] (sys_ws.south) -- (hri_cont.north);
\draw[darrow] (client.south) -- (robot_ws.north);
\draw[darrow] (robot_ws.south) -- (bridge_cont.north);
% Hardware
\node[servicebox, right=1.5cm of bridge_cont] (robot) {NAO6\\Robot};
\draw[arrow] (bridge_cont.east) -- node[above, font=\scriptsize, align=center] {NAOqi\\API} (robot.west);
\end{tikzpicture}
\caption{Containerized HRIStudio and NAO6 integration architecture.}
\label{fig:system-architecture}
\end{figure}
The HRIStudio system consists of three primary services: a Next.js application server that handles the user interface and business logic, a PostgreSQL database for persistent storage of experiment and trial data, and a MinIO object storage service for managing large media files like video and audio recordings. For robot integration, the \texttt{nao6-hristudio-integration} bridge also employs a containerized structure consisting of the NAOqi driver, a ROS 2 core for message routing, and a specialized adapter that communicates with HRIStudio.
During a live trial, the wizard's browser establishes two independent WebSocket connections. The System WebSocket connects to the HRIStudio server to manage trial state, protocol progression, and logging. The Robot WebSocket connects directly to the integration bridge to provide low-latency control of the robot platform. This split-connection model ensures that system-level management does not introduce latency into the robot's physical responses.
\subsection{Working with AI Coding Assistants}
\label{sec:ai-ws}
The scale of the implementation described in this chapter, a full-stack TypeScript application spanning user interface, application logic, persistent storage, and real-time robot control, would not have been possible within the timeframe of this thesis without the use of AI coding assistants. I distinguish clearly between the engineering and implementation roles in this work: I architected the system, made the design decisions documented in Chapter~\ref{ch:design} and this chapter, specified the behavior and constraints of each component, and reviewed and integrated all code before it entered the codebase. AI agents acted as software developers working under that direction, producing TypeScript code in response to the specifications I provided and the feedback I gave as the implementation evolved. The division of labor was consistent throughout: I engineered, they implemented.
The tools I used in this capacity spanned several vendors and interaction paradigms, and the set evolved as the AI landscape changed over the course of the project. Claude~\cite{Anthropic2024Claude} was the conversational model I relied on most consistently for design discussions and code review. I used Claude Code~\cite{AnthropicClaudeCode}, OpenCode~\cite{OpenCode}, the Gemini CLI~\cite{GeminiCLI}, and Google Antigravity~\cite{GoogleAntigravity} as terminal- and editor-integrated coding agents for implementing the features I specified; the Zed editor~\cite{ZedEditor} served as the surrounding development environment and provided its own AI-assisted editing features. These tools overlapped in places, but I generally used one at a time and switched between them as new capabilities became available and as I learned which tool suited which kind of work. Appendix~\ref{app:ai_workflow} documents this workflow in more detail: the division of responsibility between me and the agents, the kinds of tasks each category of tool handled well, and the limits I ran into.
\section{Experiment Storage and Trial Logging} \section{Experiment Storage and Trial Logging}
The system saves experiment descriptions to persistent storage when a researcher completes them in the Design interface. A saved experiment is a complete, reusable specification that a researcher can run across any number of trials without modification. The system saves experiment descriptions to persistent storage when a researcher completes them in the Design interface. A saved experiment is a complete, reusable specification that a researcher can run across any number of trials without modification.
@@ -85,10 +164,19 @@ When a trial begins, the system creates a new trial record linked to that experi
\label{fig:trial-record} \label{fig:trial-record}
\end{figure} \end{figure}
Video and audio are recorded locally in the researcher's browser during the trial rather than streamed to the server in real time. This prevents network delays or server load from dropping frames or degrading audio quality during the interaction. When the trial concludes, the browser transfers the complete recordings to the server and associates them with the trial record. The Analysis interface can align video and audio with the logged actions without any manual synchronization, because the timestamp when recording starts is logged alongside the action log. Video and audio are recorded locally in the wizard's browser during the trial rather than streamed to the server in real time. The wizard's browser is the canonical recording client because the wizard is the only role required for a trial to run; observer and researcher roles connect in read-only capacities and do not capture media. Recording locally prevents network delays or server load from dropping frames or degrading audio quality during the interaction. When the trial concludes, the wizard's browser transfers the complete recordings to the server and associates them with the trial record. The Analysis interface can align video and audio with the logged actions without any manual synchronization, because the timestamp when recording starts is logged alongside the action log.
The system stores structured and media data separately. Experiment specifications and trial records are stored in the same structured database, which makes it efficient to query across trials (for example, retrieving all trials for a specific participant or comparing action timing across conditions). Video and audio files are stored in a dedicated file store, since their size makes them unsuitable for a database and the system never queries their content directly. The system stores structured and media data separately. Experiment specifications and trial records are stored in the same structured database, which makes it efficient to query across trials (for example, retrieving all trials for a specific participant or comparing action timing across conditions). Video and audio files are stored in a dedicated file store, since their size makes them unsuitable for a database and the system never queries their content directly.
Figure~\ref{fig:trial-report} shows the Analysis interface reconstructing a completed trial. The recorded video is presented alongside a synchronized action log, with each logged event linked to its moment in the recording so researchers can jump directly to the corresponding interaction without manual cross-referencing.
\begin{figure}[htbp]
\centering
\includegraphics[width=0.95\textwidth]{assets/trial-report.png}
\caption{The HRIStudio Analysis interface showing a completed trial with video and a synchronized, timestamped action log.}
\label{fig:trial-report}
\end{figure}
\section{The Execution Engine} \section{The Execution Engine}
The execution engine is the component that runs a trial: it loads the experiment, manages the wizard's connection, sends robot commands, and keeps all connected clients in sync. The execution engine is the component that runs a trial: it loads the experiment, manages the wizard's connection, sends robot commands, and keeps all connected clients in sync.
@@ -97,9 +185,18 @@ When a trial begins, the server loads the experiment and maintains a live connec
No two human subjects respond identically to an experimental protocol. One subject gives a one-word answer; another offers a paragraph; a third asks the robot a question the script never anticipated. Unscripted actions give the wizard the tools to record how these interactions unfold when deviations from the script are required. The wizard triggers them via the manual controls in the Execution interface, the robot command runs, and the system logs the action with a deviation flag. This design preserves research value: the interaction gains the flexibility only a human can provide, and that flexibility appears explicitly in the record rather than disappearing into it. No two human subjects respond identically to an experimental protocol. One subject gives a one-word answer; another offers a paragraph; a third asks the robot a question the script never anticipated. Unscripted actions give the wizard the tools to record how these interactions unfold when deviations from the script are required. The wizard triggers them via the manual controls in the Execution interface, the robot command runs, and the system logs the action with a deviation flag. This design preserves research value: the interaction gains the flexibility only a human can provide, and that flexibility appears explicitly in the record rather than disappearing into it.
Figure~\ref{fig:execution-view} shows the Execution interface as it appears to a wizard during a live trial. The current step is highlighted in the protocol sidebar, the available actions for that step are surfaced as triggerable buttons, and the wizard has manual-control affordances for introducing unscripted actions that the system will flag as deviations in the trial log.
\begin{figure}[htbp]
\centering
\includegraphics[width=0.95\textwidth]{assets/execution-view.png}
\caption{The HRIStudio Execution interface during a live trial, showing the current step, available actions, and manual deviation controls.}
\label{fig:execution-view}
\end{figure}
\section{Robot Integration} \section{Robot Integration}
A plugin file describes each robot platform, listing the actions it supports and specifying how each one maps to a command the robot understands. The execution engine reads this file at startup and uses it whenever it needs to dispatch a command: it looks up the action type, assembles the appropriate message, and sends it to the robot over a bridge process running on the local network. The web server itself has no knowledge of any specific robot; all hardware-specific logic lives in the plugin file. A plugin file describes each robot platform, listing the actions it supports and specifying how each one maps to a command the robot understands. The execution engine reads this file at startup and uses it whenever it needs to dispatch a command: it looks up the action type, assembles the appropriate message, and sends it to the robot over a bridge process running on the local network. For the NAO6 platform, I developed a specialized ROS-based bridge called \texttt{nao6-hristudio-integration}~\cite{NaoIntegrationRepo} that translates HRIStudio commands into the NAOqi API calls required by the robot. The web server itself has no knowledge of any specific robot; all hardware-specific logic lives in the plugin file.
The execution engine treats control flow elements such as branches and conditionals, which function as elements of a computer program, the same way as robot actions. These control-flow elements appear as action groups in the experiment and are evaluated during the trial, so researchers can freely mix logical decisions and physical robot behaviors when designing an experiment without any special handling. The execution engine treats control flow elements such as branches and conditionals, which function as elements of a computer program, the same way as robot actions. These control-flow elements appear as action groups in the experiment and are evaluated during the trial, so researchers can freely mix logical decisions and physical robot behaviors when designing an experiment without any special handling.
@@ -150,6 +247,10 @@ Figure~\ref{fig:plugin-architecture} illustrates this mapping using NAO6 and Tur
\label{fig:plugin-architecture} \label{fig:plugin-architecture}
\end{figure} \end{figure}
\subsection{Containerized Development Environment}
To support development and testing for the NAO platform, I also developed \texttt{nao-workspace}, a containerized workspace~\cite{NaoWorkspaceRepo}. This was motivated by the technical constraints of Choregraphe and its related libraries, which only supported x86-64 systems running Ubuntu 22.04. The containerized structure was the only way I could run the proprietary NAO development tools on modern hardware. While I developed this stack primarily to enable technical testing and material preparation during the project, the resulting tooling may be useful to other HRI researchers facing similar platform constraints.
\section{Access Control} \section{Access Control}
I implemented access control using a role-based access control (RBAC) model with two layers. System-level roles govern what a user can do across the platform (administrator, researcher, wizard, observer), while study-level roles govern what a user can see and do within a specific study (owner, researcher, wizard, observer). The two layers are checked independently, so a user who is a wizard on one study can be an observer on another without any additional configuration. Within a study, the four study-level roles define a clear separation of capabilities: those who own the study, those who design it, those who run it, and those who observe it. This enforces need-to-know access at the study level so that each team member sees or is able to modify only what their role requires. The capabilities and constraints for each role are described below: I implemented access control using a role-based access control (RBAC) model with two layers. System-level roles govern what a user can do across the platform (administrator, researcher, wizard, observer), while study-level roles govern what a user can see and do within a specific study (owner, researcher, wizard, observer). The two layers are checked independently, so a user who is a wizard on one study can be an observer on another without any additional configuration. Within a study, the four study-level roles define a clear separation of capabilities: those who own the study, those who design it, those who run it, and those who observe it. This enforces need-to-know access at the study level so that each team member sees or is able to modify only what their role requires. The capabilities and constraints for each role are described below:
+25 -5
View File
@@ -13,7 +13,7 @@ I hypothesized that HRIStudio would improve both accessibility and reproducibili
\section{Study Design} \section{Study Design}
I used what Bartneck et al.~\cite{Bartneck2024} call a between-subjects design, in which each participant is assigned to only one condition. I randomly assigned each wizard participant to one of two conditions: HRIStudio or Choregraphe. Both groups received the same task, the same time allocation, and a similar training structure. Measuring each participant in only one condition prevents carryover effects, meaning performance changes caused by prior exposure to another condition rather than by the assigned condition itself. I used what Bartneck et al.~\cite{Bartneck2024} call a \emph{between-subjects design}, in which each participant is assigned to only one condition. To ensure that programming experience was balanced across conditions, I stratified assignment by self-reported programming background: each wizard was first classified as having \emph{None}, \emph{Moderate}, or \emph{Extensive} programming experience, and then randomly assigned within that stratum to HRIStudio or Choregraphe. This produced a design in which each condition contained exactly one wizard at each experience level, reducing the risk that tool effects would be confused with differences in programming experience. Both groups received the same task, the same time allocation, and a similar training structure. Because each wizard used only one tool, the design also avoided carryover effects from prior exposure to the other condition.
\section{Participants} \section{Participants}
@@ -48,6 +48,26 @@ The control condition used Choregraphe \cite{Pot2009}, a proprietary visual prog
The experimental condition used HRIStudio, described in Chapter~\ref{ch:implementation}. HRIStudio organizes behavior as a sequential action timeline with support for conditional branches. Unlike Choregraphe, it abstracts robot-specific commands through plugin files, though for this study both tools controlled the same NAO platform. The experimental condition used HRIStudio, described in Chapter~\ref{ch:implementation}. HRIStudio organizes behavior as a sequential action timeline with support for conditional branches. Unlike Choregraphe, it abstracts robot-specific commands through plugin files, though for this study both tools controlled the same NAO platform.
Figure~\ref{fig:design-tool-compare} places the two design environments side by side. On the left, Choregraphe's behavior-box canvas (Figure~\ref{fig:choregraphe-ui}) lets the wizard wire nodes and transitions in a finite-state-machine layout. On the right, HRIStudio's experiment designer (Figure~\ref{fig:hristudio-designer}) presents the same protocol as a vertical action timeline with dedicated blocks for speech, gesture, and conditional branching.
\begin{figure}[htbp]
\centering
\begin{minipage}[t]{0.48\textwidth}
\centering
\includegraphics[width=\textwidth]{assets/choregraphe.png}
\subcaption{Choregraphe: behavior-box canvas with nodes and transitions.}
\label{fig:choregraphe-ui}
\end{minipage}\hfill
\begin{minipage}[t]{0.48\textwidth}
\centering
\includegraphics[width=\textwidth]{assets/experiment-designer.png}
\subcaption{HRIStudio: vertical action timeline with structured step and action blocks.}
\label{fig:hristudio-designer}
\end{minipage}
\caption{The two design environments compared. Each wizard used one of these tools to implement the Interactive Storyteller specification.}
\label{fig:design-tool-compare}
\end{figure}
\section{Procedure} \section{Procedure}
Each wizard completed a single 60-minute session structured in four phases. Each wizard completed a single 60-minute session structured in four phases.
@@ -77,7 +97,7 @@ The study collected five measures, two primary and three supplementary, operatio
I define the Design Fidelity Score (DFS) as a measure of how completely and correctly the wizard implemented the specification. I evaluated the exported project file against nine weighted criteria grouped into three categories: speech actions, gestures and actions, and control flow and logic. Each criterion is scored as present, correct, and independently achieved. I define the Design Fidelity Score (DFS) as a measure of how completely and correctly the wizard implemented the specification. I evaluated the exported project file against nine weighted criteria grouped into three categories: speech actions, gestures and actions, and control flow and logic. Each criterion is scored as present, correct, and independently achieved.
The DFS rubric includes an \emph{Assisted} column. For each rubric item, the researcher marks T if a tool-operation intervention was given specifically for that item during the design phase (for example, if the researcher explained how to add a gesture node or how to wire a conditional branch). T marks are recorded and reported separately alongside the DFS score; they do not affect the Points total. This preserves the DFS as a clean measure of design fidelity while providing a parallel record of where tool-specific assistance was needed. General interventions (task clarification, hardware issues, or momentary forgetfulness) are not marked T, because those categories of difficulty are independent of the tool under evaluation. The DFS rubric includes an \emph{Assisted} column. For each rubric item, I marked a T if I provided a tool-operation intervention specifically for that item during the design phase (for example, if I explained how to add a gesture node or how to wire a conditional branch). T marks are recorded and reported separately alongside the DFS score; they do not affect the Points total. This preserves the DFS as a clean measure of design fidelity while providing a parallel record of where tool-specific assistance was needed. General interventions (task clarification, hardware issues, or momentary forgetfulness) are not marked T, because those categories of difficulty are independent of the tool under evaluation.
DFS is motivated by a gap identified by Riek~\cite{Riek2012}, whose systematic review of 54 published WoZ studies found that only 11\% constrained wizard behavior and fewer than 6\% described wizard training procedures. Porfirio et al.~\cite{Porfirio2023} similarly argued that formal, verifiable behavior specifications are a prerequisite for reproducible HRI. The DFS applies these recommendations as a weighted rubric scored against the exported project file. The complete rubric is reproduced in Appendix~\ref{app:blank_templates}. This measure addresses the question: did the tool allow a wizard to independently produce a correct design? DFS is motivated by a gap identified by Riek~\cite{Riek2012}, whose systematic review of 54 published WoZ studies found that only 11\% constrained wizard behavior and fewer than 6\% described wizard training procedures. Porfirio et al.~\cite{Porfirio2023} similarly argued that formal, verifiable behavior specifications are a prerequisite for reproducible HRI. The DFS applies these recommendations as a weighted rubric scored against the exported project file. The complete rubric is reproduced in Appendix~\ref{app:blank_templates}. This measure addresses the question: did the tool allow a wizard to independently produce a correct design?
@@ -98,9 +118,9 @@ The System Usability Scale (SUS) is a validated 10-item questionnaire measuring
During each session, I maintained a structured intervention log on the observer data sheet, recording the timestamp, type code, affected rubric item number, and a brief description for every instance in which I assisted the wizard. The four intervention type codes are: During each session, I maintained a structured intervention log on the observer data sheet, recording the timestamp, type code, affected rubric item number, and a brief description for every instance in which I assisted the wizard. The four intervention type codes are:
\begin{description} \begin{description}
\item[T (tool-operation).] The researcher explained how to operate a specific feature of the assigned software tool. \item[T (tool-operation).] I explained how to operate a specific feature of the assigned software tool.
\item[C (task clarification).] The researcher clarified the written specification or an aspect of the task design. \item[C (task clarification).] I clarified the written specification or an aspect of the task design.
\item[H (hardware or technical).] The researcher addressed a robot connection issue or other technical problem outside the wizard's control. \item[H (hardware or technical).] I addressed a robot connection issue or other technical problem outside the wizard's control.
\item[G (general).] Brief assistance not attributable to the tool or the task, such as momentary forgetfulness. \item[G (general).] Brief assistance not attributable to the tool or the task, such as momentary forgetfulness.
\end{description} \end{description}
+142 -39
View File
@@ -5,40 +5,64 @@ This chapter presents the results of the pilot validation study described in Cha
\section{Participant Overview} \section{Participant Overview}
Table~\ref{tbl:sessions} summarizes the personas and their assigned conditions. Wizards are identified by code to protect confidentiality. All six participants were Bucknell University professors drawn from Computer Science, Chemical Engineering, Digital Humanities, and Logic and Philosophy of Science. Demographic information (programming background) was collected during recruitment. Table~\ref{tbl:sessions} summarizes the participants and their assigned conditions. Wizards are identified by code to protect confidentiality. All six participants were Bucknell University professors drawn from Computer Science, Chemical Engineering, Digital Humanities, and Logic and Philosophy of Science. Demographic information (programming background) was collected during recruitment.
\begin{table}[htbp] \begin{table}[htbp]
\centering \centering
\footnotesize \footnotesize
\begin{tabular}{|l|l|l|l|l|l|l|} \begin{tabular}{|l|l|l|l|}
\hline \hline
\textbf{ID} & \textbf{Condition} & \textbf{Background} & \makecell[l]{\textbf{Programming}\\\textbf{Experience}} & \textbf{DFS} & \textbf{ERS} & \textbf{SUS} \\ \textbf{ID} & \textbf{Condition} & \textbf{Background} & \makecell[l]{\textbf{Programming}\\\textbf{Experience}} \\
\hline \hline
W-01 & Choregraphe & Digital Humanities & None & 42.5 & 65 & 60 \\ W-01 & Choregraphe & Digital Humanities & None \\
\hline \hline
W-02 & HRIStudio & Logic and Philosophy of Science & Moderate & 100 & 95 & 90 \\ W-02 & HRIStudio & Logic and Philosophy of Science & Moderate \\
\hline \hline
W-03 & Choregraphe & Computer Science & Extensive & 65 & 60 & 75 \\ W-03 & Choregraphe & Computer Science & Extensive \\
\hline \hline
W-04 & Choregraphe & Chemical Engineering & Moderate & 62.5 & 75 & 42.5 \\ W-04 & Choregraphe & Chemical Engineering & Moderate \\
\hline \hline
W-05 & HRIStudio & Chemical Engineering & None & 100 & 95 & 70 \\ W-05 & HRIStudio & Chemical Engineering & None \\
\hline \hline
W-06 & HRIStudio & Computer Science & Extensive & 100 & 100 & 70 \\ W-06 & HRIStudio & Computer Science & Extensive \\
\hline \hline
\end{tabular} \end{tabular}
\caption{Summary of wizard participants, assigned conditions, and scores.} \caption{Summary of wizard participants and assigned conditions.}
\label{tbl:sessions} \label{tbl:sessions}
\end{table} \end{table}
This table also presents numerical data representing the study's results, which is discussed next. Table~\ref{tbl:primary-outcomes} presents the primary outcome scores, which are discussed next.
\section{Primary Measures} \section{Primary Measures}
\begin{table}[htbp]
\centering
\footnotesize
\begin{tabular}{|l|l|r|r|r|}
\hline
\textbf{ID} & \textbf{Condition} & \textbf{DFS} & \textbf{ERS} & \textbf{SUS} \\
\hline
W-01 & Choregraphe & 42.5 & 65 & 60 \\
\hline
W-02 & HRIStudio & 100 & 95 & 90 \\
\hline
W-03 & Choregraphe & 65 & 60 & 75 \\
\hline
W-04 & Choregraphe & 62.5 & 75 & 42.5 \\
\hline
W-05 & HRIStudio & 100 & 95 & 70 \\
\hline
W-06 & HRIStudio & 100 & 100 & 70 \\
\hline
\end{tabular}
\caption{Primary outcome scores by wizard and condition.}
\label{tbl:primary-outcomes}
\end{table}
\subsection{Design Fidelity Score (DFS)} \subsection{Design Fidelity Score (DFS)}
The Design Fidelity Score measures how completely and correctly each wizard implemented the written specification, the experiment they received. Scores range from 0 to 100, with full points awarded only when a component — a rubric criterion representing a required speech action, gesture, or control-flow element — is both present and correct. (For a full description of rubric categories, see Section~\ref{sec:measures}.) The Design Fidelity Score measures how completely and correctly each wizard implemented the written specification of their assigned experiment. Scores range from 0 to 100, with full points awarded only when a component — a rubric criterion representing a required speech action, gesture, or control-flow element — is both present and correct. (For a full description of rubric categories, see Section~\ref{sec:measures}.)
Across the six participants, DFS scores divided sharply by study condition: all three HRIStudio wizards achieved a perfect score of 100, while the three Choregraphe wizards scored 42.5, 65, and 62.5. The following paragraphs describe the key findings from each session. Across the six participants, DFS scores divided sharply by study condition: all three HRIStudio wizards achieved a perfect score of 100, while the three Choregraphe wizards scored 42.5, 65, and 62.5. The following paragraphs describe the key findings from each session.
@@ -92,37 +116,63 @@ W-06 rated HRIStudio with a SUS score of 70. W-06, a Computer Science faculty me
HRIStudio study condition SUS scores were 90, 70, and 70 (mean 76.7). Choregraphe study condition SUS scores were 60, 75, and 42.5 (mean 59.2). HRIStudio study condition SUS scores were 90, 70, and 70 (mean 76.7). Choregraphe study condition SUS scores were 60, 75, and 42.5 (mean 59.2).
Figure~\ref{fig:results-chart} summarizes the three primary measures side-by-side. In each group, the left bar represents the Choregraphe mean and the right bar represents the HRIStudio mean. HRIStudio exceeds Choregraphe on every measure, with the largest gap on DFS (43.3 points) and the smallest on SUS (17.5 points).
\begin{figure}[htbp]
\centering
\begin{tikzpicture}
% Axes
\draw[thick] (0,0) -- (0,6.3);
\draw[thick] (0,0) -- (11.2,0);
% Y-axis ticks and labels (0--100, with 1 unit = 0.06 cm)
\foreach \tick/\val in {0/0, 1.2/20, 2.4/40, 3.6/60, 4.8/80, 6.0/100} {
\draw (-0.08, \tick) -- (0, \tick);
\node[left, font=\footnotesize] at (-0.05, \tick) {\val};
}
\node[rotate=90, font=\small] at (-1.05, 3.0) {Mean Score (0--100)};
% Horizontal gridlines
\foreach \tick in {1.2, 2.4, 3.6, 4.8, 6.0} {
\draw[gray!25, thin] (0.02, \tick) -- (11.2, \tick);
}
% DFS group
\fill[gray!40, draw=black] (1.0, 0) rectangle (2.3, 3.402);
\fill[gray!75, draw=black] (2.4, 0) rectangle (3.7, 6.000);
\node[font=\footnotesize] at (1.65, 3.60) {56.7};
\node[font=\footnotesize] at (3.05, 6.20) {100};
\node[font=\small] at (2.35, -0.38) {DFS};
% ERS group
\fill[gray!40, draw=black] (4.5, 0) rectangle (5.8, 4.002);
\fill[gray!75, draw=black] (5.9, 0) rectangle (7.2, 5.802);
\node[font=\footnotesize] at (5.15, 4.20) {66.7};
\node[font=\footnotesize] at (6.55, 6.00) {96.7};
\node[font=\small] at (5.85, -0.38) {ERS};
% SUS group
\fill[gray!40, draw=black] (8.0, 0) rectangle (9.3, 3.552);
\fill[gray!75, draw=black] (9.4, 0) rectangle (10.7, 4.602);
\node[font=\footnotesize] at (8.65, 3.75) {59.2};
\node[font=\footnotesize] at (10.05, 4.80) {76.7};
\node[font=\small] at (9.35, -0.38) {SUS};
% Legend
\fill[gray!40, draw=black] (2.6, -1.25) rectangle (3.0, -1.00);
\node[anchor=west, font=\footnotesize] at (3.1, -1.125) {Choregraphe};
\fill[gray!75, draw=black] (7.0, -1.25) rectangle (7.4, -1.00);
\node[anchor=west, font=\footnotesize] at (7.5, -1.125) {HRIStudio};
\end{tikzpicture}
\caption{Mean scores by condition across the three primary outcome measures.}
\label{fig:results-chart}
\end{figure}
\section{Supplementary Measures} \section{Supplementary Measures}
\subsection{Session Timing} \subsection{Session Timing}
Table~\ref{tbl:timing} summarizes the time spent in each phase per session.
\begin{table}[htbp]
\centering
\footnotesize
\begin{tabular}{|l|l|l|l|l|l|}
\hline
\textbf{ID} & \textbf{Training} & \textbf{Design} & \textbf{Trial} & \textbf{Debrief} & \textbf{Total} \\
\hline
W-01 & 15 min & 35 min & 5 min & 5 min & 60 min \\
\hline
W-02 & 7 min & 24 min & 5 min & 5 min & 41 min \\
\hline
W-03 & 12 min & 37 min & 5 min & 5 min & 59 min \\
\hline
W-04 & 17 min & 35 min & 4 min & 4 min & 60 min \\
\hline
W-05 & 6 min & 18 min & 4 min & 4 min & 32 min \\
\hline
W-06 & 8 min & 21 min & 3 min & 5 min & 37 min \\
\hline
\end{tabular}
\caption{Time spent in each session phase per wizard participant.}
\label{tbl:timing}
\end{table}
W-01's design phase extended to 35 minutes, five minutes over the 30-minute allocation, compressing the trial and debrief to 5 minutes each. Despite this, W-01 declared the design complete rather than abandoning it, and the robot executed a recognizable version of the specification during the trial. W-01's design phase extended to 35 minutes, five minutes over the 30-minute allocation, compressing the trial and debrief to 5 minutes each. Despite this, W-01 declared the design complete rather than abandoning it, and the robot executed a recognizable version of the specification during the trial.
W-02's training phase concluded in 7 minutes, roughly half the standard 15-minute allocation. This reflects HRIStudio's more intuitive onboarding rather than simply W-02's technical background: the platform's guided workflow and timeline-based model required less explanation before the wizard was ready to begin the design phase. W-02's design phase then concluded in 24 minutes, within the allocation, and the trial ran for approximately five minutes. W-02's training phase concluded in 7 minutes, roughly half the standard 15-minute allocation. This reflects HRIStudio's more intuitive onboarding rather than simply W-02's technical background: the platform's guided workflow and timeline-based model required less explanation before the wizard was ready to begin the design phase. W-02's design phase then concluded in 24 minutes, within the allocation, and the trial ran for approximately five minutes.
@@ -137,6 +187,59 @@ W-06's training phase concluded in 8 minutes and the design phase completed in 2
Across all six sessions, Choregraphe design phases averaged approximately 35.7 minutes; W-01 and W-03 exceeded the 30-minute target but completed their designs before the session time limit, while W-04 was the only wizard cut off by the limit without finishing. HRIStudio design phases averaged 21 minutes across three sessions, all within the allocation. Training phases similarly diverged: Choregraphe training averaged approximately 14.7 minutes, while HRIStudio training averaged 7 minutes. Across all six sessions, Choregraphe design phases averaged approximately 35.7 minutes; W-01 and W-03 exceeded the 30-minute target but completed their designs before the session time limit, while W-04 was the only wizard cut off by the limit without finishing. HRIStudio design phases averaged 21 minutes across three sessions, all within the allocation. Training phases similarly diverged: Choregraphe training averaged approximately 14.7 minutes, while HRIStudio training averaged 7 minutes.
Figure~\ref{fig:timing-chart} compares the per-condition means for training, design, and total session duration. The gap is concentrated in the design phase and carries through to the total session length; training duration also diverges, with Choregraphe wizards requiring roughly twice as long to reach readiness.
\begin{figure}[htbp]
\centering
\begin{tikzpicture}
% Axes (1 minute = 0.1 cm, so 60 min = 6 cm)
\draw[thick] (0,0) -- (0,6.3);
\draw[thick] (0,0) -- (11.2,0);
% Y-axis ticks and labels (0--60 minutes)
\foreach \tick/\val in {0/0, 1/10, 2/20, 3/30, 4/40, 5/50, 6/60} {
\draw (-0.08, \tick) -- (0, \tick);
\node[left, font=\footnotesize] at (-0.05, \tick) {\val};
}
\node[rotate=90, font=\small] at (-1.05, 3.0) {Mean Duration (minutes)};
% Horizontal gridlines
\foreach \tick in {1,2,3,4,5,6} {
\draw[gray!25, thin] (0.02, \tick) -- (11.2, \tick);
}
% Training group — Choregraphe 14.7, HRIStudio 7.0
\fill[gray!40, draw=black] (1.0, 0) rectangle (2.3, 1.47);
\fill[gray!75, draw=black] (2.4, 0) rectangle (3.7, 0.70);
\node[font=\footnotesize] at (1.65, 1.67) {14.7};
\node[font=\footnotesize] at (3.05, 0.90) {7.0};
\node[font=\small] at (2.35, -0.38) {Training};
% Design group — Choregraphe 35.7, HRIStudio 21.0
\fill[gray!40, draw=black] (4.5, 0) rectangle (5.8, 3.57);
\fill[gray!75, draw=black] (5.9, 0) rectangle (7.2, 2.10);
\node[font=\footnotesize] at (5.15, 3.77) {35.7};
\node[font=\footnotesize] at (6.55, 2.30) {21.0};
\node[font=\small] at (5.85, -0.38) {Design};
% Total group — Choregraphe 59.7, HRIStudio 36.7
\fill[gray!40, draw=black] (8.0, 0) rectangle (9.3, 5.97);
\fill[gray!75, draw=black] (9.4, 0) rectangle (10.7, 3.67);
\node[font=\footnotesize] at (8.65, 6.17) {59.7};
\node[font=\footnotesize] at (10.05, 3.87) {36.7};
\node[font=\small] at (9.35, -0.38) {Total Session};
% Legend
\fill[gray!40, draw=black] (2.6, -1.25) rectangle (3.0, -1.00);
\node[anchor=west, font=\footnotesize] at (3.1, -1.125) {Choregraphe};
\fill[gray!75, draw=black] (7.0, -1.25) rectangle (7.4, -1.00);
\node[anchor=west, font=\footnotesize] at (7.5, -1.125) {HRIStudio};
\end{tikzpicture}
\caption{Mean phase durations by condition.}
\label{fig:timing-chart}
\end{figure}
\subsection{Intervention Log} \subsection{Intervention Log}
W-01 generated a high volume of help requests during the design phase, primarily concerning Choregraphe's interface rather than the specification itself. The wizard demonstrated understanding of the task but encountered repeated friction with the tool's connection model, behavior box configuration, and branch routing. This pattern, understanding the goal but struggling with the mechanism, is characteristic of the accessibility problem described in Chapter~\ref{ch:background}. W-01 generated a high volume of help requests during the design phase, primarily concerning Choregraphe's interface rather than the specification itself. The wizard demonstrated understanding of the task but encountered repeated friction with the tool's connection model, behavior box configuration, and branch routing. This pattern, understanding the goal but struggling with the mechanism, is characteristic of the accessibility problem described in Chapter~\ref{ch:background}.
+6 -6
View File
@@ -11,7 +11,7 @@ The first research question asked whether HRIStudio enables domain experts witho
The six completed sessions provide directional evidence on the accessibility question. Across the three Choregraphe wizards, design fidelity scores were 42.5, 65, and 62.5, yielding a Choregraphe mean of 56.7. Across the three HRIStudio sessions, all three wizards achieved a DFS of 100. No HRIStudio wizard required a T-type intervention that reflected an inability to operate the platform; the T-type marks logged for W-05 concerned interface orientation, and those logged for W-06 concerned gesture execution details (parallel execution and posture-reset blocks), neither of which constituted fundamental operational barriers. By contrast, Choregraphe produced design difficulties across all three sessions. W-01 required T-type assistance for connection routing and branch wiring. W-03 required no T-type interventions but over-engineered the design, adding concurrent execution nodes and attempting onboard speech-recognition logic that falls outside the WoZ paradigm. W-04 required T-type assistance for speech content punctuation and a failed choice block attempt. The six completed sessions provide directional evidence on the accessibility question. Across the three Choregraphe wizards, design fidelity scores were 42.5, 65, and 62.5, yielding a Choregraphe mean of 56.7. Across the three HRIStudio sessions, all three wizards achieved a DFS of 100. No HRIStudio wizard required a T-type intervention that reflected an inability to operate the platform; the T-type marks logged for W-05 concerned interface orientation, and those logged for W-06 concerned gesture execution details (parallel execution and posture-reset blocks), neither of which constituted fundamental operational barriers. By contrast, Choregraphe produced design difficulties across all three sessions. W-01 required T-type assistance for connection routing and branch wiring. W-03 required no T-type interventions but over-engineered the design, adding concurrent execution nodes and attempting onboard speech-recognition logic that falls outside the WoZ paradigm. W-04 required T-type assistance for speech content punctuation and a failed choice block attempt.
The SUS scores reinforce this pattern. Choregraphe SUS scores were 60, 75, and 42.5 (mean 59.2), all at or below the average usability benchmark of 68~\cite{Brooke1996}. HRIStudio SUS scores were 90, 70, and 70 (mean 76.7), all above the benchmark. The Choregraphe study condition produced the lowest single SUS score in the study (42.5, W-04), a wizard who described the platform as getting in the way of their attempt. The HRIStudio study condition produced the highest (90, W-02). With programming backgrounds now balanced across study conditions---each study condition contains one wizard with \emph{None} programming experience, one with \emph{Moderate} experience, and one with \emph{Extensive} experience---a cross-background comparison is possible: W-01 (\emph{None}, Choregraphe, SUS 60) versus W-05 (\emph{None}, HRIStudio, SUS 70); W-04 (\emph{Moderate}, Choregraphe, SUS 42.5) versus W-02 (\emph{Moderate}, HRIStudio, SUS 90); W-03 (\emph{Extensive}, Choregraphe, SUS 75) versus W-06 (\emph{Extensive}, HRIStudio, SUS 70). HRIStudio scores exceed Choregraphe scores at the \emph{None} and \emph{Moderate} levels; at the \emph{Extensive} level the scores reverse by five points, suggesting that extensive programming experience largely attenuates the tool-level usability difference. It is worth noting that only one participant (W-01, Digital Humanities) came from a non-STEM discipline; the remaining five wizards held backgrounds in Computer Science, Chemical Engineering, or Logic and Philosophy of Science, a composition that limits claims about accessibility for humanities-domain researchers. The SUS scores reinforce this pattern. Choregraphe SUS scores were 60, 75, and 42.5 (mean 59.2), all at or below the average usability benchmark of 68~\cite{Brooke1996}. HRIStudio SUS scores were 90, 70, and 70 (mean 76.7), all above the benchmark. The Choregraphe study condition produced the lowest single SUS score in the study (42.5, W-04), a wizard who described the platform as getting in the way of their attempt. The HRIStudio study condition produced the highest (90, W-02). Because assignment was stratified by programming background, each condition contains exactly one wizard with \emph{None} experience, one with \emph{Moderate} experience, and one with \emph{Extensive} experience, enabling a direct cross-background comparison: W-01 (\emph{None}, Choregraphe, SUS 60) versus W-05 (\emph{None}, HRIStudio, SUS 70); W-04 (\emph{Moderate}, Choregraphe, SUS 42.5) versus W-02 (\emph{Moderate}, HRIStudio, SUS 90); W-03 (\emph{Extensive}, Choregraphe, SUS 75) versus W-06 (\emph{Extensive}, HRIStudio, SUS 70). HRIStudio scores exceed Choregraphe scores at the \emph{None} and \emph{Moderate} levels; at the \emph{Extensive} level the scores reverse by five points, suggesting that extensive programming experience largely attenuates the tool-level usability difference. It is worth noting that only one participant (W-01, Digital Humanities) came from a non-STEM discipline; the remaining five wizards held backgrounds in Computer Science, Chemical Engineering, or Logic and Philosophy of Science, a composition that limits claims about accessibility for humanities-domain researchers.
The most striking accessibility finding comes from W-05: a Chemical Engineering faculty member with no programming experience trained in 6 minutes, completed a perfect design in 18 minutes with no operational confusion, and ran the trial to conclusion. This outcome directly addresses the accessibility research question. HRIStudio's timeline-based model and guided workflow allowed a domain novice to implement the written specification correctly on their first attempt, without the interface friction that blocked or slowed all three Choregraphe wizards. Session timing data underscores the difference: Choregraphe design phases averaged 35.7 minutes (two overruns, one incomplete), while HRIStudio design phases averaged 21 minutes (all three within the allocation). Underlying this difference is a structural property of the two tools: HRIStudio's model is domain-specific to Wizard-of-Oz execution, so wizard effort is channeled toward implementing the specification more completely rather than elaborating the tool's architecture. Choregraphe's general-purpose programming model makes the opposite available, and both W-03 and W-04 took it, spending time on concurrent execution structures and a speech-recognition-driven choice block that the WoZ context does not support. No HRIStudio wizard had that option, and all three scored 100 on the DFS. The most striking accessibility finding comes from W-05: a Chemical Engineering faculty member with no programming experience trained in 6 minutes, completed a perfect design in 18 minutes with no operational confusion, and ran the trial to conclusion. This outcome directly addresses the accessibility research question. HRIStudio's timeline-based model and guided workflow allowed a domain novice to implement the written specification correctly on their first attempt, without the interface friction that blocked or slowed all three Choregraphe wizards. Session timing data underscores the difference: Choregraphe design phases averaged 35.7 minutes (two overruns, one incomplete), while HRIStudio design phases averaged 21 minutes (all three within the allocation). Underlying this difference is a structural property of the two tools: HRIStudio's model is domain-specific to Wizard-of-Oz execution, so wizard effort is channeled toward implementing the specification more completely rather than elaborating the tool's architecture. Choregraphe's general-purpose programming model makes the opposite available, and both W-03 and W-04 took it, spending time on concurrent execution structures and a speech-recognition-driven choice block that the WoZ context does not support. No HRIStudio wizard had that option, and all three scored 100 on the DFS.
@@ -29,15 +29,15 @@ ERS scores reflect the downstream effect of these design differences. Choregraph
W-01's design phase extended to 35 minutes, overrunning the 30-minute allocation by five minutes and leaving approximately five minutes for the trial phase. It is worth distinguishing between the two factors at play here: the overrun reflected both the tool's demands on the wizard and a procedural decision not to interrupt W-01 at the 30-minute mark. Subsequent sessions enforced the transition to the trial phase at 30 minutes regardless of design completion status, consistent with the observer protocol. That said, if a tool's demands make design completion within the allocation genuinely difficult, the risk of an overrun is real regardless of enforcement: a wizard who has not finished at 30 minutes faces a reduced trial window no matter when the cutoff is applied. W-01's design phase extended to 35 minutes, overrunning the 30-minute allocation by five minutes and leaving approximately five minutes for the trial phase. It is worth distinguishing between the two factors at play here: the overrun reflected both the tool's demands on the wizard and a procedural decision not to interrupt W-01 at the 30-minute mark. Subsequent sessions enforced the transition to the trial phase at 30 minutes regardless of design completion status, consistent with the observer protocol. That said, if a tool's demands make design completion within the allocation genuinely difficult, the risk of an overrun is real regardless of enforcement: a wizard who has not finished at 30 minutes faces a reduced trial window no matter when the cutoff is applied.
Across all six sessions, design phase overruns are concentrated in the Choregraphe study condition. W-01 and W-03 each exceeded the 30-minute design target but completed their designs before the session time limit; W-04 was the only wizard cut off by the limit without finishing. No HRIStudio wizard exceeded the target. This pattern holds across programming backgrounds: W-01 (non-programmer) and W-03 (extensive programmer) both overran in the Choregraphe study condition, while W-05 (non-programmer, HRIStudio) completed in 18 minutes and W-06 (extensive programmer, HRIStudio) completed in 21 minutes. The timing data thus corroborates the DFS and SUS findings as a supplementary accessibility indicator, and supports the conclusion that the overrun pattern is attributable to assigned tool rather than wizard background alone. With programming backgrounds balanced across study conditions, the design-phase timing difference cannot be attributed to prior programming experience. Across all six sessions, design phase overruns are concentrated in the Choregraphe study condition. W-01 and W-03 each exceeded the 30-minute design target but completed their designs before the session time limit; W-04 was the only wizard cut off by the limit without finishing. No HRIStudio wizard exceeded the target. This pattern holds across programming backgrounds: W-01 (non-programmer) and W-03 (extensive programmer) both overran in the Choregraphe study condition, while W-05 (non-programmer, HRIStudio) completed in 18 minutes and W-06 (extensive programmer, HRIStudio) completed in 21 minutes. The timing data thus corroborates the DFS and SUS findings as a supplementary accessibility indicator, and supports the conclusion that the overrun pattern is attributable to assigned tool rather than wizard background alone. Because programming experience was balanced across conditions by stratified assignment, the design-phase timing difference cannot be attributed to prior programming experience.
\section{Comparison to Prior Work} \section{Comparison to Prior Work}
With programming backgrounds now balanced across study conditions, the overall 17.5-point gap in both means reflects a genuine tool-level effect rather than a sampling artifact. Pot et al.~\cite{Pot2009} introduced Choregraphe as a tool for enabling non-programmers to create NAO behaviors, but subsequent HRI research has treated it primarily as a programmer's tool in practice. This study confirms that characterization: W-01 (no programming experience) and W-04 (moderate experience) both required substantial T-type assistance and produced incomplete or deviation-prone designs, while W-03 (extensive experience) navigated the interface without T-type support yet still over-engineered the design and scored below every HRIStudio participant on both DFS and ERS. Riek's~\cite{Riek2012} observation that WoZ tools tend to require substantial technical investment even when the underlying experiment is conceptually simple holds across all three Choregraphe sessions regardless of background. In contrast, the HRIStudio results support the claim advanced in prior work~\cite{OConnor2024, OConnor2025} that a domain-specific, web-based platform can decouple task complexity from interface complexity: all three HRIStudio wizards---spanning no, moderate, and extensive programming experience---achieved a perfect DFS, and none encountered a fundamental barrier to operating the platform. Because assignment was stratified by programming experience, the overall 17.5-point gap in both means reflects a genuine tool-level effect rather than a sampling artifact. Pot et al.~\cite{Pot2009} introduced Choregraphe as a tool for enabling non-programmers to create NAO behaviors, but subsequent HRI research has treated it primarily as a programmer's tool in practice. This study confirms that characterization: W-01 (no programming experience) and W-04 (moderate experience) both required substantial T-type assistance and produced incomplete or deviation-prone designs, while W-03 (extensive experience) navigated the interface without T-type support yet still over-engineered the design and scored below every HRIStudio participant on both DFS and ERS. Riek's~\cite{Riek2012} observation that WoZ tools tend to require substantial technical investment even when the underlying experiment is conceptually simple holds across all three Choregraphe sessions regardless of background. In contrast, the HRIStudio results support the claim advanced in prior work~\cite{OConnor2024, OConnor2025} that a domain-specific, web-based platform can decouple task complexity from interface complexity: all three HRIStudio wizards---spanning no, moderate, and extensive programming experience---achieved a perfect DFS, and none encountered a fundamental barrier to operating the platform.
The specification deviation in W-01's session connects directly to Porfirio et al.'s~\cite{Porfirio2023} argument that formal, verifiable behavior specifications are a prerequisite for reproducible HRI. Porfirio et al. propose specification languages as the solution; HRIStudio takes a complementary approach by embedding the specification into the execution environment, making deviation structurally harder rather than formally detectable after the fact. The ERS data confirms this design intent: no speech content deviations occurred across all three HRIStudio sessions, and the HRIStudio ERS mean of 96.7 versus 66.7 for Choregraphe supports the conclusion that structural enforcement produces more reliable execution in practice. Riek's~\cite{Riek2012} finding that only 3.7\% of published WoZ studies reported any measure of wizard error makes this comparison particularly significant: the ERS operationalizes exactly the kind of execution measurement the literature has consistently omitted, and the difference it surfaces here is substantial. The specification deviation in W-01's session connects directly to Porfirio et al.'s~\cite{Porfirio2023} argument that formal, verifiable behavior specifications are a prerequisite for reproducible HRI. Porfirio et al. propose specification languages as the solution; HRIStudio takes a complementary approach by embedding the specification into the execution environment, making deviation structurally harder rather than formally detectable after the fact. The ERS data confirms this design intent: no speech content deviations occurred across all three HRIStudio sessions, and the HRIStudio ERS mean of 96.7 versus 66.7 for Choregraphe supports the conclusion that structural enforcement produces more reliable execution in practice. Riek's~\cite{Riek2012} finding that only 3.7\% of published WoZ studies reported any measure of wizard error makes this comparison particularly significant: the ERS operationalizes exactly the kind of execution measurement the literature has consistently omitted, and the difference it surfaces here is substantial.
The SUS scores are consistent with prior tool evaluations in HCI. The Choregraphe mean of 59.2 falls below the average benchmark of 68~\cite{Brooke1996} and below scores reported for general-purpose visual programming environments in comparable studies, consistent with Bartneck et al.'s~\cite{Bartneck2024} finding that domain-specific design is necessary to make tools genuinely accessible to non-programmers. The HRIStudio mean of 76.7 places the platform above the benchmark across all three sessions. With programming backgrounds balanced across study conditions, the overall 17.5-point gap in the two conditions' means reflects a genuine tool-level effect rather than a sampling artifact. The gap is largest at the Moderate experience level (W-02 HRIStudio 90 vs.\ W-04 Choregraphe 42.5) and smallest at the Extensive level, where the scores reverse by five points (W-03 Choregraphe 75 vs.\ W-06 HRIStudio 70), suggesting that extensive programming experience largely attenuates the tool-level usability difference while the accessibility advantage remains pronounced for non-programmers and moderate programmers. The SUS scores are consistent with prior tool evaluations in HCI. The Choregraphe mean of 59.2 falls below the average benchmark of 68~\cite{Brooke1996} and below scores reported for general-purpose visual programming environments in comparable studies, consistent with Bartneck et al.'s~\cite{Bartneck2024} finding that domain-specific design is necessary to make tools genuinely accessible to non-programmers. The HRIStudio mean of 76.7 places the platform above the benchmark across all three sessions. Because programming experience is balanced across conditions by design, the overall 17.5-point gap in the two conditions' means reflects a genuine tool-level effect rather than an artifact of the sample's background composition. The gap is largest at the Moderate experience level (W-02 HRIStudio 90 vs.\ W-04 Choregraphe 42.5) and smallest at the Extensive level, where the scores reverse by five points (W-03 Choregraphe 75 vs.\ W-06 HRIStudio 70), suggesting that extensive programming experience largely attenuates the tool-level usability difference while the accessibility advantage remains pronounced for non-programmers and moderate programmers.
\section{Limitations} \section{Limitations}
@@ -49,10 +49,10 @@ This study has several limitations that must be considered when interpreting the
\textbf{Single task.} Both study conditions used the same Interactive Storyteller specification. While this controls for task difficulty, it limits generalizability. The task is simple relative to real HRI experiments; the gap between conditions may be larger or smaller with a more complex protocol involving multiple branches or longer interaction sequences. \textbf{Single task.} Both study conditions used the same Interactive Storyteller specification. While this controls for task difficulty, it limits generalizability. The task is simple relative to real HRI experiments; the gap between conditions may be larger or smaller with a more complex protocol involving multiple branches or longer interaction sequences.
\textbf{Condition imbalance.} Random assignment produced a programming-background distribution that happens to be balanced: each study condition contains one wizard with no programming experience, one with moderate experience, and one with extensive experience. While this balance is favorable for interpretation, it was not guaranteed by design. The small $N$ means that balance on other potentially relevant dimensions (disciplinary background, prior experience with visual programming tools, or familiarity with robots more broadly) was not assessed or controlled. \textbf{Uncontrolled dimensions.} Programming experience was balanced across conditions by stratified assignment (see Section~\ref{sec:measures} and Chapter~\ref{ch:evaluation}): each condition contains one wizard at each of the three experience levels (\emph{None}, \emph{Moderate}, \emph{Extensive}). This controls for programming background as a potential confounder but does not extend to other dimensions. The small $N$ means that balance on other potentially relevant dimensions (disciplinary background, prior experience with visual programming tools, or familiarity with robots more broadly) was not assessed or controlled and remains a source of variability not addressed in this pilot.
\textbf{Platform version.} HRIStudio is continuously evolving. The version used in this study represents the system at a specific point in time. Future iterations may change how the wizard interface presents protocol steps, how branch conditions are constructed during the design phase, or how protocol enforcement is applied during execution. Any of these changes could affect how easily a non-programmer completes the design challenge or how reliably the tool enforces the specification during the trial, potentially altering the DFS and ERS scores observed under otherwise identical conditions. Results from this study therefore describe the system as it existed at the time of data collection and may not generalize to later releases. \textbf{Platform version.} HRIStudio is continuously evolving. The version used in this study represents the system at a specific point in time. Future iterations may change how the wizard interface presents protocol steps, how branch conditions are constructed during the design phase, or how protocol enforcement is applied during execution. Any of these changes could affect how easily a non-programmer completes the design challenge or how reliably the tool enforces the specification during the trial, potentially altering the DFS and ERS scores observed under otherwise identical conditions. Results from this study therefore describe the system as it existed at the time of data collection and may not generalize to later releases.
\section{Chapter Summary} \section{Chapter Summary}
This chapter interpreted the results of all six completed pilot sessions against the two research questions and connected the findings to prior work. Across all primary measures, the directional evidence favors HRIStudio. HRIStudio wizards uniformly achieved perfect design fidelity (DFS 100) and near-perfect execution reliability (mean ERS 96.7), while Choregraphe wizards averaged DFS 56.7 and ERS 66.7, with design overruns in all three sessions and no session completing without researcher guidance. The W-01 content deviation (see Section~\ref{sec:results-qualitative}) illustrates the reproducibility problem concretely; its absence in all three HRIStudio sessions is consistent with the enforcement model's design intent. Programming backgrounds are balanced across study conditions, strengthening the cross-background comparisons. The limitations of this pilot study, including sample size, task simplicity, and the single-session design, are acknowledged and inform the future directions described in Chapter~\ref{ch:conclusion}. This chapter interpreted the results of all six completed pilot sessions against the two research questions and connected the findings to prior work. Across all primary measures, the directional evidence favors HRIStudio. HRIStudio wizards uniformly achieved perfect design fidelity (DFS 100) and near-perfect execution reliability (mean ERS 96.7), while Choregraphe wizards averaged DFS 56.7 and ERS 66.7, with design overruns in all three sessions and no session completing without researcher guidance. The W-01 content deviation (see Section~\ref{sec:results-qualitative}) illustrates the reproducibility problem concretely; its absence in all three HRIStudio sessions is consistent with the enforcement model's design intent. Programming backgrounds are balanced across study conditions by stratified assignment, strengthening the cross-background comparisons. The limitations of this pilot study, including sample size, task simplicity, and the single-session design, are acknowledged and inform the future directions described in Chapter~\ref{ch:conclusion}.
+97
View File
@@ -0,0 +1,97 @@
\chapter{AI-Assisted Development Workflow}
\label{app:ai_workflow}
This appendix documents the role that AI coding assistants played in the construction of HRIStudio. It is included both for transparency about how the system was built and because the workflow itself is, in my view, one of the more interesting artifacts produced by the project. Section~\ref{sec:ai-ws} in Chapter~\ref{ch:implementation} introduces the topic briefly; here I describe the specific responsibilities I kept for myself, the tasks I delegated to coding agents, the tools I used, the limits I encountered, and the integrity controls I maintained between implementation work and the evaluation reported in Chapter~\ref{ch:results}.
\section{Context}
\label{sec:ai-context}
I built HRIStudio while also carrying a full course load, writing this thesis, and running the pilot validation study described in Chapter~\ref{ch:evaluation}. The feature surface described in Chapters~\ref{ch:design} and~\ref{ch:implementation} is larger than what I could reasonably have produced on that schedule without assistance, given both the scope and the level of ambition of the work. AI coding assistants made that scope tractable. They did not replace design judgment; they reduced the cost of the mechanical work that sits between a well-specified design and a working feature: scaffolding new modules, implementing well-defined create/read/update/delete (CRUD) and validation code, applying consistent patterns across files, and producing the many small edits that a project of this size accumulates.
The set of tools available to me as a solo developer changed substantially during the project's timeline. When I began, agentic coding tools were still early and most of my AI use was conversational, primarily through Cursor~\cite{CursorEditor} and Zed~\cite{ZedEditor}. By the end of the project, multiple mature terminal- and editor-integrated agents were available. I changed tools as the landscape evolved, eventually moving into a mixed workflow between Visual Studio Code, Antigravity~\cite{GoogleAntigravity}, Claude Code~\cite{AnthropicClaudeCode}, and OpenCode~\cite{OpenCode}.
\section{Tools and Hardware}
\label{sec:ai-tools}
Table~\ref{tbl:ai-tools} lists the tools I used during development and the capacity in which I used each. The split between them was determined partly by capability and partly by availability over time.
\begin{table}[htbp]
\centering
\footnotesize
\begin{tabular}{|l|l|p{3.4in}|}
\hline
\textbf{Tool} & \textbf{Category} & \textbf{Primary use} \\
\hline
Claude~\cite{Anthropic2024Claude} & Chat model & Design discussions, architectural review, debugging assistance, and refactoring proposals. \\
\hline
Claude Code~\cite{AnthropicClaudeCode} & Terminal agent & Multi-file feature implementation against a written spec; codemod-style refactors; and test scaffolding. \\
\hline
OpenCode~\cite{OpenCode} & Terminal agent & Same class of task as Claude Code, used when I preferred an open-source workflow or a different backing model. \\
\hline
Gemini CLI~\cite{GeminiCLI} & Terminal agent & Occasional cross-check on changes produced by a different agent, and work against Google's models when I wanted a second reading of a larger diff. \\
\hline
Antigravity~\cite{GoogleAntigravity} & IDE agent & Editor-integrated agentic coding work, primarily late in the project as the tool became available. \\
\hline
Cursor~\cite{CursorEditor} & Editor & Early development; AI-native editing and indexing. \\
\hline
Zed~\cite{ZedEditor} & Editor & High-performance editing; transition phase before moving to specialized agents. \\
\hline
\end{tabular}
\caption{AI tools used during HRIStudio development.}
\label{tbl:ai-tools}
\end{table}
Beyond cloud-hosted models, I experimented with local execution using \texttt{llama.cpp} to run various open-weights models on my local hardware (Apple M4 Pro, 14-core CPU, 48GB RAM). While the hardware was capable of running 7B and 14B parameter models with high throughput, the reasoning performance of the local models frequently lagged behind the state-of-the-art frontier models. I found that the additional cognitive overhead of correcting errors in local model output outweighed the benefits of offline execution, leading me to rely primarily on the cloud-hosted agents for complex implementation tasks.
\section{Division of Responsibility}
\label{sec:ai-division}
My working rule throughout the project was for me to handle the engineering and for the agents to flesh out the implementation. In practice, this meant that I was responsible for every decision that had downstream consequences for the shape of the system, and the agents were responsible for producing code that realized those decisions. Concretely, I did the following work directly, without delegating it to an agent:
\begin{itemize}
\item \textbf{Architecture.} The three-tier structure described in Chapter~\ref{ch:design}, the separation between experiment specifications and trial records, the choice to route all robot communication through plugin files, and the overall shape of the event-driven execution model were mine. I wrote these decisions as prose before any code was written.
\item \textbf{Data model.} The PostgreSQL schema and the tRPC procedure boundaries were designed by me. Because downstream type safety depends on the shape of the schema and the API, I was unwilling to let an agent make those choices.
\item \textbf{Research design.} The pilot validation study in Chapter~\ref{ch:evaluation} was designed and analyzed entirely by me. The Observer Data Sheet, Design Fidelity Score rubric, and Execution Reliability Score rubric were written by hand. No AI tool was used to score sessions, compute results, or draft claims about what the data showed.
\item \textbf{The prose of this thesis.} Every chapter was written by me. The structure of the argument and the specific claims I make are my own. While AI assisted with the nuances of \LaTeX{} formatting (particularly the generation of TikZ diagrams and complex chart syntax), the content is mine.
\end{itemize}
\section{Evolution of the Workflow}
\label{sec:ai-pattern}
My use of these tools changed over the course of the project, and evolved as the models improved. Early on, I treated the agent's output as a draft that required line-by-line review. The typical loop followed five steps: writing a specification, generating a diff, reading the diff, running the code, and then accepting or rejecting the change.
As the models improved and the agents became more reliable, the focus of my effort shifted. By the final stages of development, I spent significantly less time on manual line-by-line reviews and more time on empirical testing. I moved from being a ``code reviewer'' to a ``test-driven supervisor.'' If the agent produced a feature that passed my manual acceptance tests and integrated correctly with the existing system, I was more likely to accept the implementation without auditing every line in the program. This shift allowed me to increase the velocity of development significantly in the weeks leading up to the evaluation.
\section{What Worked and What Did Not}
\label{sec:ai-limits}
The tasks that agents handled well were those with a narrow and well-specified interface. Implementing a tRPC procedure from a signature, writing a Drizzle migration that matched a schema diff, adding a new field through an existing form, or applying a consistent rename across files: these were cheap to specify and the agent's output was usually accepted on the first or second iteration. Agents were also good at scaffolding: producing the initial shape of a component, test file, or API route that I then edited to completion.
The tasks that agents handled poorly were those that required reasoning across more of the system than the context window could hold, or that depended on a piece of context I had not written down. Cross-cutting changes to the experiment and trial data models, for example, required careful coordination across the schema, the tRPC procedures, the execution runtime, and the analysis interface. When I tried to delegate changes of this shape to an agent, the diffs were often locally plausible but globally inconsistent; I ended up doing that work myself. Subtle concurrency and timing questions in the execution layer were another category the agents did not handle well. The event-driven execution model in Chapter~\ref{ch:design} has enough non-obvious ordering constraints that an agent without the full picture tended to introduce races; those parts of the codebase I wrote by hand.
Across the full set of tools I used, the differences in capability for the work I asked of them were smaller than I expected. Any of the agents could, in principle, produce a correct diff for a well-scoped task, and when one tool failed it was usually because the task was underspecified rather than because of a difference in model capability. The practical differences between tools mattered more at the workflow level, such as which shell integration I preferred, how the tool handled long diffs, and how it behaved when it needed to ask for clarification, than at the capability level.
\section{Research Integrity}
\label{sec:ai-integrity}
Because this thesis reports an empirical evaluation, I treat the boundary between AI-assisted development and the evaluation itself as a matter of research integrity rather than a matter of preference. The following statements reflect the actual workflow I followed:
\begin{itemize}
\item No AI tool generated, modified, or interpreted any of the evaluation data reported in Chapter~\ref{ch:results}. Every Design Fidelity Score, Execution Reliability Score, and System Usability Scale rating was recorded by me during or immediately after each session from direct observation, using the rubrics in Appendix~\ref{app:blank_templates}.
\item No AI tool produced the tables, means, or comparative claims in Chapter~\ref{ch:results}. The numbers were tabulated by hand from the completed Observer Data Sheets reproduced in Appendix~\ref{app:completed_materials}, and the claims about what those numbers support or do not support are mine.
\item No AI tool drafted the prose of this thesis. The chapters were written by me, in my own voice, and I am responsible for every claim they make and every argument they advance. AI tools were occasionally used as a proofreading aid to catch typos, flag awkward phrasing, or suggest an alternative word; however, the sentences are mine.
\item The code that implements HRIStudio and that was the subject of the evaluation was written under the workflow described in Sections~\ref{sec:ai-division} and~\ref{sec:ai-pattern}. Agents produced drafts; I read, tested, and accepted or rejected every one. The final state of the code is the product of my engineering decisions, regardless of who wrote any particular line.
\end{itemize}
\section{A Note on the Workflow as a Contribution}
\label{sec:ai-reflection}
The workflow described in this appendix is not a contribution of the thesis, and I do not claim that it is generalizable or optimal. I describe it because it is the actual workflow under which the system was built, and because a reader evaluating the claims in Chapter~\ref{ch:results} is entitled to know how the system being evaluated came into existence.
The more interesting observation, at least to me, is about where the boundary between human and agent naturally fell in practice. It fell at the point where a task required a decision with downstream consequences for the shape of the system. Tasks that realized a decision were inexpensive to delegate and inexpensive to verify; tasks that made a decision were neither, and delegating them produced diffs that were locally plausible and globally wrong. Whether that boundary will move as tools improve is a question I cannot answer from the evidence of a single project, but the boundary was stable across every tool I used during this one.
+8 -8
View File
@@ -84,12 +84,12 @@ The Next.js application server and the Bun WebSocket server run outside Docker o
The NAO6 integration stack is defined in a separate repository and provides three ROS~2 services that collectively bridge HRIStudio to the physical robot. The NAO6 integration stack is defined in a separate repository and provides three ROS~2 services that collectively bridge HRIStudio to the physical robot.
\begin{enumerate} \begin{enumerate}
\item The \textbf{nao\_driver} service runs the NaoQi driver ROS~2 node, which connects to the NAO's proprietary framework over the local network and publishes sensor data (joint states, camera feeds) as standard ROS~2 topics. \item The \textbf{nao\_driver} service runs the NAOqi driver ROS~2 node, which connects to the NAO's proprietary framework over the local network and publishes sensor data (joint states, camera feeds) as standard ROS~2 topics.
\item The \textbf{ros\_bridge} service runs the rosbridge WebSocket server, which exposes all ROS~2 topics over a WebSocket interface on a configurable port (default~9090). This is the endpoint that the HRIStudio server connects to. \item The \textbf{ros\_bridge} service runs the \texttt{rosbridge} WebSocket server, which exposes all ROS~2 topics over a WebSocket interface on a configurable port (default~9090). This is the endpoint that the HRIStudio server connects to.
\item The \textbf{ros\_api} service provides runtime introspection of available ROS~2 topics, services, and parameters. \item The \textbf{ros\_api} service provides runtime introspection of available ROS~2 topics, services, and parameters.
\end{enumerate} \end{enumerate}
All three services are built from a single Dockerfile based on the ROS~2 Humble base image (Ubuntu~22.04). The image installs the NaoQi driver and rosbridge server packages along with their dependencies (NaoQi libraries, bridge message types, OpenCV bridge, and TF2) and builds them with colcon. All services use host networking so that ROS~2 discovery and the NaoQi connection operate without port forwarding. All three services are built from a single Dockerfile based on the ROS~2 Humble base image (Ubuntu~22.04). The image installs the NAOqi driver and \texttt{rosbridge} server packages along with their dependencies (NAOqi libraries, bridge message types, OpenCV bridge, and TF2) and builds them with colcon. All services use host networking so that ROS~2 discovery and the NAOqi connection operate without port forwarding.
Before starting the driver, an initialization script connects to the NAO via SSH and prepares it for external control: Before starting the driver, an initialization script connects to the NAO via SSH and prepares it for external control:
@@ -103,7 +103,7 @@ Environment variables for the robot IP address, credentials, and bridge port are
\subsection{Communication Between Stacks} \subsection{Communication Between Stacks}
Figure~\ref{fig:deployment-arch} shows the relationship between the two Docker stacks and the components that run on the host. The HRIStudio server communicates with the robot integration stack over a single WebSocket connection to the \texttt{rosbridge\_websocket} endpoint. For actions that bypass ROS entirely (posture changes, animation playback), the server connects directly to the NAO via SSH and invokes NaoQi commands through the \texttt{qicli} command-line tool. Both communication paths are configured per-robot in the plugin file. Figure~\ref{fig:deployment-arch} shows the relationship between the two Docker stacks and the components that run on the host. The HRIStudio server communicates with the robot integration stack over a single WebSocket connection to the \texttt{rosbridge\_websocket} endpoint. For actions that bypass ROS entirely (posture changes, animation playback), the server connects directly to the NAO via SSH and invokes NAOqi commands through the \texttt{qicli} command-line tool. Both communication paths are configured per-robot in the plugin file.
\begin{figure}[htbp] \begin{figure}[htbp]
\centering \centering
@@ -159,7 +159,7 @@ Figure~\ref{fig:deployment-arch} shows the relationship between the two Docker s
%% ---- NAO Robot ---- %% ---- NAO Robot ----
\node[box, fill=gray!40, minimum width=2.8cm] (nao) at (0, -0.8) \node[box, fill=gray!40, minimum width=2.8cm] (nao) at (0, -0.8)
{NAO6 Robot\\[-1pt]{\scriptsize NaoQi}}; {NAO6 Robot\\[-1pt]{\scriptsize NAOqi}};
%% ---- Arrows: browser to host ---- %% ---- Arrows: browser to host ----
\draw[arrow] (browser.south west) -- node[lbl, left] {HTTP} (nextjs.north); \draw[arrow] (browser.south west) -- node[lbl, left] {HTTP} (nextjs.north);
@@ -231,13 +231,13 @@ Each action definition specifies:
\item A ROS~2 dispatch block containing the target topic, message type, and a payload mapping. \item A ROS~2 dispatch block containing the target topic, message type, and a payload mapping.
\end{itemize} \end{itemize}
The payload mapping supports two modes. In \emph{static} mode, the plugin defines a fixed message template with placeholder tokens (e.g., \texttt{\{\{text\}\}}) that the execution engine fills from the researcher's parameters. In \emph{SSH} mode, the action bypasses ROS entirely and executes a shell command on the robot via SSH; this is used for NaoQi-native operations such as posture changes and animation playback that are not exposed as ROS~2 topics. The payload mapping supports two modes. In \emph{static} mode, the plugin defines a fixed message template with placeholder tokens (e.g., \texttt{\{\{text\}\}}) that the execution engine fills from the researcher's parameters. In \emph{SSH} mode, the action bypasses ROS entirely and executes a shell command on the robot via SSH; this is used for NAOqi-native operations such as posture changes and animation playback that are not exposed as ROS~2 topics.
The NAO6 plugin defines 20 actions across three categories: speech (say text, say with emotion), movement (walk forward/backward, turn, stop, wake up, rest, stand, sit, crouch), and animation (bow, wave, nod, head shake, shrug, enthusiastic gesture, and others). Movement actions publish ROS~2 Twist messages to the velocity command topic. Animation actions publish animation path strings to the animation topic. Posture and lifecycle commands use SSH mode to call NaoQi services directly via the \texttt{qicli} command-line tool. The NAO6 plugin defines 20 actions across three categories: speech (say text, say with emotion), movement (walk forward/backward, turn, stop, wake up, rest, stand, sit, crouch), and animation (bow, wave, nod, head shake, shrug, enthusiastic gesture, and others). Movement actions publish ROS~2 Twist messages to the velocity command topic. Animation actions publish animation path strings to the animation topic. Posture and lifecycle commands use SSH mode to call NAOqi services directly via the \texttt{qicli} command-line tool.
\subsection{Adding a New Robot} \subsection{Adding a New Robot}
Adding support for a new robot platform requires writing a single JSON plugin file and placing it in the repository. No changes to the HRIStudio server code are required. The plugin author defines the robot's capabilities, maps each action to a ROS~2 topic or SSH command, and specifies the parameter schema for each action. After the repository is synced, the new robot's actions appear in the experiment designer and can be used in any study. Adding support for a new robot platform requires writing a single JSON plugin file and placing it in the plugin repository. No changes to the HRIStudio server code are required. The plugin author defines the robot's capabilities, maps each action to a ROS~2 topic or SSH command, and specifies the parameter schema for each action. After the repository is synced, the new robot's actions appear in the experiment designer and can be used in any study.
\section{Database Schema} \section{Database Schema}
Binary file not shown.

After

Width:  |  Height:  |  Size: 153 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 165 KiB

Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
+88
View File
@@ -228,3 +228,91 @@ doi = {10.1201/9781498710411-35}
year = {2021}, year = {2021},
doi = {10.1145/3412374} doi = {10.1145/3412374}
} }
@misc{HRIStudioRepo,
author = {O'Connor, Sean},
title = {{HRIStudio: A Web-Based Wizard-of-Oz Platform for Human-Robot Interaction Research}},
howpublished = {GitHub repository},
year = {2026},
url = {https://github.com/soconnor0919/hristudio}
}
@misc{RobotPluginsRepo,
author = {O'Connor, Sean},
title = {{HRIStudio Robot Plugins Repository}},
howpublished = {GitHub repository, maintained as a submodule of HRIStudio},
year = {2026},
url = {https://github.com/soconnor0919/robot-plugins}
}
@misc{NaoWorkspaceRepo,
author = {O'Connor, Sean},
title = {{nao-workspace: A Containerized Choregraphe Development Environment}},
howpublished = {GitHub repository},
year = {2026},
url = {https://github.com/soconnor0919/nao-workspace}
}
@misc{NaoIntegrationRepo,
author = {O'Connor, Sean},
title = {{nao6-hristudio-integration: ROS/NAOqi Bridge for HRIStudio}},
howpublished = {GitHub repository},
year = {2026},
url = {https://github.com/soconnor0919/nao6-hristudio-integration}
}
@misc{Anthropic2024Claude,
author = {{Anthropic}},
title = {{Claude 3.5 Sonnet}},
howpublished = {Large Language Model},
year = {2024},
url = {https://www.anthropic.com/claude}
}
@misc{AnthropicClaudeCode,
author = {{Anthropic}},
title = {{Claude Code}},
howpublished = {Agentic CLI Developer Tool},
year = {2025},
url = {https://www.anthropic.com/claude-code}
}
@misc{OpenCode,
author = {{SST}},
title = {{OpenCode}},
howpublished = {Open-source AI Coding Agent},
year = {2024},
url = {https://opencode.ai}
}
@misc{GeminiCLI,
author = {{Google}},
title = {{Gemini CLI}},
howpublished = {Agentic CLI Developer Tool},
year = {2024},
url = {https://github.com/google-gemini/gemini-cli}
}
@misc{GoogleAntigravity,
author = {{Google}},
title = {{Antigravity}},
howpublished = {Integrated Agentic Development Environment},
year = {2025},
url = {https://antigravity.google}
}
@misc{CursorEditor,
author = {{Anysphere}},
title = {{Cursor Code Editor}},
howpublished = {AI-Native Code Editor},
year = {2023},
url = {https://cursor.com}
}
@misc{ZedEditor,
author = {{Zed Industries}},
title = {{Zed Code Editor}},
howpublished = {High-performance Code Editor with AI Integration},
year = {2023},
url = {https://zed.dev}
}
+10 -7
View File
@@ -4,12 +4,13 @@
%\usepackage{graphics} %Select graphics package %\usepackage{graphics} %Select graphics package
\usepackage{graphicx} % \usepackage{graphicx} %
\usepackage{pdfpages} %For including PDF pages in appendices \usepackage{pdfpages} %For including PDF pages in appendices
\usepackage{subcaption} %For sub-figures with captions
%\usepackage{amsthm} %Add other packages as necessary %\usepackage{amsthm} %Add other packages as necessary
\usepackage{array} %Extended column types and \arraybackslash \usepackage{array} %Extended column types and \arraybackslash
\usepackage{makecell} %Multi-line table header cells \usepackage{makecell} %Multi-line table header cells
\usepackage{tabularx} %Auto-width table columns \usepackage{tabularx} %Auto-width table columns
\usepackage{tikz} %For programmatic diagrams \usepackage{tikz} %For programmatic diagrams
\usetikzlibrary{shapes,arrows,positioning,fit,backgrounds,decorations.pathreplacing} \usetikzlibrary{shapes,arrows,positioning,fit,backgrounds,decorations.pathreplacing,calc}
\usepackage[ \usepackage[
hidelinks, hidelinks,
linktoc=all, linktoc=all,
@@ -20,12 +21,13 @@
\butitle{A Web-Based Wizard-of-Oz Platform for Collaborative and Reproducible Human-Robot Interaction Research} \butitle{A Web-Based Wizard-of-Oz Platform for Collaborative and Reproducible Human-Robot Interaction Research}
\author{Sean O'Connor} \author{Sean O'Connor}
\degree{Bachelor of Science} \degree{Bachelor of Science}
\department{Computer Science} \department{Computer Science and Engineering}
\advisor{L. Felipe Perrone} \advisor{L. Felipe Perrone}
% \advisorb{Brian King} \advisorb{Brian King}
\honorscouncilrep{Abigail Kopec}
\chair{Alan Marchiori} \chair{Alan Marchiori}
\maketitle % \maketitle
\includepdf[pages=-,pagecommand={}]{pdfs/CoverPage-Signed.pdf}
\frontmatter \frontmatter
\acknowledgments{ \acknowledgments{
@@ -52,9 +54,9 @@
\abstract{ \abstract{
\begin{spacing}{1.3} \begin{spacing}{1.3}
{\setlength{\parskip}{0.1in} {\setlength{\parskip}{0.1in}
The Wizard-of-Oz (WoZ) technique is widely used in Human-Robot Interaction research, but two persistent problems limit its effectiveness: existing tools impose technical barriers that exclude non-engineering domain experts (the Accessibility Problem), and the fragmented landscape of robot-specific implementations makes interaction scripts difficult to port across platforms (the Reproducibility Problem --- concerning execution consistency and portability, not third-party replication). Through a literature review, I identified three design principles to address both: a hierarchical specification model, an event-driven execution model, and a plugin architecture that decouples experiment logic from robot-specific implementations. I realized these principles in HRIStudio, an open-source, web-based platform providing a visual experiment designer, a guided wizard execution interface, automated timestamped logging with deviation tracking, and role-based access control. The Wizard-of-Oz (WoZ) technique is widely used in Human-Robot Interaction (HRI) research, but two persistent problems limit its effectiveness: existing tools impose technical barriers that exclude non-engineering domain experts (the Accessibility Problem), and the fragmented landscape of robot-specific implementations makes interaction scripts difficult to port across platforms (the Reproducibility Problem --- concerning execution consistency and portability, not third-party replication). Through a literature review, I identified three design principles to address both: a hierarchical specification model, an event-driven execution model, and a plugin architecture that decouples experiment logic from robot-specific implementations. I realized these principles in HRIStudio, an open-source, web-based platform providing a visual experiment designer, a guided wizard execution interface, automated timestamped logging with deviation tracking, and role-based access control.
I evaluated HRIStudio in a pilot between-subjects study (N=6) against Choregraphe, the standard NAO programming tool. HRIStudio wizards achieved higher design fidelity, execution reliability, and perceived usability across all six sessions; the only unprompted specification deviation in the dataset occurred in the Choregraphe condition. While the pilot scale precludes inferential claims, the directional evidence across all measures supports the position that a tool built to realize the identified design principles can have significant impact on accessibility and reproducibility in WoZ-based HRI research. I evaluated HRIStudio in a pilot between-subjects study (N=6) against Choregraphe, the standard programming tool for the NAO robot. HRIStudio wizards achieved higher design fidelity, execution reliability, and perceived usability across all six sessions; the only unprompted specification deviation in the dataset occurred in the Choregraphe condition. While the pilot scale precludes inferential claims, the directional evidence across all measures supports the position that a tool built to realize the identified design principles can have significant impact on accessibility and reproducibility in WoZ-based HRI research.
} }
\end{spacing} \end{spacing}
} }
@@ -94,5 +96,6 @@
\include{chapters/app_blank_templates} \include{chapters/app_blank_templates}
\include{chapters/app_materials} \include{chapters/app_materials}
\include{chapters/app_tech_docs} \include{chapters/app_tech_docs}
\include{chapters/app_ai_development}
\end{document} \end{document}