diff --git a/thesis/chapters/01_introduction.tex b/thesis/chapters/01_introduction.tex index ea9d457..316b95e 100644 --- a/thesis/chapters/01_introduction.tex +++ b/thesis/chapters/01_introduction.tex @@ -19,7 +19,7 @@ To address the accessibility and reproducibility problems in WoZ-based HRI resea This approach represents a shift from the current paradigm of custom, robot-specific tools toward a unified platform that can serve as shared infrastructure for the HRI research community. By treating experiment design, execution, and analysis as distinct but integrated phases of a study, such a framework can systematically address both technical barriers and sources of variability that currently limit research quality and reproducibility. -The contributions of this thesis are the design principles of this approach, namely: a hierarchical specification model, an event-driven execution model, and a protocol/trial separation with explicit deviation logging. Together they form a coherent architecture for WoZ infrastructure that any implementation could adopt. The platform I developed, HRIStudio, is a complete realization of this architecture: an open-source, web-based platform that serves as both the primary artifact of this thesis and the instrument for empirical validation. +The contributions of this thesis are the design principles of this approach, namely: a hierarchical specification model, an event-driven execution model, and a plugin architecture that decouples experiment logic from robot-specific implementations. Together they form a coherent architecture for WoZ infrastructure that any implementation could adopt. The platform I developed, HRIStudio, is a complete realization of this architecture: an open-source, web-based platform that serves as both the primary artifact of this thesis and the instrument for empirical validation. \section{Research Objectives} diff --git a/thesis/chapters/03_reproducibility.tex b/thesis/chapters/03_reproducibility.tex index 52a545c..e4860e8 100644 --- a/thesis/chapters/03_reproducibility.tex +++ b/thesis/chapters/03_reproducibility.tex @@ -31,7 +31,7 @@ Based on this analysis, I identify specific ways that software infrastructure ca \section{Connecting Reproducibility Challenges to Infrastructure Requirements} -The reproducibility challenges identified above directly motivate the infrastructure requirements (R1--R6) established in Chapter~\ref{ch:background}. Inconsistent wizard behavior creates the need for enforced execution protocols (R1) that guide wizards step by step, and for automatic logging (R4) that captures any deviations that occur. Timing errors specifically motivate responsive, fine-grained real-time control (R3): a wizard working with a sluggish interface introduces latency that disrupts the interaction and confounds timing analysis. Technical fragmentation forces each lab to rebuild infrastructure as hardware changes, violating platform agnosticism (R5). Incomplete documentation reflects the need for self-documenting, code-free protocol specifications (R1, R2) that are simultaneously executable and shareable. Finally, the isolation of individual research groups motivates collaborative support (R6): allowing multiple team members to observe and review trials enables the shared scrutiny that reproducibility requires. As Chapter~\ref{ch:background} demonstrated, no existing platform simultaneously satisfies all six requirements. Addressing this gap requires rethinking how WoZ infrastructure is designed, prioritizing reproducibility and methodological rigor as first-class design goals rather than afterthoughts. +The reproducibility challenges identified above directly motivate the infrastructure requirements (R1--R6) established in Chapter~\ref{ch:background}. Inconsistent wizard behavior creates the need for real-time control mechanisms (R3) that guide wizards step by step, and for automatic logging (R4) that captures any deviations that occur. Timing errors further motivate responsive, fine-grained real-time control (R3): a wizard working with a sluggish interface introduces latency that disrupts the interaction and confounds timing analysis. Technical fragmentation forces each lab to rebuild infrastructure as hardware changes, violating platform agnosticism (R5). Incomplete documentation reflects the need for self-documenting, code-free protocol specifications (R2) that are simultaneously executable and shareable, integrated into a single workflow (R1) so that the specification and the execution environment are never separated. Finally, the isolation of individual research groups motivates collaborative support (R6): allowing multiple team members to observe and review trials enables the shared scrutiny that reproducibility requires. As Chapter~\ref{ch:background} demonstrated, no existing platform simultaneously satisfies all six requirements. Addressing this gap requires rethinking how WoZ infrastructure is designed, prioritizing reproducibility and methodological rigor as first-class design goals rather than afterthoughts. \section{Chapter Summary} diff --git a/thesis/chapters/05_implementation.tex b/thesis/chapters/05_implementation.tex index d3f7a17..7ae77c0 100644 --- a/thesis/chapters/05_implementation.tex +++ b/thesis/chapters/05_implementation.tex @@ -99,11 +99,11 @@ No two human subjects respond identically to an experimental protocol. One subje \section{Robot Integration} -A configuration file describes each robot platform, listing the actions it supports and specifying how each one maps to a command the robot understands. The execution engine reads this file at startup and uses it whenever it needs to dispatch a command: it looks up the action type, assembles the appropriate message, and sends it to the robot over a bridge process running on the local network. The web server itself has no knowledge of any specific robot; all hardware-specific logic lives in the configuration file. +A plugin file describes each robot platform, listing the actions it supports and specifying how each one maps to a command the robot understands. The execution engine reads this file at startup and uses it whenever it needs to dispatch a command: it looks up the action type, assembles the appropriate message, and sends it to the robot over a bridge process running on the local network. The web server itself has no knowledge of any specific robot; all hardware-specific logic lives in the plugin file. The execution engine treats control flow elements such as branches and conditionals, which function as elements of a computer program, the same way as robot actions. These control-flow elements appear as action groups in the experiment and are evaluated during the trial, so researchers can freely mix logical decisions and physical robot behaviors when designing an experiment without any special handling. -Figure~\ref{fig:plugin-architecture} illustrates this mapping using NAO6 and TurtleBot as an example. Actions a platform does not support (such as \texttt{raise\_arm} on TurtleBot) appear as explicitly unsupported in the configuration file rather than silently failing. Because all hardware-specific logic lives in the configuration file, the experiment itself does not change between platforms. +Figure~\ref{fig:plugin-architecture} illustrates this mapping using NAO6 and TurtleBot as an example. Actions a platform does not support (such as \texttt{raise\_arm} on TurtleBot) appear as explicitly unsupported in the plugin file rather than silently failing. Because all hardware-specific logic lives in the plugin file, the experiment itself does not change between platforms. \begin{figure}[htbp] \centering @@ -146,13 +146,13 @@ Figure~\ref{fig:plugin-architecture} illustrates this mapping using NAO6 and Tur \draw[arrow] (cfg.east) -- (tb.west); \end{tikzpicture} -\caption{Abstract experiment actions translated to platform-specific robot commands through per-platform configuration files.} +\caption{Abstract experiment actions translated to platform-specific robot commands through per-platform plugin files.} \label{fig:plugin-architecture} \end{figure} \section{Access Control} -I implemented access control using a role-based access control (RBAC) model. Each study has a membership list, and each member is assigned one of four roles that define a clear separation of capabilities: those who own the study, those who design it, those who run it, and those who observe it. This enforces need-to-know access at the study level so that each team member sees or is able to modify only what their role requires. +I implemented access control using a role-based access control (RBAC) model with two layers. System-level roles govern what a user can do across the platform (administrator, researcher, wizard, observer), while study-level roles govern what a user can see and do within a specific study (owner, researcher, wizard, observer). The two layers are checked independently, so a user who is a wizard on one study can be an observer on another without any additional configuration. Within a study, the four study-level roles define a clear separation of capabilities: those who own the study, those who design it, those who run it, and those who observe it. This enforces need-to-know access at the study level so that each team member sees or is able to modify only what their role requires. \begin{description} \item[Owner.] Full control over the study: can invite or remove members, configure the study settings, and access all data. @@ -177,8 +177,8 @@ The following two problems required specific solutions during implementation. HRIStudio is fully operational for controlled Wizard-of-Oz studies. The Design, Execution, and Analysis interfaces are complete and integrated. The execution engine handles scripted and unscripted actions with full timestamped logging, and I validated robot communication on the NAO6 platform during development. A researcher can design an experiment, run a live trial with a wizard, and review the resulting logs and recordings without modification to the platform's core architecture or execution workflow. -Work remaining for future development includes broader validation of the configuration file approach on robot platforms beyond NAO6. +Work remaining for future development includes broader validation of the plugin file approach on robot platforms beyond NAO6. \section{Chapter Summary} -This chapter described how HRIStudio realizes the design principles from Chapter~\ref{ch:design} in practice. Experiments are persistent, reusable specifications that produce complete, comparable trial records. The execution engine is event-driven rather than timer-driven, keeping the wizard in control of pacing while logging every action automatically. Per-platform configuration files keep the execution engine hardware-agnostic. The role system enforces access control at the study level. The platform is fully operational for controlled WoZ studies today, demonstrated through the pilot validation study presented in Chapter~\ref{ch:evaluation}. The design principles are general; HRIStudio shows they are workable. +This chapter described how HRIStudio realizes the design principles from Chapter~\ref{ch:design} in practice. Experiments are persistent, reusable specifications that produce complete, comparable trial records. The execution engine is event-driven rather than timer-driven, keeping the wizard in control of pacing while logging every action automatically. Per-platform plugin files keep the execution engine hardware-agnostic. The role system enforces access control at the study level. The platform is fully operational for controlled WoZ studies today, demonstrated through the pilot validation study presented in Chapter~\ref{ch:evaluation}. The design principles are general; HRIStudio shows they are workable. diff --git a/thesis/chapters/06_evaluation.tex b/thesis/chapters/06_evaluation.tex index 022420f..4156ba1 100644 --- a/thesis/chapters/06_evaluation.tex +++ b/thesis/chapters/06_evaluation.tex @@ -5,7 +5,7 @@ This chapter presents the pilot validation study used to evaluate whether HRIStu \section{Research Questions} -The evaluation targets the two problems established in Chapter~\ref{ch:background}. The first is the \emph{Accessibility Problem}: existing tools require substantial programming expertise, which prevents domain experts from conducting independent HRI studies. The second is the \emph{Reproducibility Problem}: without structured logging and protocol enforcement, experiment execution varies across participants and wizards in ways that are difficult to detect or control after the fact. +The validation study targets the two problems established in Chapter~\ref{ch:background}. The first is the \emph{Accessibility Problem}: existing tools require substantial programming expertise, which prevents domain experts from conducting independent HRI studies. The second is the \emph{Reproducibility Problem}: without structured logging and protocol enforcement, experiment execution varies across participants and wizards in ways that are difficult to detect or control after the fact. These problems give rise to two research questions. The first asks whether HRIStudio enables domain experts without prior robotics experience to successfully implement a robot interaction from a written specification. The second asks whether HRIStudio produces more reliable execution of that interaction compared to standard practice. @@ -44,7 +44,7 @@ Both conditions used the same NAO humanoid robot (Figure~\ref{fig:nao6-photo}), The control condition used Choregraphe \cite{Pot2009}, a proprietary visual programming tool developed by Aldebaran Robotics and the standard software for NAO programming. Choregraphe organizes behavior as a finite state machine: nodes represent states and edges represent transitions triggered by conditions or timers. -The experimental condition used HRIStudio, described in Chapter~\ref{ch:implementation}. HRIStudio organizes behavior as a sequential action timeline with support for conditional branches. Unlike Choregraphe, it abstracts robot-specific commands through configuration files, though for this study both tools controlled the same NAO platform. +The experimental condition used HRIStudio, described in Chapter~\ref{ch:implementation}. HRIStudio organizes behavior as a sequential action timeline with support for conditional branches. Unlike Choregraphe, it abstracts robot-specific commands through plugin files, though for this study both tools controlled the same NAO platform. \section{Procedure} @@ -69,7 +69,7 @@ Following the trial, the wizard completed the System Usability Scale survey. The \section{Measures} \label{sec:measures} -The study collected four measures, two primary and two supplementary. +The study collected five measures, two primary and three supplementary, operationalized through five instruments. \subsection{Design Fidelity Score} @@ -106,7 +106,7 @@ Only T-type interventions affect rubric scoring; the others are recorded to prov \section{Measurement Instruments} -The four primary and supplementary measures are designed to work together. The DFS and ERS address separate phases of the session: DFS captures what was designed, and ERS captures whether that design translated faithfully into execution. Taken together, they make it possible to distinguish a wizard who implemented the specification correctly but whose design failed during the trial from one whose design was incomplete but executed without researcher assistance. The SUS grounds both scores in the wizard's subjective experience of the tool. The intervention log and session timing are supplementary: they do not directly answer the research questions but provide context for interpreting the primary scores, particularly for understanding whether help requests concerned the tool itself or the task. +The five measures are designed to work together. The DFS and ERS address separate phases of the session: DFS captures what was designed, and ERS captures whether that design translated faithfully into execution. Taken together, they make it possible to distinguish a wizard who implemented the specification correctly but whose design failed during the trial from one whose design was incomplete but executed without researcher assistance. The SUS grounds both scores in the wizard's subjective experience of the tool. The intervention log and session timing are supplementary: they do not directly answer the research questions but provide context for interpreting the primary scores, particularly for understanding whether help requests concerned the tool itself or the task. Table~\ref{tbl:measurement_instruments} summarizes the five instruments, when they were collected, and which research question each addresses. diff --git a/thesis/chapters/07_results.tex b/thesis/chapters/07_results.tex index 7ef7dbc..453f1b7 100644 --- a/thesis/chapters/07_results.tex +++ b/thesis/chapters/07_results.tex @@ -5,9 +5,9 @@ This chapter presents the results of the pilot validation study described in Cha \section{Participant Overview} -Table~\ref{tbl:sessions} summarizes the participants and their assigned conditions. Wizards are identified by code to protect confidentiality. All six participants were Bucknell University faculty members recruited from departments outside Computer Science. Demographic information (programming background) was collected during recruitment. +Table~\ref{tbl:sessions} summarizes the participants and their assigned conditions. Wizards are identified by code to protect confidentiality. All six participants were Bucknell University faculty members drawn from Computer Science, Chemical Engineering, Digital Humanities, and Logic and Philosophy of Science. Demographic information (programming background) was collected during recruitment. + -% TODO: Fill in W-06 row once session is complete. \begin{table}[htbp] \centering \footnotesize @@ -17,7 +17,7 @@ Table~\ref{tbl:sessions} summarizes the participants and their assigned conditio \hline W-01 & Choregraphe & Digital Humanities & None & 42.5 & 65 & 60 \\ \hline -W-02 & HRIStudio & Computer Science & Moderate & 100 & 95 & 90 \\ +W-02 & HRIStudio & Logic and Philosophy of Science & Moderate & 100 & 95 & 90 \\ \hline W-03 & Choregraphe & Computer Science & Extensive & 65 & 60 & 75 \\ \hline @@ -25,7 +25,7 @@ W-04 & Choregraphe & Chemical Engineering & Moderate & 62.5 & 75 & 42.5 \\ \hline W-05 & HRIStudio & Chemical Engineering & None & 100 & 95 & 70 \\ \hline -W-06 & HRIStudio & Computer Science & Extensive & --- & --- & --- \\ +W-06 & HRIStudio & Computer Science & Extensive & 100 & 100 & 70 \\ \hline \end{tabular} \caption{Summary of wizard participants, assigned conditions, and scores.} @@ -38,22 +38,23 @@ W-06 & HRIStudio & Computer Science & Extensive & --- & --- & --- \\ The Design Fidelity Score measures how completely and correctly each wizard implemented the written specification. Scores range from 0 to 100, with full points awarded only when a component is both present and correct. -W-01 (Choregraphe) received a DFS of 42.5. Analysis of the exported project file found all four interaction steps present and correctly sequenced; the conditional branch was wired and functional. Speech fidelity was partial: W-01 deviated from the specification by substituting a different rock color in the narrative and comprehension question, departing from the ``red'' specified in the paper protocol. Items 1 and 4 (introduction and branch responses) received full points; items 2 and 3 received half points due to the content mismatch. The gesture category scored zero. Both the introduction wave and the narrative gesture were implemented via the tool's \emph{Animated Say} function, which generates motion non-deterministically from a library rather than placing a specific gesture node; under the rubric's clarifying rule, this does not satisfy the Correct criterion. Item 7 (nod or head shake) was not explicitly programmed. The control-flow category was split: item 9 (correct step sequence) received full points; item 8 (conditional branch) received half points because the branch was resolved by manually deleting and re-routing connections during the trial rather than through a dedicated conditional node wired at design time. +W-01 (Choregraphe, Digital Humanities, no programming experience) received a DFS of 42.5. Analysis of the exported project file found all four interaction steps present and correctly sequenced; the conditional branch was wired and functional. Speech fidelity was partial: W-01 deviated from the specification by substituting a different rock color in the narrative and comprehension question, departing from the ``red'' specified in the paper protocol. Items 1 and 4 (introduction and branch responses) received full points; items 2 and 3 received half points due to the content mismatch. The gesture category scored zero. Both the introduction wave and the narrative gesture were implemented via the tool's \emph{Animated Say} function, which generates motion non-deterministically from a library rather than placing a specific gesture node; under the rubric's clarifying rule, this does not satisfy the Correct criterion. Item 7 (nod or head shake) was not explicitly programmed. The control-flow category was split: item 9 (correct step sequence) received full points; item 8 (conditional branch) received half points because the branch was resolved by manually deleting and re-routing connections during the trial rather than through a dedicated conditional node wired at design time. -W-02 (HRIStudio, programmer) received a DFS of 100. The exported project file confirmed all four interaction steps present and correctly sequenced, speech content matching the written specification verbatim, gestures placed using dedicated action nodes, and the conditional branch wired through HRIStudio's branch component. No tool-operation interventions were logged during the design phase. W-02 completed the design in 24 minutes, within the 30-minute allocation. +W-02 (HRIStudio, Logic and Philosophy of Science, moderate programming) received a DFS of 100. The exported project file confirmed all four interaction steps present and correctly sequenced, speech content matching the written specification verbatim, gestures placed using dedicated action nodes, and the conditional branch wired through HRIStudio's branch component. No tool-operation interventions were logged during the design phase. W-02 completed the design in 24 minutes, within the 30-minute allocation. -W-03 (Choregraphe, programmer) received a DFS of 65. W-03 approached the design as a block programming exercise, constructing extra nodes and attempting a concurrent execution structure not called for by the specification. One C-type clarification was required: I noted that control-flow logic relying on onboard speech recognition was outside the scope of this study, since Wizard-of-Oz execution routes all speech decisions through the wizard rather than the robot. Speech fidelity was partial: two of the three scorable speech items were present, with not all delivered correctly. No conditional branch was implemented in the final design, resulting in zero points for that category. The design phase extended to 37 minutes, seven minutes over the 30-minute allocation. +W-03 (Choregraphe, Computer Science, extensive programming) received a DFS of 65. W-03 approached the design as a block programming exercise, constructing extra nodes and attempting a concurrent execution structure not called for by the specification. One C-type clarification was required: I noted that control-flow logic relying on onboard speech recognition was outside the scope of this study, since Wizard-of-Oz execution routes all speech decisions through the wizard rather than the robot. Speech fidelity was partial: two of the three scorable speech items were present, with not all delivered correctly. No conditional branch was implemented in the final design, resulting in zero points for that category. The design phase extended to 37 minutes, seven minutes over the 30-minute allocation. -W-04 (Choregraphe, Chemical Engineering, moderate programmer) received a DFS of 62.5. The design phase ran 35 minutes without reaching completion, making W-04 the only wizard in the study who did not finish the design before the cutoff. Four T-type tool-operation interventions and one C-type clarification were logged. During training, W-04 asked about running two behavior blocks simultaneously and how to edit a block, reflecting early engagement with Choregraphe's concurrent flow model. During the design phase, W-04 asked about interpretation of punctuation in speech content, generating three simultaneous T-type marks across items 1--3. W-04 also independently attempted to use Choregraphe's choice block for conditional branching; the block did not execute correctly. The researcher re-explained the WoZ execution model and how to branch by manual step selection. Speech items 1, 2, and 4 received full points; item 3 (the comprehension question) was absent from the final design. Gesture items 5 and 6 received full points; item 7 (nod or head shake) was present but not marked correct (5/10). The conditional branch received zero points; no functional branch was wired at export. Step sequencing received partial credit (7.5/15). +W-04 (Choregraphe, Chemical Engineering, moderate programming experience) received a DFS of 62.5. The design phase ran 35 minutes without reaching completion, making W-04 the only wizard in the study who did not finish the design before the cutoff. Four T-type tool-operation interventions and one C-type clarification were logged. During training, W-04 asked about running two behavior blocks simultaneously and how to edit a block, reflecting early engagement with Choregraphe's concurrent flow model. During the design phase, W-04 asked about interpretation of punctuation in speech content, generating three simultaneous T-type marks across items 1--3. W-04 also independently attempted to use Choregraphe's choice block for conditional branching; the block did not execute correctly. The researcher re-explained the WoZ execution model and how to branch by manual step selection. Speech items 1, 2, and 4 received full points; item 3 (the comprehension question) was absent from the final design. Gesture items 5 and 6 received full points; item 7 (nod or head shake) was present but not marked correct (5/10). The conditional branch received zero points; no functional branch was wired at export. Step sequencing received partial credit (7.5/15). W-05 (HRIStudio, Chemical Engineering, no programming experience) received a DFS of 100. The design phase completed in 18 minutes, the shortest design phase in the study. Training concluded in 6 minutes with no questions asked; the wizard described the platform as ``pretty straightforward.'' Two T-type interventions and three C-type clarifications were logged during the design phase. The T-type interventions concerned editing properties in the right pane of the experiment designer and understanding that the branch block requires predefined steps; both were addressed without affecting the final design. The C-type clarifications concerned what ``steps'' represent as structural containers, the relationship between the written specification's speech and platform speech actions, and a related conceptual question. The wizard added a creative narrative gesture not specified in the protocol (a crouch animation); this was present and correct under the rubric. The DFS assessment noted that the wizard's design mapped well from the specification. -% TODO: Add DFS scores and per-item breakdown for W-06 when complete. -% TODO: Add condition means once W-06 is complete. +W-06 (HRIStudio, Computer Science, extensive programming) received a DFS of 100. Two T-type interventions were logged during the design phase, both pertaining to item 6 (narrative gesture): at 15:21, W-06 attempted to use parallel execution for a gesture action and was unable to edit the action node; at 15:24, W-06 encountered difficulty resetting the robot's posture and was directed to recommended posture blocks. In both cases, W-06 resolved the issue independently after the initial prompt. W-06's programming background led to a more elaborate design than the specification required, including extra posture-reset actions that were ultimately redundant since the robot was already in the correct starting position; these additions did not affect scoring since all required actions were present and correct in the exported project file. The conditional branch was wired correctly, and all speech and gesture items matched the specification. W-06 completed the design in 21 minutes, within the 30-minute allocation. + +Across the three HRIStudio sessions, DFS scores were 100, 100, and 100 (mean 100). Across the three Choregraphe sessions, DFS scores were 42.5, 65, and 62.5 (mean 56.7). \subsection{Execution Reliability Score} -The Execution Reliability Score measures how faithfully the designed interaction executed during the live trial. W-01 received an ERS of 65. The trial ran for approximately five minutes. In this session, I served as the test subject during the live trial. Through that experience I confirmed that a separately recruited participant is not required: the DFS and ERS both evaluate the wizard's implementation and execution fidelity rather than a subject's behavioral responses. Subsequent sessions therefore run the trial phase with the wizard executing the designed interaction directly, without a separate test subject. The introduction speech and gesture executed correctly. The narrative speech executed but deviated from the specification due to the modified rock color described above. The comprehension question was delivered, a branch response was triggered, and the interaction proceeded to its conclusion. Gesture synchronization was partial: a pause gesture executed, but coordination between speech and movement was inconsistent at several points. No system disconnections or crashes occurred. +The Execution Reliability Score measures how faithfully the designed interaction executed during the live trial. W-01 received an ERS of 65. The trial ran for approximately five minutes. In this session, I served as the test subject during the live trial. Through that experience I confirmed that a separately recruited participant is not required: the DFS and ERS both evaluate the wizard's implementation and execution fidelity rather than a subject's behavioral responses. Subsequent sessions therefore ran the trial phase with the wizard executing the designed interaction directly, without a separate test subject. The introduction speech and gesture executed correctly. The narrative speech executed but deviated from the specification due to the modified rock color described above. The comprehension question was delivered, a branch response was triggered, and the interaction proceeded to its conclusion. Gesture synchronization was partial: a pause gesture executed, but coordination between speech and movement was inconsistent at several points. No system disconnections or crashes occurred. W-02 (HRIStudio) received an ERS of 95. The trial ran for approximately five minutes. Introduction speech and gesture, narrative speech, comprehension question, and branch response content all executed correctly and matched the specification. During the trial, the interaction briefly advanced to an incorrect step when a branch transition misfired; this was immediately corrected by manually selecting the correct step in the execution interface. This incident was logged as an H-type intervention (platform behavior, not wizard error). The branching item scored 5 out of 10 on its own merits: the branch was present in the design and execution reached the branch step, but the initial misfire meant the transition was not fully correct before manual correction. No other deviations or system failures occurred. @@ -63,14 +64,15 @@ W-04 (Choregraphe) received an ERS of 75. The trial ran for approximately four m W-05 (HRIStudio) received an ERS of 95. The trial ran for approximately four minutes and reached step 4. The researcher's answer was ``Red'' (the correct answer), and branch A fired via programmed conditional logic. All speech items executed correctly. Introduction gesture, nod or head shake, speech synchronization, and the pre-question pause all scored full points. One trial intervention pair was logged: the researcher briefly forgot they were in live execution (G-type), then was reminded and manually skipped a non-functional crouch action (T-type, capping item 6 at 5/10). The crouch animation exists in HRIStudio's action library but does not execute on the NAO6 robot-side; skipping it was the correct recovery. All other items scored full points and no system errors occurred. The overall ERS assessment recorded that the interaction executed as designed. -% TODO: Add ERS score and breakdown for W-06 when complete. -% TODO: Once all sessions complete, report condition ERS means and note patterns in execution failures across conditions. +W-06 (HRIStudio) received a perfect ERS of 100. The trial ran for approximately three minutes. No interventions of any type were logged during the trial phase. All speech items executed correctly and matched the specification. Gestures, speech synchronization, and the pre-question pause all scored full points. The conditional branch was present in the design and fired correctly during execution via programmed conditional logic. The interaction reached its conclusion without errors, disconnections, or researcher involvement. + +Across the three HRIStudio sessions, ERS scores were 95, 95, and 100 (mean 96.7). Across the three Choregraphe sessions, ERS scores were 65, 60, and 75 (mean 66.7). In the HRIStudio condition, branching was present in every design and executed correctly in every trial; no trial required tool-operation guidance from the researcher to complete. In the Choregraphe condition, branching was absent from two of three designs (W-03, W-04) and was resolved by manual redesign during the trial in the third (W-01). \subsection{System Usability Scale} W-01 rated Choregraphe with a SUS score of 60. The standard benchmark for SUS scores places 68 as the average; scores below 68 are generally considered below average usability~\cite{Brooke1996}. A score of 60 suggests that W-01, a Digital Humanities faculty member with no programming background, found Choregraphe marginal in usability; this outcome is consistent with the high volume of interface-level help requests observed during the design phase. -W-02 rated HRIStudio with a SUS score of 90, well above the average benchmark of 68 and the highest score observed so far. W-02, a programmer with a combined CS and psychology background, completed the design phase without tool-operation assistance and rated the platform favorably across usability dimensions. +W-02 rated HRIStudio with a SUS score of 90, well above the average benchmark of 68 and the highest score in the study. W-02, a Logic and Philosophy of Science faculty member with moderate programming experience, completed the design phase without tool-operation assistance and rated the platform favorably across usability dimensions. W-03 rated Choregraphe with a SUS score of 75, above the average benchmark of 68. W-03, a programmer with prior experience in block programming environments, perceived the tool positively in general terms, framing it as a capable system for its category. Post-session comments indicated that W-03 found the tool harder to apply to this specific task than its general capability suggested, particularly given the WoZ framing's constraint against onboard control-flow logic. W-03 had no prior knowledge of HRIStudio, providing no comparative baseline for their usability rating. @@ -78,7 +80,9 @@ W-04 rated Choregraphe with a SUS score of 42.5, the lowest score in the study a W-05 rated HRIStudio with a SUS score of 70, above the average benchmark of 68. Post-session comments recorded no issues. W-05, a Chemical Engineering faculty member with no programming background, completed the design well within the allocation and ran the trial to its conclusion without tool-operation difficulty during execution. -% TODO: Add SUS score for W-06 when complete. Then report condition means. +W-06 rated HRIStudio with a SUS score of 70, above the average benchmark of 68. W-06, a Computer Science faculty member with extensive programming experience, completed the design within the allocation and ran a perfect trial without researcher intervention. The score matches W-05's rating exactly; both wizards found the platform above-average in usability despite approaching the task from very different programming backgrounds. + +HRIStudio condition SUS scores were 90, 70, and 70 (mean 76.7), all above the average benchmark of 68. Choregraphe condition SUS scores were 60, 75, and 42.5 (mean 59.2), all at or below the benchmark. \section{Supplementary Measures} @@ -86,7 +90,7 @@ W-05 rated HRIStudio with a SUS score of 70, above the average benchmark of 68. Table~\ref{tbl:timing} summarizes the time spent in each phase per session. -% TODO: Fill in W-06 timing row once session is complete. + \begin{table}[htbp] \centering \footnotesize @@ -104,7 +108,7 @@ W-04 & 17 min & 35 min & 4 min & 4 min & 60 min \\ \hline W-05 & 6 min & 18 min & 4 min & 4 min & 32 min \\ \hline -W-06 & --- & --- & --- & --- & --- \\ +W-06 & 8 min & 21 min & 3 min & 5 min & 37 min \\ \hline \end{tabular} \caption{Time spent in each session phase per wizard participant.} @@ -115,15 +119,15 @@ W-01's design phase extended to 35 minutes, five minutes over the 30-minute allo W-02's training phase concluded in 7 minutes, roughly half the standard 15-minute allocation. This reflects HRIStudio's more intuitive onboarding rather than simply W-02's technical background: the platform's guided workflow and timeline-based model required less explanation before the wizard was ready to begin the design phase. W-02's design phase then concluded in 24 minutes, within the allocation, and the trial ran for approximately five minutes. -W-03's design phase extended to 37 minutes, the longest design phase observed so far, despite W-03's programming background. The overrun reflects not conventional interface friction but the time spent constructing and then revising an over-engineered design; beginning sessions from W-02 onward enforce the 30-minute transition, so W-03's overrun constitutes a procedural exception noted in the observer log. +W-03's design phase extended to 37 minutes, the longest design phase in the study, despite W-03's programming background. The overrun reflects not conventional interface friction but the time spent constructing and then revising an over-engineered design; beginning sessions from W-02 onward enforced the 30-minute transition, so W-03's overrun constitutes a procedural exception noted in the observer log. W-04's design phase ran 35 minutes without completion, the only session in which the wizard did not finish before the cutoff. Training took 17 minutes, the longest training phase in the study; W-04 entered the design phase with questions about concurrent block execution that presaged later difficulties with branching. W-05's design phase completed in 18 minutes, the shortest in the study. The overall session lasted 32 minutes, also the shortest. Training took 6 minutes with no questions asked. The contrast between W-04 and W-05 is striking: both come from Chemical Engineering, both with no robotics background, yet the difference in tool condition produced a 17-minute gap in design completion time and a qualitatively different session experience. -Across the five completed sessions, Choregraphe design phases averaged approximately 35.7 minutes. W-01 and W-03 both exceeded the 30-minute design target but completed their designs before the session time limit; W-04 was the only wizard cut off by the limit without finishing. HRIStudio design phases averaged 21 minutes across two completed sessions, both well within the allocation. Training phases similarly diverged: Choregraphe training averaged approximately 14.7 minutes, while HRIStudio training averaged 6.5 minutes. +W-06's training phase concluded in 8 minutes and the design phase completed in 21 minutes, both within their allocations. The overall session lasted 37 minutes. The trial ran for approximately three minutes, the shortest trial phase in the study, reflecting a clean execution without errors or researcher interventions. -% TODO: Update condition means once W-06 is complete. +Across all six sessions, Choregraphe design phases averaged approximately 35.7 minutes; W-01 and W-03 exceeded the 30-minute target but completed their designs before the session time limit, while W-04 was the only wizard cut off by the limit without finishing. HRIStudio design phases averaged 21 minutes across three sessions, all within the allocation. Training phases similarly diverged: Choregraphe training averaged approximately 14.7 minutes, while HRIStudio training averaged 7 minutes. \subsection{Intervention Log} @@ -137,7 +141,7 @@ W-04 generated the highest T-type count in the Choregraphe condition: five total W-05 generated five design-phase interventions (2 T-type, 3 C-type) and two trial interventions (1 T-type, 1 G-type). The design-phase T marks concerned interface orientation (right-pane editing, branch block configuration); the C-type clarifications concerned conceptual mappings between the written specification and HRIStudio's structural model. Importantly, none of the clarifications blocked design completion, and the final DFS was unaffected. The C-type pattern for W-05 reflects a different kind of engagement from Choregraphe's T-type pattern: questions about what the tool means rather than how to operate it. -% TODO: Compile a summary intervention table once W-06 is complete. +W-06 generated two T-type interventions during the design phase, both pertaining to item 6 (narrative gesture): one for an attempted use of parallel action execution, and one for difficulty resetting the robot's posture, for which specific recommended blocks were suggested. W-06 resolved both issues independently after the initial prompts. No interventions of any type were logged during the trial phase, making W-06 the only wizard in the study to complete the trial with zero interventions. \section{Qualitative Findings} @@ -145,13 +149,13 @@ W-05 generated five design-phase interventions (2 T-type, 3 C-type) and two tria A notable qualitative finding from W-01's session was an unprompted deviation from the written specification: the wizard substituted a different rock color in the robot's speech and comprehension question, departing from the ``red'' specified in the paper protocol. This was not a tool failure; the wizard made a deliberate creative choice that the tool did not prevent or flag. The deviation was undetected until the live trial, when it surfaced during execution. This incident illustrates the reproducibility problem concretely: without automated protocol enforcement, wizard behavior can drift from the specification in ways that are invisible until execution, affecting the validity of the resulting interaction data. -No specification deviations from the written protocol were observed in W-02, W-04, or W-05. W-03 introduced extra nodes beyond the specification's scope, which was addressed by a C-type clarification during design. W-05 added a creative gesture not required by the specification (crouch), which was not a deviation from the protocol's content but an elaboration of the gesture category; it scored within the rubric and was noted for completeness. The speech substitution incident in W-01 remains the only case of content drift from the written specification, and it occurred exclusively in the Choregraphe condition. +No specification deviations from the written protocol were observed in W-02, W-04, W-05, or W-06. W-03 introduced extra nodes beyond the specification's scope, which was addressed by a C-type clarification during design. W-05 added a creative gesture not required by the specification (crouch), which was not a deviation from the protocol's content but an elaboration of the gesture category; it scored within the rubric and was noted for completeness. The speech substitution incident in W-01 remains the only case of content drift from the written specification, and it occurred exclusively in the Choregraphe condition. \subsection{Wizard Experience} W-01 expressed that the training was comprehensible and that the underlying logic of the task was clear. The primary source of frustration was Choregraphe's interface for handling conditional branches and managing the timing of parallel behaviors. Post-session comments suggested that the wizard would not use Choregraphe independently for future HRI work without technical support. -W-02 engaged with HRIStudio's timeline-based interface without requiring tool-operation guidance. The session proceeded efficiently, and W-02's combined CS and psychology background appeared to support both the technical implementation and the contextual understanding of the interaction scenario. No notable sources of friction were observed during design or trial phases. +W-02 engaged with HRIStudio's timeline-based interface without requiring tool-operation guidance. The session proceeded efficiently, and W-02's Logic and Philosophy of Science background, combined with moderate programming experience, appeared to support both the technical implementation and the contextual understanding of the interaction scenario. No notable sources of friction were observed during design or trial phases. W-03 approached the task as a programming challenge, applying Choregraphe's full feature set beyond what the specification required. When the WoZ framing was clarified (specifically that branching should reflect wizard decisions rather than onboard robot logic), W-03 revised the design but the over-engineered structure introduced earlier persisted in the final export and was reflected in the DFS score. W-03 described Choregraphe as a powerful block programming environment, but noted that applying it to this task was harder than its general capability implied, a characterization consistent with the tool-task mismatch the study is designed to surface. @@ -159,9 +163,8 @@ W-04 approached the session with clear engagement and self-driven exploration: i W-05 presented the clearest demonstration of HRIStudio's accessibility case. With no programming background, W-05 trained in 6 minutes, asked no questions, completed the design in 18 minutes with a creative addition, and ran the trial to completion. The researcher's session notes observed: ``Overall good session. Learning: different backgrounds determine tool curiosity and drive to self-explore.'' W-05's willingness to add a crouch gesture beyond the specification, and their straightforward navigation of the platform without tool-operation confusion, suggests that HRIStudio's design model successfully supports exploratory use by non-programmers without producing the friction pattern observed in the Choregraphe condition. -% TODO: Add qualitative observations for W-06 when complete. +W-06 approached the design with a programmer's instinct for thoroughness, initially exploring parallel execution structures for gesture actions and adding posture-reset steps beyond what the specification called for. The two T-type design-phase interventions reflected this exploratory behavior rather than confusion about the task. The extra posture-reset actions in the final design were redundant in practice since the robot was already in the correct starting position, but they did not interfere with the required items and the design achieved a perfect DFS. W-06's trial ran entirely without researcher intervention, producing the only perfect ERS in the study. The session illustrates a different accessibility profile from W-05: where W-05 encountered no interface friction at all, W-06's programming background produced brief exploratory detours that the platform absorbed without compromising the final design or execution. \section{Chapter Summary} -% TODO: Update condition means and summary once W-06 is complete. -This chapter presented results from five completed sessions of the pilot validation study. Across the three Choregraphe sessions (W-01, W-03, W-04), DFS scores were 42.5, 65, and 62.5 (mean 56.7); ERS scores were 65, 60, and 75 (mean 66.7); and SUS scores were 60, 75, and 42.5 (mean 59.2). Design phases in the Choregraphe condition averaged 35.7 minutes; W-01 and W-03 exceeded the 30-minute target but completed their designs, while W-04 was the only wizard cut off by the session time limit without finishing. Across the two completed HRIStudio sessions (W-02, W-05), DFS scores were both 100 (mean 100); ERS scores were both 95 (mean 95); and SUS scores were 90 and 70 (mean 80). HRIStudio design phases averaged 21 minutes, both within the allocation. The only unprompted speech content deviation observed in the dataset occurred in the Choregraphe condition (W-01). Branching failures or absences appeared in two of three Choregraphe sessions (W-03, W-04) and in neither completed HRIStudio session. The direction of the evidence across all five measures consistently favors HRIStudio. One HRIStudio session (W-06) remains; Chapter~\ref{ch:discussion} interprets the available findings in the context of the research questions. +This chapter presented results from all six sessions of the pilot validation study. Across the three Choregraphe sessions (W-01, W-03, W-04), DFS scores were 42.5, 65, and 62.5 (mean 56.7); ERS scores were 65, 60, and 75 (mean 66.7); and SUS scores were 60, 75, and 42.5 (mean 59.2). Design phases in the Choregraphe condition averaged 35.7 minutes; W-01 and W-03 exceeded the 30-minute target but completed their designs, while W-04 was the only wizard cut off by the session time limit without finishing. Across the three HRIStudio sessions (W-02, W-05, W-06), DFS scores were 100, 100, and 100 (mean 100); ERS scores were 95, 95, and 100 (mean 96.7); and SUS scores were 90, 70, and 70 (mean 76.7). HRIStudio design phases averaged 21 minutes, all within the allocation. The only unprompted speech content deviation observed in the dataset occurred in the Choregraphe condition (W-01). Branching failures or absences appeared in two of three Choregraphe sessions (W-03, W-04) and in none of the three HRIStudio sessions. The direction of the evidence across all measures consistently favors HRIStudio. Chapter~\ref{ch:discussion} interprets these findings in the context of the research questions. diff --git a/thesis/chapters/08_discussion.tex b/thesis/chapters/08_discussion.tex index 0bd1419..32c76f0 100644 --- a/thesis/chapters/08_discussion.tex +++ b/thesis/chapters/08_discussion.tex @@ -1,7 +1,7 @@ \chapter{Discussion} \label{ch:discussion} -This chapter interprets the results presented in Chapter~\ref{ch:results} against the two research questions established in Chapter~\ref{ch:evaluation}, situates the findings within the broader literature on WoZ methodology, and identifies the limitations of this study. Where the pilot data derives from an initial subset of sessions, I treat those observations as preliminary evidence and establish the analytical framework that governs interpretation of the full dataset. +This chapter interprets the results presented in Chapter~\ref{ch:results} against the two research questions established in Chapter~\ref{ch:evaluation}, situates the findings within the broader literature on WoZ methodology, and identifies the limitations of this study. With all six sessions now complete, this chapter presents the full dataset and draws conclusions across the complete sample. \section{Interpretation of Findings} @@ -9,13 +9,11 @@ This chapter interprets the results presented in Chapter~\ref{ch:results} agains The first research question asked whether HRIStudio enables domain experts without prior robotics experience to successfully implement a robot interaction from a written specification. The Choregraphe condition provides the baseline against which this question is evaluated. -The five completed sessions provide directional evidence on the accessibility question. Across the three Choregraphe wizards, design fidelity scores were 42.5, 65, and 62.5, yielding a condition mean of 56.7. Across the two completed HRIStudio sessions, both wizards achieved a DFS of 100. No HRIStudio wizard required a T-type tool-operation intervention that reflected an inability to operate the platform; the T-type marks logged for W-05 concerned interface orientation (where to find property editing, how the branch block is configured) rather than fundamental operational barriers. By contrast, all three Choregraphe wizards required T-type assistance for core design tasks: W-01 for connection routing and branch wiring, W-03 for none but over-engineered the design beyond the specification, and W-04 for speech content punctuation and a failed choice block attempt. +The six completed sessions provide directional evidence on the accessibility question. Across the three Choregraphe wizards, design fidelity scores were 42.5, 65, and 62.5, yielding a condition mean of 56.7. Across the three HRIStudio sessions, all three wizards achieved a DFS of 100. No HRIStudio wizard required a T-type intervention that reflected an inability to operate the platform; the T-type marks logged for W-05 concerned interface orientation, and those logged for W-06 concerned gesture execution details (parallel execution and posture-reset blocks), neither of which constituted fundamental operational barriers. By contrast, Choregraphe produced design difficulties across all three sessions. W-01 required T-type assistance for connection routing and branch wiring. W-03 required no T-type interventions but over-engineered the design, adding concurrent execution nodes and attempting onboard speech-recognition logic that falls outside the WoZ paradigm. W-04 required T-type assistance for speech content punctuation and a failed choice block attempt. -The SUS scores reinforce this pattern. Choregraphe SUS scores were 60, 75, and 42.5 (mean 59.2), all at or below the average usability benchmark of 68~\cite{Brooke1996}. HRIStudio SUS scores were 90 and 70 (mean 80), both above the benchmark. The Choregraphe condition produced the lowest single SUS score in the study (42.5, W-04), a wizard who described the platform as getting in the way of their attempt. The HRIStudio condition produced the highest (90, W-02). Across programming backgrounds, the gap is consistent: W-01 (non-programmer, Choregraphe, SUS 60) versus W-05 (non-programmer, HRIStudio, SUS 70); W-04 (moderate programmer, Choregraphe, SUS 42.5) versus W-02 (programmer, HRIStudio, SUS 90). +The SUS scores reinforce this pattern. Choregraphe SUS scores were 60, 75, and 42.5 (mean 59.2), all at or below the average usability benchmark of 68~\cite{Brooke1996}. HRIStudio SUS scores were 90, 70, and 70 (mean 76.7), all above the benchmark. The Choregraphe condition produced the lowest single SUS score in the study (42.5, W-04), a wizard who described the platform as getting in the way of their attempt. The HRIStudio condition produced the highest (90, W-02). With programming backgrounds now balanced across conditions---each condition contains one wizard with no programming experience, one with moderate experience, and one with extensive experience---a cross-background comparison is possible: W-01 (non-programmer, Choregraphe, SUS 60) versus W-05 (non-programmer, HRIStudio, SUS 70); W-04 (moderate programmer, Choregraphe, SUS 42.5) versus W-02 (moderate programmer, HRIStudio, SUS 90); W-03 (extensive programmer, Choregraphe, SUS 75) versus W-06 (extensive programmer, HRIStudio, SUS 70). HRIStudio scores exceed Choregraphe scores at the None and Moderate levels; at the Extensive level the scores reverse by five points (W-03 Choregraphe 75 vs.\ W-06 HRIStudio 70), suggesting that extensive programming experience largely attenuates the tool-level usability difference. -The most striking accessibility finding comes from W-05: a Chemical Engineering faculty member with no programming experience trained in 6 minutes, completed a perfect design in 18 minutes with no operational confusion, and ran the trial to conclusion. This outcome directly addresses the accessibility claim. HRIStudio's timeline-based model and guided workflow allowed a domain novice to implement the written specification correctly on their first attempt, without the interface friction that blocked or slowed all three Choregraphe wizards. Session timing data underscores the difference: Choregraphe design phases averaged 35.7 minutes (two overruns, one incomplete), while HRIStudio design phases averaged 21 minutes (both within the allocation). Underlying this difference is a structural property of the two tools: HRIStudio's model is domain-specific to Wizard-of-Oz execution, so wizard effort is channeled toward implementing the specification more completely rather than elaborating the tool's architecture. Choregraphe's general-purpose programming model makes the opposite available, and both W-03 and W-04 took it, spending time on concurrent execution structures and a speech-recognition-driven choice block that the WoZ context does not support. Neither HRIStudio wizard had that option, and both scored 100 on the DFS. - -% TODO: Add W-06 data to condition means once session is complete. +The most striking accessibility finding comes from W-05: a Chemical Engineering faculty member with no programming experience trained in 6 minutes, completed a perfect design in 18 minutes with no operational confusion, and ran the trial to conclusion. This outcome directly addresses the accessibility claim. HRIStudio's timeline-based model and guided workflow allowed a domain novice to implement the written specification correctly on their first attempt, without the interface friction that blocked or slowed all three Choregraphe wizards. Session timing data underscores the difference: Choregraphe design phases averaged 35.7 minutes (two overruns, one incomplete), while HRIStudio design phases averaged 21 minutes (all three within the allocation). Underlying this difference is a structural property of the two tools: HRIStudio's model is domain-specific to Wizard-of-Oz execution, so wizard effort is channeled toward implementing the specification more completely rather than elaborating the tool's architecture. Choregraphe's general-purpose programming model makes the opposite available, and both W-03 and W-04 took it, spending time on concurrent execution structures and a speech-recognition-driven choice block that the WoZ context does not support. No HRIStudio wizard had that option, and all three scored 100 on the DFS. \subsection{Research Question 2: Reproducibility} @@ -23,27 +21,23 @@ The second research question asked whether HRIStudio produces more reliable exec This is precisely the failure mode the reproducibility problem predicts. Riek's~\cite{Riek2012} review found that fewer than 4\% of published WoZ studies reported any measure of wizard error, meaning most studies have no mechanism to detect whether execution matched design intent. W-01's session demonstrates that such deviations occur even in controlled conditions with a single, simple specification and an engaged wizard. The deviation was not negligence; it was creative drift made possible by a tool that places no structural constraint on what the wizard types into a speech action. -HRIStudio's protocol enforcement model is designed to prevent this class of deviation by locking speech content at design time. The available data supports this design intent. No speech content deviations occurred in either completed HRIStudio session. W-05 added an action beyond the specification (a crouch gesture), but this was an elaboration of the gesture category rather than a substitution of specified content, and it was scored within the rubric. The Choregraphe condition produced the only speech substitution in the dataset (W-01) and two sessions in which branching was absent from the design entirely (W-03, W-04). +HRIStudio's protocol enforcement model is designed to prevent this class of deviation by locking speech content at design time. The available data supports this design intent. No speech content deviations occurred in any of the three HRIStudio sessions. W-05 added an action beyond the specification (a crouch gesture), but this was an elaboration of the gesture category rather than a substitution of specified content, and it was scored within the rubric. The Choregraphe condition produced the only speech substitution in the dataset (W-01) and two sessions in which branching was absent from the design entirely (W-03, W-04). -ERS scores reflect the downstream effect of these design differences. Choregraphe ERS scores were 65, 60, and 75 (mean 66.7). HRIStudio ERS scores were both 95 (mean 95). The branching item is particularly instructive: in the Choregraphe condition, branch execution was either absent from the design entirely (W-03) or present but not implemented as conditional logic (W-01, W-04). W-01 resolved the branch by manually re-routing connections during the trial; W-04 required a T-type trial intervention to be reminded how to trigger the branch step. In both completed HRIStudio sessions, the conditional branch was present in the design and executed during the trial. W-05's branch fired cleanly via programmed conditional logic; W-02's session saw a brief platform-side step misfire immediately corrected by manual step selection, logged as an H-type (platform behavior) intervention rather than a wizard error. In neither HRIStudio session did branch execution depend on tool-operation guidance from the researcher. - -% TODO: Add W-06 ERS data once session is complete. +ERS scores reflect the downstream effect of these design differences. Choregraphe ERS scores were 65, 60, and 75 (mean 66.7). HRIStudio ERS scores were 95, 95, and 100 (mean 96.7). The branching item is particularly instructive: in the Choregraphe condition, branch execution was either absent from the design entirely (W-03) or present but not implemented as conditional logic (W-01, W-04). W-01 resolved the branch by manually re-routing connections during the trial; W-04 required a T-type trial intervention to be reminded how to trigger the branch step. In all three HRIStudio sessions, the conditional branch was present in the design and executed during the trial. W-05's branch fired cleanly via programmed conditional logic; W-02's session saw a brief platform-side step misfire immediately corrected by manual step selection, logged as an H-type (platform behavior) intervention rather than a wizard error; W-06's branch fired cleanly with no intervention of any kind. In no HRIStudio session did branch execution depend on tool-operation guidance from the researcher. \subsection{Session Timing and Downstream Effects} -W-01's design phase extended to 35 minutes, overrunning the 30-minute allocation by five minutes and leaving approximately five minutes for the trial phase. It is worth distinguishing between the two factors at play here: the overrun reflected both the tool's demands on the wizard and a procedural decision not to interrupt W-01 at the 30-minute mark. Subsequent sessions enforce the transition to the trial phase at 30 minutes regardless of design completion status, consistent with the observer protocol. That said, if a tool's demands make design completion within the allocation genuinely difficult, the risk of an overrun is real regardless of enforcement: a wizard who has not finished at 30 minutes faces a reduced trial window no matter when the cutoff is applied. Phase-by-phase timing data collected across all sessions will reveal whether design phase overruns are characteristic of one condition rather than the other, constituting a supplementary indicator of tool accessibility independent of the DFS score. +W-01's design phase extended to 35 minutes, overrunning the 30-minute allocation by five minutes and leaving approximately five minutes for the trial phase. It is worth distinguishing between the two factors at play here: the overrun reflected both the tool's demands on the wizard and a procedural decision not to interrupt W-01 at the 30-minute mark. Subsequent sessions enforced the transition to the trial phase at 30 minutes regardless of design completion status, consistent with the observer protocol. That said, if a tool's demands make design completion within the allocation genuinely difficult, the risk of an overrun is real regardless of enforcement: a wizard who has not finished at 30 minutes faces a reduced trial window no matter when the cutoff is applied. -Across the five completed sessions, design phase overruns are concentrated in the Choregraphe condition. W-01 and W-03 each exceeded the 30-minute design target but completed their designs before the session time limit; W-04 was the only wizard cut off by the limit without finishing. No HRIStudio wizard exceeded the target. This pattern holds across programming backgrounds: W-01 (non-programmer) and W-03 (programmer) both overran in the Choregraphe condition, while W-05 (non-programmer, HRIStudio) completed in 18 minutes. The timing data thus corroborates the DFS and SUS findings as a supplementary accessibility indicator, and supports the conclusion that the overrun pattern is attributable to tool condition rather than wizard background alone. +Across all six sessions, design phase overruns are concentrated in the Choregraphe condition. W-01 and W-03 each exceeded the 30-minute design target but completed their designs before the session time limit; W-04 was the only wizard cut off by the limit without finishing. No HRIStudio wizard exceeded the target. This pattern holds across programming backgrounds: W-01 (non-programmer) and W-03 (extensive programmer) both overran in the Choregraphe condition, while W-05 (non-programmer, HRIStudio) completed in 18 minutes and W-06 (extensive programmer, HRIStudio) completed in 21 minutes. The timing data thus corroborates the DFS and SUS findings as a supplementary accessibility indicator, and supports the conclusion that the overrun pattern is attributable to tool condition rather than wizard background alone. With programming backgrounds balanced across conditions, the design-phase timing difference cannot be attributed to prior programming experience. \section{Comparison to Prior Work} -The findings from W-01's session are broadly consistent with prior characterizations of Choregraphe's usability profile. Pot et al.~\cite{Pot2009} introduced Choregraphe as a tool for enabling non-programmers to create NAO behaviors, but subsequent HRI research has treated it primarily as a programmer's tool in practice. The pattern of help requests observed, in which W-01 understood the task but struggled with the tool's interface mechanisms, aligns with Riek's~\cite{Riek2012} observation that WoZ tools tend to require substantial technical investment even when the underlying experiment is conceptually simple. +The accessibility findings are consistent with prior characterizations of both tools. Pot et al.~\cite{Pot2009} introduced Choregraphe as a tool for enabling non-programmers to create NAO behaviors, but subsequent HRI research has treated it primarily as a programmer's tool in practice. This study confirms that characterization: W-01 (no programming experience) and W-04 (moderate experience) both required substantial T-type assistance and produced incomplete or deviation-prone designs, while W-03 (extensive experience) navigated the interface without T-type support yet still over-engineered the design and scored below every HRIStudio participant on both DFS and ERS. Riek's~\cite{Riek2012} observation that WoZ tools tend to require substantial technical investment even when the underlying experiment is conceptually simple holds across all three Choregraphe sessions regardless of background. In contrast, the HRIStudio results support the claim advanced in prior work~\cite{OConnor2024, OConnor2025} that a domain-specific, web-based platform can decouple task complexity from interface complexity: all three HRIStudio wizards---spanning no, moderate, and extensive programming experience---achieved a perfect DFS, and none encountered a fundamental barrier to operating the platform. -The specification deviation observed in W-01's session connects directly to Porfirio et al.'s~\cite{Porfirio2023} argument that formal, verifiable behavior specifications are a prerequisite for reproducible HRI. Porfirio et al. propose specification languages as the solution; HRIStudio takes a complementary approach by embedding the specification into the execution environment itself, making deviation structurally harder rather than formally detectable after the fact. The practical consequence of this design choice (whether it reduces deviations in practice) is what the ERS comparison will reveal. +The specification deviation in W-01's session connects directly to Porfirio et al.'s~\cite{Porfirio2023} argument that formal, verifiable behavior specifications are a prerequisite for reproducible HRI. Porfirio et al. propose specification languages as the solution; HRIStudio takes a complementary approach by embedding the specification into the execution environment, making deviation structurally harder rather than formally detectable after the fact. The ERS data confirms this design intent: no speech content deviations occurred across all three HRIStudio sessions, and the condition ERS mean of 96.7 versus 66.7 for Choregraphe supports the conclusion that structural enforcement produces more reliable execution in practice. Riek's~\cite{Riek2012} finding that only 3.7\% of published WoZ studies reported any measure of wizard error makes this comparison particularly significant: the ERS operationalizes exactly the kind of execution measurement the literature has consistently omitted, and the difference it surfaces here is substantial. -The SUS score of 60 for Choregraphe falls below scores reported for general-purpose visual programming tools in other HCI studies, though direct comparison is complicated by task and population differences. It is consistent with the finding that domain-specific visual programming environments carry learning curves that programming experience alone does not fully resolve~\cite{Bartneck2024}. - -The HRIStudio SUS mean of 80 (across two completed sessions) compared to the Choregraphe mean of 59.2 is consistent with this expectation. A 20-point gap is practically significant even in a pilot sample: it places the Choregraphe condition below average usability and the HRIStudio condition well above it, across wizards with different programming backgrounds. The Choregraphe score of 42.5 from W-04 falls in a range typically characterized as poor usability, a finding that is especially notable given that W-04 had moderate programming experience and engaged with the tool actively rather than passively. +The SUS scores are consistent with prior tool evaluations in HCI. The Choregraphe mean of 59.2 falls below the average benchmark of 68~\cite{Brooke1996} and below scores reported for general-purpose visual programming environments in comparable studies, consistent with Bartneck et al.'s~\cite{Bartneck2024} finding that domain-specific design is necessary to make tools genuinely accessible to non-programmers. The HRIStudio mean of 76.7 places the platform above the benchmark across all three sessions. With programming backgrounds balanced across conditions, the overall 17.5-point gap in condition means reflects a genuine tool-level effect rather than a sampling artifact. The gap is largest at the Moderate experience level (W-02 HRIStudio 90 vs.\ W-04 Choregraphe 42.5) and smallest at the Extensive level, where the scores reverse by five points (W-03 Choregraphe 75 vs.\ W-06 HRIStudio 70), suggesting that extensive programming experience largely attenuates the tool-level usability difference while the accessibility advantage remains pronounced for non-programmers and moderate programmers. \section{Limitations} @@ -55,10 +49,10 @@ This study has several limitations that must be considered when interpreting the \textbf{Single task.} Both conditions used the same Interactive Storyteller specification. While this controls for task difficulty, it limits generalizability. The task is simple relative to real HRI experiments; the gap between conditions may be larger or smaller with a more complex protocol involving multiple branches or longer interaction sequences. -\textbf{Condition imbalance.} Because participants were randomly assigned, the final sample may distribute programmers unevenly across conditions, confounding the comparison. With a small $N$, random assignment does not guarantee balance across programming background. +\textbf{Condition imbalance.} Random assignment produced a programming-background distribution that happens to be balanced: each condition contains one wizard with no programming experience, one with moderate experience, and one with extensive experience. While this balance is favorable for interpretation, it was not guaranteed by design. The small $N$ means that balance on other potentially relevant dimensions (disciplinary background, prior experience with visual programming tools, or familiarity with robots more broadly) was not assessed or controlled. \textbf{Platform version.} HRIStudio is under active development. The version used in this study represents the system at a specific point in time. Future iterations may change how the wizard interface presents protocol steps, how branch conditions are constructed during the design phase, or how protocol enforcement is applied during execution. Any of these changes could affect how easily a non-programmer completes the design challenge or how reliably the tool enforces the specification during the trial, potentially altering the DFS and ERS scores observed under otherwise identical conditions. Results from this study therefore describe the system as it existed at the time of data collection and may not generalize to later releases. \section{Chapter Summary} -This chapter interpreted the results of five completed pilot sessions against the two research questions and connected the findings to prior work. Across all primary measures, the directional evidence favors HRIStudio. The Choregraphe condition produced a mean DFS of 56.7, mean ERS of 66.7, and mean SUS of 59.2, with design phase overruns in all three sessions and branching failures or absences in two. The two completed HRIStudio sessions produced mean DFS and ERS scores of 100 and 95 respectively, mean SUS of 80, both design phases within the allocation, and no speech content deviations. The specification deviation observed in W-01 illustrates the reproducibility problem concretely; its absence in the HRIStudio condition is consistent with the enforcement model's design intent. One HRIStudio session (W-06) remains; its results will complete the condition comparison. The limitations of this pilot study, including sample size, task simplicity, and condition imbalance by programming background, are acknowledged and inform the future directions described in Chapter~\ref{ch:conclusion}. +This chapter interpreted the results of all six completed pilot sessions against the two research questions and connected the findings to prior work. Across all primary measures, the directional evidence favors HRIStudio. The Choregraphe condition produced a mean DFS of 56.7, mean ERS of 66.7, and mean SUS of 59.2, with design phase overruns in all three sessions and branching failures or absences in two. The three HRIStudio sessions produced mean DFS 100, mean ERS 96.7, and mean SUS 76.7, all three design phases within the allocation, and no speech content deviations. W-06 produced the only perfect ERS in the dataset. The specification deviation observed in W-01 illustrates the reproducibility problem concretely; its absence across all three HRIStudio sessions is consistent with the enforcement model's design intent. Programming backgrounds are balanced across conditions, strengthening the cross-background comparisons. The limitations of this pilot study, including sample size, task simplicity, and the single-session design, are acknowledged and inform the future directions described in Chapter~\ref{ch:conclusion}. diff --git a/thesis/chapters/09_conclusion.tex b/thesis/chapters/09_conclusion.tex index 6561de0..e8c8b93 100644 --- a/thesis/chapters/09_conclusion.tex +++ b/thesis/chapters/09_conclusion.tex @@ -17,9 +17,9 @@ This thesis makes three contributions to the field of HRI research infrastructur The central question this thesis addressed was: \emph{can the right software architecture make Wizard-of-Oz experiments more accessible to non-programmers and more reproducible across participants?} The evidence from the pilot study suggests the answer is yes, with the qualifications appropriate to a small $N$ directional study. -On accessibility, the evidence from five completed sessions is consistent and directional. The Choregraphe condition produced a mean DFS of 56.7 across three wizards, with design phases averaging 35.7 minutes; W-01 and W-03 exceeded the 30-minute target but completed their designs, while W-04 was the only wizard cut off by the session time limit without finishing. The two completed HRIStudio sessions each produced a DFS of 100, with design phases averaging 21 minutes, both within the allocation. The most direct demonstration comes from W-05: a Chemical Engineering faculty member with no programming background trained in 6 minutes, completed a perfect design in 18 minutes, and ran the trial to completion without tool-operation difficulty. Choregraphe's finite state machine model, with boxes connected by signals, imposed cognitive overhead that domain knowledge of the task alone could not resolve; HRIStudio's timeline-based model did not produce this friction for any wizard regardless of background. SUS scores reflect the same pattern: Choregraphe mean 59.2 (below average), HRIStudio mean 80 (above average). +On accessibility, the evidence from all six sessions is consistent and directional. The Choregraphe condition produced a mean DFS of 56.7 across three wizards, with design phases averaging 35.7 minutes; W-01 and W-03 exceeded the 30-minute target but completed their designs, while W-04 was the only wizard cut off by the session time limit without finishing. All three HRIStudio sessions produced a DFS of 100, with design phases averaging 21 minutes, all within the allocation. The most direct demonstration comes from W-05: a Chemical Engineering faculty member with no programming background trained in 6 minutes, completed a perfect design in 18 minutes, and ran the trial to completion without tool-operation difficulty. Choregraphe's finite state machine model, with boxes connected by signals, imposed cognitive overhead that domain knowledge of the task alone could not resolve; HRIStudio's timeline-based model did not produce this friction for any wizard regardless of background. SUS scores reflect the same pattern: Choregraphe mean 59.2 (below average), HRIStudio mean 76.7 (above average). -On reproducibility, the specification deviation observed in W-01's Choregraphe session, a substituted rock color in the robot's speech that was undetected until execution, illustrates the failure mode the reproducibility problem predicts. No equivalent speech content deviation occurred in either HRIStudio session. Branching, the other primary reliability measure, was present in the design and executed in both HRIStudio sessions. W-05's branch fired cleanly via programmed conditional logic; W-02's session experienced a brief platform-side misfire corrected immediately by manual step selection, logged as an H-type (platform behavior) rather than a wizard error. In neither HRIStudio session was branching absent from the design or dependent on tool-operation guidance from the researcher. By contrast, branching was absent from two Choregraphe designs entirely (W-03, W-04) and resolved by manual re-routing in a third (W-01). ERS condition means reflect the outcome: 66.7 for Choregraphe, 95 for HRIStudio (two sessions complete). The enforcement model's design intent, locking speech at design time and presenting it during execution rather than requiring re-entry, appears to produce the reliability difference the architecture was designed to achieve. One HRIStudio session (W-06) remains; its inclusion will complete the condition comparison and may refine these means, but is unlikely to reverse the direction of the evidence. +On reproducibility, the specification deviation observed in W-01's Choregraphe session, a substituted rock color in the robot's speech that was undetected until execution, illustrates the failure mode the reproducibility problem predicts. No equivalent speech content deviation occurred in any of the three HRIStudio sessions. Branching, the other primary reliability measure, was present in the design and executed in all three HRIStudio sessions. W-05's branch fired cleanly via programmed conditional logic; W-02's session experienced a brief platform-side misfire corrected immediately by manual step selection, logged as an H-type (platform behavior) rather than a wizard error; W-06's branch fired cleanly with no intervention of any kind. In no HRIStudio session was branching absent from the design or dependent on tool-operation guidance from the researcher. By contrast, branching was absent from two Choregraphe designs entirely (W-03, W-04) and resolved by manual re-routing in a third (W-01). ERS condition means reflect the outcome: 66.7 for Choregraphe, 96.7 for HRIStudio. W-06 produced the only perfect ERS in the dataset (100), with a three-minute trial run entirely without researcher intervention. The enforcement model's design intent, locking speech at design time and presenting it during execution rather than requiring re-entry, appears to produce the reliability difference the architecture was designed to achieve. \section{Future Directions} diff --git a/thesis/chapters/app_materials.tex b/thesis/chapters/app_materials.tex index 18c3fdd..161fc44 100644 --- a/thesis/chapters/app_materials.tex +++ b/thesis/chapters/app_materials.tex @@ -1,7 +1,7 @@ \chapter{Completed Study Materials} \label{app:completed_materials} -This appendix contains the completed study instruments for each of the five sessions conducted prior to the submission of this thesis (W-01 through W-05). The DFS and ERS were scored during and immediately after each session using live observation and the Observer Data Sheet; the SUS was completed by the wizard during the debrief phase. +This appendix contains the completed study instruments for each of the six sessions conducted prior to the submission of this thesis (W-01 through W-06). The DFS and ERS were scored during and immediately after each session using live observation and the Observer Data Sheet; the SUS was completed by the wizard during the debrief phase. \medskip \noindent\textbf{Contents of this appendix, in order:} @@ -11,34 +11,41 @@ This appendix contains the completed study instruments for each of the five sess \item \textbf{W-03 (Choregraphe):} ODS, DFS, ERS, SUS \item \textbf{W-04 (Choregraphe):} ODS, DFS, ERS, SUS \item \textbf{W-05 (HRIStudio):} ODS, DFS, ERS, SUS + \item \textbf{W-06 (HRIStudio):} ODS, DFS, ERS, SUS \end{itemize} % --- W-01 ------------------------------------------------------------------- \includepdf[pages=-,pagecommand={}]{pdfs/completed/01/ODS-01.pdf} \includepdf[pages=-,pagecommand={}]{pdfs/completed/01/DFS-01.pdf} \includepdf[pages=-,pagecommand={}]{pdfs/completed/01/ERS-01.pdf} -\includepdf[pages=-,pagecommand={}]{pdfs/completed/01/SUS-01C.pdf} +\includepdf[pages=-,pagecommand={}]{pdfs/completed/01/SUS-01.pdf} % --- W-02 ------------------------------------------------------------------- \includepdf[pages=-,pagecommand={}]{pdfs/completed/02/ODS-02.pdf} \includepdf[pages=-,pagecommand={}]{pdfs/completed/02/DFS-02.pdf} \includepdf[pages=-,pagecommand={}]{pdfs/completed/02/ERS-02.pdf} -\includepdf[pages=-,pagecommand={}]{pdfs/completed/02/SUS-02H.pdf} +\includepdf[pages=-,pagecommand={}]{pdfs/completed/02/SUS-02.pdf} % --- W-03 ------------------------------------------------------------------- \includepdf[pages=-,pagecommand={}]{pdfs/completed/03/ODS-03.pdf} \includepdf[pages=-,pagecommand={}]{pdfs/completed/03/DFS-03.pdf} \includepdf[pages=-,pagecommand={}]{pdfs/completed/03/ERS-03.pdf} -\includepdf[pages=-,pagecommand={}]{pdfs/completed/03/SUS-03C.pdf} +\includepdf[pages=-,pagecommand={}]{pdfs/completed/03/SUS-03.pdf} % --- W-04 ------------------------------------------------------------------- \includepdf[pages=-,pagecommand={}]{pdfs/completed/04/ODS-04.pdf} \includepdf[pages=-,pagecommand={}]{pdfs/completed/04/DFS-04.pdf} \includepdf[pages=-,pagecommand={}]{pdfs/completed/04/ERS-04.pdf} -\includepdf[pages=-,pagecommand={}]{pdfs/completed/04/SRS-04C.pdf} +\includepdf[pages=-,pagecommand={}]{pdfs/completed/04/SUS-04.pdf} % --- W-05 ------------------------------------------------------------------- \includepdf[pages=-,pagecommand={}]{pdfs/completed/05/ODS-05.pdf} \includepdf[pages=-,pagecommand={}]{pdfs/completed/05/DFS-05.pdf} \includepdf[pages=-,pagecommand={}]{pdfs/completed/05/ERS-05.pdf} \includepdf[pages=-,pagecommand={}]{pdfs/completed/05/SUS-05.pdf} + +% --- W-06 ------------------------------------------------------------------- +\includepdf[pages=-,pagecommand={}]{pdfs/completed/06/ODS-06.pdf} +\includepdf[pages=-,pagecommand={}]{pdfs/completed/06/DFS-06.pdf} +\includepdf[pages=-,pagecommand={}]{pdfs/completed/06/ERS-06.pdf} +\includepdf[pages=-,pagecommand={}]{pdfs/completed/06/SUS-06.pdf} diff --git a/thesis/chapters/app_tech_docs.tex b/thesis/chapters/app_tech_docs.tex index 15e051f..7536152 100644 --- a/thesis/chapters/app_tech_docs.tex +++ b/thesis/chapters/app_tech_docs.tex @@ -1,49 +1,258 @@ \chapter{Technical Documentation} \label{app:tech_docs} -This appendix documents the specific technologies and libraries used to build HRIStudio, organized by the three architectural layers described in Chapter~\ref{ch:design}. The goal here is reference, not justification; Chapter~\ref{ch:implementation} explains the reasoning behind the major architectural choices. +This appendix documents the specific technologies, infrastructure, and integration mechanisms used to build HRIStudio, organized by the three architectural layers described in Chapter~\ref{ch:design}. The goal here is reference, not justification; Chapter~\ref{ch:implementation} explains the reasoning behind the major architectural choices. \section{Technology Stack} +Table~\ref{tbl:tech-stack} lists the principal dependencies and their roles. The entire codebase is written in TypeScript, so type inconsistencies between layers are caught at compile time rather than appearing as runtime failures during a trial. + +\begin{table}[htbp] +\centering +\footnotesize +\begin{tabular}{|l|l|l|} +\hline +\textbf{Component} & \textbf{Version} & \textbf{Role} \\ +\hline +Next.js (App Router) & 16.2 & Full-stack React framework \\ +\hline +React & 19.2 & User interface rendering \\ +\hline +TypeScript & --- & Static typing across the full stack \\ +\hline +tRPC & 11.10 & Type-safe API between client and server \\ +\hline +Better Auth & 1.5 & Authentication and session management \\ +\hline +Drizzle ORM & 0.41 & Type-safe database access and migrations \\ +\hline +PostgreSQL & 15 & Primary relational database \\ +\hline +MinIO & latest & S3-compatible object storage (video/audio) \\ +\hline +Bun & runtime & WebSocket server for real-time trial communication \\ +\hline +Tailwind CSS + shadcn/ui & 4.1 / 0.0.4 & Styling and UI component library \\ +\hline +\texttt{@dnd-kit} & --- & Drag-and-drop for experiment designer \\ +\hline +ROS~2 Humble & --- & Robot middleware (NAO6 integration stack) \\ +\hline +Docker Compose & --- & Multi-container orchestration \\ +\hline +\end{tabular} +\caption{Principal dependencies in the HRIStudio technology stack.} +\label{tbl:tech-stack} +\end{table} + \subsection{User Interface Layer} -The frontend is built on Next.js (App Router) using React and TypeScript. TypeScript is used throughout the entire codebase, including the server and data access layers, so that type inconsistencies between layers are caught at compile time rather than at runtime. Styling is handled with Tailwind CSS and the shadcn/ui component library. The drag-and-drop canvas in the Design interface uses the \texttt{@dnd-kit} library (\texttt{@dnd-kit/core} and \texttt{@dnd-kit/sortable}) to manage nested drag operations for arranging steps and action blocks. +The frontend is built on Next.js using React and TypeScript. Styling is handled with Tailwind CSS and the shadcn/ui component library, which provides accessible, pre-built UI primitives built on Radix UI. The drag-and-drop canvas in the Design interface uses the \texttt{@dnd-kit} library (\texttt{@dnd-kit/core} and \texttt{@dnd-kit/sortable}) to manage nested drag operations for arranging steps and action blocks. \subsection{Application Logic Layer} -The server runs as a Next.js Node.js process. API routes use tRPC over HTTP for typed request/response calls; real-time communication during live trials uses a persistent WebSocket connection via the \texttt{ws} package. Authentication and session management are handled by NextAuth.js (v5 beta) with the \texttt{@auth/drizzle-adapter} and bcryptjs for password hashing. Currently, credential-based (username and password) authentication is supported. +The server runs as a Next.js process. API routes use tRPC over HTTP for typed request/response calls; real-time communication during live trials uses a separate WebSocket server running on the Bun runtime (described in Section~\ref{sec:ws-arch}). Authentication and session management are handled by Better Auth with the Drizzle adapter for database-backed sessions. Passwords are hashed with bcrypt (cost factor~12). Currently, credential-based (username and password) authentication is supported; the architecture allows adding OAuth providers without changes to the session model. \subsection{Data and Robot Control Layer} -Experiment protocols, trial records, and user data are stored in PostgreSQL. The schema and all database queries are managed through Drizzle ORM, which provides compile-time type safety for database interactions. Action configuration parameters and plugin-specific fields are stored as JSONB columns, which allows the same schema to accommodate any robot's action types. +Experiment protocols, trial records, and user data are stored in PostgreSQL. The schema and all database queries are managed through Drizzle ORM, which provides compile-time type safety for database interactions. Action configuration parameters and plugin-specific fields are stored as JSONB columns, which allows the same schema to accommodate any robot's action types without schema migrations. -Video and audio recordings captured during trials are stored in a self-hosted MinIO instance, an S3-compatible object storage service. Recordings are captured in the browser using the native MediaRecorder API (assisted by \texttt{react-webcam}) and uploaded to MinIO as a chunked transfer when the trial concludes. +Video and audio recordings captured during trials are stored in a self-hosted MinIO instance, an S3-compatible object storage service. Recordings are captured in the browser using the native MediaRecorder API and uploaded to MinIO when the trial concludes. Structured data (experiment specifications, trial event logs) and media files are stored separately: the database handles queryable records, and MinIO handles large binary files that the system never queries by content. -Robot communication is handled through a ROS Bridge (\texttt{rosbridge\_suite} or \texttt{ros2-web-bridge}) running on the robot's local network. The server connects to the bridge over a WebSocket and exchanges JSON-encoded ROS messages; it does not run as a ROS node itself. The bridge address is configured per robot in the plugin file (for example, \texttt{"rosbridgeUrl": "ws://localhost:9090"} in the NAO6 plugin). +Robot communication is handled through a ROS~2 WebSocket bridge running on the robot's local network. The HRIStudio server connects to the bridge over a WebSocket and exchanges JSON-encoded ROS messages; it does not run as a ROS node itself. The bridge address is configured per robot in the plugin file. For actions that do not require ROS message passing, the system can also execute commands directly on the robot via SSH (see Section~\ref{sec:nao6-integration}). -\section{Deployment} +\section{Deployment Infrastructure} +\label{sec:deployment} -The full stack is orchestrated using Docker Compose. The \texttt{docker-compose.yml} file defines three services: the PostgreSQL database (\texttt{postgres:15}), the MinIO storage instance, and the Next.js application server. Starting the entire system on any machine with Docker installed is a single \texttt{docker compose up} command. This configuration is intended for on-premises deployment, which is important for studies involving participant data that cannot leave the institution's network. +HRIStudio uses a double Docker Compose stack: one stack runs the application and its backing services, and a second stack runs the robot integration layer. This separation allows the application to run on any host while the robot stack runs on a machine with network access to the physical robot. Both stacks can run on the same machine for single-lab deployments. -\section{Plugin Specification} +\subsection{Application Stack} -Robot capabilities are defined in JSON plugin files. Each file describes a robot platform and the actions it supports. The structure of a plugin file is as follows: +The application stack is defined in \texttt{hristudio/docker-compose.yml} and provides three services: + +\begin{description} +\item[db.] PostgreSQL~15 with a persistent named volume. Exposes port~5432. +\item[minio.] MinIO object storage with a persistent named volume. Exposes port~9000 (S3 API) and port~9001 (web console). +\item[createbuckets.] An initialization container that runs once at startup using the MinIO client to create the default storage bucket. +\end{description} + +The Next.js application server and the Bun WebSocket server run outside Docker on the host, connecting to the containerized database and object store. Starting the backing services requires a single \texttt{docker compose up} command. This configuration is intended for on-premises deployment, which is important for studies involving participant data that cannot leave the institution's network. + +\subsection{NAO6 Integration Stack} +\label{sec:nao6-integration} + +The NAO6 integration stack is defined in a separate repository and provides three ROS~2 services that collectively bridge HRIStudio to the physical robot. + +\begin{enumerate} +\item The \textbf{nao\_driver} service runs the NaoQi driver ROS~2 node, which connects to the NAO's proprietary framework over the local network and publishes sensor data (joint states, camera feeds) as standard ROS~2 topics. +\item The \textbf{ros\_bridge} service runs the rosbridge WebSocket server, which exposes all ROS~2 topics over a WebSocket interface on a configurable port (default~9090). This is the endpoint that the HRIStudio server connects to. +\item The \textbf{ros\_api} service provides runtime introspection of available ROS~2 topics, services, and parameters. +\end{enumerate} + +All three services are built from a single Dockerfile based on the ROS~2 Humble base image (Ubuntu~22.04). The image installs the NaoQi driver and rosbridge server packages along with their dependencies (NaoQi libraries, bridge message types, OpenCV bridge, and TF2) and builds them with colcon. All services use host networking so that ROS~2 discovery and the NaoQi connection operate without port forwarding. + +Before starting the driver, an initialization script connects to the NAO via SSH and prepares it for external control: + +\begin{enumerate} +\item Disables Autonomous Life, which would otherwise cause the robot to move unpredictably. +\item Calls \texttt{ALMotion.wakeUp} to energize the motors. +\item Commands the robot to assume a standing posture via the ALRobotPosture service. +\end{enumerate} + +Environment variables for the robot IP address, credentials, and bridge port are read from a \texttt{.env} file shared across all three services. + +\subsection{Communication Between Stacks} + +Figure~\ref{fig:deployment-arch} shows the relationship between the two Docker stacks and the components that run on the host. The HRIStudio server communicates with the robot integration stack over a single WebSocket connection to the \texttt{rosbridge\_websocket} endpoint. For actions that bypass ROS entirely (posture changes, animation playback), the server connects directly to the NAO via SSH and invokes NaoQi commands through the \texttt{qicli} command-line tool. Both communication paths are configured per-robot in the plugin file. + +\begin{figure}[htbp] +\centering +\begin{tikzpicture}[ + box/.style={rectangle, draw=black, thick, rounded corners=2pt, align=center, + font=\footnotesize, inner sep=4pt, minimum height=0.9cm}, + container/.style={rectangle, draw=black!60, thick, dashed, rounded corners=4pt, + inner sep=8pt}, + arrow/.style={->, thick}, + lbl/.style={font=\scriptsize\itshape, fill=white, inner sep=1pt}] + + %% ---- Browser ---- + \node[box, fill=gray!10, minimum width=3.5cm] (browser) at (0, 7.2) + {Browser Client\\[-1pt]{\scriptsize React, tRPC, WebSocket}}; + + %% ---- Host processes ---- + \node[box, fill=gray!20, minimum width=2.6cm] (nextjs) at (-1.8, 5.4) + {Next.js Server\\[-1pt]{\scriptsize port 3000}}; + \node[box, fill=gray!20, minimum width=2.6cm] (wsserver) at (1.8, 5.4) + {Bun WS Server\\[-1pt]{\scriptsize port 3001}}; + + \begin{scope}[on background layer] + \node[container, fill=blue!4, + fit=(nextjs)(wsserver), + label={[font=\scriptsize\bfseries, anchor=south]above:Host}] {}; + \end{scope} + + %% ---- Docker App Stack ---- + \node[box, fill=gray!15, minimum width=2.2cm] (pg) at (-1.8, 3.3) + {PostgreSQL\\[-1pt]{\scriptsize port 5432}}; + \node[box, fill=gray!15, minimum width=2.2cm] (minio) at (1.8, 3.3) + {MinIO\\[-1pt]{\scriptsize port 9000}}; + + \begin{scope}[on background layer] + \node[container, fill=green!4, + fit=(pg)(minio), + label={[font=\scriptsize\bfseries, anchor=south]above:Application Stack}] {}; + \end{scope} + + %% ---- NAO6 Docker Stack ---- + \node[box, fill=gray!30, minimum width=1.7cm] (driver) at (-2.4, 1.2) + {nao\_driver}; + \node[box, fill=gray!30, minimum width=1.7cm] (bridge) at (0, 1.2) + {ros\_bridge\\[-1pt]{\scriptsize port 9090}}; + \node[box, fill=gray!30, minimum width=1.7cm] (rosapi) at (2.4, 1.2) + {ros\_api}; + + \begin{scope}[on background layer] + \node[container, fill=orange!6, + fit=(driver)(bridge)(rosapi), + label={[font=\scriptsize\bfseries, anchor=south]above:NAO6 Integration Stack}] {}; + \end{scope} + + %% ---- NAO Robot ---- + \node[box, fill=gray!40, minimum width=2.8cm] (nao) at (0, -0.8) + {NAO6 Robot\\[-1pt]{\scriptsize NaoQi}}; + + %% ---- Arrows: browser to host ---- + \draw[arrow] (browser.south west) -- node[lbl, left] {HTTP} (nextjs.north); + \draw[arrow] (browser.south east) -- node[lbl, right] {WS} (wsserver.north); + + %% ---- Host internal ---- + \draw[arrow, dashed] (nextjs.east) -- node[lbl, above] {broadcast} (wsserver.west); + + %% ---- Host to app stack (straight down) ---- + \draw[arrow] (nextjs.south) -- (pg.north); + \draw[arrow] ([xshift=4pt]nextjs.south east) -- (minio.north west); + + %% ---- Next.js to ros_bridge: route down the LEFT outside ---- + \draw[arrow] (nextjs.west) -- ++(-1.2, 0) |- node[lbl, pos=0.22, left] {WS} (bridge.west); + + %% ---- Next.js to NAO via SSH: route down the RIGHT outside ---- + \draw[arrow, dashed] ([yshift=-2pt]nextjs.west) -- ++(-1.6, 0) |- node[lbl, pos=0.18, left] {SSH} (nao.west); + + %% ---- ROS containers to robot ---- + \draw[arrow] (driver.south) -- ([xshift=-8pt]nao.north); + \draw[arrow] (bridge.south) -- ([xshift=8pt]nao.north); + +\end{tikzpicture} +\caption{Deployment architecture: two Docker stacks and their communication paths.} +\label{fig:deployment-arch} +\end{figure} + +\section{WebSocket Architecture} +\label{sec:ws-arch} + +Real-time communication during trials is handled by a dedicated WebSocket server that runs as a separate process alongside the Next.js application server. The WebSocket server is implemented in TypeScript and runs on the Bun runtime, listening on port~3001. + +When a wizard or observer opens the Execution interface for a trial, the browser establishes a WebSocket connection to the server, passing the trial identifier and an authentication token as query parameters. The server registers the connection in an in-memory map keyed by client identifier and also records it in the database (\texttt{hs\_ws\_connection} table) for persistence across restarts. + +The server handles four message types from connected clients: + +\begin{description} +\item[Heartbeat.] Keeps the connection alive; the server responds with a timestamped acknowledgment. +\item[Request trial status.] Returns the current trial state (status, current step index) by querying the database. +\item[Request trial events.] Returns the most recent trial events from the trial event log table. +\item[Ping.] Returns a pong response with a timestamp for latency measurement. +\end{description} + +When the Next.js server needs to push an update to all clients observing a trial (for example, after a step completes), it sends an HTTP POST to the WebSocket server's internal \texttt{/internal/broadcast} endpoint. The WebSocket server then forwards the message to every client registered for that trial. This architecture separates the stateful WebSocket connections from the stateless HTTP request handling of the Next.js server. + +\section{Plugin System} + +Robot capabilities are defined in JSON plugin files hosted in a plugin repository. A plugin repository is a static file server (served by an nginx container on port~8080 in the default configuration) that exposes three resources: + +\begin{description} +\item[\texttt{repository.json}.] Repository metadata including name, maintainers, trust level, supported ROS~2 distributions, and compatibility constraints. +\item[\texttt{plugins/index.json}.] An array of plugin filenames available in the repository. +\item[\texttt{plugins/\{name\}.json}.] Individual plugin files, one per robot platform. +\end{description} + +When an administrator triggers a repository sync in the HRIStudio admin interface, the server fetches the repository metadata, retrieves the plugin index, and then fetches each plugin file. The action definitions from each plugin are stored as JSONB in the \texttt{hs\_robot\_plugin} database table, making them available to the experiment designer and the execution engine without further network requests. + +\subsection{Plugin File Structure} + +Each plugin file is a self-contained description of a robot platform. The top-level fields include robot metadata (name, manufacturer, version, capabilities, physical specifications), a ROS~2 configuration block (namespace, default topics), and an array of action definitions. The official repository currently contains three plugins: \texttt{nao6-ros2.json}, \texttt{turtlebot3-burger.json}, and \texttt{turtlebot3-waffle.json}. + +Each action definition specifies: \begin{itemize} - \item \textbf{Metadata}: name, version, and a human-readable description of the platform. - \item \textbf{ROS configuration} (\texttt{ros2Config}): the bridge URL and any global connection parameters. - \item \textbf{Actions}: an array of action definitions. Each action specifies: - \begin{itemize} - \item A unique action type identifier (e.g., \texttt{speak}, \texttt{raise\_arm}) - \item A human-readable label shown in the Design interface - \item A parameter schema defining the input fields the researcher configures - \item The target ROS topic and message type - \item A mapping from parameter names to message fields - \end{itemize} +\item A unique identifier (e.g., \texttt{say\_text}, \texttt{walk\_forward}, \texttt{play\_animation\_bow}). +\item A human-readable name and icon for display in the Design interface. +\item A parameter schema (JSON Schema format) defining the input fields the researcher configures. +\item A timeout and retry policy. +\item A ROS~2 dispatch block containing the target topic, message type, and a payload mapping. \end{itemize} -When the server dispatches a robot command, it loads the active plugin, locates the matching action definition, constructs the ROS message by applying the parameter mapping, and sends it to the bridge. Adding a new robot means writing a new plugin file; no server code changes are required. +The payload mapping supports two modes. In \emph{static} mode, the plugin defines a fixed message template with placeholder tokens (e.g., \texttt{\{\{text\}\}}) that the execution engine fills from the researcher's parameters. In \emph{SSH} mode, the action bypasses ROS entirely and executes a shell command on the robot via SSH; this is used for NaoQi-native operations such as posture changes and animation playback that are not exposed as ROS~2 topics. + +The NAO6 plugin defines 20 actions across three categories: speech (say text, say with emotion), movement (walk forward/backward, turn, stop, wake up, rest, stand, sit, crouch), and animation (bow, wave, nod, head shake, shrug, enthusiastic gesture, and others). Movement actions publish ROS~2 Twist messages to the velocity command topic. Animation actions publish animation path strings to the animation topic. Posture and lifecycle commands use SSH mode to call NaoQi services directly via the \texttt{qicli} command-line tool. + +\subsection{Adding a New Robot} + +Adding support for a new robot platform requires writing a single JSON plugin file and placing it in the repository. No changes to the HRIStudio server code are required. The plugin author defines the robot's capabilities, maps each action to a ROS~2 topic or SSH command, and specifies the parameter schema for each action. After the repository is synced, the new robot's actions appear in the experiment designer and can be used in any study. + +\section{Database Schema} + +The database schema is managed through Drizzle ORM and uses a consistent \texttt{hs\_} prefix for all tables. The schema is organized into five groups: + +\begin{description} +\item[Authentication.] User accounts, sessions, and system role assignments. +\item[Study management.] Studies with status tracking, study membership with per-study roles, and participant records with consent tracking. +\item[Experimental design.] Experiments, steps, and actions. Each action stores its transport type, configuration, parameter schema, and retry policy as JSONB columns. +\item[Trial execution.] Trials with status and duration tracking, and a trial event log that records every action, step transition, and deviation with a timestamp. +\item[Robot integration.] Robot definitions and installed plugins with cached action definitions. A block registry maps visual blocks in the experiment designer to their underlying action types, parameter schemas, and display properties. +\end{description} + +All tables use a consistent \texttt{hs\_} prefix (e.g., \texttt{hs\_study}, \texttt{hs\_trial}, \texttt{hs\_action}). \section{Role-Based Access Control} -HRIStudio uses a two-layer role system. System roles (\texttt{systemRoleEnum}) govern what a user can do across the platform: \emph{administrator}, \emph{researcher}, \emph{wizard}, and \emph{observer}. Study roles (\texttt{studyMemberRoleEnum}) govern what a user can see and do within a specific study: \emph{owner}, \emph{researcher}, \emph{wizard}, and \emph{observer}. A user's system role and study role are checked independently, so a user who is a wizard on one study can be an observer on another without any additional configuration. +As described in Chapter~\ref{ch:implementation}, HRIStudio uses a two-layer role system. System roles are stored in the \texttt{systemRoleEnum} column: \emph{administrator}, \emph{researcher}, \emph{wizard}, and \emph{observer}. Study roles are stored in the \texttt{studyMemberRoleEnum} column: \emph{owner}, \emph{researcher}, \emph{wizard}, and \emph{observer}. The two layers are checked independently at the database level. On the server, tRPC middleware enforces access control: public procedures require no authentication, protected procedures require a valid session, and admin procedures additionally verify the user's system role. Study-level permissions are checked per-request by querying the \texttt{hs\_study\_member} table. diff --git a/thesis/refs.bib b/thesis/refs.bib index 00117c9..1b46936 100644 --- a/thesis/refs.bib +++ b/thesis/refs.bib @@ -33,17 +33,17 @@ publisher={ACM} } -@article{Rietz2021, +@inproceedings{Rietz2021, title={{WoZ4U: An Open-Source Wizard-of-Oz Interface for Human-Robot Interaction}}, author={Rietz, Frank and Bennewitz, Maren}, - journal={Proceedings of the 16th ACM/IEEE International Conference on Human-Robot Interaction}, + booktitle={Proceedings of the 16th ACM/IEEE International Conference on Human-Robot Interaction}, pages={95--103}, year={2021}, publisher={IEEE} } @inproceedings{Quigley2009, - title={{ROS: an open-source Robot Operating System}}, + title={{ROS: An Open-Source Robot Operating System}}, author={Quigley, Morgan and Conley, Ken and Gerkey, Brian and Faust, Josh and Foote, Tully and Leibs, Jeremy and Wheeler, Rob and Ng, Andrew Y}, booktitle={IEEE International Conference on Robotics and Automation}, year={2009}, @@ -130,7 +130,7 @@ series = {OzCHI '15} title = {{A Web-Based Wizard-of-Oz Platform for Collaborative and Reproducible Human-Robot Interaction Research}}, author = {Sean O'Connor and L. Felipe Perrone}, year = {2025}, - booktitle = {Proceedings of the 34th IEEE International Conference on Robot and Human Interactive Communication (RO-MAN)}, + organization = {2025 34th IEEE International Conference on Robot and Human Interactive Communication (RO-MAN)}, abstract = {Human-robot interaction (HRI) research plays a pivotal role in shaping how robots communicate and collaborate with humans. However, conducting HRI studies can be challenging, particularly those employing the Wizard-of-Oz (WoZ) technique. WoZ user studies can have technical and methodological complexities that may render the results irreproducible. We propose to address these challenges with HRIStudio, a modular web-based platform designed to streamline the design, the execution, and the analysis of WoZ experiments. HRIStudio offers an intuitive interface for experiment creation, real-time control and monitoring during experimental runs, and comprehensive data logging and playback tools for analysis and reproducibility. By lowering technical barriers, promoting collaboration, and offering methodological guidelines, HRIStudio aims to make human-centered robotics research easier and empower researchers to develop scientifically rigorous user studies.}, } @@ -156,7 +156,7 @@ series = {OzCHI '15} @inproceedings{Steinfeld2009, author = {Steinfeld, Aaron and Jenkins, Odest Chadwicke and Scassellati, Brian}, - title = {{The oz of wizard: simulating the human for interaction research}}, + title = {{The Oz of Wizard: Simulating the Human for Interaction Research}}, year = {2009}, isbn = {9781605582934}, publisher = {Association for Computing Machinery}, @@ -167,7 +167,7 @@ series = {OzCHI '15} @inproceedings{Gibert2013, author = {Gibert, Guillaume and Petit, Morgan and Lance, Frederic and Pointeau, Gregoire and Dominey, Peter F.}, - title = {{What makes human so different? Analysis of human-humanoid robot interaction with a super wizard of oz platform}}, + title = {{What Makes Humans So Different? Analysis of Human-Humanoid Robot Interaction with a Super Wizard of Oz Platform}}, year = {2013}, booktitle = {Proceedings of the 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)}, pages = {931--938}, @@ -176,7 +176,7 @@ series = {OzCHI '15} @article{Strazdas2020, author = {Strazdas, Daniel and Hintz, Jonathan and Felßberg, Anna Maria and Al-Hamadi, Ayoub}, - title = {{Robots and wizards: An investigation into natural human–robot interaction}}, + title = {{Robots and Wizards: An Investigation into Natural Human–Robot Interaction}}, journal = {IEEE Access}, volume = {8}, pages = {218808--218821}, @@ -186,7 +186,7 @@ series = {OzCHI '15} @inproceedings{Helgert2024, author = {Helgert, Anna and Straßmann, Christopher and Eimler, Sabine C.}, - title = {{Unlocking potentials of virtual reality as a research tool in human-robot interaction: A wizard-of-oz approach}}, + title = {{Unlocking Potentials of Virtual Reality as a Research Tool in Human-Robot Interaction: A Wizard-of-Oz Approach}}, year = {2024}, booktitle = {Proceedings of the 2024 ACM/IEEE International Conference on Human-Robot Interaction}, pages = {123--132}, @@ -208,14 +208,14 @@ abstract="TypeScript is an extension of JavaScript intended to enable easier dev isbn="978-3-662-44202-9" } +% fix below to read: J. Brooke, “SUS: A ‘Quick and Dirty’ Usability Scale,” CRC Press, 1996, pp. 207–212. doi: 10.1201/9781498710411-35 @article{Brooke1996, author = {Brooke, John}, -year = {1995}, -month = {11}, -pages = {}, -title = {SUS: A quick and dirty usability scale}, -volume = {189}, -journal = {Usability Eval. Ind.} +year = {1996}, +pages = {207--212}, +title = {{SUS: A Quick and Dirty Usability Scale}}, +publisher = {CRC Press}, +doi = {10.1201/9781498710411-35} } @article{HoffmanZhao2021, diff --git a/thesis/thesis.tex b/thesis/thesis.tex index f5a097a..4bd3e8d 100644 --- a/thesis/thesis.tex +++ b/thesis/thesis.tex @@ -26,6 +26,8 @@ \chair{Alan Marchiori} \maketitle +\frontmatter + \acknowledgments{ (Draft Acknowledgments) } @@ -37,7 +39,7 @@ \listoffigures \abstract{ - [Abstract goes here] + The Wizard-of-Oz (WoZ) technique is widely used in Human-Robot Interaction (HRI) research to prototype and evaluate robot interaction designs before autonomous capabilities are fully developed. However, two persistent problems limit the technique's effectiveness. First, existing WoZ tools impose high technical barriers that prevent domain experts outside engineering from conducting independent studies (the Accessibility Problem). Second, the fragmented landscape of custom, robot-specific tools makes it difficult to verify or replicate experimental results across labs (the Reproducibility Problem). This thesis formalizes a set of design principles for WoZ infrastructure that address both problems simultaneously: a hierarchical specification model that organizes experiments as studies, experiments, steps, and actions; an event-driven execution model that separates protocol design from live trial control; and a plugin architecture that decouples experiment logic from robot-specific implementations. These principles are realized in HRIStudio, an open-source, web-based platform that provides a visual experiment designer, a guided wizard execution interface, automated timestamped logging with explicit deviation tracking, and role-based access control for research teams. A pilot between-subjects study compared HRIStudio against Choregraphe, a representative baseline tool, using six faculty participants who each designed and executed an interactive storytelling task on a NAO robot. Across all six sessions, HRIStudio participants achieved higher design fidelity (mean 100 vs. 56.7), higher execution reliability (mean 96.7 vs. 66.7), and higher perceived usability (mean SUS 76.7 vs. 59.2) than Choregraphe participants. The only unprompted specification deviation in the dataset occurred in the Choregraphe condition, illustrating the reproducibility failure mode HRIStudio's enforcement model is designed to prevent. While the pilot scale precludes inferential claims, the directional evidence across all measures suggests that the right software architecture can make WoZ experiments more accessible to non-programmers and more reproducible across executions. } \mainmatter