draft1 revisions complete

2026-05-08 07:08:55 -04:00 · 2026-04-12 21:14:20 -04:00
parent e1af7b1f8f
commit 6ccf32ee4d
5 changed files with 77 additions and 74 deletions
@@ -40,7 +40,7 @@ This table also presents numerical data representing the study's results, which

 The Design Fidelity Score measures how completely and correctly each wizard implemented the written specification, the experiment they received. Scores range from 0 to 100, with full points awarded only when a component — a rubric criterion representing a required speech action, gesture, or control-flow element — is both present and correct. (For a full description of rubric categories, see Section~\ref{sec:measures}.)

-Across the six participants, DFS scores divided sharply by condition: all three HRIStudio wizards achieved a perfect score of 100, while the three Choregraphe wizards scored 42.5, 65, and 62.5. The following paragraphs describe the key findings from each session.
+Across the six participants, DFS scores divided sharply by study condition: all three HRIStudio wizards achieved a perfect score of 100, while the three Choregraphe wizards scored 42.5, 65, and 62.5. The following paragraphs describe the key findings from each session.

 W-01 received a DFS of 42.5. Analysis of the exported project file found all four interaction steps present and correctly sequenced; the conditional branch was wired and functional. Speech fidelity was partial: W-01 deviated from the specification by substituting a different rock color in the narrative and comprehension question, departing from the ``red'' specified in the paper protocol. Items 1 and 4 (introduction and branch responses) received full points; items 2 and 3 received half points due to the content mismatch. The gesture category scored zero. Both the introduction wave and the narrative gesture were implemented via the tool's \emph{Animated Say} function, which generates motion non-deterministically from a library rather than placing a specific gesture node; under the rubric's clarifying rule, this does not satisfy the Correct criterion. Item 7 (nod or head shake) was not explicitly programmed. The control-flow category was split: item 9 (correct step sequence) received full points; item 8 (conditional branch) received half points because the branch was resolved by manually deleting and re-routing connections during the trial rather than through a dedicated conditional node wired at design time.

@@ -60,7 +60,7 @@ Across the three HRIStudio sessions, DFS scores were 100, 100, and 100 (mean 100

 The Execution Reliability Score measures how faithfully the designed interaction executed during the live trial.

-Execution results followed the same pattern as design fidelity. HRIStudio trials produced ERS scores of 95, 95, and 100, with no session requiring tool-operation guidance to reach the interaction's conclusion. Choregraphe trials averaged 66.7, with branching failures or absences in two of three sessions and the study's only unprompted content deviation occurring in the third. The per-session details are as follows.
+Execution results followed the same pattern as design fidelity. HRIStudio trials produced ERS scores of 95, 95, and 100, with no session requiring tool-operation guidance to reach the interaction's conclusion. Choregraphe trials averaged 66.7, with branching failures in two of three sessions and a speech content deviation in the third (see Section~\ref{sec:results-qualitative} for details). The per-session details are as follows.

 W-01 received an ERS of 65. The trial ran for approximately five minutes. In this session, I served as the test subject during the live trial. Through that experience I confirmed that a separately recruited participant is not required: the DFS and ERS both evaluate the wizard's implementation and execution fidelity rather than a subject's behavioral responses. Subsequent sessions therefore ran the trial phase with the wizard executing the designed interaction directly, without a separate test subject. The introduction speech and gesture executed correctly. The narrative speech executed but deviated from the specification due to the modified rock color described above. The comprehension question was delivered, a branch response was triggered, and the interaction proceeded to its conclusion. Gesture synchronization was partial: a pause gesture executed, but coordination between speech and movement was inconsistent at several points. No system disconnections or crashes occurred.

@@ -74,23 +74,23 @@ W-05 received an ERS of 95. The trial ran for approximately four minutes and rea

 W-06 received a perfect ERS of 100. The trial ran for approximately three minutes. No interventions of any type were logged during the trial phase. All speech items executed correctly and matched the specification. Gestures, speech synchronization, and the pre-question pause all scored full points. The conditional branch was present in the design and fired correctly during execution via programmed conditional logic. The interaction reached its conclusion without errors, disconnections, or researcher involvement.

-Across the three HRIStudio sessions, ERS scores were 95, 95, and 100 (mean 96.7). Across the three Choregraphe sessions, ERS scores were 65, 60, and 75 (mean 66.7). In the HRIStudio condition, branching was present in every design and executed correctly in every trial; no trial required tool-operation guidance from the researcher to complete. In the Choregraphe condition, branching was absent from two of three designs (W-03, W-04) and was resolved by manual redesign during the trial in the third (W-01).
+Across the three HRIStudio sessions, ERS scores were 95, 95, and 100 (mean 96.7). Across the three Choregraphe sessions, ERS scores were 65, 60, and 75 (mean 66.7).

 \subsection{System Usability Scale}

-W-01 rated Choregraphe with a SUS score of 60. The standard benchmark for SUS scores places 68 as the average; scores below 68 are generally considered below average usability~\cite{Brooke1996}. A score of 60 suggests that W-01, a Digital Humanities faculty member with no programming background, found Choregraphe marginal in usability; this outcome is consistent with the high volume of interface-level help requests observed during the design phase.
+The System Usability Scale (SUS) uses a 0--100 scale with a conventional average of 68; scores above 68 indicate above-average perceived usability~\cite{Brooke1996}. W-01 rated Choregraphe with a SUS score of 60. A score of 60 suggests that W-01, a Digital Humanities faculty member with no programming background, found Choregraphe marginal in usability; this outcome is consistent with the high volume of interface-level help requests observed during the design phase.

-W-02 rated HRIStudio with a SUS score of 90, well above the average benchmark of 68 and the highest score in the study. W-02, a Logic and Philosophy of Science faculty member with moderate programming experience, completed the design phase without tool-operation assistance and rated the platform favorably across usability dimensions.
+W-02 rated HRIStudio with a SUS score of 90, the highest score in the study. W-02, a Logic and Philosophy of Science faculty member with moderate programming experience, completed the design phase without tool-operation assistance and rated the platform favorably across usability dimensions.

-W-03 rated Choregraphe with a SUS score of 75, above the average benchmark of 68. W-03, a programmer with prior experience in block programming environments, perceived the tool positively in general terms, framing it as a capable system for its category. Post-session comments indicated that W-03 found the tool harder to apply to this specific task than its general capability suggested, particularly given the WoZ framing's constraint against onboard control-flow logic. W-03 had no prior knowledge of HRIStudio, providing no comparative baseline for their usability rating.
+W-03 rated Choregraphe with a SUS score of 75. W-03, a programmer with prior experience in block programming environments, perceived the tool positively in general terms, framing it as a capable system for its category. Post-session comments indicated that W-03 found the tool harder to apply to this specific task than its general capability suggested, particularly given the WoZ framing's constraint against onboard control-flow logic. W-03 had no prior knowledge of HRIStudio, providing no comparative baseline for their usability rating.

-W-04 rated Choregraphe with a SUS score of 42.5, the lowest score in the study and well below the average benchmark of 68. Researcher notes recorded that W-04 attempted the task with evident self-driven engagement but that the platform appeared to get in the way. The gap between effort and outcome in W-04's session, a motivated wizard who exceeded the time allocation without completing the design and required four T-type interventions, is directly reflected in this rating.
+W-04 rated Choregraphe with a SUS score of 42.5, the lowest score in the study. Researcher notes recorded that W-04 attempted the task with evident self-driven engagement but that the platform appeared to get in the way. The gap between effort and outcome in W-04's session, a motivated wizard who exceeded the time allocation without completing the design and required four T-type interventions, is directly reflected in this rating.

-W-05 rated HRIStudio with a SUS score of 70, above the average benchmark of 68. Post-session comments recorded no issues. W-05, a Chemical Engineering faculty member with no programming background, completed the design well within the allocation and ran the trial to its conclusion without tool-operation difficulty during execution.
+W-05 rated HRIStudio with a SUS score of 70. Post-session comments recorded no issues. W-05, a Chemical Engineering faculty member with no programming background, completed the design well within the allocation and ran the trial to its conclusion without tool-operation difficulty during execution.

-W-06 rated HRIStudio with a SUS score of 70, above the average benchmark of 68. W-06, a Computer Science faculty member with extensive programming experience, completed the design within the allocation and ran a perfect trial without researcher intervention. The score matches W-05's rating exactly; both wizards found the platform above-average in usability despite approaching the task from very different programming backgrounds.
+W-06 rated HRIStudio with a SUS score of 70. W-06, a Computer Science faculty member with extensive programming experience, completed the design within the allocation and ran a perfect trial without researcher intervention. The score matches W-05's rating exactly; both wizards found the platform above-average in usability despite approaching the task from very different programming backgrounds.

-HRIStudio condition SUS scores were 90, 70, and 70 (mean 76.7), all above the average benchmark of 68. Choregraphe condition SUS scores were 60, 75, and 42.5 (mean 59.2), all at or below the benchmark.
+HRIStudio study condition SUS scores were 90, 70, and 70 (mean 76.7). Choregraphe study condition SUS scores were 60, 75, and 42.5 (mean 59.2).

 \section{Supplementary Measures}

@@ -131,7 +131,7 @@ W-03's design phase extended to 37 minutes, the longest design phase in the stud

 W-04's design phase ran 35 minutes without completion, the only session in which the wizard did not finish before the cutoff. Training took 17 minutes, the longest training phase in the study; W-04 entered the design phase with questions about concurrent block execution that presaged later difficulties with branching.

-W-05's design phase completed in 18 minutes, the shortest in the study. The overall session lasted 32 minutes, also the shortest. Training took 6 minutes with no questions asked. The contrast between W-04 and W-05 is striking: both come from Chemical Engineering, both with no robotics background, yet the difference in tool condition produced a 17-minute gap in design completion time and a qualitatively different session experience.
+W-05's design phase completed in 18 minutes, the shortest in the study. The overall session lasted 32 minutes, also the shortest. Training took 6 minutes with no questions asked. The contrast between W-04 and W-05 is striking: both come from Chemical Engineering, both with no robotics background, yet the difference in assigned tool produced a 17-minute gap in design completion time and a qualitatively different session experience.

 W-06's training phase concluded in 8 minutes and the design phase completed in 21 minutes, both within their allocations. The overall session lasted 37 minutes. The trial ran for approximately three minutes, the shortest trial phase in the study, reflecting a clean execution without errors or researcher interventions.

@@ -145,19 +145,20 @@ W-02 generated minimal interventions. No T-type tool-operation assistance was re

 W-03 generated one C-type intervention during the design phase: a clarification that control-flow logic dependent on onboard speech recognition was outside the study's scope. No T-type interventions were required; W-03 navigated Choregraphe independently throughout the design phase. The absence of T-type interventions for W-03, compared to W-01's high T-type volume, suggests that programming background moderates the interface accessibility problem in Choregraphe: the tool does not block programmers the way it blocked a non-programmer, though it still produced a lower DFS than HRIStudio.

-W-04 generated the highest T-type count in the Choregraphe condition: five total design-phase interventions (4 T-type, 1 C-type), plus one T-type intervention during the trial. The design-phase T marks covered speech content punctuation ($\times$3, items 1--3) and the failed choice block attempt (item 8). The pattern echoes W-01's volume of tool-level friction, concentrated in a wizard with moderate rather than no programming experience.
+W-04 generated the highest T-type count in the Choregraphe study condition: five total design-phase interventions (4 T-type, 1 C-type), plus one T-type intervention during the trial. The design-phase T marks covered speech content punctuation ($\times$3, items 1--3) and the failed choice block attempt (item 8). The pattern echoes W-01's volume of tool-level friction, concentrated in a wizard with moderate rather than no programming experience.

 W-05 generated five design-phase interventions (2 T-type, 3 C-type) and two trial interventions (1 T-type, 1 G-type). The design-phase T marks concerned interface orientation (right-pane editing, branch block configuration); the C-type clarifications concerned conceptual mappings between the written specification and HRIStudio's structural model. Importantly, none of the clarifications blocked design completion, and the final DFS was unaffected. The C-type pattern for W-05 reflects a different kind of engagement from Choregraphe's T-type pattern: questions about what the tool means rather than how to operate it.

 W-06 generated two T-type interventions during the design phase, both pertaining to item 6 (narrative gesture): one for an attempted use of parallel action execution, and one for difficulty resetting the robot's posture, for which specific recommended blocks were suggested. W-06 resolved both issues independently after the initial prompts. No interventions of any type were logged during the trial phase, making W-06 the only wizard in the study to complete the trial with zero interventions.

 \section{Qualitative Findings}
+\label{sec:results-qualitative}

 \subsection{Observed Specification Deviation}

 A notable qualitative finding from W-01's session was an unprompted deviation from the written specification: the wizard substituted a different rock color in the robot's speech and comprehension question, departing from the ``red'' specified in the paper protocol. This was not a tool failure; the wizard made a deliberate creative choice that the tool did not prevent or flag. The deviation was undetected until the live trial, when it surfaced during execution. This incident illustrates the reproducibility problem concretely: without automated protocol enforcement, wizard behavior can drift from the specification in ways that are invisible until execution, affecting the validity of the resulting interaction data.

-No specification deviations from the written protocol were observed in W-02, W-04, W-05, or W-06. W-03 introduced extra nodes beyond the specification's scope, which was addressed by a C-type clarification during design. W-05 added a creative gesture not required by the specification (crouch), which was not a deviation from the protocol's content but an elaboration of the gesture category; it scored within the rubric and was noted for completeness. The speech substitution incident in W-01 remains the only case of content drift from the written specification, and it occurred exclusively in the Choregraphe condition.
+No specification deviations from the written protocol were observed in W-02, W-04, W-05, or W-06. W-03 introduced extra nodes beyond the specification's scope, which was addressed by a C-type clarification during design. W-05 added a creative gesture not required by the specification (crouch), which was not a deviation from the protocol's content but an elaboration of the gesture category; it scored within the rubric and was noted for completeness. The speech substitution incident in W-01 remains the only case of content drift from the written specification, and it occurred exclusively in the Choregraphe study condition.

 \subsection{Wizard Experience}

@@ -169,10 +170,10 @@ W-03 approached the task as a programming challenge, applying Choregraphe's full

 W-04 approached the session with clear engagement and self-driven exploration: independently attempting Choregraphe features (concurrent blocks, choice node) that went beyond what prior instructions had covered. The researcher noted ``Great attempt. Self-driven to explore.'' The SUS score of 42.5 reflects a session where ambition consistently exceeded what the tool's interface could support without additional guidance. W-04's post-session comment that quality was attempted but the platform got in the way is arguably the most direct characterization of the accessibility problem in the dataset.

-W-05 presented the clearest demonstration of HRIStudio's accessibility case. With no programming background, W-05 trained in 6 minutes, asked no questions, completed the design in 18 minutes with a creative addition, and ran the trial to completion. The researcher's session notes observed: ``Overall good session. Learning: different backgrounds determine tool curiosity and drive to self-explore.'' W-05's willingness to add a crouch gesture beyond the specification, and their straightforward navigation of the platform without tool-operation confusion, suggests that HRIStudio's design model successfully supports exploratory use by non-programmers without producing the friction pattern observed in the Choregraphe condition.
+W-05 presented the clearest demonstration of HRIStudio's accessibility case. With no programming background, W-05 trained in 6 minutes, asked no questions, completed the design in 18 minutes with a creative addition, and ran the trial to completion. The researcher's session notes observed: ``Overall good session. Learning: different backgrounds determine tool curiosity and drive to self-explore.'' W-05's willingness to add a crouch gesture beyond the specification, and their straightforward navigation of the platform without tool-operation confusion, suggests that HRIStudio's design model successfully supports exploratory use by non-programmers without producing the friction pattern observed in the Choregraphe study condition.

 W-06 approached the design with a programmer's instinct for thoroughness, initially exploring parallel execution structures for gesture actions and adding posture-reset steps beyond what the specification called for. The two T-type design-phase interventions reflected this exploratory behavior rather than confusion about the task. The extra posture-reset actions in the final design were redundant in practice since the robot was already in the correct starting position, but they did not interfere with the required items and the design achieved a perfect DFS. W-06's trial ran entirely without researcher intervention, producing the only perfect ERS in the study. The session illustrates a different accessibility profile from W-05: where W-05 encountered no interface friction at all, W-06's programming background produced brief exploratory detours that the platform absorbed without compromising the final design or execution.

 \section{Chapter Summary}

-Across all six sessions, the evidence consistently favored HRIStudio on every primary and supplementary measure. On accessibility, every HRIStudio wizard produced a perfect design without requiring tool-operation assistance, while all three Choregraphe wizards scored below 70 and the only wizard who did not complete the design before the session cutoff was in the Choregraphe condition. On execution consistency, HRIStudio trials reached their conclusion without researcher guidance in every case; Choregraphe produced branching failures or absences in two of three sessions and the study's only unprompted content deviation from the written specification in the third. Perceived usability followed the same split: all HRIStudio ratings exceeded the SUS benchmark of 68, while all Choregraphe ratings fell at or below it. Supplementary measures reinforced this pattern — HRIStudio design phases completed faster, generated fewer tool-operation interventions, and produced no incomplete designs, while Choregraphe consistently required more time and guidance to reach the same outcome. Taken together, these results suggest that HRIStudio's design principles produce measurable gains in both accessibility and execution consistency compared to standard practice. Chapter~\ref{ch:discussion} interprets these findings in the context of the research questions.
+Across all six sessions, the evidence consistently favored HRIStudio on every primary and supplementary measure. Every HRIStudio wizard produced a perfect design without tool-operation assistance, while all three Choregraphe wizards scored below perfect and the only wizard who did not finish before the session cutoff was in the Choregraphe study condition. On execution consistency, HRIStudio trials reached their conclusion without researcher guidance in every case; Choregraphe produced branching failures in two of three sessions and a content deviation in the third (see Section~\ref{sec:results-qualitative}). Perceived usability followed the same split, with all HRIStudio ratings above the SUS average and all Choregraphe ratings below it. Taken together, these results suggest that HRIStudio's design principles produce measurable gains in both accessibility and execution consistency compared to standard practice. Chapter~\ref{ch:discussion} interprets these findings in the context of the research questions.