final submissions update part 1

2026-05-08 07:08:55 -04:00 · 2026-04-12 17:07:09 -04:00
parent c28408bd9f
commit e1af7b1f8f
11 changed files with 488 additions and 91 deletions
@@ -1,11 +1,11 @@
 \chapter{Results}
 \label{ch:results}

-This chapter presents the results of the pilot validation study described in Chapter~\ref{ch:evaluation}. Because this is a small pilot, I report descriptive statistics and qualitative observations rather than inferential tests. The goal is directional evidence: do the patterns in the data suggest that HRIStudio changes what wizards can produce and how reliably they can produce it?
+This chapter presents the results of the pilot validation study described in Chapter~\ref{ch:evaluation}. Because this is a small pilot, I report descriptive statistics and qualitative observations rather than inferential tests. The goal is directional evidence: the chapter reports whether patterns in the data consistently favor HRIStudio across the primary and supplementary measures.

 \section{Participant Overview}

-Table~\ref{tbl:sessions} summarizes the participants and their assigned conditions. Wizards are identified by code to protect confidentiality. All six participants were Bucknell University faculty members drawn from Computer Science, Chemical Engineering, Digital Humanities, and Logic and Philosophy of Science. Demographic information (programming background) was collected during recruitment.
+Table~\ref{tbl:sessions} summarizes the personas and their assigned conditions. Wizards are identified by code to protect confidentiality. All six participants were Bucknell University professors drawn from Computer Science, Chemical Engineering, Digital Humanities, and Logic and Philosophy of Science. Demographic information (programming background) was collected during recruitment.


 \begin{table}[htbp]
@@ -32,39 +32,47 @@ W-06 & HRIStudio & Computer Science & Extensive & 100 & 100 & 70 \\
 \label{tbl:sessions}
 \end{table}

+This table also presents numerical data representing the study's results, which is discussed next.
+
 \section{Primary Measures}

-\subsection{Design Fidelity Score}
+\subsection{Design Fidelity Score (DFS)}

-The Design Fidelity Score measures how completely and correctly each wizard implemented the written specification. Scores range from 0 to 100, with full points awarded only when a component is both present and correct.
+The Design Fidelity Score measures how completely and correctly each wizard implemented the written specification, the experiment they received. Scores range from 0 to 100, with full points awarded only when a component — a rubric criterion representing a required speech action, gesture, or control-flow element — is both present and correct. (For a full description of rubric categories, see Section~\ref{sec:measures}.)

-W-01 (Choregraphe, Digital Humanities, no programming experience) received a DFS of 42.5. Analysis of the exported project file found all four interaction steps present and correctly sequenced; the conditional branch was wired and functional. Speech fidelity was partial: W-01 deviated from the specification by substituting a different rock color in the narrative and comprehension question, departing from the ``red'' specified in the paper protocol. Items 1 and 4 (introduction and branch responses) received full points; items 2 and 3 received half points due to the content mismatch. The gesture category scored zero. Both the introduction wave and the narrative gesture were implemented via the tool's \emph{Animated Say} function, which generates motion non-deterministically from a library rather than placing a specific gesture node; under the rubric's clarifying rule, this does not satisfy the Correct criterion. Item 7 (nod or head shake) was not explicitly programmed. The control-flow category was split: item 9 (correct step sequence) received full points; item 8 (conditional branch) received half points because the branch was resolved by manually deleting and re-routing connections during the trial rather than through a dedicated conditional node wired at design time.
+Across the six participants, DFS scores divided sharply by condition: all three HRIStudio wizards achieved a perfect score of 100, while the three Choregraphe wizards scored 42.5, 65, and 62.5. The following paragraphs describe the key findings from each session.

-W-02 (HRIStudio, Logic and Philosophy of Science, moderate programming) received a DFS of 100. The exported project file confirmed all four interaction steps present and correctly sequenced, speech content matching the written specification verbatim, gestures placed using dedicated action nodes, and the conditional branch wired through HRIStudio's branch component. No tool-operation interventions were logged during the design phase. W-02 completed the design in 24 minutes, within the 30-minute allocation.
+W-01 received a DFS of 42.5. Analysis of the exported project file found all four interaction steps present and correctly sequenced; the conditional branch was wired and functional. Speech fidelity was partial: W-01 deviated from the specification by substituting a different rock color in the narrative and comprehension question, departing from the ``red'' specified in the paper protocol. Items 1 and 4 (introduction and branch responses) received full points; items 2 and 3 received half points due to the content mismatch. The gesture category scored zero. Both the introduction wave and the narrative gesture were implemented via the tool's \emph{Animated Say} function, which generates motion non-deterministically from a library rather than placing a specific gesture node; under the rubric's clarifying rule, this does not satisfy the Correct criterion. Item 7 (nod or head shake) was not explicitly programmed. The control-flow category was split: item 9 (correct step sequence) received full points; item 8 (conditional branch) received half points because the branch was resolved by manually deleting and re-routing connections during the trial rather than through a dedicated conditional node wired at design time.

-W-03 (Choregraphe, Computer Science, extensive programming) received a DFS of 65. W-03 approached the design as a block programming exercise, constructing extra nodes and attempting a concurrent execution structure not called for by the specification. One C-type clarification was required: I noted that control-flow logic relying on onboard speech recognition was outside the scope of this study, since Wizard-of-Oz execution routes all speech decisions through the wizard rather than the robot. Speech fidelity was partial: two of the three scorable speech items were present, with not all delivered correctly. No conditional branch was implemented in the final design, resulting in zero points for that category. The design phase extended to 37 minutes, seven minutes over the 30-minute allocation.
+W-02 received a DFS of 100. The exported project file confirmed all four interaction steps present and correctly sequenced, speech content matching the written specification verbatim, gestures placed using dedicated action nodes, and the conditional branch wired through HRIStudio's branch component. No tool-operation interventions were logged during the design phase. W-02 completed the design in 24 minutes, within the 30-minute allocation.

-W-04 (Choregraphe, Chemical Engineering, moderate programming experience) received a DFS of 62.5. The design phase ran 35 minutes without reaching completion, making W-04 the only wizard in the study who did not finish the design before the cutoff. Four T-type tool-operation interventions and one C-type clarification were logged. During training, W-04 asked about running two behavior blocks simultaneously and how to edit a block, reflecting early engagement with Choregraphe's concurrent flow model. During the design phase, W-04 asked about interpretation of punctuation in speech content, generating three simultaneous T-type marks across items 1--3. W-04 also independently attempted to use Choregraphe's choice block for conditional branching; the block did not execute correctly. The researcher re-explained the WoZ execution model and how to branch by manual step selection. Speech items 1, 2, and 4 received full points; item 3 (the comprehension question) was absent from the final design. Gesture items 5 and 6 received full points; item 7 (nod or head shake) was present but not marked correct (5/10). The conditional branch received zero points; no functional branch was wired at export. Step sequencing received partial credit (7.5/15).
+W-03 received a DFS of 65. W-03 approached the design as a block programming exercise, constructing extra nodes and attempting a concurrent execution structure not called for by the specification. One C-type clarification (see Section~\ref{sec:measures}) was required: I noted that control-flow logic relying on onboard speech recognition was outside the scope of this study, since Wizard-of-Oz execution routes all speech decisions through the wizard rather than the robot. Speech fidelity was partial: two of the three scorable speech items were present, with not all delivered correctly. No conditional branch was implemented in the final design, resulting in zero points for that category. The design phase extended to 37 minutes, seven minutes over the 30-minute allocation.

-W-05 (HRIStudio, Chemical Engineering, no programming experience) received a DFS of 100. The design phase completed in 18 minutes, the shortest design phase in the study. Training concluded in 6 minutes with no questions asked; the wizard described the platform as ``pretty straightforward.'' Two T-type interventions and three C-type clarifications were logged during the design phase. The T-type interventions concerned editing properties in the right pane of the experiment designer and understanding that the branch block requires predefined steps; both were addressed without affecting the final design. The C-type clarifications concerned what ``steps'' represent as structural containers, the relationship between the written specification's speech and platform speech actions, and a related conceptual question. The wizard added a creative narrative gesture not specified in the protocol (a crouch animation); this was present and correct under the rubric. The DFS assessment noted that the wizard's design mapped well from the specification.
+W-04 received a DFS of 62.5. The design phase ran 35 minutes without reaching completion, making W-04 the only wizard in the study who did not finish the design before the cutoff. Four T-type tool-operation interventions and one C-type clarification were logged. During training, W-04 asked about running two behavior blocks simultaneously and how to edit a block, reflecting early engagement with Choregraphe's concurrent flow model. During the design phase, W-04 asked about interpretation of punctuation in speech content, generating three simultaneous T-type marks across items 1--3. W-04 also independently attempted to use Choregraphe's choice block for conditional branching; the block did not execute correctly. The researcher re-explained the WoZ execution model and how to branch by manual step selection. Speech items 1, 2, and 4 received full points; item 3 (the comprehension question) was absent from the final design. Gesture items 5 and 6 received full points; item 7 (nod or head shake) was present but not marked correct (5/10). The conditional branch received zero points; no functional branch was wired at export. Step sequencing received partial credit (7.5/15).

-W-06 (HRIStudio, Computer Science, extensive programming) received a DFS of 100. Two T-type interventions were logged during the design phase, both pertaining to item 6 (narrative gesture): at 15:21, W-06 attempted to use parallel execution for a gesture action and was unable to edit the action node; at 15:24, W-06 encountered difficulty resetting the robot's posture and was directed to recommended posture blocks. In both cases, W-06 resolved the issue independently after the initial prompt. W-06's programming background led to a more elaborate design than the specification required, including extra posture-reset actions that were ultimately redundant since the robot was already in the correct starting position; these additions did not affect scoring since all required actions were present and correct in the exported project file. The conditional branch was wired correctly, and all speech and gesture items matched the specification. W-06 completed the design in 21 minutes, within the 30-minute allocation.
+W-05 received a DFS of 100. The design phase completed in 18 minutes, the shortest design phase in the study. Training concluded in 6 minutes with no questions asked; the wizard described the platform as ``pretty straightforward.'' Two T-type interventions and three C-type clarifications were logged during the design phase. The T-type interventions concerned editing properties in the right pane of the experiment designer and understanding that the branch block requires predefined steps; both were addressed without affecting the final design. The C-type clarifications concerned what ``steps'' represent as structural containers, the relationship between the written specification's speech and platform speech actions, and a related conceptual question. The wizard added a creative narrative gesture not specified in the protocol (a crouch animation); this was present and correct under the rubric. The DFS assessment noted that the wizard's design mapped well from the specification.
+
+W-06 received a DFS of 100. Two T-type interventions were logged during the design phase, both pertaining to item 6 (narrative gesture): at 15:21, W-06 attempted to use parallel execution for a gesture action and was unable to edit the action node; at 15:24, W-06 encountered difficulty resetting the robot's posture and was directed to recommended posture blocks. In both cases, W-06 resolved the issue independently after the initial prompt. W-06's programming background led to a more elaborate design than the specification required, including extra posture-reset actions that were ultimately redundant since the robot was already in the correct starting position; these additions did not affect scoring since all required actions were present and correct in the exported project file. The conditional branch was wired correctly, and all speech and gesture items matched the specification. W-06 completed the design in 21 minutes, within the 30-minute allocation.

 Across the three HRIStudio sessions, DFS scores were 100, 100, and 100 (mean 100). Across the three Choregraphe sessions, DFS scores were 42.5, 65, and 62.5 (mean 56.7).

-\subsection{Execution Reliability Score}
+\subsection{Execution Reliability Score (ERS)}

-The Execution Reliability Score measures how faithfully the designed interaction executed during the live trial. W-01 received an ERS of 65. The trial ran for approximately five minutes. In this session, I served as the test subject during the live trial. Through that experience I confirmed that a separately recruited participant is not required: the DFS and ERS both evaluate the wizard's implementation and execution fidelity rather than a subject's behavioral responses. Subsequent sessions therefore ran the trial phase with the wizard executing the designed interaction directly, without a separate test subject. The introduction speech and gesture executed correctly. The narrative speech executed but deviated from the specification due to the modified rock color described above. The comprehension question was delivered, a branch response was triggered, and the interaction proceeded to its conclusion. Gesture synchronization was partial: a pause gesture executed, but coordination between speech and movement was inconsistent at several points. No system disconnections or crashes occurred.
+The Execution Reliability Score measures how faithfully the designed interaction executed during the live trial.

-W-02 (HRIStudio) received an ERS of 95. The trial ran for approximately five minutes. Introduction speech and gesture, narrative speech, comprehension question, and branch response content all executed correctly and matched the specification. During the trial, the interaction briefly advanced to an incorrect step when a branch transition misfired; this was immediately corrected by manually selecting the correct step in the execution interface. This incident was logged as an H-type intervention (platform behavior, not wizard error). The branching item scored 5 out of 10 on its own merits: the branch was present in the design and execution reached the branch step, but the initial misfire meant the transition was not fully correct before manual correction. No other deviations or system failures occurred.
+Execution results followed the same pattern as design fidelity. HRIStudio trials produced ERS scores of 95, 95, and 100, with no session requiring tool-operation guidance to reach the interaction's conclusion. Choregraphe trials averaged 66.7, with branching failures or absences in two of three sessions and the study's only unprompted content deviation occurring in the third. The per-session details are as follows.

-W-03 (Choregraphe) received an ERS of 60. The trial ran for approximately five minutes. Speech execution was partial: two of three items were present but not all delivered correctly. Gesture and speech synchronization was poor throughout the interaction; motion cues were present but did not coordinate reliably with corresponding speech actions. The conditional branch, absent from W-03's design, was not executed during the trial; the interaction proceeded without a branch resolution step. No system disconnections or crashes occurred.
+W-01 received an ERS of 65. The trial ran for approximately five minutes. In this session, I served as the test subject during the live trial. Through that experience I confirmed that a separately recruited participant is not required: the DFS and ERS both evaluate the wizard's implementation and execution fidelity rather than a subject's behavioral responses. Subsequent sessions therefore ran the trial phase with the wizard executing the designed interaction directly, without a separate test subject. The introduction speech and gesture executed correctly. The narrative speech executed but deviated from the specification due to the modified rock color described above. The comprehension question was delivered, a branch response was triggered, and the interaction proceeded to its conclusion. Gesture synchronization was partial: a pause gesture executed, but coordination between speech and movement was inconsistent at several points. No system disconnections or crashes occurred.

-W-04 (Choregraphe) received an ERS of 75. The trial ran for approximately four minutes. Introduction and narrative speech executed correctly. The comprehension question, absent from the design, was not delivered; the interaction proceeded directly to the branch step. A T-type trial intervention was required to remind W-04 how to trigger the branch; the yes-branch response was delivered following that prompt, capping item 4 at 5/10 (T-assisted). Gesture execution was strong: introduction wave, narrative gesture, and nod or head shake all executed correctly. Speech and gesture synchronization scored full points. The pause before the comprehension question scored zero, as no question was delivered. No system errors occurred.
+W-02 received an ERS of 95. The trial ran for approximately five minutes. Introduction speech and gesture, narrative speech, comprehension question, and branch response content all executed correctly and matched the specification. During the trial, the interaction briefly advanced to an incorrect step when a branch transition misfired; this was immediately corrected by manually selecting the correct step in the execution interface. This incident was logged as an H-type intervention (platform behavior, not wizard error). The branching item scored 5 out of 10 on its own merits: the branch was present in the design and execution reached the branch step, but the initial misfire meant the transition was not fully correct before manual correction. No other deviations or system failures occurred.

-W-05 (HRIStudio) received an ERS of 95. The trial ran for approximately four minutes and reached step 4. The researcher's answer was ``Red'' (the correct answer), and branch A fired via programmed conditional logic. All speech items executed correctly. Introduction gesture, nod or head shake, speech synchronization, and the pre-question pause all scored full points. One trial intervention pair was logged: the researcher briefly forgot they were in live execution (G-type), then was reminded and manually skipped a non-functional crouch action (T-type, capping item 6 at 5/10). The crouch animation exists in HRIStudio's action library but does not execute on the NAO6 robot-side; skipping it was the correct recovery. All other items scored full points and no system errors occurred. The overall ERS assessment recorded that the interaction executed as designed.
+W-03 received an ERS of 60. The trial ran for approximately five minutes. Speech execution was partial: two of three items were present but not all delivered correctly. Gesture and speech synchronization was poor throughout the interaction; motion cues were present but did not coordinate reliably with corresponding speech actions. The conditional branch, absent from W-03's design, was not executed during the trial; the interaction proceeded without a branch resolution step. No system disconnections or crashes occurred.

-W-06 (HRIStudio) received a perfect ERS of 100. The trial ran for approximately three minutes. No interventions of any type were logged during the trial phase. All speech items executed correctly and matched the specification. Gestures, speech synchronization, and the pre-question pause all scored full points. The conditional branch was present in the design and fired correctly during execution via programmed conditional logic. The interaction reached its conclusion without errors, disconnections, or researcher involvement.
+W-04 received an ERS of 75. The trial ran for approximately four minutes. Introduction and narrative speech executed correctly. The comprehension question, absent from the design, was not delivered; the interaction proceeded directly to the branch step. A T-type trial intervention was required to remind W-04 how to trigger the branch; the yes-branch response was delivered following that prompt, capping item 4 at 5/10 (T-assisted). Gesture execution was strong: introduction wave, narrative gesture, and nod or head shake all executed correctly. Speech and gesture synchronization scored full points. The pause before the comprehension question scored zero, as no question was delivered. No system errors occurred.
+
+W-05 received an ERS of 95. The trial ran for approximately four minutes and reached step 4. The researcher's answer was ``Red'' (the correct answer), and branch A fired via programmed conditional logic. All speech items executed correctly. Introduction gesture, nod or head shake, speech synchronization, and the pre-question pause all scored full points. One trial intervention pair was logged: the researcher briefly forgot they were in live execution (G-type), then was reminded and manually skipped a non-functional crouch action (T-type, capping item 6 at 5/10). The crouch animation exists in HRIStudio's action library but does not execute on the NAO6 robot-side; skipping it was the correct recovery. All other items scored full points and no system errors occurred. The overall ERS assessment recorded that the interaction executed as designed.
+
+W-06 received a perfect ERS of 100. The trial ran for approximately three minutes. No interventions of any type were logged during the trial phase. All speech items executed correctly and matched the specification. Gestures, speech synchronization, and the pre-question pause all scored full points. The conditional branch was present in the design and fired correctly during execution via programmed conditional logic. The interaction reached its conclusion without errors, disconnections, or researcher involvement.

 Across the three HRIStudio sessions, ERS scores were 95, 95, and 100 (mean 96.7). Across the three Choregraphe sessions, ERS scores were 65, 60, and 75 (mean 66.7). In the HRIStudio condition, branching was present in every design and executed correctly in every trial; no trial required tool-operation guidance from the researcher to complete. In the Choregraphe condition, branching was absent from two of three designs (W-03, W-04) and was resolved by manual redesign during the trial in the third (W-01).

@@ -167,4 +175,4 @@ W-06 approached the design with a programmer's instinct for thoroughness, initia

 \section{Chapter Summary}

-This chapter presented results from all six sessions of the pilot validation study. Across the three Choregraphe sessions (W-01, W-03, W-04), DFS scores were 42.5, 65, and 62.5 (mean 56.7); ERS scores were 65, 60, and 75 (mean 66.7); and SUS scores were 60, 75, and 42.5 (mean 59.2). Design phases in the Choregraphe condition averaged 35.7 minutes; W-01 and W-03 exceeded the 30-minute target but completed their designs, while W-04 was the only wizard cut off by the session time limit without finishing. Across the three HRIStudio sessions (W-02, W-05, W-06), DFS scores were 100, 100, and 100 (mean 100); ERS scores were 95, 95, and 100 (mean 96.7); and SUS scores were 90, 70, and 70 (mean 76.7). HRIStudio design phases averaged 21 minutes, all within the allocation. The only unprompted speech content deviation observed in the dataset occurred in the Choregraphe condition (W-01). Branching failures or absences appeared in two of three Choregraphe sessions (W-03, W-04) and in none of the three HRIStudio sessions. The direction of the evidence across all measures consistently favors HRIStudio. Chapter~\ref{ch:discussion} interprets these findings in the context of the research questions.
+Across all six sessions, the evidence consistently favored HRIStudio on every primary and supplementary measure. On accessibility, every HRIStudio wizard produced a perfect design without requiring tool-operation assistance, while all three Choregraphe wizards scored below 70 and the only wizard who did not complete the design before the session cutoff was in the Choregraphe condition. On execution consistency, HRIStudio trials reached their conclusion without researcher guidance in every case; Choregraphe produced branching failures or absences in two of three sessions and the study's only unprompted content deviation from the written specification in the third. Perceived usability followed the same split: all HRIStudio ratings exceeded the SUS benchmark of 68, while all Choregraphe ratings fell at or below it. Supplementary measures reinforced this pattern — HRIStudio design phases completed faster, generated fewer tool-operation interventions, and produced no incomplete designs, while Choregraphe consistently required more time and guidance to reach the same outcome. Taken together, these results suggest that HRIStudio's design principles produce measurable gains in both accessibility and execution consistency compared to standard practice. Chapter~\ref{ch:discussion} interprets these findings in the context of the research questions.