draft1

2026-05-08 07:08:55 -04:00 · 2026-04-11 16:41:52 -04:00
parent 659a4b0683
commit c28408bd9f
11 changed files with 316 additions and 101 deletions
@@ -5,9 +5,9 @@ This chapter presents the results of the pilot validation study described in Cha

 \section{Participant Overview}

-Table~\ref{tbl:sessions} summarizes the participants and their assigned conditions. Wizards are identified by code to protect confidentiality. All six participants were Bucknell University faculty members recruited from departments outside Computer Science. Demographic information (programming background) was collected during recruitment.
+Table~\ref{tbl:sessions} summarizes the participants and their assigned conditions. Wizards are identified by code to protect confidentiality. All six participants were Bucknell University faculty members drawn from Computer Science, Chemical Engineering, Digital Humanities, and Logic and Philosophy of Science. Demographic information (programming background) was collected during recruitment.
+

-% TODO: Fill in W-06 row once session is complete.
 \begin{table}[htbp]
 \centering
 \footnotesize
@@ -17,7 +17,7 @@ Table~\ref{tbl:sessions} summarizes the participants and their assigned conditio
 \hline
 W-01 & Choregraphe & Digital Humanities & None & 42.5 & 65 & 60 \\
 \hline
-W-02 & HRIStudio & Computer Science & Moderate & 100 & 95 & 90 \\
+W-02 & HRIStudio & Logic and Philosophy of Science & Moderate & 100 & 95 & 90 \\
 \hline
 W-03 & Choregraphe & Computer Science & Extensive & 65 & 60 & 75 \\
 \hline
@@ -25,7 +25,7 @@ W-04 & Choregraphe & Chemical Engineering & Moderate & 62.5 & 75 & 42.5 \\
 \hline
 W-05 & HRIStudio & Chemical Engineering & None & 100 & 95 & 70 \\
 \hline
-W-06 & HRIStudio & Computer Science & Extensive & --- & --- & --- \\
+W-06 & HRIStudio & Computer Science & Extensive & 100 & 100 & 70 \\
 \hline
 \end{tabular}
 \caption{Summary of wizard participants, assigned conditions, and scores.}
@@ -38,22 +38,23 @@ W-06 & HRIStudio & Computer Science & Extensive & --- & --- & --- \\

 The Design Fidelity Score measures how completely and correctly each wizard implemented the written specification. Scores range from 0 to 100, with full points awarded only when a component is both present and correct.

-W-01 (Choregraphe) received a DFS of 42.5. Analysis of the exported project file found all four interaction steps present and correctly sequenced; the conditional branch was wired and functional. Speech fidelity was partial: W-01 deviated from the specification by substituting a different rock color in the narrative and comprehension question, departing from the ``red'' specified in the paper protocol. Items 1 and 4 (introduction and branch responses) received full points; items 2 and 3 received half points due to the content mismatch. The gesture category scored zero. Both the introduction wave and the narrative gesture were implemented via the tool's \emph{Animated Say} function, which generates motion non-deterministically from a library rather than placing a specific gesture node; under the rubric's clarifying rule, this does not satisfy the Correct criterion. Item 7 (nod or head shake) was not explicitly programmed. The control-flow category was split: item 9 (correct step sequence) received full points; item 8 (conditional branch) received half points because the branch was resolved by manually deleting and re-routing connections during the trial rather than through a dedicated conditional node wired at design time.
+W-01 (Choregraphe, Digital Humanities, no programming experience) received a DFS of 42.5. Analysis of the exported project file found all four interaction steps present and correctly sequenced; the conditional branch was wired and functional. Speech fidelity was partial: W-01 deviated from the specification by substituting a different rock color in the narrative and comprehension question, departing from the ``red'' specified in the paper protocol. Items 1 and 4 (introduction and branch responses) received full points; items 2 and 3 received half points due to the content mismatch. The gesture category scored zero. Both the introduction wave and the narrative gesture were implemented via the tool's \emph{Animated Say} function, which generates motion non-deterministically from a library rather than placing a specific gesture node; under the rubric's clarifying rule, this does not satisfy the Correct criterion. Item 7 (nod or head shake) was not explicitly programmed. The control-flow category was split: item 9 (correct step sequence) received full points; item 8 (conditional branch) received half points because the branch was resolved by manually deleting and re-routing connections during the trial rather than through a dedicated conditional node wired at design time.

-W-02 (HRIStudio, programmer) received a DFS of 100. The exported project file confirmed all four interaction steps present and correctly sequenced, speech content matching the written specification verbatim, gestures placed using dedicated action nodes, and the conditional branch wired through HRIStudio's branch component. No tool-operation interventions were logged during the design phase. W-02 completed the design in 24 minutes, within the 30-minute allocation.
+W-02 (HRIStudio, Logic and Philosophy of Science, moderate programming) received a DFS of 100. The exported project file confirmed all four interaction steps present and correctly sequenced, speech content matching the written specification verbatim, gestures placed using dedicated action nodes, and the conditional branch wired through HRIStudio's branch component. No tool-operation interventions were logged during the design phase. W-02 completed the design in 24 minutes, within the 30-minute allocation.

-W-03 (Choregraphe, programmer) received a DFS of 65. W-03 approached the design as a block programming exercise, constructing extra nodes and attempting a concurrent execution structure not called for by the specification. One C-type clarification was required: I noted that control-flow logic relying on onboard speech recognition was outside the scope of this study, since Wizard-of-Oz execution routes all speech decisions through the wizard rather than the robot. Speech fidelity was partial: two of the three scorable speech items were present, with not all delivered correctly. No conditional branch was implemented in the final design, resulting in zero points for that category. The design phase extended to 37 minutes, seven minutes over the 30-minute allocation.
+W-03 (Choregraphe, Computer Science, extensive programming) received a DFS of 65. W-03 approached the design as a block programming exercise, constructing extra nodes and attempting a concurrent execution structure not called for by the specification. One C-type clarification was required: I noted that control-flow logic relying on onboard speech recognition was outside the scope of this study, since Wizard-of-Oz execution routes all speech decisions through the wizard rather than the robot. Speech fidelity was partial: two of the three scorable speech items were present, with not all delivered correctly. No conditional branch was implemented in the final design, resulting in zero points for that category. The design phase extended to 37 minutes, seven minutes over the 30-minute allocation.

-W-04 (Choregraphe, Chemical Engineering, moderate programmer) received a DFS of 62.5. The design phase ran 35 minutes without reaching completion, making W-04 the only wizard in the study who did not finish the design before the cutoff. Four T-type tool-operation interventions and one C-type clarification were logged. During training, W-04 asked about running two behavior blocks simultaneously and how to edit a block, reflecting early engagement with Choregraphe's concurrent flow model. During the design phase, W-04 asked about interpretation of punctuation in speech content, generating three simultaneous T-type marks across items 1--3. W-04 also independently attempted to use Choregraphe's choice block for conditional branching; the block did not execute correctly. The researcher re-explained the WoZ execution model and how to branch by manual step selection. Speech items 1, 2, and 4 received full points; item 3 (the comprehension question) was absent from the final design. Gesture items 5 and 6 received full points; item 7 (nod or head shake) was present but not marked correct (5/10). The conditional branch received zero points; no functional branch was wired at export. Step sequencing received partial credit (7.5/15).
+W-04 (Choregraphe, Chemical Engineering, moderate programming experience) received a DFS of 62.5. The design phase ran 35 minutes without reaching completion, making W-04 the only wizard in the study who did not finish the design before the cutoff. Four T-type tool-operation interventions and one C-type clarification were logged. During training, W-04 asked about running two behavior blocks simultaneously and how to edit a block, reflecting early engagement with Choregraphe's concurrent flow model. During the design phase, W-04 asked about interpretation of punctuation in speech content, generating three simultaneous T-type marks across items 1--3. W-04 also independently attempted to use Choregraphe's choice block for conditional branching; the block did not execute correctly. The researcher re-explained the WoZ execution model and how to branch by manual step selection. Speech items 1, 2, and 4 received full points; item 3 (the comprehension question) was absent from the final design. Gesture items 5 and 6 received full points; item 7 (nod or head shake) was present but not marked correct (5/10). The conditional branch received zero points; no functional branch was wired at export. Step sequencing received partial credit (7.5/15).

 W-05 (HRIStudio, Chemical Engineering, no programming experience) received a DFS of 100. The design phase completed in 18 minutes, the shortest design phase in the study. Training concluded in 6 minutes with no questions asked; the wizard described the platform as ``pretty straightforward.'' Two T-type interventions and three C-type clarifications were logged during the design phase. The T-type interventions concerned editing properties in the right pane of the experiment designer and understanding that the branch block requires predefined steps; both were addressed without affecting the final design. The C-type clarifications concerned what ``steps'' represent as structural containers, the relationship between the written specification's speech and platform speech actions, and a related conceptual question. The wizard added a creative narrative gesture not specified in the protocol (a crouch animation); this was present and correct under the rubric. The DFS assessment noted that the wizard's design mapped well from the specification.

-% TODO: Add DFS scores and per-item breakdown for W-06 when complete.
-% TODO: Add condition means once W-06 is complete.
+W-06 (HRIStudio, Computer Science, extensive programming) received a DFS of 100. Two T-type interventions were logged during the design phase, both pertaining to item 6 (narrative gesture): at 15:21, W-06 attempted to use parallel execution for a gesture action and was unable to edit the action node; at 15:24, W-06 encountered difficulty resetting the robot's posture and was directed to recommended posture blocks. In both cases, W-06 resolved the issue independently after the initial prompt. W-06's programming background led to a more elaborate design than the specification required, including extra posture-reset actions that were ultimately redundant since the robot was already in the correct starting position; these additions did not affect scoring since all required actions were present and correct in the exported project file. The conditional branch was wired correctly, and all speech and gesture items matched the specification. W-06 completed the design in 21 minutes, within the 30-minute allocation.
+
+Across the three HRIStudio sessions, DFS scores were 100, 100, and 100 (mean 100). Across the three Choregraphe sessions, DFS scores were 42.5, 65, and 62.5 (mean 56.7).

 \subsection{Execution Reliability Score}

-The Execution Reliability Score measures how faithfully the designed interaction executed during the live trial. W-01 received an ERS of 65. The trial ran for approximately five minutes. In this session, I served as the test subject during the live trial. Through that experience I confirmed that a separately recruited participant is not required: the DFS and ERS both evaluate the wizard's implementation and execution fidelity rather than a subject's behavioral responses. Subsequent sessions therefore run the trial phase with the wizard executing the designed interaction directly, without a separate test subject. The introduction speech and gesture executed correctly. The narrative speech executed but deviated from the specification due to the modified rock color described above. The comprehension question was delivered, a branch response was triggered, and the interaction proceeded to its conclusion. Gesture synchronization was partial: a pause gesture executed, but coordination between speech and movement was inconsistent at several points. No system disconnections or crashes occurred.
+The Execution Reliability Score measures how faithfully the designed interaction executed during the live trial. W-01 received an ERS of 65. The trial ran for approximately five minutes. In this session, I served as the test subject during the live trial. Through that experience I confirmed that a separately recruited participant is not required: the DFS and ERS both evaluate the wizard's implementation and execution fidelity rather than a subject's behavioral responses. Subsequent sessions therefore ran the trial phase with the wizard executing the designed interaction directly, without a separate test subject. The introduction speech and gesture executed correctly. The narrative speech executed but deviated from the specification due to the modified rock color described above. The comprehension question was delivered, a branch response was triggered, and the interaction proceeded to its conclusion. Gesture synchronization was partial: a pause gesture executed, but coordination between speech and movement was inconsistent at several points. No system disconnections or crashes occurred.

 W-02 (HRIStudio) received an ERS of 95. The trial ran for approximately five minutes. Introduction speech and gesture, narrative speech, comprehension question, and branch response content all executed correctly and matched the specification. During the trial, the interaction briefly advanced to an incorrect step when a branch transition misfired; this was immediately corrected by manually selecting the correct step in the execution interface. This incident was logged as an H-type intervention (platform behavior, not wizard error). The branching item scored 5 out of 10 on its own merits: the branch was present in the design and execution reached the branch step, but the initial misfire meant the transition was not fully correct before manual correction. No other deviations or system failures occurred.

@@ -63,14 +64,15 @@ W-04 (Choregraphe) received an ERS of 75. The trial ran for approximately four m

 W-05 (HRIStudio) received an ERS of 95. The trial ran for approximately four minutes and reached step 4. The researcher's answer was ``Red'' (the correct answer), and branch A fired via programmed conditional logic. All speech items executed correctly. Introduction gesture, nod or head shake, speech synchronization, and the pre-question pause all scored full points. One trial intervention pair was logged: the researcher briefly forgot they were in live execution (G-type), then was reminded and manually skipped a non-functional crouch action (T-type, capping item 6 at 5/10). The crouch animation exists in HRIStudio's action library but does not execute on the NAO6 robot-side; skipping it was the correct recovery. All other items scored full points and no system errors occurred. The overall ERS assessment recorded that the interaction executed as designed.

-% TODO: Add ERS score and breakdown for W-06 when complete.
-% TODO: Once all sessions complete, report condition ERS means and note patterns in execution failures across conditions.
+W-06 (HRIStudio) received a perfect ERS of 100. The trial ran for approximately three minutes. No interventions of any type were logged during the trial phase. All speech items executed correctly and matched the specification. Gestures, speech synchronization, and the pre-question pause all scored full points. The conditional branch was present in the design and fired correctly during execution via programmed conditional logic. The interaction reached its conclusion without errors, disconnections, or researcher involvement.
+
+Across the three HRIStudio sessions, ERS scores were 95, 95, and 100 (mean 96.7). Across the three Choregraphe sessions, ERS scores were 65, 60, and 75 (mean 66.7). In the HRIStudio condition, branching was present in every design and executed correctly in every trial; no trial required tool-operation guidance from the researcher to complete. In the Choregraphe condition, branching was absent from two of three designs (W-03, W-04) and was resolved by manual redesign during the trial in the third (W-01).

 \subsection{System Usability Scale}

 W-01 rated Choregraphe with a SUS score of 60. The standard benchmark for SUS scores places 68 as the average; scores below 68 are generally considered below average usability~\cite{Brooke1996}. A score of 60 suggests that W-01, a Digital Humanities faculty member with no programming background, found Choregraphe marginal in usability; this outcome is consistent with the high volume of interface-level help requests observed during the design phase.

-W-02 rated HRIStudio with a SUS score of 90, well above the average benchmark of 68 and the highest score observed so far. W-02, a programmer with a combined CS and psychology background, completed the design phase without tool-operation assistance and rated the platform favorably across usability dimensions.
+W-02 rated HRIStudio with a SUS score of 90, well above the average benchmark of 68 and the highest score in the study. W-02, a Logic and Philosophy of Science faculty member with moderate programming experience, completed the design phase without tool-operation assistance and rated the platform favorably across usability dimensions.

 W-03 rated Choregraphe with a SUS score of 75, above the average benchmark of 68. W-03, a programmer with prior experience in block programming environments, perceived the tool positively in general terms, framing it as a capable system for its category. Post-session comments indicated that W-03 found the tool harder to apply to this specific task than its general capability suggested, particularly given the WoZ framing's constraint against onboard control-flow logic. W-03 had no prior knowledge of HRIStudio, providing no comparative baseline for their usability rating.

@@ -78,7 +80,9 @@ W-04 rated Choregraphe with a SUS score of 42.5, the lowest score in the study a

 W-05 rated HRIStudio with a SUS score of 70, above the average benchmark of 68. Post-session comments recorded no issues. W-05, a Chemical Engineering faculty member with no programming background, completed the design well within the allocation and ran the trial to its conclusion without tool-operation difficulty during execution.

-% TODO: Add SUS score for W-06 when complete. Then report condition means.
+W-06 rated HRIStudio with a SUS score of 70, above the average benchmark of 68. W-06, a Computer Science faculty member with extensive programming experience, completed the design within the allocation and ran a perfect trial without researcher intervention. The score matches W-05's rating exactly; both wizards found the platform above-average in usability despite approaching the task from very different programming backgrounds.
+
+HRIStudio condition SUS scores were 90, 70, and 70 (mean 76.7), all above the average benchmark of 68. Choregraphe condition SUS scores were 60, 75, and 42.5 (mean 59.2), all at or below the benchmark.

 \section{Supplementary Measures}

@@ -86,7 +90,7 @@ W-05 rated HRIStudio with a SUS score of 70, above the average benchmark of 68.

 Table~\ref{tbl:timing} summarizes the time spent in each phase per session.

-% TODO: Fill in W-06 timing row once session is complete.
+
 \begin{table}[htbp]
 \centering
 \footnotesize
@@ -104,7 +108,7 @@ W-04 & 17 min & 35 min & 4 min & 4 min & 60 min \\
 \hline
 W-05 & 6 min & 18 min & 4 min & 4 min & 32 min \\
 \hline
-W-06 & --- & --- & --- & --- & --- \\
+W-06 & 8 min & 21 min & 3 min & 5 min & 37 min \\
 \hline
 \end{tabular}
 \caption{Time spent in each session phase per wizard participant.}
@@ -115,15 +119,15 @@ W-01's design phase extended to 35 minutes, five minutes over the 30-minute allo

 W-02's training phase concluded in 7 minutes, roughly half the standard 15-minute allocation. This reflects HRIStudio's more intuitive onboarding rather than simply W-02's technical background: the platform's guided workflow and timeline-based model required less explanation before the wizard was ready to begin the design phase. W-02's design phase then concluded in 24 minutes, within the allocation, and the trial ran for approximately five minutes.

-W-03's design phase extended to 37 minutes, the longest design phase observed so far, despite W-03's programming background. The overrun reflects not conventional interface friction but the time spent constructing and then revising an over-engineered design; beginning sessions from W-02 onward enforce the 30-minute transition, so W-03's overrun constitutes a procedural exception noted in the observer log.
+W-03's design phase extended to 37 minutes, the longest design phase in the study, despite W-03's programming background. The overrun reflects not conventional interface friction but the time spent constructing and then revising an over-engineered design; beginning sessions from W-02 onward enforced the 30-minute transition, so W-03's overrun constitutes a procedural exception noted in the observer log.

 W-04's design phase ran 35 minutes without completion, the only session in which the wizard did not finish before the cutoff. Training took 17 minutes, the longest training phase in the study; W-04 entered the design phase with questions about concurrent block execution that presaged later difficulties with branching.

 W-05's design phase completed in 18 minutes, the shortest in the study. The overall session lasted 32 minutes, also the shortest. Training took 6 minutes with no questions asked. The contrast between W-04 and W-05 is striking: both come from Chemical Engineering, both with no robotics background, yet the difference in tool condition produced a 17-minute gap in design completion time and a qualitatively different session experience.

-Across the five completed sessions, Choregraphe design phases averaged approximately 35.7 minutes. W-01 and W-03 both exceeded the 30-minute design target but completed their designs before the session time limit; W-04 was the only wizard cut off by the limit without finishing. HRIStudio design phases averaged 21 minutes across two completed sessions, both well within the allocation. Training phases similarly diverged: Choregraphe training averaged approximately 14.7 minutes, while HRIStudio training averaged 6.5 minutes.
+W-06's training phase concluded in 8 minutes and the design phase completed in 21 minutes, both within their allocations. The overall session lasted 37 minutes. The trial ran for approximately three minutes, the shortest trial phase in the study, reflecting a clean execution without errors or researcher interventions.

-% TODO: Update condition means once W-06 is complete.
+Across all six sessions, Choregraphe design phases averaged approximately 35.7 minutes; W-01 and W-03 exceeded the 30-minute target but completed their designs before the session time limit, while W-04 was the only wizard cut off by the limit without finishing. HRIStudio design phases averaged 21 minutes across three sessions, all within the allocation. Training phases similarly diverged: Choregraphe training averaged approximately 14.7 minutes, while HRIStudio training averaged 7 minutes.

 \subsection{Intervention Log}

@@ -137,7 +141,7 @@ W-04 generated the highest T-type count in the Choregraphe condition: five total

 W-05 generated five design-phase interventions (2 T-type, 3 C-type) and two trial interventions (1 T-type, 1 G-type). The design-phase T marks concerned interface orientation (right-pane editing, branch block configuration); the C-type clarifications concerned conceptual mappings between the written specification and HRIStudio's structural model. Importantly, none of the clarifications blocked design completion, and the final DFS was unaffected. The C-type pattern for W-05 reflects a different kind of engagement from Choregraphe's T-type pattern: questions about what the tool means rather than how to operate it.

-% TODO: Compile a summary intervention table once W-06 is complete.
+W-06 generated two T-type interventions during the design phase, both pertaining to item 6 (narrative gesture): one for an attempted use of parallel action execution, and one for difficulty resetting the robot's posture, for which specific recommended blocks were suggested. W-06 resolved both issues independently after the initial prompts. No interventions of any type were logged during the trial phase, making W-06 the only wizard in the study to complete the trial with zero interventions.

 \section{Qualitative Findings}

@@ -145,13 +149,13 @@ W-05 generated five design-phase interventions (2 T-type, 3 C-type) and two tria

 A notable qualitative finding from W-01's session was an unprompted deviation from the written specification: the wizard substituted a different rock color in the robot's speech and comprehension question, departing from the ``red'' specified in the paper protocol. This was not a tool failure; the wizard made a deliberate creative choice that the tool did not prevent or flag. The deviation was undetected until the live trial, when it surfaced during execution. This incident illustrates the reproducibility problem concretely: without automated protocol enforcement, wizard behavior can drift from the specification in ways that are invisible until execution, affecting the validity of the resulting interaction data.

-No specification deviations from the written protocol were observed in W-02, W-04, or W-05. W-03 introduced extra nodes beyond the specification's scope, which was addressed by a C-type clarification during design. W-05 added a creative gesture not required by the specification (crouch), which was not a deviation from the protocol's content but an elaboration of the gesture category; it scored within the rubric and was noted for completeness. The speech substitution incident in W-01 remains the only case of content drift from the written specification, and it occurred exclusively in the Choregraphe condition.
+No specification deviations from the written protocol were observed in W-02, W-04, W-05, or W-06. W-03 introduced extra nodes beyond the specification's scope, which was addressed by a C-type clarification during design. W-05 added a creative gesture not required by the specification (crouch), which was not a deviation from the protocol's content but an elaboration of the gesture category; it scored within the rubric and was noted for completeness. The speech substitution incident in W-01 remains the only case of content drift from the written specification, and it occurred exclusively in the Choregraphe condition.

 \subsection{Wizard Experience}

 W-01 expressed that the training was comprehensible and that the underlying logic of the task was clear. The primary source of frustration was Choregraphe's interface for handling conditional branches and managing the timing of parallel behaviors. Post-session comments suggested that the wizard would not use Choregraphe independently for future HRI work without technical support.

-W-02 engaged with HRIStudio's timeline-based interface without requiring tool-operation guidance. The session proceeded efficiently, and W-02's combined CS and psychology background appeared to support both the technical implementation and the contextual understanding of the interaction scenario. No notable sources of friction were observed during design or trial phases.
+W-02 engaged with HRIStudio's timeline-based interface without requiring tool-operation guidance. The session proceeded efficiently, and W-02's Logic and Philosophy of Science background, combined with moderate programming experience, appeared to support both the technical implementation and the contextual understanding of the interaction scenario. No notable sources of friction were observed during design or trial phases.

 W-03 approached the task as a programming challenge, applying Choregraphe's full feature set beyond what the specification required. When the WoZ framing was clarified (specifically that branching should reflect wizard decisions rather than onboard robot logic), W-03 revised the design but the over-engineered structure introduced earlier persisted in the final export and was reflected in the DFS score. W-03 described Choregraphe as a powerful block programming environment, but noted that applying it to this task was harder than its general capability implied, a characterization consistent with the tool-task mismatch the study is designed to surface.

@@ -159,9 +163,8 @@ W-04 approached the session with clear engagement and self-driven exploration: i

 W-05 presented the clearest demonstration of HRIStudio's accessibility case. With no programming background, W-05 trained in 6 minutes, asked no questions, completed the design in 18 minutes with a creative addition, and ran the trial to completion. The researcher's session notes observed: ``Overall good session. Learning: different backgrounds determine tool curiosity and drive to self-explore.'' W-05's willingness to add a crouch gesture beyond the specification, and their straightforward navigation of the platform without tool-operation confusion, suggests that HRIStudio's design model successfully supports exploratory use by non-programmers without producing the friction pattern observed in the Choregraphe condition.

-% TODO: Add qualitative observations for W-06 when complete.
+W-06 approached the design with a programmer's instinct for thoroughness, initially exploring parallel execution structures for gesture actions and adding posture-reset steps beyond what the specification called for. The two T-type design-phase interventions reflected this exploratory behavior rather than confusion about the task. The extra posture-reset actions in the final design were redundant in practice since the robot was already in the correct starting position, but they did not interfere with the required items and the design achieved a perfect DFS. W-06's trial ran entirely without researcher intervention, producing the only perfect ERS in the study. The session illustrates a different accessibility profile from W-05: where W-05 encountered no interface friction at all, W-06's programming background produced brief exploratory detours that the platform absorbed without compromising the final design or execution.

 \section{Chapter Summary}

-% TODO: Update condition means and summary once W-06 is complete.
-This chapter presented results from five completed sessions of the pilot validation study. Across the three Choregraphe sessions (W-01, W-03, W-04), DFS scores were 42.5, 65, and 62.5 (mean 56.7); ERS scores were 65, 60, and 75 (mean 66.7); and SUS scores were 60, 75, and 42.5 (mean 59.2). Design phases in the Choregraphe condition averaged 35.7 minutes; W-01 and W-03 exceeded the 30-minute target but completed their designs, while W-04 was the only wizard cut off by the session time limit without finishing. Across the two completed HRIStudio sessions (W-02, W-05), DFS scores were both 100 (mean 100); ERS scores were both 95 (mean 95); and SUS scores were 90 and 70 (mean 80). HRIStudio design phases averaged 21 minutes, both within the allocation. The only unprompted speech content deviation observed in the dataset occurred in the Choregraphe condition (W-01). Branching failures or absences appeared in two of three Choregraphe sessions (W-03, W-04) and in neither completed HRIStudio session. The direction of the evidence across all five measures consistently favors HRIStudio. One HRIStudio session (W-06) remains; Chapter~\ref{ch:discussion} interprets the available findings in the context of the research questions.
+This chapter presented results from all six sessions of the pilot validation study. Across the three Choregraphe sessions (W-01, W-03, W-04), DFS scores were 42.5, 65, and 62.5 (mean 56.7); ERS scores were 65, 60, and 75 (mean 66.7); and SUS scores were 60, 75, and 42.5 (mean 59.2). Design phases in the Choregraphe condition averaged 35.7 minutes; W-01 and W-03 exceeded the 30-minute target but completed their designs, while W-04 was the only wizard cut off by the session time limit without finishing. Across the three HRIStudio sessions (W-02, W-05, W-06), DFS scores were 100, 100, and 100 (mean 100); ERS scores were 95, 95, and 100 (mean 96.7); and SUS scores were 90, 70, and 70 (mean 76.7). HRIStudio design phases averaged 21 minutes, all within the allocation. The only unprompted speech content deviation observed in the dataset occurred in the Choregraphe condition (W-01). Branching failures or absences appeared in two of three Choregraphe sessions (W-03, W-04) and in none of the three HRIStudio sessions. The direction of the evidence across all measures consistently favors HRIStudio. Chapter~\ref{ch:discussion} interprets these findings in the context of the research questions.