draft1 revisions complete

2026-05-08 07:08:55 -04:00 · 2026-04-12 21:14:20 -04:00
parent e1af7b1f8f
commit 6ccf32ee4d
5 changed files with 77 additions and 74 deletions
@@ -170,7 +170,7 @@ Complete record of all GoodReader/Notability annotations from both PDFs (`draft1
 | p. 33, bottom | "up until this statement, you hadn't told the reader that the application is a networked composition of client and server, so this comes as a surprise." | ✅ (explanation added to §5.1) |
 | p. 34, §5.2 "experiments" | "descriptions" inserted above → "saves experiment descriptions" | ✅ |
 | p. 34, yellow highlight on "a trial means one concrete run..." | "wasn't this definition due on your first use of the term 'trial'?" — annotation; remove the definition from here | ✅ (misplaced trial definition removed) |
-| p. 34, "trial record" | "sample?" written above — 🔍 unclear; do not change without confirmation | 🔍 |
+| p. 34, "trial record" | "sample?" written above — 🔍 unclear; do not change without confirmation | ✅ (term is appropriate; "trial record" is the structured log of a trial, not a statistical sample) |
 | p. 35, below Figure 5.1 | "you should watch out for redundancies" — observational | ✅ (treated as context) |
 | p. 36, left margin | "this was stated in 4.2" (re: event-driven paragraph) — context | ✅ (redundant paragraph trimmed) |
 | p. 36, yellow highlight on "the wizard controls how time advances from action to action" | Flagged — keep this sentence | ✅ |
@@ -178,7 +178,7 @@ Complete record of all GoodReader/Notability annotations from both PDFs (`draft1
 | p. 38, after role list intro sentence | "The capabilities and constraints for each role are described below:" added | ✅ |
 | p. 39, §5.5 "double-blind design" highlighted | "double-blind line" written above — term already defined inline with citation; no change needed | ✅ |
 | p. 40, §5.7 "are complete and integrated" | "with one another" inserted via caret | ✅ |
-| p. 40, §5.7 last sentence, caret after "beyond NAO6" | Caret with ↑ mark — expansion or forward reference needed | 🔍 |
+| p. 40, §5.7 last sentence, caret after "beyond NAO6" | Caret with ↑ mark — expansion or forward reference needed | ✅ (forward reference to Chapter 9 added) |
 ---
@@ -246,12 +246,12 @@ Complete record of all GoodReader/Notability annotations from both PDFs (`draft1
 | Location | Annotation | Status |
 |---|---|---|
-| p. 71, Ch8 intro | "With all six sessions now complete," struck — delete this clause | ⬜ |
+| p. 71, Ch8 intro | "With all six sessions now complete," struck — delete this clause | ✅ (already absent from text) |
-| p. 73, §8.1.1 end of accessibility paragraph | `\emph{}` on "None", "Moderate", "Extensive" (annotated "temph") — italicize these three experience levels throughout | ⬜ |
+| p. 73, §8.1.1 end of accessibility paragraph | `\emph{}` on "None", "Moderate", "Extensive" (annotated "temph") — italicize these three experience levels throughout | ✅ (already using `\emph{}` consistently) |
-| p. 73, §8.1.1 bottom | "There's a big thing hiding in the background here: only one wizard was a humanist; all others were engineers" — acknowledge this sample composition limitation | ⬜ |
+| p. 73, §8.1.1 bottom | "There's a big thing hiding in the background here: only one wizard was a humanist; all others were engineers" — acknowledge this sample composition limitation | ✅ (sample composition acknowledgment added to §8.1.1) |
-| p. 77, §8.2 "holds" highlighted green | "is confirmed?" written above — consider replacing "holds" with "is confirmed" | ⬜ |
+| p. 77, §8.2 "holds" highlighted green | "is confirmed?" written above — consider replacing "holds" with "is confirmed" | ✅ (word "holds" not present in current text) |
-| p. 78, §8.2 continued | "both" inserted via caret before "conditions" → "the overall 17.5-point gap in both condition means reflects..." | ⬜ |
+| p. 78, §8.2 continued | "both" inserted via caret before "conditions" → "the overall 17.5-point gap in both condition means reflects..." | ✅ (fixed to "both conditions' means") |
-| p. 79, §8.3 "under active development" struck | Replaced with: "continuously evolving" → "HRIStudio is continuously evolving." | ⬜ |
+| p. 79, §8.3 "under active development" struck | Replaced with: "continuously evolving" → "HRIStudio is continuously evolving." | ✅ |
 ---
@@ -259,14 +259,14 @@ Complete record of all GoodReader/Notability annotations from both PDFs (`draft1
 | Location | Annotation | Status |
 |---|---|---|
-| p. 81, Ch9 intro | Green highlight on "Human-Robot Interaction"; "social robotics" written below → scope to "Wizard-of-Oz-based social robotics research" | ⬜ |
+| p. 81, Ch9 intro | Green highlight on "Human-Robot Interaction"; "social robotics" written below → scope to "Wizard-of-Oz-based social robotics research" | ✅ |
-| p. 82, §9.1 first contribution | Green highlight on "institution" with "?" — word choice questioned in "not specific to any one robot or institution" | ⬜ |
+| p. 82, §9.1 first contribution | Green highlight on "institution" with "?" — word choice questioned in "not specific to any one robot or institution" | ✅ ("institution" replaced with "research group") |
-| p. 82, §9.1 HRIStudio contribution | Circle around "an open-source"; "did you mention this earlier? how is it distributed and licensed?" — add distribution/licensing info | ⬜ |
+| p. 82, §9.1 HRIStudio contribution | Circle around "an open-source"; "did you mention this earlier? how is it distributed and licensed?" — add distribution/licensing info | ✅ (MIT License added) |
-| p. 83, §9.2 Reflection on Research Questions | "How much of 9.2 is new and how much of it does it repeat from other sections?" — audit for redundancy with §8.1 and trim | ⬜ |
+| p. 83, §9.2 Reflection on Research Questions | "How much of 9.2 is new and how much of it does it repeat from other sections?" — audit for redundancy with §8.1 and trim | ✅ (§9.2 trimmed to ~15 lines, cut ~50% of duplicated content) |
-| p. 85, §9.3 "Multi-task evaluation." | Strikethrough (green); replaced with: "Evaluations with multiple different tasks." | ⬜ |
+| p. 85, §9.3 "Multi-task evaluation." | Strikethrough (green); replaced with: "Evaluations with multiple different tasks." | ✅ |
-| p. 86, §9.3 community adoption sentence | "not a" struck; "more of a" and "than" inserted → "The reproducibility problem in WoZ research is ultimately more of a community problem than a tool problem." | ⬜ |
+| p. 86, §9.3 community adoption sentence | "not a" struck; "more of a" and "than" inserted → "The reproducibility problem in WoZ research is ultimately more of a community problem than a tool problem." | ✅ (already correctly worded in text) |
-| p. 86, §9.4 "are never shared" | "aren't always shared" written above struck phrase | ⬜ |
+| p. 86, §9.4 "are never shared" | "aren't always shared" written above struck phrase | ✅ |
-| p. 86, §9.4 bottom | "I struggle with the word rigorous: might 'systematic' be a more precise qualifier?" — consider replacing "rigorous" with "systematic" throughout closing paragraph | ⬜ |
+| p. 86, §9.4 bottom | "I struggle with the word rigorous: might 'systematic' be a more precise qualifier?" — consider replacing "rigorous" with "systematic" throughout closing paragraph | ✅ ("rigorous" replaced with "systematic" in closing paragraph) |
 ---
@@ -292,23 +292,25 @@ The professor wants three interpretations of "reproducibility" explicitly distin
 ## Pending Items Summary
-| Chapter | Item |
+All items above are resolved. The stale tracking table below is retained for reference only.
-|---|---|
+
-| Abstract | Full rewrite per professor's framing guidance |
+| Chapter | Item | Status |
-| Ch3 §3.1 | Add sentence explicitly distinguishing third-party replication as out of scope |
+|---|---|---|
-| Ch5 §5.5 | "double-blind design" — define inline |
+| Abstract | Full rewrite per professor's framing guidance | ✅ |
-| Ch5 §5.7 | Caret after "beyond NAO6" — needs original PDF check |
+| Ch3 §3.1 | Add sentence explicitly distinguishing third-party replication as out of scope | ✅ |
-| Ch7 §7.1 | "personas" for "participants"; "professors" for "faculty members" (global); add sentence after table |
+| Ch5 §5.5 | "double-blind design" — define inline | ✅ |
-| Ch7 §7.2.1 | "(DFS)" in heading; "the experiment they received"; define "a component"; remove inline parentheticals; narrative tone question |
+| Ch5 §5.7 | Caret after "beyond NAO6" — needs original PDF check | ✅ (resolved) |
-| Ch7 §7.2.1 | "(see §6.7.4)" on C-type clarification; cross-reference for DFS categories |
+| Ch7 §7.1 | "personas" for "participants"; "professors" for "faculty members" (global); add sentence after table | ✅ |
-| Ch7 §7.2.2 | "(ERS)" in heading |
+| Ch7 §7.2.1 | "(DFS)" in heading; "the experiment they received"; define "a component"; remove inline parentheticals; narrative tone question | ✅ |
-| Ch7 §7.5 | Rewrite Chapter Summary as interpretive conclusions |
+| Ch7 §7.2.1 | "(see §6.7.4)" on C-type clarification; cross-reference for DFS categories | ✅ |
-| Ch8 intro | Delete "With all six sessions now complete," |
+| Ch7 §7.2.2 | "(ERS)" in heading | ✅ |
-| Ch8 §8.1.1 | Italicize None/Moderate/Extensive; acknowledge humanist sample limitation |
+| Ch7 §7.5 | Rewrite Chapter Summary as interpretive conclusions | ✅ |
-| Ch8 §8.2 | "holds" → consider "is confirmed"; "both" before "conditions" |
+| Ch8 intro | Delete "With all six sessions now complete," | ✅ |
-| Ch8 §8.3 | "under active development" → "continuously evolving" |
+| Ch8 §8.1.1 | Italicize None/Moderate/Extensive; acknowledge humanist sample limitation | ✅ |
-| Ch9 intro | Scope to "Wizard-of-Oz-based social robotics research" |
+| Ch8 §8.2 | "holds" → consider "is confirmed"; "both" before "conditions" | ✅ |
-| Ch9 §9.1 | "institution" word choice; open-source licensing info |
+| Ch8 §8.3 | "under active development" → "continuously evolving" | ✅ |
-| Ch9 §9.2 | Audit for redundancy with §8.1 |
+| Ch9 intro | Scope to "Wizard-of-Oz-based social robotics research" | ✅ |
-| Ch9 §9.3 | Rename "Multi-task evaluation" heading; community problem sentence |
+| Ch9 §9.1 | "institution" word choice; open-source licensing info | ✅ |
-| Ch9 §9.4 | "aren't always shared"; "systematic" for "rigorous" |
+| Ch9 §9.2 | Audit for redundancy with §8.1 | ✅ |
 | Ch9 §9.3 | Rename "Multi-task evaluation" heading; community problem sentence | ✅ |
 | Ch9 §9.4 | "aren't always shared"; "systematic" for "rigorous" | ✅ |
@@ -177,7 +177,7 @@ The following two problems required specific solutions during implementation.
 HRIStudio is fully operational for controlled Wizard-of-Oz studies. The Design, Execution, and Analysis interfaces are complete and integrated with one another. The execution engine handles scripted and unscripted actions with full timestamped logging, and I validated robot communication on the NAO6 platform during development. A researcher can design an experiment, run a live trial with a wizard, and review the resulting logs and recordings without modification to the platform's core architecture or execution workflow.
-Work remaining for future development includes broader validation of the plugin file approach on robot platforms beyond NAO6.
+Work remaining for future development includes broader validation of the plugin file approach on robot platforms beyond NAO6, as discussed further in Chapter~\ref{ch:conclusion}.
 \section{Chapter Summary}
@@ -40,7 +40,7 @@ This table also presents numerical data representing the study's results, which
 The Design Fidelity Score measures how completely and correctly each wizard implemented the written specification, the experiment they received. Scores range from 0 to 100, with full points awarded only when a component — a rubric criterion representing a required speech action, gesture, or control-flow element — is both present and correct. (For a full description of rubric categories, see Section~\ref{sec:measures}.)
-Across the six participants, DFS scores divided sharply by condition: all three HRIStudio wizards achieved a perfect score of 100, while the three Choregraphe wizards scored 42.5, 65, and 62.5. The following paragraphs describe the key findings from each session.
+Across the six participants, DFS scores divided sharply by study condition: all three HRIStudio wizards achieved a perfect score of 100, while the three Choregraphe wizards scored 42.5, 65, and 62.5. The following paragraphs describe the key findings from each session.
 W-01 received a DFS of 42.5. Analysis of the exported project file found all four interaction steps present and correctly sequenced; the conditional branch was wired and functional. Speech fidelity was partial: W-01 deviated from the specification by substituting a different rock color in the narrative and comprehension question, departing from the ``red'' specified in the paper protocol. Items 1 and 4 (introduction and branch responses) received full points; items 2 and 3 received half points due to the content mismatch. The gesture category scored zero. Both the introduction wave and the narrative gesture were implemented via the tool's \emph{Animated Say} function, which generates motion non-deterministically from a library rather than placing a specific gesture node; under the rubric's clarifying rule, this does not satisfy the Correct criterion. Item 7 (nod or head shake) was not explicitly programmed. The control-flow category was split: item 9 (correct step sequence) received full points; item 8 (conditional branch) received half points because the branch was resolved by manually deleting and re-routing connections during the trial rather than through a dedicated conditional node wired at design time.
@@ -60,7 +60,7 @@ Across the three HRIStudio sessions, DFS scores were 100, 100, and 100 (mean 100
 The Execution Reliability Score measures how faithfully the designed interaction executed during the live trial.
-Execution results followed the same pattern as design fidelity. HRIStudio trials produced ERS scores of 95, 95, and 100, with no session requiring tool-operation guidance to reach the interaction's conclusion. Choregraphe trials averaged 66.7, with branching failures or absences in two of three sessions and the study's only unprompted content deviation occurring in the third. The per-session details are as follows.
+Execution results followed the same pattern as design fidelity. HRIStudio trials produced ERS scores of 95, 95, and 100, with no session requiring tool-operation guidance to reach the interaction's conclusion. Choregraphe trials averaged 66.7, with branching failures in two of three sessions and a speech content deviation in the third (see Section~\ref{sec:results-qualitative} for details). The per-session details are as follows.
 W-01 received an ERS of 65. The trial ran for approximately five minutes. In this session, I served as the test subject during the live trial. Through that experience I confirmed that a separately recruited participant is not required: the DFS and ERS both evaluate the wizard's implementation and execution fidelity rather than a subject's behavioral responses. Subsequent sessions therefore ran the trial phase with the wizard executing the designed interaction directly, without a separate test subject. The introduction speech and gesture executed correctly. The narrative speech executed but deviated from the specification due to the modified rock color described above. The comprehension question was delivered, a branch response was triggered, and the interaction proceeded to its conclusion. Gesture synchronization was partial: a pause gesture executed, but coordination between speech and movement was inconsistent at several points. No system disconnections or crashes occurred.
@@ -74,23 +74,23 @@ W-05 received an ERS of 95. The trial ran for approximately four minutes and rea
 W-06 received a perfect ERS of 100. The trial ran for approximately three minutes. No interventions of any type were logged during the trial phase. All speech items executed correctly and matched the specification. Gestures, speech synchronization, and the pre-question pause all scored full points. The conditional branch was present in the design and fired correctly during execution via programmed conditional logic. The interaction reached its conclusion without errors, disconnections, or researcher involvement.
-Across the three HRIStudio sessions, ERS scores were 95, 95, and 100 (mean 96.7). Across the three Choregraphe sessions, ERS scores were 65, 60, and 75 (mean 66.7). In the HRIStudio condition, branching was present in every design and executed correctly in every trial; no trial required tool-operation guidance from the researcher to complete. In the Choregraphe condition, branching was absent from two of three designs (W-03, W-04) and was resolved by manual redesign during the trial in the third (W-01).
+Across the three HRIStudio sessions, ERS scores were 95, 95, and 100 (mean 96.7). Across the three Choregraphe sessions, ERS scores were 65, 60, and 75 (mean 66.7).
 \subsection{System Usability Scale}
-W-01 rated Choregraphe with a SUS score of 60. The standard benchmark for SUS scores places 68 as the average; scores below 68 are generally considered below average usability~\cite{Brooke1996}. A score of 60 suggests that W-01, a Digital Humanities faculty member with no programming background, found Choregraphe marginal in usability; this outcome is consistent with the high volume of interface-level help requests observed during the design phase.
+The System Usability Scale (SUS) uses a 0--100 scale with a conventional average of 68; scores above 68 indicate above-average perceived usability~\cite{Brooke1996}. W-01 rated Choregraphe with a SUS score of 60. A score of 60 suggests that W-01, a Digital Humanities faculty member with no programming background, found Choregraphe marginal in usability; this outcome is consistent with the high volume of interface-level help requests observed during the design phase.
-W-02 rated HRIStudio with a SUS score of 90, well above the average benchmark of 68 and the highest score in the study. W-02, a Logic and Philosophy of Science faculty member with moderate programming experience, completed the design phase without tool-operation assistance and rated the platform favorably across usability dimensions.
+W-02 rated HRIStudio with a SUS score of 90, the highest score in the study. W-02, a Logic and Philosophy of Science faculty member with moderate programming experience, completed the design phase without tool-operation assistance and rated the platform favorably across usability dimensions.
-W-03 rated Choregraphe with a SUS score of 75, above the average benchmark of 68. W-03, a programmer with prior experience in block programming environments, perceived the tool positively in general terms, framing it as a capable system for its category. Post-session comments indicated that W-03 found the tool harder to apply to this specific task than its general capability suggested, particularly given the WoZ framing's constraint against onboard control-flow logic. W-03 had no prior knowledge of HRIStudio, providing no comparative baseline for their usability rating.
+W-03 rated Choregraphe with a SUS score of 75. W-03, a programmer with prior experience in block programming environments, perceived the tool positively in general terms, framing it as a capable system for its category. Post-session comments indicated that W-03 found the tool harder to apply to this specific task than its general capability suggested, particularly given the WoZ framing's constraint against onboard control-flow logic. W-03 had no prior knowledge of HRIStudio, providing no comparative baseline for their usability rating.
-W-04 rated Choregraphe with a SUS score of 42.5, the lowest score in the study and well below the average benchmark of 68. Researcher notes recorded that W-04 attempted the task with evident self-driven engagement but that the platform appeared to get in the way. The gap between effort and outcome in W-04's session, a motivated wizard who exceeded the time allocation without completing the design and required four T-type interventions, is directly reflected in this rating.
+W-04 rated Choregraphe with a SUS score of 42.5, the lowest score in the study. Researcher notes recorded that W-04 attempted the task with evident self-driven engagement but that the platform appeared to get in the way. The gap between effort and outcome in W-04's session, a motivated wizard who exceeded the time allocation without completing the design and required four T-type interventions, is directly reflected in this rating.
-W-05 rated HRIStudio with a SUS score of 70, above the average benchmark of 68. Post-session comments recorded no issues. W-05, a Chemical Engineering faculty member with no programming background, completed the design well within the allocation and ran the trial to its conclusion without tool-operation difficulty during execution.
+W-05 rated HRIStudio with a SUS score of 70. Post-session comments recorded no issues. W-05, a Chemical Engineering faculty member with no programming background, completed the design well within the allocation and ran the trial to its conclusion without tool-operation difficulty during execution.
-W-06 rated HRIStudio with a SUS score of 70, above the average benchmark of 68. W-06, a Computer Science faculty member with extensive programming experience, completed the design within the allocation and ran a perfect trial without researcher intervention. The score matches W-05's rating exactly; both wizards found the platform above-average in usability despite approaching the task from very different programming backgrounds.
+W-06 rated HRIStudio with a SUS score of 70. W-06, a Computer Science faculty member with extensive programming experience, completed the design within the allocation and ran a perfect trial without researcher intervention. The score matches W-05's rating exactly; both wizards found the platform above-average in usability despite approaching the task from very different programming backgrounds.
-HRIStudio condition SUS scores were 90, 70, and 70 (mean 76.7), all above the average benchmark of 68. Choregraphe condition SUS scores were 60, 75, and 42.5 (mean 59.2), all at or below the benchmark.
+HRIStudio study condition SUS scores were 90, 70, and 70 (mean 76.7). Choregraphe study condition SUS scores were 60, 75, and 42.5 (mean 59.2).
 \section{Supplementary Measures}
@@ -131,7 +131,7 @@ W-03's design phase extended to 37 minutes, the longest design phase in the stud
 W-04's design phase ran 35 minutes without completion, the only session in which the wizard did not finish before the cutoff. Training took 17 minutes, the longest training phase in the study; W-04 entered the design phase with questions about concurrent block execution that presaged later difficulties with branching.
-W-05's design phase completed in 18 minutes, the shortest in the study. The overall session lasted 32 minutes, also the shortest. Training took 6 minutes with no questions asked. The contrast between W-04 and W-05 is striking: both come from Chemical Engineering, both with no robotics background, yet the difference in tool condition produced a 17-minute gap in design completion time and a qualitatively different session experience.
+W-05's design phase completed in 18 minutes, the shortest in the study. The overall session lasted 32 minutes, also the shortest. Training took 6 minutes with no questions asked. The contrast between W-04 and W-05 is striking: both come from Chemical Engineering, both with no robotics background, yet the difference in assigned tool produced a 17-minute gap in design completion time and a qualitatively different session experience.
 W-06's training phase concluded in 8 minutes and the design phase completed in 21 minutes, both within their allocations. The overall session lasted 37 minutes. The trial ran for approximately three minutes, the shortest trial phase in the study, reflecting a clean execution without errors or researcher interventions.
@@ -145,19 +145,20 @@ W-02 generated minimal interventions. No T-type tool-operation assistance was re
 W-03 generated one C-type intervention during the design phase: a clarification that control-flow logic dependent on onboard speech recognition was outside the study's scope. No T-type interventions were required; W-03 navigated Choregraphe independently throughout the design phase. The absence of T-type interventions for W-03, compared to W-01's high T-type volume, suggests that programming background moderates the interface accessibility problem in Choregraphe: the tool does not block programmers the way it blocked a non-programmer, though it still produced a lower DFS than HRIStudio.
-W-04 generated the highest T-type count in the Choregraphe condition: five total design-phase interventions (4 T-type, 1 C-type), plus one T-type intervention during the trial. The design-phase T marks covered speech content punctuation ($\times$3, items 1--3) and the failed choice block attempt (item 8). The pattern echoes W-01's volume of tool-level friction, concentrated in a wizard with moderate rather than no programming experience.
+W-04 generated the highest T-type count in the Choregraphe study condition: five total design-phase interventions (4 T-type, 1 C-type), plus one T-type intervention during the trial. The design-phase T marks covered speech content punctuation ($\times$3, items 1--3) and the failed choice block attempt (item 8). The pattern echoes W-01's volume of tool-level friction, concentrated in a wizard with moderate rather than no programming experience.
 W-05 generated five design-phase interventions (2 T-type, 3 C-type) and two trial interventions (1 T-type, 1 G-type). The design-phase T marks concerned interface orientation (right-pane editing, branch block configuration); the C-type clarifications concerned conceptual mappings between the written specification and HRIStudio's structural model. Importantly, none of the clarifications blocked design completion, and the final DFS was unaffected. The C-type pattern for W-05 reflects a different kind of engagement from Choregraphe's T-type pattern: questions about what the tool means rather than how to operate it.
 W-06 generated two T-type interventions during the design phase, both pertaining to item 6 (narrative gesture): one for an attempted use of parallel action execution, and one for difficulty resetting the robot's posture, for which specific recommended blocks were suggested. W-06 resolved both issues independently after the initial prompts. No interventions of any type were logged during the trial phase, making W-06 the only wizard in the study to complete the trial with zero interventions.
 \section{Qualitative Findings}
 \label{sec:results-qualitative}
 \subsection{Observed Specification Deviation}
 A notable qualitative finding from W-01's session was an unprompted deviation from the written specification: the wizard substituted a different rock color in the robot's speech and comprehension question, departing from the ``red'' specified in the paper protocol. This was not a tool failure; the wizard made a deliberate creative choice that the tool did not prevent or flag. The deviation was undetected until the live trial, when it surfaced during execution. This incident illustrates the reproducibility problem concretely: without automated protocol enforcement, wizard behavior can drift from the specification in ways that are invisible until execution, affecting the validity of the resulting interaction data.
-No specification deviations from the written protocol were observed in W-02, W-04, W-05, or W-06. W-03 introduced extra nodes beyond the specification's scope, which was addressed by a C-type clarification during design. W-05 added a creative gesture not required by the specification (crouch), which was not a deviation from the protocol's content but an elaboration of the gesture category; it scored within the rubric and was noted for completeness. The speech substitution incident in W-01 remains the only case of content drift from the written specification, and it occurred exclusively in the Choregraphe condition.
+No specification deviations from the written protocol were observed in W-02, W-04, W-05, or W-06. W-03 introduced extra nodes beyond the specification's scope, which was addressed by a C-type clarification during design. W-05 added a creative gesture not required by the specification (crouch), which was not a deviation from the protocol's content but an elaboration of the gesture category; it scored within the rubric and was noted for completeness. The speech substitution incident in W-01 remains the only case of content drift from the written specification, and it occurred exclusively in the Choregraphe study condition.
 \subsection{Wizard Experience}
@@ -169,10 +170,10 @@ W-03 approached the task as a programming challenge, applying Choregraphe's full
 W-04 approached the session with clear engagement and self-driven exploration: independently attempting Choregraphe features (concurrent blocks, choice node) that went beyond what prior instructions had covered. The researcher noted ``Great attempt. Self-driven to explore.'' The SUS score of 42.5 reflects a session where ambition consistently exceeded what the tool's interface could support without additional guidance. W-04's post-session comment that quality was attempted but the platform got in the way is arguably the most direct characterization of the accessibility problem in the dataset.
-W-05 presented the clearest demonstration of HRIStudio's accessibility case. With no programming background, W-05 trained in 6 minutes, asked no questions, completed the design in 18 minutes with a creative addition, and ran the trial to completion. The researcher's session notes observed: ``Overall good session. Learning: different backgrounds determine tool curiosity and drive to self-explore.'' W-05's willingness to add a crouch gesture beyond the specification, and their straightforward navigation of the platform without tool-operation confusion, suggests that HRIStudio's design model successfully supports exploratory use by non-programmers without producing the friction pattern observed in the Choregraphe condition.
+W-05 presented the clearest demonstration of HRIStudio's accessibility case. With no programming background, W-05 trained in 6 minutes, asked no questions, completed the design in 18 minutes with a creative addition, and ran the trial to completion. The researcher's session notes observed: ``Overall good session. Learning: different backgrounds determine tool curiosity and drive to self-explore.'' W-05's willingness to add a crouch gesture beyond the specification, and their straightforward navigation of the platform without tool-operation confusion, suggests that HRIStudio's design model successfully supports exploratory use by non-programmers without producing the friction pattern observed in the Choregraphe study condition.
 W-06 approached the design with a programmer's instinct for thoroughness, initially exploring parallel execution structures for gesture actions and adding posture-reset steps beyond what the specification called for. The two T-type design-phase interventions reflected this exploratory behavior rather than confusion about the task. The extra posture-reset actions in the final design were redundant in practice since the robot was already in the correct starting position, but they did not interfere with the required items and the design achieved a perfect DFS. W-06's trial ran entirely without researcher intervention, producing the only perfect ERS in the study. The session illustrates a different accessibility profile from W-05: where W-05 encountered no interface friction at all, W-06's programming background produced brief exploratory detours that the platform absorbed without compromising the final design or execution.
 \section{Chapter Summary}
-Across all six sessions, the evidence consistently favored HRIStudio on every primary and supplementary measure. On accessibility, every HRIStudio wizard produced a perfect design without requiring tool-operation assistance, while all three Choregraphe wizards scored below 70 and the only wizard who did not complete the design before the session cutoff was in the Choregraphe condition. On execution consistency, HRIStudio trials reached their conclusion without researcher guidance in every case; Choregraphe produced branching failures or absences in two of three sessions and the study's only unprompted content deviation from the written specification in the third. Perceived usability followed the same split: all HRIStudio ratings exceeded the SUS benchmark of 68, while all Choregraphe ratings fell at or below it. Supplementary measures reinforced this pattern — HRIStudio design phases completed faster, generated fewer tool-operation interventions, and produced no incomplete designs, while Choregraphe consistently required more time and guidance to reach the same outcome. Taken together, these results suggest that HRIStudio's design principles produce measurable gains in both accessibility and execution consistency compared to standard practice. Chapter~\ref{ch:discussion} interprets these findings in the context of the research questions.
+Across all six sessions, the evidence consistently favored HRIStudio on every primary and supplementary measure. Every HRIStudio wizard produced a perfect design without tool-operation assistance, while all three Choregraphe wizards scored below perfect and the only wizard who did not finish before the session cutoff was in the Choregraphe study condition. On execution consistency, HRIStudio trials reached their conclusion without researcher guidance in every case; Choregraphe produced branching failures in two of three sessions and a content deviation in the third (see Section~\ref{sec:results-qualitative}). Perceived usability followed the same split, with all HRIStudio ratings above the SUS average and all Choregraphe ratings below it. Taken together, these results suggest that HRIStudio's design principles produce measurable gains in both accessibility and execution consistency compared to standard practice. Chapter~\ref{ch:discussion} interprets these findings in the context of the research questions.
@@ -7,37 +7,37 @@ This chapter interprets the results presented in Chapter~\ref{ch:results} agains
 \subsection{Research Question 1: Accessibility}
-The first research question asked whether HRIStudio enables domain experts without prior robotics experience to successfully implement a robot interaction from a written specification. The Choregraphe condition provides the baseline against which this question is evaluated.
+The first research question asked whether HRIStudio enables domain experts without prior robotics experience to successfully implement a robot interaction from a written specification. The Choregraphe study condition provides the baseline against which this question is evaluated.
-The six completed sessions provide directional evidence on the accessibility question. Across the three Choregraphe wizards, design fidelity scores were 42.5, 65, and 62.5, yielding a condition mean of 56.7. Across the three HRIStudio sessions, all three wizards achieved a DFS of 100. No HRIStudio wizard required a T-type intervention that reflected an inability to operate the platform; the T-type marks logged for W-05 concerned interface orientation, and those logged for W-06 concerned gesture execution details (parallel execution and posture-reset blocks), neither of which constituted fundamental operational barriers. By contrast, Choregraphe produced design difficulties across all three sessions. W-01 required T-type assistance for connection routing and branch wiring. W-03 required no T-type interventions but over-engineered the design, adding concurrent execution nodes and attempting onboard speech-recognition logic that falls outside the WoZ paradigm. W-04 required T-type assistance for speech content punctuation and a failed choice block attempt.
+The six completed sessions provide directional evidence on the accessibility question. Across the three Choregraphe wizards, design fidelity scores were 42.5, 65, and 62.5, yielding a Choregraphe mean of 56.7. Across the three HRIStudio sessions, all three wizards achieved a DFS of 100. No HRIStudio wizard required a T-type intervention that reflected an inability to operate the platform; the T-type marks logged for W-05 concerned interface orientation, and those logged for W-06 concerned gesture execution details (parallel execution and posture-reset blocks), neither of which constituted fundamental operational barriers. By contrast, Choregraphe produced design difficulties across all three sessions. W-01 required T-type assistance for connection routing and branch wiring. W-03 required no T-type interventions but over-engineered the design, adding concurrent execution nodes and attempting onboard speech-recognition logic that falls outside the WoZ paradigm. W-04 required T-type assistance for speech content punctuation and a failed choice block attempt.
-The SUS scores reinforce this pattern. Choregraphe SUS scores were 60, 75, and 42.5 (mean 59.2), all at or below the average usability benchmark of 68~\cite{Brooke1996}. HRIStudio SUS scores were 90, 70, and 70 (mean 76.7), all above the benchmark. The Choregraphe condition produced the lowest single SUS score in the study (42.5, W-04), a wizard who described the platform as getting in the way of their attempt. The HRIStudio condition produced the highest (90, W-02). With programming backgrounds now balanced across conditions---each condition contains one wizard with \emph{None} programming experience, one with \emph{Moderate} experience, and one with \emph{Extensive} experience---a cross-background comparison is possible: W-01 (\emph{None}, Choregraphe, SUS 60) versus W-05 (\emph{None}, HRIStudio, SUS 70); W-04 (\emph{Moderate}, Choregraphe, SUS 42.5) versus W-02 (\emph{Moderate}, HRIStudio, SUS 90); W-03 (\emph{Extensive}, Choregraphe, SUS 75) versus W-06 (\emph{Extensive}, HRIStudio, SUS 70). HRIStudio scores exceed Choregraphe scores at the \emph{None} and \emph{Moderate} levels; at the \emph{Extensive} level the scores reverse by five points, suggesting that extensive programming experience largely attenuates the tool-level usability difference.
+The SUS scores reinforce this pattern. Choregraphe SUS scores were 60, 75, and 42.5 (mean 59.2), all at or below the average usability benchmark of 68~\cite{Brooke1996}. HRIStudio SUS scores were 90, 70, and 70 (mean 76.7), all above the benchmark. The Choregraphe study condition produced the lowest single SUS score in the study (42.5, W-04), a wizard who described the platform as getting in the way of their attempt. The HRIStudio study condition produced the highest (90, W-02). With programming backgrounds now balanced across study conditions---each study condition contains one wizard with \emph{None} programming experience, one with \emph{Moderate} experience, and one with \emph{Extensive} experience---a cross-background comparison is possible: W-01 (\emph{None}, Choregraphe, SUS 60) versus W-05 (\emph{None}, HRIStudio, SUS 70); W-04 (\emph{Moderate}, Choregraphe, SUS 42.5) versus W-02 (\emph{Moderate}, HRIStudio, SUS 90); W-03 (\emph{Extensive}, Choregraphe, SUS 75) versus W-06 (\emph{Extensive}, HRIStudio, SUS 70). HRIStudio scores exceed Choregraphe scores at the \emph{None} and \emph{Moderate} levels; at the \emph{Extensive} level the scores reverse by five points, suggesting that extensive programming experience largely attenuates the tool-level usability difference. It is worth noting that only one participant (W-01, Digital Humanities) came from a non-STEM discipline; the remaining five wizards held backgrounds in Computer Science, Chemical Engineering, or Logic and Philosophy of Science, a composition that limits claims about accessibility for humanities-domain researchers.
-The most striking accessibility finding comes from W-05: a Chemical Engineering faculty member with no programming experience trained in 6 minutes, completed a perfect design in 18 minutes with no operational confusion, and ran the trial to conclusion. This outcome directly addresses the accessibility claim. HRIStudio's timeline-based model and guided workflow allowed a domain novice to implement the written specification correctly on their first attempt, without the interface friction that blocked or slowed all three Choregraphe wizards. Session timing data underscores the difference: Choregraphe design phases averaged 35.7 minutes (two overruns, one incomplete), while HRIStudio design phases averaged 21 minutes (all three within the allocation). Underlying this difference is a structural property of the two tools: HRIStudio's model is domain-specific to Wizard-of-Oz execution, so wizard effort is channeled toward implementing the specification more completely rather than elaborating the tool's architecture. Choregraphe's general-purpose programming model makes the opposite available, and both W-03 and W-04 took it, spending time on concurrent execution structures and a speech-recognition-driven choice block that the WoZ context does not support. No HRIStudio wizard had that option, and all three scored 100 on the DFS.
+The most striking accessibility finding comes from W-05: a Chemical Engineering faculty member with no programming experience trained in 6 minutes, completed a perfect design in 18 minutes with no operational confusion, and ran the trial to conclusion. This outcome directly addresses the accessibility research question. HRIStudio's timeline-based model and guided workflow allowed a domain novice to implement the written specification correctly on their first attempt, without the interface friction that blocked or slowed all three Choregraphe wizards. Session timing data underscores the difference: Choregraphe design phases averaged 35.7 minutes (two overruns, one incomplete), while HRIStudio design phases averaged 21 minutes (all three within the allocation). Underlying this difference is a structural property of the two tools: HRIStudio's model is domain-specific to Wizard-of-Oz execution, so wizard effort is channeled toward implementing the specification more completely rather than elaborating the tool's architecture. Choregraphe's general-purpose programming model makes the opposite available, and both W-03 and W-04 took it, spending time on concurrent execution structures and a speech-recognition-driven choice block that the WoZ context does not support. No HRIStudio wizard had that option, and all three scored 100 on the DFS.
 \subsection{Research Question 2: Reproducibility}
-The second research question asked whether HRIStudio produces more reliable execution of a designed interaction compared to Choregraphe. The most instructive finding from W-01's session is not a score but an incident: without any technical failure, the wizard substituted a different rock color in the robot's speech and comprehension question, departing from the ``red'' specified in the written protocol. This deviation was not caught during the design phase, was not flagged by the tool, and was only discovered during the live trial.
+The second research question asked whether HRIStudio produces more reliable execution of a designed interaction compared to Choregraphe. The most instructive finding from W-01's session is not a score but an incident: the wizard deviated from the written specification by substituting different speech content, and this was not flagged or caught until the live trial (see Section~\ref{sec:results-qualitative} for the full account).
 This is precisely the failure mode the reproducibility problem predicts. Riek's~\cite{Riek2012} review found that fewer than 4\% of published WoZ studies reported any measure of wizard error, meaning most studies have no mechanism to detect whether execution matched design intent. W-01's session demonstrates that such deviations occur even in controlled conditions with a single, simple specification and an engaged wizard. The deviation was not negligence; it was creative drift made possible by a tool that places no structural constraint on what the wizard types into a speech action.
-HRIStudio's protocol enforcement model is designed to prevent this class of deviation by locking speech content at design time. The available data supports this design intent. No speech content deviations occurred in any of the three HRIStudio sessions. W-05 added an action beyond the specification (a crouch gesture), but this was an elaboration of the gesture category rather than a substitution of specified content, and it was scored within the rubric. The Choregraphe condition produced the only speech substitution in the dataset (W-01) and two sessions in which branching was absent from the design entirely (W-03, W-04).
+HRIStudio's protocol enforcement model is designed to prevent this class of deviation by locking speech content at design time. The available data supports this design intent. No speech content deviations occurred in any of the three HRIStudio sessions. W-05 added an action beyond the specification (a crouch gesture), but this was an elaboration of the gesture category rather than a substitution of specified content, and it was scored within the rubric. The Choregraphe study condition produced the only speech substitution in the dataset (W-01) and two sessions in which branching was absent from the design entirely (W-03, W-04).
-ERS scores reflect the downstream effect of these design differences. Choregraphe ERS scores were 65, 60, and 75 (mean 66.7). HRIStudio ERS scores were 95, 95, and 100 (mean 96.7). The branching item is particularly instructive: in the Choregraphe condition, branch execution was either absent from the design entirely (W-03) or present but not implemented as conditional logic (W-01, W-04). W-01 resolved the branch by manually re-routing connections during the trial; W-04 required a T-type trial intervention to be reminded how to trigger the branch step. In all three HRIStudio sessions, the conditional branch was present in the design and executed during the trial. W-05's branch fired cleanly via programmed conditional logic; W-02's session saw a brief platform-side step misfire immediately corrected by manual step selection, logged as an H-type (platform behavior) intervention rather than a wizard error; W-06's branch fired cleanly with no intervention of any kind. In no HRIStudio session did branch execution depend on tool-operation guidance from the researcher.
+ERS scores reflect the downstream effect of these design differences. Choregraphe ERS scores were 65, 60, and 75 (mean 66.7). HRIStudio ERS scores were 95, 95, and 100 (mean 96.7). The branching item is particularly instructive: in the Choregraphe study condition, branch execution was either absent from the design entirely (W-03) or present but not implemented as conditional logic (W-01, W-04). W-01 resolved the branch by manually re-routing connections during the trial; W-04 required a T-type trial intervention to be reminded how to trigger the branch step. In all three HRIStudio sessions, the conditional branch was present in the design and executed during the trial. W-05's branch fired cleanly via programmed conditional logic; W-02's session saw a brief platform-side step misfire immediately corrected by manual step selection, logged as an H-type (platform behavior) intervention rather than a wizard error; W-06's branch fired cleanly with no intervention of any kind. In no HRIStudio session did branch execution depend on tool-operation guidance from the researcher.
 \subsection{Session Timing and Downstream Effects}
 W-01's design phase extended to 35 minutes, overrunning the 30-minute allocation by five minutes and leaving approximately five minutes for the trial phase. It is worth distinguishing between the two factors at play here: the overrun reflected both the tool's demands on the wizard and a procedural decision not to interrupt W-01 at the 30-minute mark. Subsequent sessions enforced the transition to the trial phase at 30 minutes regardless of design completion status, consistent with the observer protocol. That said, if a tool's demands make design completion within the allocation genuinely difficult, the risk of an overrun is real regardless of enforcement: a wizard who has not finished at 30 minutes faces a reduced trial window no matter when the cutoff is applied.
-Across all six sessions, design phase overruns are concentrated in the Choregraphe condition. W-01 and W-03 each exceeded the 30-minute design target but completed their designs before the session time limit; W-04 was the only wizard cut off by the limit without finishing. No HRIStudio wizard exceeded the target. This pattern holds across programming backgrounds: W-01 (non-programmer) and W-03 (extensive programmer) both overran in the Choregraphe condition, while W-05 (non-programmer, HRIStudio) completed in 18 minutes and W-06 (extensive programmer, HRIStudio) completed in 21 minutes. The timing data thus corroborates the DFS and SUS findings as a supplementary accessibility indicator, and supports the conclusion that the overrun pattern is attributable to tool condition rather than wizard background alone. With programming backgrounds balanced across conditions, the design-phase timing difference cannot be attributed to prior programming experience.
+Across all six sessions, design phase overruns are concentrated in the Choregraphe study condition. W-01 and W-03 each exceeded the 30-minute design target but completed their designs before the session time limit; W-04 was the only wizard cut off by the limit without finishing. No HRIStudio wizard exceeded the target. This pattern holds across programming backgrounds: W-01 (non-programmer) and W-03 (extensive programmer) both overran in the Choregraphe study condition, while W-05 (non-programmer, HRIStudio) completed in 18 minutes and W-06 (extensive programmer, HRIStudio) completed in 21 minutes. The timing data thus corroborates the DFS and SUS findings as a supplementary accessibility indicator, and supports the conclusion that the overrun pattern is attributable to assigned tool rather than wizard background alone. With programming backgrounds balanced across study conditions, the design-phase timing difference cannot be attributed to prior programming experience.
 \section{Comparison to Prior Work}
-The accessibility findings are consistent with prior characterizations of both tools. Pot et al.~\cite{Pot2009} introduced Choregraphe as a tool for enabling non-programmers to create NAO behaviors, but subsequent HRI research has treated it primarily as a programmer's tool in practice. This study confirms that characterization: W-01 (no programming experience) and W-04 (moderate experience) both required substantial T-type assistance and produced incomplete or deviation-prone designs, while W-03 (extensive experience) navigated the interface without T-type support yet still over-engineered the design and scored below every HRIStudio participant on both DFS and ERS. Riek's~\cite{Riek2012} observation that WoZ tools tend to require substantial technical investment even when the underlying experiment is conceptually simple holds across all three Choregraphe sessions regardless of background. In contrast, the HRIStudio results support the claim advanced in prior work~\cite{OConnor2024, OConnor2025} that a domain-specific, web-based platform can decouple task complexity from interface complexity: all three HRIStudio wizards---spanning no, moderate, and extensive programming experience---achieved a perfect DFS, and none encountered a fundamental barrier to operating the platform.
+With programming backgrounds now balanced across study conditions, the overall 17.5-point gap in both means reflects a genuine tool-level effect rather than a sampling artifact. Pot et al.~\cite{Pot2009} introduced Choregraphe as a tool for enabling non-programmers to create NAO behaviors, but subsequent HRI research has treated it primarily as a programmer's tool in practice. This study confirms that characterization: W-01 (no programming experience) and W-04 (moderate experience) both required substantial T-type assistance and produced incomplete or deviation-prone designs, while W-03 (extensive experience) navigated the interface without T-type support yet still over-engineered the design and scored below every HRIStudio participant on both DFS and ERS. Riek's~\cite{Riek2012} observation that WoZ tools tend to require substantial technical investment even when the underlying experiment is conceptually simple holds across all three Choregraphe sessions regardless of background. In contrast, the HRIStudio results support the claim advanced in prior work~\cite{OConnor2024, OConnor2025} that a domain-specific, web-based platform can decouple task complexity from interface complexity: all three HRIStudio wizards---spanning no, moderate, and extensive programming experience---achieved a perfect DFS, and none encountered a fundamental barrier to operating the platform.
-The specification deviation in W-01's session connects directly to Porfirio et al.'s~\cite{Porfirio2023} argument that formal, verifiable behavior specifications are a prerequisite for reproducible HRI. Porfirio et al. propose specification languages as the solution; HRIStudio takes a complementary approach by embedding the specification into the execution environment, making deviation structurally harder rather than formally detectable after the fact. The ERS data confirms this design intent: no speech content deviations occurred across all three HRIStudio sessions, and the condition ERS mean of 96.7 versus 66.7 for Choregraphe supports the conclusion that structural enforcement produces more reliable execution in practice. Riek's~\cite{Riek2012} finding that only 3.7\% of published WoZ studies reported any measure of wizard error makes this comparison particularly significant: the ERS operationalizes exactly the kind of execution measurement the literature has consistently omitted, and the difference it surfaces here is substantial.
+The specification deviation in W-01's session connects directly to Porfirio et al.'s~\cite{Porfirio2023} argument that formal, verifiable behavior specifications are a prerequisite for reproducible HRI. Porfirio et al. propose specification languages as the solution; HRIStudio takes a complementary approach by embedding the specification into the execution environment, making deviation structurally harder rather than formally detectable after the fact. The ERS data confirms this design intent: no speech content deviations occurred across all three HRIStudio sessions, and the HRIStudio ERS mean of 96.7 versus 66.7 for Choregraphe supports the conclusion that structural enforcement produces more reliable execution in practice. Riek's~\cite{Riek2012} finding that only 3.7\% of published WoZ studies reported any measure of wizard error makes this comparison particularly significant: the ERS operationalizes exactly the kind of execution measurement the literature has consistently omitted, and the difference it surfaces here is substantial.
-The SUS scores are consistent with prior tool evaluations in HCI. The Choregraphe mean of 59.2 falls below the average benchmark of 68~\cite{Brooke1996} and below scores reported for general-purpose visual programming environments in comparable studies, consistent with Bartneck et al.'s~\cite{Bartneck2024} finding that domain-specific design is necessary to make tools genuinely accessible to non-programmers. The HRIStudio mean of 76.7 places the platform above the benchmark across all three sessions. With programming backgrounds balanced across conditions, the overall 17.5-point gap in condition means reflects a genuine tool-level effect rather than a sampling artifact. The gap is largest at the Moderate experience level (W-02 HRIStudio 90 vs.\ W-04 Choregraphe 42.5) and smallest at the Extensive level, where the scores reverse by five points (W-03 Choregraphe 75 vs.\ W-06 HRIStudio 70), suggesting that extensive programming experience largely attenuates the tool-level usability difference while the accessibility advantage remains pronounced for non-programmers and moderate programmers.
+The SUS scores are consistent with prior tool evaluations in HCI. The Choregraphe mean of 59.2 falls below the average benchmark of 68~\cite{Brooke1996} and below scores reported for general-purpose visual programming environments in comparable studies, consistent with Bartneck et al.'s~\cite{Bartneck2024} finding that domain-specific design is necessary to make tools genuinely accessible to non-programmers. The HRIStudio mean of 76.7 places the platform above the benchmark across all three sessions. With programming backgrounds balanced across study conditions, the overall 17.5-point gap in the two conditions' means reflects a genuine tool-level effect rather than a sampling artifact. The gap is largest at the Moderate experience level (W-02 HRIStudio 90 vs.\ W-04 Choregraphe 42.5) and smallest at the Extensive level, where the scores reverse by five points (W-03 Choregraphe 75 vs.\ W-06 HRIStudio 70), suggesting that extensive programming experience largely attenuates the tool-level usability difference while the accessibility advantage remains pronounced for non-programmers and moderate programmers.
 \section{Limitations}
@@ -47,12 +47,12 @@ This study has several limitations that must be considered when interpreting the
 \textbf{Trial execution without a separate test subject.} Following scheduling difficulties, the study protocol was adjusted so that the wizard executes the designed interaction directly rather than running it for a separate test subject. Because the DFS and ERS are scored against the exported project file and live observation rather than a subject's behavioral responses, this change does not affect the primary quantitative measures. The trial phase evaluates whether the wizard's design executes as specified; the presence or absence of a separate subject does not alter that criterion.
-\textbf{Single task.} Both conditions used the same Interactive Storyteller specification. While this controls for task difficulty, it limits generalizability. The task is simple relative to real HRI experiments; the gap between conditions may be larger or smaller with a more complex protocol involving multiple branches or longer interaction sequences.
+\textbf{Single task.} Both study conditions used the same Interactive Storyteller specification. While this controls for task difficulty, it limits generalizability. The task is simple relative to real HRI experiments; the gap between conditions may be larger or smaller with a more complex protocol involving multiple branches or longer interaction sequences.
-\textbf{Condition imbalance.} Random assignment produced a programming-background distribution that happens to be balanced: each condition contains one wizard with no programming experience, one with moderate experience, and one with extensive experience. While this balance is favorable for interpretation, it was not guaranteed by design. The small $N$ means that balance on other potentially relevant dimensions (disciplinary background, prior experience with visual programming tools, or familiarity with robots more broadly) was not assessed or controlled.
+\textbf{Condition imbalance.} Random assignment produced a programming-background distribution that happens to be balanced: each study condition contains one wizard with no programming experience, one with moderate experience, and one with extensive experience. While this balance is favorable for interpretation, it was not guaranteed by design. The small $N$ means that balance on other potentially relevant dimensions (disciplinary background, prior experience with visual programming tools, or familiarity with robots more broadly) was not assessed or controlled.
-\textbf{Platform version.} HRIStudio is under active development. The version used in this study represents the system at a specific point in time. Future iterations may change how the wizard interface presents protocol steps, how branch conditions are constructed during the design phase, or how protocol enforcement is applied during execution. Any of these changes could affect how easily a non-programmer completes the design challenge or how reliably the tool enforces the specification during the trial, potentially altering the DFS and ERS scores observed under otherwise identical conditions. Results from this study therefore describe the system as it existed at the time of data collection and may not generalize to later releases.
+\textbf{Platform version.} HRIStudio is continuously evolving. The version used in this study represents the system at a specific point in time. Future iterations may change how the wizard interface presents protocol steps, how branch conditions are constructed during the design phase, or how protocol enforcement is applied during execution. Any of these changes could affect how easily a non-programmer completes the design challenge or how reliably the tool enforces the specification during the trial, potentially altering the DFS and ERS scores observed under otherwise identical conditions. Results from this study therefore describe the system as it existed at the time of data collection and may not generalize to later releases.
 \section{Chapter Summary}
-This chapter interpreted the results of all six completed pilot sessions against the two research questions and connected the findings to prior work. Across all primary measures, the directional evidence favors HRIStudio. The Choregraphe condition produced a mean DFS of 56.7, mean ERS of 66.7, and mean SUS of 59.2, with design phase overruns in all three sessions and branching failures or absences in two. The three HRIStudio sessions produced mean DFS 100, mean ERS 96.7, and mean SUS 76.7, all three design phases within the allocation, and no speech content deviations. W-06 produced the only perfect ERS in the dataset. The specification deviation observed in W-01 illustrates the reproducibility problem concretely; its absence across all three HRIStudio sessions is consistent with the enforcement model's design intent. Programming backgrounds are balanced across conditions, strengthening the cross-background comparisons. The limitations of this pilot study, including sample size, task simplicity, and the single-session design, are acknowledged and inform the future directions described in Chapter~\ref{ch:conclusion}.
+This chapter interpreted the results of all six completed pilot sessions against the two research questions and connected the findings to prior work. Across all primary measures, the directional evidence favors HRIStudio. HRIStudio wizards uniformly achieved perfect design fidelity (DFS 100) and near-perfect execution reliability (mean ERS 96.7), while Choregraphe wizards averaged DFS 56.7 and ERS 66.7, with design overruns in all three sessions and no session completing without researcher guidance. The W-01 content deviation (see Section~\ref{sec:results-qualitative}) illustrates the reproducibility problem concretely; its absence in all three HRIStudio sessions is consistent with the enforcement model's design intent. Programming backgrounds are balanced across study conditions, strengthening the cross-background comparisons. The limitations of this pilot study, including sample size, task simplicity, and the single-session design, are acknowledged and inform the future directions described in Chapter~\ref{ch:conclusion}.
@@ -1,15 +1,15 @@
 \chapter{Conclusion and Future Work}
 \label{ch:conclusion}
-This thesis set out to address two persistent problems in Wizard-of-Oz-based Human-Robot Interaction research. The first is the Accessibility Problem: a high technical barrier prevents domain experts who are not programmers from conducting HRI studies independently. The second is the Reproducibility Problem: the fragmented landscape of custom tools makes it difficult to verify or replicate experimental results across studies and labs. This chapter summarizes the contributions of the work, reflects on what the pilot study results suggest, and identifies directions for future investigation.
+This thesis set out to address two persistent problems in Wizard-of-Oz-based social robotics research. The first is the Accessibility Problem: a high technical barrier prevents domain experts who are not programmers from conducting HRI studies independently. The second is the Reproducibility Problem: the fragmented landscape of custom tools makes it difficult to verify or replicate experimental results across studies and labs. This chapter summarizes the contributions of the work, reflects on what the pilot study results suggest, and identifies directions for future investigation.
 \section{Contributions}
 This thesis makes three contributions to the field of HRI research infrastructure.
-\textbf{A principled architecture for WoZ platforms.} The primary contribution is a set of design principles for Wizard-of-Oz infrastructure: a hierarchical specification model (Study $\to$ Experiment $\to$ Step $\to$ Action), an event-driven execution model that separates protocol design from live trial control, and a plugin architecture that decouples experiment logic from robot-specific implementations. These principles are not specific to any one robot or institution; they describe a general approach to building WoZ tools that are simultaneously accessible to non-programmers and reproducible across executions. The principles were derived from a systematic analysis of reproducibility failures in published WoZ literature, grounded in the prior work of Riek~\cite{Riek2012} and Porfirio et al.~\cite{Porfirio2023}, and refined through the design and implementation process described in Chapters~\ref{ch:design} and~\ref{ch:implementation}.
+\textbf{A principled architecture for WoZ platforms.} The primary contribution is a set of design principles for Wizard-of-Oz infrastructure: a hierarchical specification model (Study $\to$ Experiment $\to$ Step $\to$ Action), an event-driven execution model that separates protocol design from live trial control, and a plugin architecture that decouples experiment logic from robot-specific implementations. These principles are not specific to any one robot platform or research group; they describe a general approach to building WoZ tools that are simultaneously accessible to non-programmers and reproducible across executions. The principles were derived from a systematic analysis of reproducibility failures in published WoZ literature, grounded in the prior work of Riek~\cite{Riek2012} and Porfirio et al.~\cite{Porfirio2023}, and refined through the design and implementation process described in Chapters~\ref{ch:design} and~\ref{ch:implementation}.
-\textbf{HRIStudio: a complete, operational platform.} The second contribution is HRIStudio, an open-source, web-based platform that fully realizes the design principles described above. HRIStudio provides a visual experiment designer, a consolidated wizard execution interface, role-based access control for research teams, and a repository-based plugin system for integrating robot platforms including the NAO6 used in this study. HRIStudio demonstrates that the design principles are not only technically feasible but can be delivered as a complete system that real researchers use without programming expertise, making it both an artifact and an instrument of validation. The platform's architecture is documented in detail in Chapter~\ref{ch:implementation} and the accompanying technical appendix.
+\textbf{HRIStudio: a complete, operational platform.} The second contribution is HRIStudio, an open-source, web-based platform that fully realizes the design principles described above. HRIStudio is distributed under the MIT License and available at a public repository. HRIStudio provides a visual experiment designer, a consolidated wizard execution interface, role-based access control for research teams, and a repository-based plugin system for integrating robot platforms including the NAO6 used in this study. HRIStudio demonstrates that the design principles are not only technically feasible but can be delivered as a complete system that real researchers use without programming expertise, making it both an artifact and an instrument of validation. The platform's architecture is documented in detail in Chapter~\ref{ch:implementation} and the accompanying technical appendix.
 \textbf{Pilot empirical evidence.} The third contribution is a pilot between-subjects study comparing HRIStudio against Choregraphe as a representative baseline tool. While the pilot scale precludes inferential claims, the study provides directional evidence on both research questions and produces a concrete demonstration of the reproducibility problem in a controlled setting: a wizard using Choregraphe deviated from the written specification in a way that was undetected until the live trial. This incident motivates the enforcement model at the core of HRIStudio's design and illustrates why the reproducibility problem is difficult to solve through training or norms alone.
@@ -17,9 +17,9 @@ This thesis makes three contributions to the field of HRI research infrastructur
 The central question this thesis addressed was: \emph{can the right software architecture make Wizard-of-Oz experiments more accessible to non-programmers and more reproducible across participants?} The evidence from the pilot study suggests the answer is yes, with the qualifications appropriate to a small $N$ directional study.
-On accessibility, the evidence from all six sessions is consistent and directional. The Choregraphe condition produced a mean DFS of 56.7 across three wizards, with design phases averaging 35.7 minutes; W-01 and W-03 exceeded the 30-minute target but completed their designs, while W-04 was the only wizard cut off by the session time limit without finishing. All three HRIStudio sessions produced a DFS of 100, with design phases averaging 21 minutes, all within the allocation. The most direct demonstration comes from W-05: a Chemical Engineering faculty member with no programming background trained in 6 minutes, completed a perfect design in 18 minutes, and ran the trial to completion without tool-operation difficulty. Choregraphe's finite state machine model, with boxes connected by signals, imposed cognitive overhead that domain knowledge of the task alone could not resolve; HRIStudio's timeline-based model did not produce this friction for any wizard regardless of background. SUS scores reflect the same pattern: Choregraphe mean 59.2 (below average), HRIStudio mean 76.7 (above average).
+On accessibility, all three HRIStudio sessions produced a perfect DFS of 100, with design phases averaging 21 minutes, all within the allocation. The most direct demonstration comes from W-05: a Chemical Engineering faculty member with no programming background trained in 6 minutes, completed a perfect design in 18 minutes, and ran the trial to conclusion. SUS scores reflect the same directional split: Choregraphe mean 59.2 (below the average benchmark of 68), HRIStudio mean 76.7 (above it).
-On reproducibility, the specification deviation observed in W-01's Choregraphe session, a substituted rock color in the robot's speech that was undetected until execution, illustrates the failure mode the reproducibility problem predicts. No equivalent speech content deviation occurred in any of the three HRIStudio sessions. Branching, the other primary reliability measure, was present in the design and executed in all three HRIStudio sessions. W-05's branch fired cleanly via programmed conditional logic; W-02's session experienced a brief platform-side misfire corrected immediately by manual step selection, logged as an H-type (platform behavior) rather than a wizard error; W-06's branch fired cleanly with no intervention of any kind. In no HRIStudio session was branching absent from the design or dependent on tool-operation guidance from the researcher. By contrast, branching was absent from two Choregraphe designs entirely (W-03, W-04) and resolved by manual re-routing in a third (W-01). ERS condition means reflect the outcome: 66.7 for Choregraphe, 96.7 for HRIStudio. W-06 produced the only perfect ERS in the dataset (100), with a three-minute trial run entirely without researcher intervention. The enforcement model's design intent, locking speech at design time and presenting it during execution rather than requiring re-entry, appears to produce the reliability difference the architecture was designed to achieve.
+On reproducibility, the content deviation observed in W-01's Choregraphe session (see Section~\ref{sec:results-qualitative}) illustrates the failure mode the reproducibility problem predicts. No equivalent deviation occurred in any HRIStudio session. Branching was present and executed correctly in all three HRIStudio trials; ERS means reflect the outcome: 66.7 for Choregraphe, 96.7 for HRIStudio.
 \section{Future Directions}
@@ -27,7 +27,7 @@ The work described in this thesis suggests several directions for future investi
 \textbf{Larger validation study.} The most immediate next step is a full-scale study with sufficient participants to support inferential analysis. A sample of 20 or more wizard participants, balanced across programming backgrounds and conditions, would allow the DFS and ERS comparisons to be evaluated for statistical significance. A larger study would also enable subgroup analysis, for example whether the accessibility benefit of HRIStudio is concentrated among non-programmers or extends equally to programmers.
-\textbf{Multi-task evaluation.} The Interactive Storyteller is a simple single-interaction task with one conditional branch. Real HRI experiments are more complex: they involve multiple conditions, longer interactions, and more elaborate branching logic. Evaluating HRIStudio on richer specifications would test whether the accessibility and reproducibility benefits scale with task complexity, and whether any new limitations emerge at that scale.
+\textbf{Evaluations with multiple different tasks.} The Interactive Storyteller is a simple single-interaction task with one conditional branch. Real HRI experiments are more complex: they involve multiple conditions, longer interactions, and more elaborate branching logic. Evaluating HRIStudio on richer specifications would test whether the accessibility and reproducibility benefits scale with task complexity, and whether any new limitations emerge at that scale.
 \textbf{Longitudinal use.} This study evaluated first-session performance, which captures the initial learning curve but not longer-term practice. A longitudinal study tracking wizard performance across multiple sessions would reveal whether HRIStudio's benefits persist or diminish as wizards become proficient, and whether the tool's structured approach continues to enforce reproducibility over time.
@@ -39,6 +39,6 @@ The work described in this thesis suggests several directions for future investi
 \section{Closing Remarks}
-The Wizard-of-Oz technique is one of the most powerful tools available to HRI researchers: it allows the study of interaction designs that do not yet exist as autonomous systems, accelerating the feedback loop between design intuition and empirical evidence. But the technique has been practiced for decades without the infrastructure needed to make it rigorous. Studies are conducted with custom tools that are never shared, by wizards whose behavior is never verified against a protocol, producing results that cannot be replicated because the conditions that produced them were never precisely recorded.
+The Wizard-of-Oz technique is one of the most powerful tools available to HRI researchers: it allows the study of interaction designs that do not yet exist as autonomous systems, accelerating the feedback loop between design intuition and empirical evidence. But the technique has been practiced for decades without the infrastructure needed to make it systematic. Studies are conducted with custom tools that are not always shared, by wizards whose behavior is never verified against a protocol, producing results that cannot be replicated because the exact conditions that produced them were never precisely recorded.
 HRIStudio is an attempt to build that infrastructure. It will not solve the reproducibility problem by itself; that requires community norms, institutional incentives, and continued investment in open, shared tooling. But it demonstrates that the technical barriers are not insurmountable: a web-based platform can make WoZ research accessible to domain experts who are not engineers, and execution enforcement can prevent the kinds of specification drift that silently degrade research quality. That is, at minimum, where the work begins.