refactor: update thesis protocol to remove test subjects and screen recordings, add tracking documentation, and refine bibliography entries.

2026-05-08 07:08:55 -04:00 · 2026-04-08 22:43:20 -04:00
parent ab48109f64
commit 659a4b0683
12 changed files with 235 additions and 107 deletions
@@ -9,14 +9,13 @@ This chapter interprets the results presented in Chapter~\ref{ch:results} agains

 The first research question asked whether HRIStudio enables domain experts without prior robotics experience to successfully implement a robot interaction from a written specification. The Choregraphe condition provides the baseline against which this question is evaluated.

-W-01's session offers preliminary evidence consistent with the accessibility problem described in Chapter~\ref{ch:background}. W-01 was a Digital Humanities faculty member with no programming background --- precisely the intended user population for tools like Choregraphe. Despite this framing, W-01 required significantly more time than allocated and generated a high volume of help requests, the majority of which concerned the tool's interface rather than the task itself. This distinction matters: W-01 understood what the specification required but could not efficiently translate that understanding into Choregraphe's behavior model. The finite state machine paradigm --- boxes, signals, and explicit connection routing --- imposed cognitive overhead on a domain expert who had no prior exposure to this abstraction.
+The five completed sessions provide directional evidence on the accessibility question. Across the three Choregraphe wizards, design fidelity scores were 42.5, 65, and 62.5, yielding a condition mean of 56.7. Across the two completed HRIStudio sessions, both wizards achieved a DFS of 100. No HRIStudio wizard required a T-type tool-operation intervention that reflected an inability to operate the platform; the T-type marks logged for W-05 concerned interface orientation (where to find property editing, how the branch block is configured) rather than fundamental operational barriers. By contrast, all three Choregraphe wizards required T-type assistance for core design tasks: W-01 for connection routing and branch wiring, W-03 for none but over-engineered the design beyond the specification, and W-04 for speech content punctuation and a failed choice block attempt.

-W-01's SUS score of 60, below the average benchmark of 68~\cite{Brooke1996}, corroborates this observation. Post-session comments indicated that the wizard would not use Choregraphe for future HRI work without technical support, despite completing the design challenge. Together these observations establish a concrete baseline: a tool nominally designed for non-programmers nonetheless required substantial researcher support, produced a high volume of interface-level help requests, and was rated below average in usability by a domain expert with no programming background.
+The SUS scores reinforce this pattern. Choregraphe SUS scores were 60, 75, and 42.5 (mean 59.2), all at or below the average usability benchmark of 68~\cite{Brooke1996}. HRIStudio SUS scores were 90 and 70 (mean 80), both above the benchmark. The Choregraphe condition produced the lowest single SUS score in the study (42.5, W-04), a wizard who described the platform as getting in the way of their attempt. The HRIStudio condition produced the highest (90, W-02). Across programming backgrounds, the gap is consistent: W-01 (non-programmer, Choregraphe, SUS 60) versus W-05 (non-programmer, HRIStudio, SUS 70); W-04 (moderate programmer, Choregraphe, SUS 42.5) versus W-02 (programmer, HRIStudio, SUS 90).

-The HRIStudio sessions are evaluated against this baseline. The central comparison is whether wizards using HRIStudio produce higher DFS scores with fewer tool-operation interventions and higher SUS ratings. If HRIStudio's timeline-based interaction model reduces the interface friction observed with Choregraphe, those differences should appear across all three measures simultaneously; a pattern limited to one measure would call for a more qualified interpretation.
+The most striking accessibility finding comes from W-05: a Chemical Engineering faculty member with no programming experience trained in 6 minutes, completed a perfect design in 18 minutes with no operational confusion, and ran the trial to conclusion. This outcome directly addresses the accessibility claim. HRIStudio's timeline-based model and guided workflow allowed a domain novice to implement the written specification correctly on their first attempt, without the interface friction that blocked or slowed all three Choregraphe wizards. Session timing data underscores the difference: Choregraphe design phases averaged 35.7 minutes (two overruns, one incomplete), while HRIStudio design phases averaged 21 minutes (both within the allocation). Underlying this difference is a structural property of the two tools: HRIStudio's model is domain-specific to Wizard-of-Oz execution, so wizard effort is channeled toward implementing the specification more completely rather than elaborating the tool's architecture. Choregraphe's general-purpose programming model makes the opposite available, and both W-03 and W-04 took it, spending time on concurrent execution structures and a speech-recognition-driven choice block that the WoZ context does not support. Neither HRIStudio wizard had that option, and both scored 100 on the DFS.

-% TODO: Replace the forward-looking framing above with the actual condition-level comparison once HRIStudio sessions are complete.
-% TODO: Report mean DFS, SUS, and T-type intervention counts per condition. Discuss what any gap implies for the accessibility claim.
+% TODO: Add W-06 data to condition means once session is complete.

 \subsection{Research Question 2: Reproducibility}

@@ -24,26 +23,27 @@ The second research question asked whether HRIStudio produces more reliable exec

 This is precisely the failure mode the reproducibility problem predicts. Riek's~\cite{Riek2012} review found that fewer than 4\% of published WoZ studies reported any measure of wizard error, meaning most studies have no mechanism to detect whether execution matched design intent. W-01's session demonstrates that such deviations occur even in controlled conditions with a single, simple specification and an engaged wizard. The deviation was not negligence; it was creative drift made possible by a tool that places no structural constraint on what the wizard types into a speech action.

-HRIStudio's protocol enforcement model is designed to prevent this class of deviation. By locking speech content at design time and presenting it to the wizard during execution rather than requiring re-entry, HRIStudio eliminates the structural opportunity for this substitution. Whether enforcement translates into measurably higher ERS scores is the empirical question the full dataset addresses. Complementing the ERS, the intervention log records whether any branch during the trial was resolved through programmed conditional logic or by manual re-routing, providing a parallel measure of execution reliability that is independent of the test subject's responses.
+HRIStudio's protocol enforcement model is designed to prevent this class of deviation by locking speech content at design time. The available data supports this design intent. No speech content deviations occurred in either completed HRIStudio session. W-05 added an action beyond the specification (a crouch gesture), but this was an elaboration of the gesture category rather than a substitution of specified content, and it was scored within the rubric. The Choregraphe condition produced the only speech substitution in the dataset (W-01) and two sessions in which branching was absent from the design entirely (W-03, W-04).

-% TODO: Replace the forward-looking framing above with actual ERS condition means once HRIStudio sessions are complete.
-% TODO: Report whether any HRIStudio sessions produced specification deviations or required trial-phase T interventions.
+ERS scores reflect the downstream effect of these design differences. Choregraphe ERS scores were 65, 60, and 75 (mean 66.7). HRIStudio ERS scores were both 95 (mean 95). The branching item is particularly instructive: in the Choregraphe condition, branch execution was either absent from the design entirely (W-03) or present but not implemented as conditional logic (W-01, W-04). W-01 resolved the branch by manually re-routing connections during the trial; W-04 required a T-type trial intervention to be reminded how to trigger the branch step. In both completed HRIStudio sessions, the conditional branch was present in the design and executed during the trial. W-05's branch fired cleanly via programmed conditional logic; W-02's session saw a brief platform-side step misfire immediately corrected by manual step selection, logged as an H-type (platform behavior) intervention rather than a wizard error. In neither HRIStudio session did branch execution depend on tool-operation guidance from the researcher.
+
+% TODO: Add W-06 ERS data once session is complete.

 \subsection{Session Timing and Downstream Effects}

-W-01's design phase extended to 35 minutes, overrunning the 30-minute allocation by five minutes and compressing the trial window to approximately five minutes, well short of the intended ten. This timing pattern is itself evidence for the accessibility claim. If a tool reliably causes design phases to overrun their allocation, the downstream quality of the trial is compromised: a shorter trial produces a less complete ERS and a less representative interaction for the test subject. The difficulty of a tool does not only affect the design experience; it degrades the quality of the data that follow from it. Phase-by-phase timing data collected across all sessions will reveal whether design phase overruns are characteristic of one condition rather than the other, constituting a supplementary indicator of tool accessibility independent of the DFS score.
+W-01's design phase extended to 35 minutes, overrunning the 30-minute allocation by five minutes and leaving approximately five minutes for the trial phase. It is worth distinguishing between the two factors at play here: the overrun reflected both the tool's demands on the wizard and a procedural decision not to interrupt W-01 at the 30-minute mark. Subsequent sessions enforce the transition to the trial phase at 30 minutes regardless of design completion status, consistent with the observer protocol. That said, if a tool's demands make design completion within the allocation genuinely difficult, the risk of an overrun is real regardless of enforcement: a wizard who has not finished at 30 minutes faces a reduced trial window no matter when the cutoff is applied. Phase-by-phase timing data collected across all sessions will reveal whether design phase overruns are characteristic of one condition rather than the other, constituting a supplementary indicator of tool accessibility independent of the DFS score.

-% TODO: Report mean design phase duration per condition and note whether overruns cluster in the Choregraphe condition.
+Across the five completed sessions, design phase overruns are concentrated in the Choregraphe condition. W-01 and W-03 each exceeded the 30-minute design target but completed their designs before the session time limit; W-04 was the only wizard cut off by the limit without finishing. No HRIStudio wizard exceeded the target. This pattern holds across programming backgrounds: W-01 (non-programmer) and W-03 (programmer) both overran in the Choregraphe condition, while W-05 (non-programmer, HRIStudio) completed in 18 minutes. The timing data thus corroborates the DFS and SUS findings as a supplementary accessibility indicator, and supports the conclusion that the overrun pattern is attributable to tool condition rather than wizard background alone.

 \section{Comparison to Prior Work}

-The findings from W-01's session are broadly consistent with prior characterizations of Choregraphe's usability profile. Pot et al.~\cite{Pot2009} introduced Choregraphe as a tool for enabling non-programmers to create NAO behaviors, but subsequent HRI research has treated it primarily as a programmer's tool in practice. The help request pattern observed --- conceptual understanding blocked by interface friction --- aligns with Riek's~\cite{Riek2012} observation that WoZ tools tend to require substantial technical investment even when the underlying experiment is conceptually simple.
+The findings from W-01's session are broadly consistent with prior characterizations of Choregraphe's usability profile. Pot et al.~\cite{Pot2009} introduced Choregraphe as a tool for enabling non-programmers to create NAO behaviors, but subsequent HRI research has treated it primarily as a programmer's tool in practice. The pattern of help requests observed, in which W-01 understood the task but struggled with the tool's interface mechanisms, aligns with Riek's~\cite{Riek2012} observation that WoZ tools tend to require substantial technical investment even when the underlying experiment is conceptually simple.

-The specification deviation observed in W-01's session connects directly to Porfirio et al.'s~\cite{Porfirio2023} argument that formal, verifiable behavior specifications are a prerequisite for reproducible HRI. Porfirio et al. propose specification languages as the solution; HRIStudio takes a complementary approach by embedding the specification into the execution environment itself, making deviation structurally harder rather than formally detectable after the fact. The practical consequence of this design choice --- whether it reduces deviations in practice --- is what the ERS comparison will reveal.
+The specification deviation observed in W-01's session connects directly to Porfirio et al.'s~\cite{Porfirio2023} argument that formal, verifiable behavior specifications are a prerequisite for reproducible HRI. Porfirio et al. propose specification languages as the solution; HRIStudio takes a complementary approach by embedding the specification into the execution environment itself, making deviation structurally harder rather than formally detectable after the fact. The practical consequence of this design choice (whether it reduces deviations in practice) is what the ERS comparison will reveal.

 The SUS score of 60 for Choregraphe falls below scores reported for general-purpose visual programming tools in other HCI studies, though direct comparison is complicated by task and population differences. It is consistent with the finding that domain-specific visual programming environments carry learning curves that programming experience alone does not fully resolve~\cite{Bartneck2024}.

-% TODO: Add HRIStudio condition SUS mean to this section and compare to the Choregraphe baseline once sessions are complete.
+The HRIStudio SUS mean of 80 (across two completed sessions) compared to the Choregraphe mean of 59.2 is consistent with this expectation. A 20-point gap is practically significant even in a pilot sample: it places the Choregraphe condition below average usability and the HRIStudio condition well above it, across wizards with different programming backgrounds. The Choregraphe score of 42.5 from W-04 falls in a range typically characterized as poor usability, a finding that is especially notable given that W-04 had moderate programming experience and engaged with the tool actively rather than passively.

 \section{Limitations}

@@ -51,16 +51,14 @@ This study has several limitations that must be considered when interpreting the

 \textbf{Sample size.} With six wizard participants ($N = 6$), the study is too small for inferential statistics. The reported scores are descriptive. Patterns in the data can suggest directions for future work but cannot establish causal claims about the effect of the tool on design fidelity or execution reliability.

-\textbf{Researcher as test subject.} In W-01's session, the researcher served as the test subject due to participant unavailability. The researcher had foreknowledge of the specification and the study design, which may have introduced familiarity bias into the interaction. Because the DFS and ERS are scored against recordings and exported files rather than the test subject's behavior, this limitation primarily affects the qualitative character of the trial rather than the quantitative scores.
-
-\textbf{Compressed trial window.} W-01's trial lasted approximately five minutes rather than the intended ten. This limits the completeness of the ERS for that session, since several interaction steps were abbreviated under time pressure. Future sessions should enforce the transition to the trial phase at the 30-minute design mark regardless of completion status, consistent with the observer's role defined in the study protocol.
+\textbf{Trial execution without a separate test subject.} Following scheduling difficulties, the study protocol was adjusted so that the wizard executes the designed interaction directly rather than running it for a separate test subject. Because the DFS and ERS are scored against the exported project file and live observation rather than a subject's behavioral responses, this change does not affect the primary quantitative measures. The trial phase evaluates whether the wizard's design executes as specified; the presence or absence of a separate subject does not alter that criterion.

 \textbf{Single task.} Both conditions used the same Interactive Storyteller specification. While this controls for task difficulty, it limits generalizability. The task is simple relative to real HRI experiments; the gap between conditions may be larger or smaller with a more complex protocol involving multiple branches or longer interaction sequences.

 \textbf{Condition imbalance.} Because participants were randomly assigned, the final sample may distribute programmers unevenly across conditions, confounding the comparison. With a small $N$, random assignment does not guarantee balance across programming background.

-\textbf{Platform version.} HRIStudio is under active development. The version used in this study represents the system at a specific point in time; future iterations may behave differently.
+\textbf{Platform version.} HRIStudio is under active development. The version used in this study represents the system at a specific point in time. Future iterations may change how the wizard interface presents protocol steps, how branch conditions are constructed during the design phase, or how protocol enforcement is applied during execution. Any of these changes could affect how easily a non-programmer completes the design challenge or how reliably the tool enforces the specification during the trial, potentially altering the DFS and ERS scores observed under otherwise identical conditions. Results from this study therefore describe the system as it existed at the time of data collection and may not generalize to later releases.

 \section{Chapter Summary}

-This chapter interpreted the results of the pilot study in the context of the two research questions and connected the findings to prior work. The W-01 session provides preliminary evidence for both the accessibility problem and the reproducibility problem: Choregraphe produced significant interface friction for a Digital Humanities faculty member with no programming background, and permitted a specification deviation that was undetected until the live trial. These observations are consistent with the motivating analysis in Chapter~\ref{ch:background} and anchor the comparisons that the full dataset will resolve. The limitations of this pilot study --- sample size, researcher as test subject, compressed trial window, and single task --- are acknowledged and inform the future directions described in Chapter~\ref{ch:conclusion}.
+This chapter interpreted the results of five completed pilot sessions against the two research questions and connected the findings to prior work. Across all primary measures, the directional evidence favors HRIStudio. The Choregraphe condition produced a mean DFS of 56.7, mean ERS of 66.7, and mean SUS of 59.2, with design phase overruns in all three sessions and branching failures or absences in two. The two completed HRIStudio sessions produced mean DFS and ERS scores of 100 and 95 respectively, mean SUS of 80, both design phases within the allocation, and no speech content deviations. The specification deviation observed in W-01 illustrates the reproducibility problem concretely; its absence in the HRIStudio condition is consistent with the enforcement model's design intent. One HRIStudio session (W-06) remains; its results will complete the condition comparison. The limitations of this pilot study, including sample size, task simplicity, and condition imbalance by programming background, are acknowledged and inform the future directions described in Chapter~\ref{ch:conclusion}.