feat: draft discussion chapter and update thesis structure with preliminary results and placeholder sections.

2026-05-08 07:08:55 -04:00 · 2026-04-01 17:22:53 -04:00
parent 96057e1bf8
commit ab48109f64
6 changed files with 240 additions and 334 deletions
@@ -19,66 +19,56 @@ In this study, I defined two types of participants with distinct roles. Wizards
 \section{Participants}
-\textbf{Wizards.} I recruited eight Bucknell University faculty members drawn from across departments to serve as wizards. I deliberately recruited from both ends of the programming experience spectrum: four had substantial programming backgrounds, and four described themselves as non-programmers or having minimal coding experience. This cross-departmental recruitment was intentional. A primary claim of HRIStudio is that it lowers the technical barrier for domain experts who are not programmers; drawing wizards from outside computer science allows the data to speak to whether that claim holds for the intended user population.
+\textbf{Wizards.} I recruited six Bucknell University faculty members drawn from across departments to serve as wizards. I deliberately recruited from both ends of the programming experience spectrum, targeting participants with substantial programming backgrounds as well as those who described themselves as non-programmers or having minimal coding experience. This cross-departmental recruitment was intentional. A primary claim of HRIStudio is that it lowers the technical barrier for domain experts who are not programmers; drawing wizards from outside computer science allows the data to speak to whether that claim holds for the intended user population.
 The key inclusion criterion for all wizards was no prior experience with either the NAO robot or Choregraphe software specifically. This controls for tool familiarity so that performance differences reflect the tools themselves rather than prior exposure. I recruited wizards through direct email. Participation was framed as a voluntary software evaluation unrelated to any professional obligations.
-\textbf{Test subjects.} I recruited eight undergraduate students from Bucknell University to serve as test subjects. Their role was to serve as the subjects for the experimental protocol coded by each wizard. To eliminate any risk of coercion, I screened participants to ensure that no test subject was enrolled in a course taught by the wizard they were paired with. Recruitment used campus flyers inviting volunteers to interact with a robot for approximately 15 minutes, and all participants received international snacks and refreshments upon arrival regardless of whether they completed the full session.
+\textbf{Test subjects.} I recruited one undergraduate student per wizard session to serve as a test subject, for a total matching the wizard sample. Their role was to serve as the subjects for the experimental protocol coded by each wizard. To eliminate any risk of coercion, I screened participants to ensure that no test subject was enrolled in a course taught by the wizard they were paired with. Recruitment used campus flyers inviting volunteers to interact with a robot for approximately 15 minutes. There was no compensation for participation.
-\textbf{Sample size rationale.} With $N = 16$ total participants, this sample size is appropriate for a pilot validation study whose goal is directional evidence and failure-mode identification rather than effect-size estimation for a broad population. The size matches the scope and constraints of this honors thesis: two academic semesters, one undergraduate researcher, and no funded research assistant support. It also reflects the target population and recruitment context. Faculty domain experts outside computer science with no prior NAO or Choregraphe experience are a limited pool at a small liberal arts university and have high competing time demands; eight wizard participants represent the available pool without relaxing inclusion criteria.
+\textbf{Sample size rationale.} With six wizard participants ($N = 6$) and a matched number of test subjects, this sample size is appropriate for a pilot validation study whose goal is directional evidence and failure-mode identification rather than effect-size estimation for a broad population. The size matches the scope and constraints of this honors thesis: two academic semesters, one undergraduate researcher, and no funded research assistant support. It also reflects the target population and recruitment context. Faculty domain experts outside computer science with no prior NAO or Choregraphe experience are a limited pool at a small liberal arts university and have high competing time demands. This scale is consistent with pilot and feasibility studies in HRI, where small $N$ designs are common in early-stage tool validation~\cite{Steinfeld2009}. Findings should be interpreted as preliminary evidence and directional indicators rather than as conclusive proof.
 This scale is consistent with pilot and feasibility studies in HRI, where small $N$ designs are common in early-stage tool validation~\cite{Steinfeld2009}. Findings should be interpreted as preliminary evidence and directional indicators rather than as conclusive proof.
 \section{Task}
-Both wizard groups received the same written task specification: the \emph{Interactive Storyteller} scenario. The specification described a robot that introduces an astronaut named Dara, narrates her discovery of an anomalous glowing rock on Mars, asks the human subject a comprehension question about the story, and delivers one of two responses depending on whether the answer is correct. The full specification, including exact robot speech, required gestures, and branching logic, is reproduced in Appendix~\ref{app:materials}.
+Both wizard groups received the same written task specification: the \emph{Interactive Storyteller} scenario. The specification described a robot that introduces an astronaut named Kai, narrates her discovery of a glowing rock on Mars, asks the human subject a comprehension question about the story, and delivers one of two responses depending on whether the answer is correct. The full specification, including exact robot speech, required gestures, and branching logic, is reproduced in Appendix~\ref{app:materials}.
 The task was chosen because it requires several distinct capabilities: speech actions, gesture coordination, conditional branching based on human-subject input, and a defined conclusion. In both conditions, wizards had to translate the same written protocol into an executable interaction script, including action ordering, branching logic, and timing decisions. In Choregraphe, that meant assembling and connecting behavior nodes in a finite state machine. In HRIStudio, it meant building a sequential action timeline with conditional branches. This makes the task a direct comparison of how each tool supports coding the robot behavior required by the same protocol.
 \section{Robot Platform and Software Apparatus}
-Both conditions used the same NAO humanoid robot, a platform approximately 0.58 meters tall capable of speech synthesis, animated gestures, and head movement. Using the same hardware ensured that any differences in execution quality were attributable to the software, not the robot.
+Both conditions used the same NAO humanoid robot (Figure~\ref{fig:nao6-photo}), a platform approximately 0.58 meters tall capable of speech synthesis, animated gestures, and head movement. Using the same hardware ensured that any differences in execution quality were attributable to the software, not the robot.
 Figure~\ref{fig:platform-photo-placeholders} reserves space for final platform images. Replace these placeholders with the final NAO6 and TurtleBot photos when available.
 \begin{figure}[htbp]
 \centering
-\begin{tikzpicture}
+\includegraphics[width=0.45\textwidth]{images/nao6.jpg}
-	\draw[thick] (0,0) rectangle (6,4);
+\caption{The NAO V6 humanoid robot used in both conditions of the pilot study.}
-	\node at (3,2.5) {\textbf{NAO6 Image Placeholder}};
+\label{fig:nao6-photo}
 	\node at (3,1.7) {Humanoid platform photo};
 	\draw[thick] (7,0) rectangle (13,4);
 	\node at (10,2.5) {\textbf{TurtleBot Image Placeholder}};
 	\node at (10,1.7) {Mobile base platform photo};
 \end{tikzpicture}
 \caption{Placeholder image slots for NAO6 and TurtleBot platforms.}
 \label{fig:platform-photo-placeholders}
 \end{figure}
 The control condition used Choregraphe \cite{Pot2009}, a proprietary visual programming tool developed by Aldebaran Robotics and the standard software for NAO programming. Choregraphe organizes behavior as a finite state machine: nodes represent states and edges represent transitions triggered by conditions or timers.
 The experimental condition used HRIStudio, described in Chapter~\ref{ch:implementation}. HRIStudio organizes behavior as a sequential action timeline with support for conditional branches. Unlike Choregraphe, it abstracts robot-specific commands through configuration files, though for this study both tools controlled the same NAO platform.
 \section{Procedure}
-Each wizard completed a single 75-minute session structured in four phases. Each session was run by one wizard and included one test subject during the trial phase, which lasted approximately 15 minutes.
+Each wizard completed a single 60-minute session structured in four phases. Each session was run by one wizard and included one test subject during the trial phase.
 \subsection{Phase 1: Training (15 minutes)}
-I opened each session with a standardized tutorial tailored to the wizard's assigned tool. The tutorial covered how to create speech actions, specify gestures, define conditional branches, and save the completed design. Training was intentionally brief to simulate a domain expert encountering a new tool without dedicated onboarding. I answered clarification questions but did not offer hints about the design challenge.
+I opened each session with a standardized tutorial tailored to the wizard's assigned tool. The tutorial covered how to create speech actions, specify gestures, define conditional branches, and save the completed design. Training was intentionally allocated 15 minutes to allow enough time for wizards to ask clarifying questions about the tool before the design challenge began, while still simulating first encounter with a new tool without extensive onboarding. I answered clarification questions during this phase but did not offer hints about the design challenge.
 \subsection{Phase 2: Design Challenge (30 minutes)}
-The wizard received the paper specification and had 30 minutes to implement it using their assigned tool. I observed silently and recorded a screen capture of the wizard's workflow throughout. I noted time to completion, help requests, and any observable errors or misconceptions. If the wizard declared completion before the 30-minute limit, the remaining time was used to review and refine the design.
+The wizard received the paper specification and had 30 minutes to implement it using their assigned tool. I observed and recorded a screen capture of the wizard's workflow throughout. Using a structured observer data sheet, I logged every instance in which I provided assistance to the wizard, categorizing each by type: \emph{tool-operation} (T), \emph{task clarification} (C), \emph{hardware or technical} (H), or \emph{general} (G). For each tool-operation intervention, I also recorded which rubric item it pertained to. If the wizard declared completion before the time limit, the remaining time was used to review and refine the design.
-\subsection{Phase 3: Trial (15 minutes)}
+\subsection{Phase 3: Live Trial (10 minutes)}
-After the design phase, a test subject entered the room and the wizard ran their completed program to control the robot during an actual interaction. I video-recorded the full trial to capture robot behavior and timing. I told the test subject they were helping evaluate the robot's performance, not being evaluated themselves.
+After the design phase, a test subject entered the room and the wizard ran their completed program to control the robot during an actual interaction. I video-recorded the full trial to capture robot behavior and timing. I told the test subject they were helping evaluate the robot's performance, not being evaluated themselves. I continued logging any researcher interventions during the trial using the same type categories, noting the relevant ERS rubric item for any tool-operation intervention.
-\subsection{Phase 4: Debrief (15 minutes)}
+\subsection{Phase 4: Debrief (5 minutes)}
-Following the trial, the wizard exported their completed project file and completed the System Usability Scale survey. The exported file and video recording served as the primary artifacts for scoring.
+Following the trial, the wizard completed the System Usability Scale survey. The screen recording and video recording served as the primary artifacts for post-session scoring.
 \section{Measures}
 \label{sec:measures}
@@ -87,42 +77,48 @@ The study collected four measures, two primary and two supplementary.
 \subsection{Design Fidelity Score}
-The Design Fidelity Score measures how completely and correctly the wizard implemented the paper specification. I evaluated the exported project file against five criteria: whether all four interaction steps were present, whether robot speech matched the specification word-for-word, whether gestures were assigned to the correct steps, whether the conditional branch triggered on the correct condition, and whether both response branches were complete and correctly ordered. I scored each criterion as met or not met; the DFS is the proportion of criteria satisfied.
+The Design Fidelity Score (DFS) measures how completely and correctly the wizard implemented the paper specification. I evaluated the exported project file against nine weighted criteria grouped into three categories: speech actions, gestures and actions, and control flow and logic. Each criterion is scored as present, correct, and independently achieved.
-This measure is motivated by a gap identified by Riek~\cite{Riek2012}, whose systematic review of 54 published WoZ studies found that only 11\% constrained what the wizard could recognize and fewer than 6\% described wizard training procedures, meaning the vast majority of WoZ studies never verified whether the wizard's design matched any formal specification. Porfirio et al.~\cite{Porfirio2023} similarly argued that formal, verifiable behavior specifications are a prerequisite for reproducible HRI, and the preliminary design of HRIStudio identified specification adherence as a primary evaluation target~\cite{OConnor2024}. The DFS applies these recommendations as a weighted rubric scored against the exported project file. The complete rubric is reproduced in Appendix~\ref{app:materials}. This measure addresses accessibility: did the tool allow a non-expert to produce a correct design?
+The DFS rubric includes an \emph{Assisted} column. For each rubric item, the researcher marks T if a tool-operation intervention was given specifically for that item during the design phase --- for example, if the researcher explained how to add a gesture node or how to wire a conditional branch. T marks are recorded and reported separately alongside the DFS score; they do not affect the Points total. This preserves the DFS as a clean measure of design fidelity while providing a parallel record of where tool-specific assistance was needed. General interventions --- task clarification, hardware issues, or momentary forgetfulness --- are not marked T, because those categories of difficulty are independent of the tool under evaluation.
 This measure is motivated by a gap identified by Riek~\cite{Riek2012}, whose systematic review of 54 published WoZ studies found that only 11\% constrained wizard behavior and fewer than 6\% described wizard training procedures. Porfirio et al.~\cite{Porfirio2023} similarly argued that formal, verifiable behavior specifications are a prerequisite for reproducible HRI. The DFS applies these recommendations as a weighted rubric scored against the exported project file. The complete rubric is reproduced in Appendix~\ref{app:materials}. This measure addresses accessibility: did the tool allow a wizard to independently produce a correct design?
 \subsection{Execution Reliability Score}
-The Execution Reliability Score measures whether the designed interaction executed as intended during the trial. I reviewed the video recording against the specification and the wizard's design. Evaluation criteria included whether the robot delivered the correct speech at each step, whether gestures executed and synchronized with speech, whether the conditional branch resolved correctly based on the test subject's answer, and whether any errors, disconnections, or hangs occurred. The score is the proportion of the interaction that executed without error.
+The Execution Reliability Score (ERS) measures whether the designed interaction executed as intended during the live trial. I reviewed the video recording against the specification and the wizard's design. Evaluation criteria included whether the robot delivered the correct speech at each step, whether gestures executed and synchronized with speech, whether the conditional branch resolved correctly based on the test subject's answer, and whether any errors, disconnections, or hangs occurred.
-This measure responds directly to Riek's~\cite{Riek2012} finding that only 3.7\% of published WoZ studies reported any measure of wizard error, making it nearly impossible to determine whether execution matched design intent. Without an execution-level metric, a study could report a technically correct design that nonetheless failed during the trial due to timing errors, disconnections, or mishandled branches, exactly the kind of problem HRIStudio was designed to detect and log~\cite{OConnor2024, OConnor2025}. The ERS captures those deviations quantitatively. The complete rubric is reproduced in Appendix~\ref{app:materials}. This measure addresses reproducibility: did the design translate reliably into execution?
+The ERS rubric applies the same \emph{Assisted} modifier as the DFS, extended to the trial phase. Any tool-operation intervention I provided during the trial --- for example, explaining to the wizard how to launch or advance their program --- caps the affected ERS item at half points. This is scored separately from design-phase interventions: a wizard who needed help only during design can still achieve a full ERS score if the trial runs without assistance, and vice versa. The rubric also records whether the trial reached its conclusion step and whether the test subject was a recruited participant or the researcher, since foreknowledge of the specification on the part of the test subject represents a qualitatively different trial condition. I additionally note whether any branch resolved through programmed conditional logic or through manual intervention by the wizard during execution.
 This measure responds directly to Riek's~\cite{Riek2012} finding that only 3.7\% of published WoZ studies reported any measure of wizard error, making it nearly impossible to determine whether execution matched design intent~\cite{OConnor2024, OConnor2025}. The complete rubric is reproduced in Appendix~\ref{app:materials}. This measure addresses reproducibility: did the design translate reliably into execution without researcher support?
 \subsection{System Usability Scale}
 The System Usability Scale (SUS) is a validated 10-item questionnaire measuring perceived usability \cite{Brooke1996}. Wizards completed the SUS after the debrief phase. Scores range from 0 to 100, with higher scores indicating better perceived usability. The full questionnaire is reproduced in Appendix~\ref{app:materials}.
-\subsection{Time-to-Completion and Help Requests}
+\subsection{Intervention Log and Session Timing}
-Time to completion measures how long the wizard took to declare the design finished within the 30-minute window. Help request count and type capture where participants encountered difficulty. These supplementary measures provide context for interpreting the primary scores.
+During each session, I maintained a structured intervention log on the observer data sheet, recording the timestamp, type code, affected rubric item number, and a brief description for every instance in which I assisted the wizard. Intervention type codes are: T (tool-operation), C (task or specification clarification), H (hardware or technical issue), and G (general or forgetfulness. Only T-type interventions affect rubric scoring; the others are recorded to provide context for interpreting session flow and wizard experience. I also recorded the actual duration of each session phase and the time at which the wizard completed or abandoned the design, providing supplementary evidence about tool accessibility beyond the DFS score itself.
 \section{Measurement Instruments}
-Table~\ref{tbl:measurement_instruments} summarizes the four instruments, when they were collected, and which research question each addresses.
+Table~\ref{tbl:measurement_instruments} summarizes the five instruments, when they were collected, and which research question each addresses.
 \begin{table}[htbp]
 \centering
 \footnotesize
-\begin{tabular}{|p{3.2cm}|p{4.2cm}|p{2.4cm}|p{3cm}|}
+\begin{tabular}{|p{3.0cm}|p{4.4cm}|p{2.4cm}|p{2.8cm}|}
 \hline
 \textbf{Instrument} & \textbf{What it captures} & \textbf{When collected} & \textbf{Research question} \\
 \hline
-Design Fidelity Score & Completeness and correctness of the wizard's implementation against the written specification & End of design phase & Accessibility \\
+Design Fidelity Score (DFS) & Completeness and correctness of the wizard's implementation; caps items where tool-operation assistance was given & Post-session file review & Accessibility \\
 \hline
-Execution Reliability Score & Whether the interaction executed as designed during the trial & Post-trial video review & Reproducibility \\
+Execution Reliability Score (ERS) & Whether the interaction executed as designed during the trial; caps items where trial-phase tool assistance occurred & Post-trial video review & Reproducibility \\
 \hline
-System Usability Scale & Wizard's perceived usability of the assigned tool & Debrief phase & User experience \\
+System Usability Scale (SUS) & Wizard's perceived usability of the assigned tool & Debrief phase & User experience \\
 \hline
-Time-to-Completion \& Help Requests & Task duration and support requests during design & Throughout design phase & Supplementary \\
+Intervention Log & Timestamped record of all researcher assistance by type (T/C/H/G) and affected rubric item & Throughout session & Supplementary \\
 \hline
 Session Timing & Actual duration of each phase; time to design completion & Throughout session & Supplementary \\
 \hline
 \end{tabular}
 \caption{Measurement instruments used in the pilot validation study.}
@@ -131,4 +127,4 @@ Time-to-Completion \& Help Requests & Task duration and support requests during
 \section{Chapter Summary}
-This chapter described a pilot between-subjects study I designed to test whether the design principles formalized in Chapters~\ref{ch:design} and~\ref{ch:implementation} produce measurably different outcomes from existing practice. Eight wizard participants (four with programming backgrounds and four without) each designed and ran the Interactive Storyteller task on a NAO robot using either HRIStudio or Choregraphe. I measured design fidelity against the written specification, execution reliability during the trial, perceived usability via the SUS, and supplementary timing and help data. Chapter~\ref{ch:results} presents the results.
+This chapter described a pilot between-subjects study I designed to test whether the design principles formalized in Chapters~\ref{ch:design} and~\ref{ch:implementation} produce measurably different outcomes from existing practice. Six wizard participants ($N = 6$), drawn from across departments and spanning the programming experience spectrum, each designed and ran the Interactive Storyteller task on a NAO robot using either HRIStudio or Choregraphe. Each 60-minute session was structured in four phases: a 15-minute standardized tutorial, a 30-minute design challenge, a 10-minute live trial, and a 5-minute debrief. I measured design fidelity (DFS) and execution reliability (ERS) against the written specification, applying a per-item scoring modifier that caps any rubric criterion for which tool-operation assistance was given. I also collected perceived usability via the SUS, a structured intervention log categorizing all researcher assistance by type, and session phase timings. Chapter~\ref{ch:results} presents the results.
@@ -1,8 +1,109 @@
 \chapter{Results}
 \label{ch:results}
-\section{Quantitative Results}
+This chapter presents the results of the pilot validation study described in Chapter~\ref{ch:evaluation}. Because this is a small pilot, I report descriptive statistics and qualitative observations rather than inferential tests. The goal is directional evidence: do the patterns in the data suggest that HRIStudio changes what wizards can produce and how reliably they can produce it?
-% TODO
+
 \section{Participant Overview}
 % TODO: Update session counts when all sessions are complete.
 Table~\ref{tbl:sessions} summarizes the participants and their assigned conditions. Wizards are identified by code to protect confidentiality. Demographic information (programming background: programmer or non-programmer) was collected during recruitment.
 \begin{table}[htbp]
 \centering
 \footnotesize
 \begin{tabular}{|l|l|l|l|l|l|l|}
 \hline
 \textbf{ID} & \textbf{Condition} & \textbf{Background} & \textbf{DFS} & \textbf{ERS} & \textbf{SUS} & \textbf{Design Time} \\
 \hline
 W-01 & Choregraphe & Programmer & 70 & 65 & 60 & 35 min \\
 \hline
 W-02 & \textit{[PLACEHOLDER]} & \textit{[PLACEHOLDER]} & --- & --- & --- & --- \\
 \hline
 W-03 & \textit{[PLACEHOLDER]} & \textit{[PLACEHOLDER]} & --- & --- & --- & --- \\
 \hline
 W-04 & \textit{[PLACEHOLDER]} & \textit{[PLACEHOLDER]} & --- & --- & --- & --- \\
 \hline
 \end{tabular}
 \caption{Summary of wizard participants, conditions, and scores. Rows marked PLACEHOLDER are pending completion.}
 \label{tbl:sessions}
 \end{table}
 \section{Primary Measures}
 \subsection{Design Fidelity Score}
 The Design Fidelity Score measures how completely and correctly each wizard implemented the written specification. Scores range from 0 to 100, with full points awarded only when a component is both present and correct.
 W-01 (Choregraphe) received a DFS of 70. Analysis of the exported project file indicated that all four interaction steps were present and correctly sequenced, and the conditional branch was implemented and functional. However, W-01 deviated from the specification by modifying the color of the rock from red to a different value, causing the narrative speech and comprehension question to no longer match the written protocol. This reduced the ``Correct'' scores for speech items 2 and 3. The open-hand introduction gesture was present and correctly executed; at least one narrative gesture was included; and both branch responses were implemented, though the correct-branch response speech was also modified to reflect the changed rock color.
 % TODO: Add DFS scores for remaining participants and compute condition means when data collection is complete.
 % TODO: Add a bar chart or table comparing DFS by condition.
 \textit{[PLACEHOLDER: DFS results for W-02 through W-0X will be reported here. Condition means and ranges will be summarized in a table.]}
 \subsection{Execution Reliability Score}
 The Execution Reliability Score measures how faithfully the designed interaction executed during the live trial. W-01 received an ERS of 65. The trial ran for approximately five minutes, which was shorter than anticipated due to the design phase overrunning the scheduled window. The introduction speech and gesture executed correctly. The narrative speech executed but deviated from the specification due to the modified rock color, as described above. The comprehension question was delivered, the branching logic resolved correctly based on the test subject's response, and the appropriate branch response was given. Gesture synchronization was partial: the pause gesture executed, but coordination between speech and movement was inconsistent at several points. No system disconnections or crashes occurred.
 % TODO: Add ERS scores for remaining participants and compute condition means.
 % TODO: Note any systematic patterns in execution failures across conditions.
 \textit{[PLACEHOLDER: ERS results for W-02 through W-0X will be reported here.]}
 \subsection{System Usability Scale}
 W-01 rated Choregraphe with a SUS score of 60. The standard benchmark for SUS scores places 68 as the average; scores below 68 are generally considered below average usability~\cite{Brooke1996}. A score of 60 suggests that W-01 found Choregraphe marginal in usability despite having a programming background, which is consistent with the large number of help requests observed during the design phase.
 % TODO: Add SUS scores for remaining participants. Report condition means.
 \textit{[PLACEHOLDER: SUS scores for W-02 through W-0X will be reported here.]}
 \section{Supplementary Measures}
 \subsection{Session Timing}
 Table~\ref{tbl:timing} summarizes the time spent in each phase per session.
 \begin{table}[htbp]
 \centering
 \footnotesize
 \begin{tabular}{|l|l|l|l|l|l|}
 \hline
 \textbf{ID} & \textbf{Training} & \textbf{Design} & \textbf{Trial} & \textbf{Debrief} & \textbf{Total} \\
 \hline
 W-01 & 15 min & 35 min & 5 min & 5 min & 60 min \\
 \hline
 W-02 & --- & --- & --- & --- & --- \\
 \hline
 W-03 & --- & --- & --- & --- & --- \\
 \hline
 W-04 & --- & --- & --- & --- & --- \\
 \hline
 \end{tabular}
 \caption{Time spent in each session phase per wizard participant.}
 \label{tbl:timing}
 \end{table}
 W-01's design phase extended to 35 minutes, nearly double the 20-minute allocation, compressing the trial and debrief to 5 minutes each. Despite this, W-01 declared the design complete rather than abandoning it, and the robot did execute a recognizable version of the specification during the trial.
 \subsection{Help Requests}
 % TODO: Report help request counts and types for all sessions.
 W-01 generated a substantial number of help requests during the design phase, primarily concerning Choregraphe's interface rather than the specification itself. The wizard demonstrated understanding of the task but encountered repeated friction with the tool's connection model, behavior box configuration, and branch routing. This pattern --- understanding the goal but struggling with the mechanism --- is characteristic of the accessibility problem described in Chapter~\ref{ch:background}.
 \textit{[PLACEHOLDER: Help request counts and categories for all sessions will be reported here.]}
 \section{Qualitative Findings}
-% TODO
+
 \subsection{Observed Specification Deviation}
 A notable qualitative finding from W-01's session was an unprompted deviation from the written specification: the wizard substituted a different rock color in the robot's speech and comprehension question, departing from the ``red'' specified in the paper protocol. This was not a tool failure; the wizard made a deliberate creative choice that the tool did not prevent or flag. The deviation was undetected until the live trial, when the researcher --- serving as test subject --- did not correctly identify the rock color and triggered the incorrect-answer branch. This incident illustrates the reproducibility problem concretely: without automated protocol enforcement, wizard behavior can drift from the specification in ways that are invisible until execution, affecting the validity of the resulting interaction data.
 \subsection{Wizard Experience}
 % TODO: Add qualitative observations from remaining sessions.
 W-01 expressed that the training was comprehensible and that the underlying logic of the task was clear. The primary source of frustration was Choregraphe's interface for handling conditional branches and managing the timing of parallel behaviors. Post-session comments suggested that the wizard would not use Choregraphe independently for future HRI work without technical support.
 \textit{[PLACEHOLDER: Qualitative observations from remaining sessions will be reported here.]}
 \section{Chapter Summary}
 % TODO: Update summary when all sessions are complete.
 This chapter presented the results from the pilot validation study. To date, one Choregraphe condition session has been completed (W-01), yielding a DFS of 70, ERS of 65, and SUS of 60. Qualitative observations from this session provide preliminary evidence for both the accessibility problem (substantial help requests and design phase overrun) and the reproducibility problem (unprompted specification deviation undetected until the live trial). Remaining sessions will add data for both conditions; Chapter~\ref{ch:discussion} interprets the available findings in the context of the research questions.
@@ -1,11 +1,66 @@
 \chapter{Discussion}
 \label{ch:discussion}
 This chapter interprets the results presented in Chapter~\ref{ch:results} against the two research questions established in Chapter~\ref{ch:evaluation}, situates the findings within the broader literature on WoZ methodology, and identifies the limitations of this study. Where the pilot data derives from an initial subset of sessions, I treat those observations as preliminary evidence and establish the analytical framework that governs interpretation of the full dataset.
 \section{Interpretation of Findings}
-% TODO
+
 \subsection{Research Question 1: Accessibility}
 The first research question asked whether HRIStudio enables domain experts without prior robotics experience to successfully implement a robot interaction from a written specification. The Choregraphe condition provides the baseline against which this question is evaluated.
 W-01's session offers preliminary evidence consistent with the accessibility problem described in Chapter~\ref{ch:background}. W-01 was a Digital Humanities faculty member with no programming background --- precisely the intended user population for tools like Choregraphe. Despite this framing, W-01 required significantly more time than allocated and generated a high volume of help requests, the majority of which concerned the tool's interface rather than the task itself. This distinction matters: W-01 understood what the specification required but could not efficiently translate that understanding into Choregraphe's behavior model. The finite state machine paradigm --- boxes, signals, and explicit connection routing --- imposed cognitive overhead on a domain expert who had no prior exposure to this abstraction.
 W-01's SUS score of 60, below the average benchmark of 68~\cite{Brooke1996}, corroborates this observation. Post-session comments indicated that the wizard would not use Choregraphe for future HRI work without technical support, despite completing the design challenge. Together these observations establish a concrete baseline: a tool nominally designed for non-programmers nonetheless required substantial researcher support, produced a high volume of interface-level help requests, and was rated below average in usability by a domain expert with no programming background.
 The HRIStudio sessions are evaluated against this baseline. The central comparison is whether wizards using HRIStudio produce higher DFS scores with fewer tool-operation interventions and higher SUS ratings. If HRIStudio's timeline-based interaction model reduces the interface friction observed with Choregraphe, those differences should appear across all three measures simultaneously; a pattern limited to one measure would call for a more qualified interpretation.
 % TODO: Replace the forward-looking framing above with the actual condition-level comparison once HRIStudio sessions are complete.
 % TODO: Report mean DFS, SUS, and T-type intervention counts per condition. Discuss what any gap implies for the accessibility claim.
 \subsection{Research Question 2: Reproducibility}
 The second research question asked whether HRIStudio produces more reliable execution of a designed interaction compared to Choregraphe. The most instructive finding from W-01's session is not a score but an incident: without any technical failure, the wizard substituted a different rock color in the robot's speech and comprehension question, departing from the ``red'' specified in the written protocol. This deviation was not caught during the design phase, was not flagged by the tool, and was only discovered during the live trial.
 This is precisely the failure mode the reproducibility problem predicts. Riek's~\cite{Riek2012} review found that fewer than 4\% of published WoZ studies reported any measure of wizard error, meaning most studies have no mechanism to detect whether execution matched design intent. W-01's session demonstrates that such deviations occur even in controlled conditions with a single, simple specification and an engaged wizard. The deviation was not negligence; it was creative drift made possible by a tool that places no structural constraint on what the wizard types into a speech action.
 HRIStudio's protocol enforcement model is designed to prevent this class of deviation. By locking speech content at design time and presenting it to the wizard during execution rather than requiring re-entry, HRIStudio eliminates the structural opportunity for this substitution. Whether enforcement translates into measurably higher ERS scores is the empirical question the full dataset addresses. Complementing the ERS, the intervention log records whether any branch during the trial was resolved through programmed conditional logic or by manual re-routing, providing a parallel measure of execution reliability that is independent of the test subject's responses.
 % TODO: Replace the forward-looking framing above with actual ERS condition means once HRIStudio sessions are complete.
 % TODO: Report whether any HRIStudio sessions produced specification deviations or required trial-phase T interventions.
 \subsection{Session Timing and Downstream Effects}
 W-01's design phase extended to 35 minutes, overrunning the 30-minute allocation by five minutes and compressing the trial window to approximately five minutes, well short of the intended ten. This timing pattern is itself evidence for the accessibility claim. If a tool reliably causes design phases to overrun their allocation, the downstream quality of the trial is compromised: a shorter trial produces a less complete ERS and a less representative interaction for the test subject. The difficulty of a tool does not only affect the design experience; it degrades the quality of the data that follow from it. Phase-by-phase timing data collected across all sessions will reveal whether design phase overruns are characteristic of one condition rather than the other, constituting a supplementary indicator of tool accessibility independent of the DFS score.
 % TODO: Report mean design phase duration per condition and note whether overruns cluster in the Choregraphe condition.
 \section{Comparison to Prior Work}
-% TODO
+
 The findings from W-01's session are broadly consistent with prior characterizations of Choregraphe's usability profile. Pot et al.~\cite{Pot2009} introduced Choregraphe as a tool for enabling non-programmers to create NAO behaviors, but subsequent HRI research has treated it primarily as a programmer's tool in practice. The help request pattern observed --- conceptual understanding blocked by interface friction --- aligns with Riek's~\cite{Riek2012} observation that WoZ tools tend to require substantial technical investment even when the underlying experiment is conceptually simple.
 The specification deviation observed in W-01's session connects directly to Porfirio et al.'s~\cite{Porfirio2023} argument that formal, verifiable behavior specifications are a prerequisite for reproducible HRI. Porfirio et al. propose specification languages as the solution; HRIStudio takes a complementary approach by embedding the specification into the execution environment itself, making deviation structurally harder rather than formally detectable after the fact. The practical consequence of this design choice --- whether it reduces deviations in practice --- is what the ERS comparison will reveal.
 The SUS score of 60 for Choregraphe falls below scores reported for general-purpose visual programming tools in other HCI studies, though direct comparison is complicated by task and population differences. It is consistent with the finding that domain-specific visual programming environments carry learning curves that programming experience alone does not fully resolve~\cite{Bartneck2024}.
 % TODO: Add HRIStudio condition SUS mean to this section and compare to the Choregraphe baseline once sessions are complete.
 \section{Limitations}
-% TODO
+
 This study has several limitations that must be considered when interpreting the findings.
 \textbf{Sample size.} With six wizard participants ($N = 6$), the study is too small for inferential statistics. The reported scores are descriptive. Patterns in the data can suggest directions for future work but cannot establish causal claims about the effect of the tool on design fidelity or execution reliability.
 \textbf{Researcher as test subject.} In W-01's session, the researcher served as the test subject due to participant unavailability. The researcher had foreknowledge of the specification and the study design, which may have introduced familiarity bias into the interaction. Because the DFS and ERS are scored against recordings and exported files rather than the test subject's behavior, this limitation primarily affects the qualitative character of the trial rather than the quantitative scores.
 \textbf{Compressed trial window.} W-01's trial lasted approximately five minutes rather than the intended ten. This limits the completeness of the ERS for that session, since several interaction steps were abbreviated under time pressure. Future sessions should enforce the transition to the trial phase at the 30-minute design mark regardless of completion status, consistent with the observer's role defined in the study protocol.
 \textbf{Single task.} Both conditions used the same Interactive Storyteller specification. While this controls for task difficulty, it limits generalizability. The task is simple relative to real HRI experiments; the gap between conditions may be larger or smaller with a more complex protocol involving multiple branches or longer interaction sequences.
 \textbf{Condition imbalance.} Because participants were randomly assigned, the final sample may distribute programmers unevenly across conditions, confounding the comparison. With a small $N$, random assignment does not guarantee balance across programming background.
 \textbf{Platform version.} HRIStudio is under active development. The version used in this study represents the system at a specific point in time; future iterations may behave differently.
 \section{Chapter Summary}
 This chapter interpreted the results of the pilot study in the context of the two research questions and connected the findings to prior work. The W-01 session provides preliminary evidence for both the accessibility problem and the reproducibility problem: Choregraphe produced significant interface friction for a Digital Humanities faculty member with no programming background, and permitted a specification deviation that was undetected until the live trial. These observations are consistent with the motivating analysis in Chapter~\ref{ch:background} and anchor the comparisons that the full dataset will resolve. The limitations of this pilot study --- sample size, researcher as test subject, compressed trial window, and single task --- are acknowledged and inform the future directions described in Chapter~\ref{ch:conclusion}.
@@ -1,8 +1,48 @@
 \chapter{Conclusion and Future Work}
 \label{ch:conclusion}
 This thesis set out to address two persistent problems in Wizard-of-Oz-based Human-Robot Interaction research. The first is the Accessibility Problem: a high technical barrier prevents domain experts who are not programmers from conducting HRI studies independently. The second is the Reproducibility Problem: the fragmented landscape of custom tools makes it difficult to verify or replicate experimental results across studies and labs. This chapter summarizes the contributions of the work, reflects on what the pilot study results suggest, and identifies directions for future investigation.
 \section{Contributions}
-% TODO
+
 This thesis makes three contributions to the field of Human-Robot Interaction research infrastructure.
 \textbf{A principled architecture for WoZ platforms.} The primary contribution is a set of design principles for Wizard-of-Oz infrastructure: a hierarchical specification model (Study $\to$ Experiment $\to$ Step $\to$ Action), an event-driven execution model that separates protocol design from live trial control, and a plugin architecture that decouples experiment logic from robot-specific implementations. These principles are not specific to any one robot or institution; they describe a general approach to building WoZ tools that are simultaneously accessible to non-programmers and reproducible across executions. The principles were derived from a systematic analysis of reproducibility failures in published WoZ literature, grounded in the prior work of Riek~\cite{Riek2012} and Porfirio et al.~\cite{Porfirio2023}, and refined through the design and implementation process described in Chapters~\ref{ch:design} and~\ref{ch:implementation}.
 \textbf{HRIStudio: a reference implementation.} The second contribution is HRIStudio, an open-source, web-based platform that realizes the design principles described above. HRIStudio provides a visual experiment designer, a consolidated wizard execution interface, role-based access control for research teams, and a repository-based plugin system for integrating robot platforms including the NAO V6 used in this study. As a reference implementation, HRIStudio demonstrates that the design principles are technically feasible and can be delivered in a form that real researchers can use without programming expertise. The platform's architecture is documented in detail in Chapter~\ref{ch:implementation} and the accompanying technical appendix.
 \textbf{Pilot empirical evidence.} The third contribution is a pilot between-subjects study comparing HRIStudio against Choregraphe as a representative baseline tool. While the pilot scale precludes inferential claims, the study provides directional evidence on both research questions and produces a concrete demonstration of the reproducibility problem in a controlled setting: a wizard using Choregraphe deviated from the written specification in a way that was undetected until the live trial. This incident motivates the enforcement model at the core of HRIStudio's design and illustrates why the reproducibility problem is difficult to solve through training or norms alone.
 \section{Reflection on Research Questions}
 The central question this thesis addressed was: \emph{can the right software architecture make Wizard-of-Oz experiments more accessible to non-programmers and more reproducible across participants?} The evidence from the pilot study suggests the answer is yes, with the qualifications appropriate to a small-N directional study.
 On accessibility, the Choregraphe condition demonstrates that even a tool described as suitable for non-programmers creates significant interface friction in practice. A wizard with programming experience required more time than allocated, generated a high volume of tool-level help requests, and rated the tool below the average SUS benchmark. The finite state machine model --- boxes connected by signals --- imposed cognitive overhead that domain knowledge of the task alone could not resolve. If HRIStudio's timeline-based model and guided workflow reduce that overhead, the difference should appear as higher DFS scores, fewer tool-operation interventions, and higher SUS ratings across the full sample.
 On reproducibility, the specification deviation observed in the Choregraphe session illustrates why enforcement matters. A tool that allows wizards to freely edit speech content at any point in the design process creates opportunities for drift that are invisible until they surface during execution. HRIStudio's protocol enforcement forecloses this class of deviation by construction --- speech is locked at design time and surfaced during execution rather than re-entered. Whether this architectural choice translates into measurably higher execution reliability scores, and whether the proportion of tool-assisted branching resolution differs between conditions, are the questions the full dataset answers.
 % TODO: Once all sessions are complete, rewrite the Reflection section with actual condition means for DFS, ERS, and SUS.
 % TODO: Replace the forward-looking framing in both RQ paragraphs with concrete comparative analysis.
 % TODO: Update the chapter intro sentence ("The evidence suggests yes...") to reflect the actual direction of the findings.
 \section{Future Directions}
-% TODO
+
 The work described in this thesis suggests several directions for future investigation.
 \textbf{Larger validation study.} The most immediate next step is a full-scale study with sufficient participants to support inferential analysis. A sample of 20 or more wizard participants, balanced across programming backgrounds and conditions, would allow the DFS and ERS comparisons to be evaluated for statistical significance. A larger study would also enable subgroup analysis --- for example, whether the accessibility benefit of HRIStudio is concentrated among non-programmers or extends equally to programmers.
 \textbf{Multi-task evaluation.} The Interactive Storyteller is a simple single-interaction task with one conditional branch. Real HRI experiments are more complex: they involve multiple conditions, longer interactions, and more elaborate branching logic. Evaluating HRIStudio on richer specifications would test whether the accessibility and reproducibility benefits scale with task complexity, and whether any new limitations emerge at that scale.
 \textbf{Longitudinal use.} This study evaluated first-session performance, which captures the initial learning curve but not longer-term practice. A longitudinal study tracking wizard performance across multiple sessions would reveal whether HRIStudio's benefits persist or diminish as wizards become proficient, and whether the tool's structured approach continues to enforce reproducibility over time.
 \textbf{Observer and researcher roles.} HRIStudio's role-based architecture includes Observer and Researcher roles that were not formally evaluated in this study. Future work should investigate how these roles support team coordination in multi-experimenter studies, and whether the annotation and logging capabilities they enable produce analysis workflows that are meaningfully more efficient than manual video coding.
 \textbf{Platform expansion.} The NAO integration used in this study is one instance of HRIStudio's plugin architecture. Extending the plugin ecosystem to include mobile robots, socially assistive robots, and non-humanoid platforms would broaden the system's applicability and test whether the plugin abstraction is sufficiently general to accommodate the range of robot capabilities used in published HRI research.
 \textbf{Community adoption.} The reproducibility problem in WoZ research is ultimately a community problem, not a tool problem. Future work should investigate what it would take for HRIStudio to be adopted as shared infrastructure across multiple labs --- including documentation standards, experiment sharing mechanisms, and incentive structures that make reproducibility a norm rather than an exception.
 \section{Closing Remarks}
 The Wizard-of-Oz technique is one of the most powerful tools available to HRI researchers: it allows the study of interaction designs that do not yet exist as autonomous systems, accelerating the feedback loop between design intuition and empirical evidence. But the technique has been practiced for decades without the infrastructure needed to make it rigorous. Studies are conducted with custom tools that are never shared, by wizards whose behavior is never verified against a protocol, producing results that cannot be replicated because the conditions that produced them were never precisely recorded.
 HRIStudio is an attempt to build that infrastructure. It will not solve the reproducibility problem by itself; that requires community norms, institutional incentives, and continued investment in open, shared tooling. But it demonstrates that the technical barriers are not insurmountable --- that a web-based platform can make WoZ research accessible to domain experts who are not engineers, and that execution enforcement can prevent the kinds of specification drift that silently degrade research quality. That is, at minimum, where the work begins.
@@ -1,290 +1,4 @@
 \chapter{Study Materials}
 \label{app:materials}
-This appendix contains the study materials used in the evaluation described in Chapter~\ref{ch:evaluation}, in the order they were presented to participants.
+\textit{[PLACEHOLDER: Study materials will be inserted here. Content includes recruitment materials, paper specification, consent forms, SUS questionnaire, Design Fidelity Score rubric, Execution Reliability Score rubric, observer data sheet, and training protocol.]}
 \section{Recruitment Materials}
 \subsection*{Email Invitation (Wizard Participants)}
 \textit{Subject: Invitation to evaluate Human-Robot Interaction software (International Snacks provided!)}
 Dear [Professor Name],
 I am conducting an honors thesis study to evaluate ``HRIStudio'', a new platform for designing human-robot interactions. I am seeking Computer Science faculty to act as expert reviewers by participating in a 75-minute Wizard-of-Oz design session.
 You will be asked to spend 30 minutes programming a simple behavior on the NAO robot using either HRIStudio or Choregraphe, and then run it live with a student volunteer. No prior experience with the NAO robot is required.
 International snacks and refreshments will be provided during the session. If you are willing to participate, please reply to schedule a time.
 \hfill Sean O'Connor (\texttt{sso005@bucknell.edu})
 \subsection*{Campus Flyer (Test Subject Participants)}
 \begin{center}
 \textbf{\large VOLUNTEERS NEEDED: INTERACT WITH A ROBOT!}
 \vspace{0.4cm}
 Participate in a short 15-minute session with a NAO humanoid robot.
 \vspace{0.4cm}
 \textbf{Snacks from around the world will be provided!}
 \vspace{0.2cm}
 Contact: \texttt{sso005@bucknell.edu}
 \end{center}
 \section{Informed Consent Forms}
 \subsection*{Wizard Participant Consent Form}
 \textbf{HRIStudio User Study --- Informed Consent (Faculty/Wizard Participant)}
 \textbf{Introduction:} You are invited to participate in a research study evaluating a new software platform for the NAO robot. This study is conducted by Sean O'Connor (Student PI) and Dr.~L.~Felipe Perrone (Advisor) in the Department of Computer Science at Bucknell University.
 \textbf{Purpose:} The purpose of this study is to compare the usability and reproducibility of a new visual programming tool (HRIStudio) against the standard software (Choregraphe).
 \textbf{Procedures:} If you agree to participate, you will complete the following in a single 75-minute session:
 \begin{enumerate}
    \item \textbf{Training (15 min):} A brief tutorial on your assigned software interface covering speech, gesture, and branching.
    \item \textbf{Design Challenge (30 min):} You will receive a written storyboard and program it on the NAO robot using your assigned tool.
    \item \textbf{Live Trial (15 min):} A student volunteer will enter the room and you will run your program to deliver the story to them.
    \item \textbf{Debrief (15 min):} You will complete a short usability survey.
 \end{enumerate}
 \textbf{Data Collection:} Your workflow will be screen-recorded during the design phase. The live trial will be video recorded to verify robot behavior. All data will be stored on encrypted drives and your identity replaced with a numerical code (e.g., W-01).
 \textbf{Risks and Benefits:} There are no known risks beyond those of normal computer use. You will receive international snacks and refreshments during the session. Your participation contributes to research on accessible tools for HRI.
 \textbf{Voluntary Participation:} Participation is entirely voluntary and unrelated to any departmental obligations. You may withdraw at any time without penalty.
 \textbf{Questions:} Contact Sean O'Connor (\texttt{sso005@bucknell.edu}) or the Bucknell IRB (\texttt{irb@bucknell.edu}).
 \vspace{0.8cm}
 \noindent\rule{0.55\textwidth}{0.4pt}\\
 Signature of Participant \hspace{4cm} Date
 \vspace{1.2cm}
 \subsection*{Test Subject Consent Form}
 \textbf{HRIStudio User Study --- Informed Consent (Student/Test Subject)}
 \textbf{Introduction:} You are invited to participate in a 15-minute robot interaction session as part of a research study conducted in the Bucknell Computer Science Department.
 \textbf{Procedure:} You will enter a lab room and listen to a short story told by a NAO humanoid robot. The robot will then ask you a comprehension question. The interaction takes approximately 5--10 minutes.
 \textbf{Data Collection:} The session will be video recorded to analyze the robot's timing and behavior. Your responses are not being graded; we are evaluating the robot's performance, not yours.
 \textbf{Risks and Benefits:} Minimal risk. You will receive international snacks and refreshments for your time.
 \textbf{Voluntary Participation:} You may stop the interaction and leave at any time without penalty.
 \vspace{0.8cm}
 \noindent\rule{0.55\textwidth}{0.4pt}\\
 Signature of Participant \hspace{4cm} Date
 \section{Paper Specification: The Interactive Storyteller}
 \textit{This document was given to each wizard participant at the start of the Design Phase.}
 \textbf{Goal:} Program the robot to tell a short interactive story to a participant. The robot must introduce the story, deliver the narrative with appropriate gestures, ask a comprehension question, and respond to the participant's answer.
 \textbf{Script and Logic Flow:}
 \begin{enumerate}
    \item \textbf{Start State}
    \begin{itemize}
        \item Robot is standing and looking at the participant.
    \end{itemize}
    \item \textbf{Step 1 --- The Hook}
    \begin{itemize}
        \item \textbf{Speech:} ``Hello. I want to tell you about someone named Dara ---
               an astronaut who made a decision that changed what we thought we knew about Mars.
               Are you ready?''
        \item \textbf{Gesture:} Perform a slow open-hand gesture toward the participant, then lower both arms and stand still before continuing.
    \end{itemize}
    \item \textbf{Step 2 --- The Narrative}
    \begin{itemize}
        \item \textbf{Speech:} ``It was 2147. Dara's crew had been on the Martian surface for six days.
               Mission protocol said to collect samples, document the terrain, and stay on schedule.
               But on the sixth morning, while the rest of the crew ran diagnostics,
               Dara wandered off course.
               About forty meters from camp, she stopped.
               Half-buried in the dust was a rock she almost stepped on ---
               smooth, the size of a fist, and glowing a deep, steady red.
               Not reflecting sunlight. Glowing.
               She knelt down, picked it up, and said nothing to anyone.''
        \item \textbf{Gesture 1:} As the robot says ``stayed on schedule,'' make a precise, dismissive hand wave.
        \item \textbf{Gesture 2:} As the robot says ``she stopped,'' pause all motion for one full second.
        \item \textbf{Gesture 3:} As the robot says ``glowing a deep, steady red,'' look slowly downward.
        \item \textbf{Gesture 4:} As the robot says ``said nothing to anyone,'' lean slightly forward and lower the voice.
    \end{itemize}
    \item \textbf{Step 3 --- Comprehension Check (Branching)}
    \begin{itemize}
        \item \textbf{Speech:} ``She brought it home.
               The mission report listed it as an anomalous geological sample.
               NASA has been running tests on it ever since.
               No one has published anything yet.''
        \item \textbf{Gesture:} Stand upright, look directly at the participant, and pause for one full second.
        \item \textbf{Question:} ``What color was the rock Dara found?''
        \item \textbf{Branch A (Correct answer: ``Red'' or ``red''):}
        \begin{itemize}
            \item \textbf{Speech:} ``Red. And it was still glowing when she landed.''
            \item \textbf{Gesture:} Robot nods once, slowly.
        \end{itemize}
        \item \textbf{Branch B (Any other answer):}
        \begin{itemize}
            \item \textbf{Speech:} ``Actually, red. Not reflecting light --- emitting it.''
            \item \textbf{Gesture:} Robot shakes head once.
        \end{itemize}
    \end{itemize}
    \item \textbf{Step 4 --- Conclusion}
    \begin{itemize}
        \item \textbf{Speech:} ``That was six years ago.
               The rock is in a lab in Houston.
               Dara still hasn't told anyone exactly where she found it.
               That's the end of the story.''
        \item \textbf{Gesture:} Stand still, lower arms to sides, and bow.
    \end{itemize}
 \end{enumerate}
 \section{Post-Study Questionnaire (System Usability Scale)}
 \textit{Completed by wizard participants after the live trial. Circle the number that best reflects your agreement with each statement.}
 \vspace{0.4cm}
 \noindent
 \renewcommand{\arraystretch}{2.2}
 \begin{tabularx}{\linewidth}{X *{5}{>{{\centering\arraybackslash}}p{0.85cm}}}
 \textbf{Statement} & \textbf{1} & \textbf{2} & \textbf{3} & \textbf{4} & \textbf{5} \\
 \textit{\footnotesize (Circle one per row)}
  & \textit{\footnotesize SD} & & & & \textit{\footnotesize SA} \\
 \hline
 1.\enspace I think that I would like to use this system frequently.
  & $\bigcirc$ & $\bigcirc$ & $\bigcirc$ & $\bigcirc$ & $\bigcirc$ \\
 2.\enspace I found the system unnecessarily complex.
  & $\bigcirc$ & $\bigcirc$ & $\bigcirc$ & $\bigcirc$ & $\bigcirc$ \\
 3.\enspace I thought the system was easy to use.
  & $\bigcirc$ & $\bigcirc$ & $\bigcirc$ & $\bigcirc$ & $\bigcirc$ \\
 4.\enspace I think that I would need the support of a technical person to be able to use this system.
  & $\bigcirc$ & $\bigcirc$ & $\bigcirc$ & $\bigcirc$ & $\bigcirc$ \\
 5.\enspace I found the various functions in this system were well integrated.
  & $\bigcirc$ & $\bigcirc$ & $\bigcirc$ & $\bigcirc$ & $\bigcirc$ \\
 6.\enspace I thought there was too much inconsistency in this system.
  & $\bigcirc$ & $\bigcirc$ & $\bigcirc$ & $\bigcirc$ & $\bigcirc$ \\
 7.\enspace I would imagine that most people would learn to use this system very quickly.
  & $\bigcirc$ & $\bigcirc$ & $\bigcirc$ & $\bigcirc$ & $\bigcirc$ \\
 8.\enspace I found the system very cumbersome to use.
  & $\bigcirc$ & $\bigcirc$ & $\bigcirc$ & $\bigcirc$ & $\bigcirc$ \\
 9.\enspace I felt very confident using the system.
  & $\bigcirc$ & $\bigcirc$ & $\bigcirc$ & $\bigcirc$ & $\bigcirc$ \\
 10.\enspace I needed to learn a lot of things before I could get going with this system.
  & $\bigcirc$ & $\bigcirc$ & $\bigcirc$ & $\bigcirc$ & $\bigcirc$ \\
 \hline
 \end{tabularx}
 \renewcommand{\arraystretch}{1}
 \vspace{0.4cm}
 \noindent\textit{\footnotesize SD = Strongly Disagree \quad SA = Strongly Agree}
 \section{Design Fidelity Score Rubric}
 \textit{To be completed by the researcher after analyzing the exported project file.}
 \vspace{0.3cm}
 \noindent\textbf{Participant ID:} \underline{\hspace{3cm}} \hspace{1cm} \textbf{Condition:} \underline{\hspace{3cm}}
 \vspace{0.4cm}
 \renewcommand{\arraystretch}{1.6}
 \begin{tabularx}{\linewidth}{X >{\centering\arraybackslash}p{1.4cm} >{\centering\arraybackslash}p{1.4cm} >{\centering\arraybackslash}p{1.4cm}}
 \hline
 \textbf{Component} & \textbf{Present} & \textbf{Correct} & \textbf{Points} \\
 \hline
 \multicolumn{4}{l}{\textbf{Speech Actions (40 points total)}} \\
 \hline
 1.\enspace Introduction speech (``Hello. I want to tell you about someone named Dara\ldots'') & Y~~/~~N & Y~~/~~N & ~~~~~/10 \\
 2.\enspace Narrative speech (``It was 2147. Dara's crew\ldots'') & Y~~/~~N & Y~~/~~N & ~~~~~/10 \\
 3.\enspace Question speech (``What color was the rock Dara found?'') & Y~~/~~N & Y~~/~~N & ~~~~~/10 \\
 4.\enspace Response speeches (correct and incorrect branches) & Y~~/~~N & Y~~/~~N & ~~~~~/10 \\
 \hline
 \multicolumn{4}{l}{\textbf{Gestures and Actions (30 points total)}} \\
 \hline
 5.\enspace Open-hand gesture during introduction & Y~~/~~N & Y~~/~~N & ~~~~~/10 \\
 6.\enspace At least two narrative gestures (pause, lean, gaze) & Y~~/~~N & Y~~/~~N & ~~~~~/10 \\
 7.\enspace Nod (correct branch) or head shake (incorrect branch) & Y~~/~~N & Y~~/~~N & ~~~~~/10 \\
 \hline
 \multicolumn{4}{l}{\textbf{Control Flow and Logic (30 points total)}} \\
 \hline
 8.\enspace Conditional branch triggers on participant's answer & Y~~/~~N & Y~~/~~N & ~~~~~/15 \\
 9.\enspace Correct sequencing of all four steps & Y~~/~~N & Y~~/~~N & ~~~~~/15 \\
 \hline
 \end{tabularx}
 \renewcommand{\arraystretch}{1}
 \vspace{0.4cm}
 \noindent\textbf{Scoring:} Award full points if both Present \emph{and} Correct; 50\% if Present but not Correct; 0 if not Present.
 \vspace{0.2cm}
 \noindent\textbf{Total:} \underline{\hspace{2cm}} / 100 \hspace{1.5cm} \textbf{Design Fidelity Score:} \underline{\hspace{2cm}}\%
 \vspace{0.3cm}
 \noindent\textbf{Notes:}
 \vspace{2.5cm}
 \section{Execution Reliability Score Rubric}
 \textit{To be completed by the researcher after reviewing the video recording of the live trial.}
 \vspace{0.3cm}
 \noindent\textbf{Participant ID:} \underline{\hspace{3cm}} \hspace{0.5cm} \textbf{Condition:} \underline{\hspace{3cm}}
 \vspace{0.2cm}
 \noindent\textbf{Video File:} \underline{\hspace{6cm}}
 \vspace{0.4cm}
 \renewcommand{\arraystretch}{1.6}
 \begin{tabularx}{\linewidth}{X >{\centering\arraybackslash}p{1.4cm} >{\centering\arraybackslash}p{1.6cm} >{\centering\arraybackslash}p{1.4cm}}
 \hline
 \textbf{Behavior} & \textbf{Executed?} & \textbf{Correctly?} & \textbf{Points} \\
 \hline
 \multicolumn{4}{l}{\textbf{Speech Execution (40 points total)}} \\
 \hline
 1.\enspace Introduction speech delivered without errors & Y~~/~~N & Y~~/~~N & ~~~~~/10 \\
 2.\enspace Narrative speech delivered without errors & Y~~/~~N & Y~~/~~N & ~~~~~/10 \\
 3.\enspace Comprehension question delivered correctly & Y~~/~~N & Y~~/~~N & ~~~~~/10 \\
 4.\enspace Appropriate branch response given & Y~~/~~N & Y~~/~~N & ~~~~~/10 \\
 \hline
 \multicolumn{4}{l}{\textbf{Gesture and Movement Execution (30 points total)}} \\
 \hline
 5.\enspace Introduction gesture executed completely & Y~~/~~N & Y~~/~~N & ~~~~~/10 \\
 6.\enspace At least two narrative gestures executed & Y~~/~~N & Y~~/~~N & ~~~~~/10 \\
 7.\enspace Nod or head shake executed correctly & Y~~/~~N & Y~~/~~N & ~~~~~/10 \\
 \hline
 \multicolumn{4}{l}{\textbf{Timing and Synchronization (20 points total)}} \\
 \hline
 8.\enspace Speech and gestures synchronized & Y~~/~~N & Y~~/~~N & ~~~~~/10 \\
 9.\enspace Pause held before comprehension question & Y~~/~~N & Y~~/~~N & ~~~~~/10 \\
 \hline
 \multicolumn{4}{l}{\textbf{System Reliability (10 points --- deduct if problems occur)}} \\
 \hline
 10.\enspace No disconnections, crashes, or hangs occurred & Y~~/~~N & N/A & ~~~~~/10 \\
 \hline
 \end{tabularx}
 \renewcommand{\arraystretch}{1}
 \vspace{0.4cm}
 \noindent\textbf{Scoring:} Award full points if both Executed \emph{and} Correct; 50\% if Executed but not Correct; 0 if not Executed. For item 10, award full points only if no errors occurred.
 \vspace{0.2cm}
 \noindent\textbf{Total:} \underline{\hspace{2cm}} / 100 \hspace{1.5cm} \textbf{Execution Reliability Score:} \underline{\hspace{2cm}}\%
 \vspace{0.3cm}
 \noindent\textbf{Notes:}
 \vspace{2.5cm}