diff --git a/thesis/chapters/06_evaluation.tex b/thesis/chapters/06_evaluation.tex
index acc98e7..def5d05 100644
--- a/thesis/chapters/06_evaluation.tex
+++ b/thesis/chapters/06_evaluation.tex
@@ -1,11 +1,123 @@
-\chapter{Experimental Evaluation of HRIStudio}
+\chapter{Evaluation Study}
 \label{ch:evaluation}
 
-\section{Evaluation Goals}
-% TODO
+Chapters~\ref{ch:design} and~\ref{ch:implementation} described a platform designed to address two specific problems in WoZ-based HRI research: the high technical barrier that limits who can design robot interactions, and the methodological inconsistency that limits how reproducible those interactions are once designed. This chapter describes a user study conducted to test whether HRIStudio actually improves on both dimensions compared to the current standard tool. The study design, participant selection, task, procedure, and measures are documented here.
+
+\section{Research Questions}
+
+The evaluation targets the two problems established in Chapter~\ref{ch:background}. The first is the \emph{Accessibility Problem}: existing tools require substantial programming expertise, which prevents domain experts from conducting independent HRI studies. The second is the \emph{Reproducibility Problem}: without structured logging and protocol enforcement, experiment execution varies across participants and wizards in ways that are difficult to detect or control after the fact.
+
+These problems give rise to two research questions. The first asks whether HRIStudio enables domain experts without prior robotics experience to successfully implement a robot interaction from a written specification. The second asks whether HRIStudio produces more reliable execution of that interaction compared to standard practice.
+
+I hypothesized that wizards using HRIStudio would more completely and correctly implement the written specification, and that their designs would execute more reliably during the live trial, compared to wizards using Choregraphe.
 
 \section{Study Design}
-% TODO
+
+The study used a between-subjects design~\cite{Bartneck2024}. Each wizard participant was randomly assigned to one of two conditions: HRIStudio or Choregraphe. Both groups received the same task, the same time allocation, and the same training structure. Measuring each participant in only one condition prevents carryover effects that would arise from using the same tool twice in sequence.
+
+Two types of participants took part with distinct roles. Wizards were CS faculty who designed and ran the robot interaction. Test subjects were undergraduate students who interacted with the robot during the live trial. This separation ensures that the evaluation captures both the design experience and the quality of the resulting interaction.
+
+\section{Participants}
+
+\subsection{Wizards}
+
+Eight Bucknell University faculty members drawn from across departments were recruited to serve as wizards. Participants were deliberately drawn from both ends of the programming experience spectrum: four had substantial programming backgrounds, and four described themselves as non-programmers or having minimal coding experience. This cross-departmental recruitment was intentional. A primary claim of HRIStudio is that it lowers the technical barrier for domain experts who are not programmers; drawing wizards from outside computer science allows the data to speak to whether that claim holds for the intended user population.
+
+The key inclusion criterion for all wizards was no prior experience with either the NAO robot or Choregraphe software specifically. This controls for tool familiarity so that performance differences reflect the tools themselves rather than prior exposure. Wizards were recruited through direct email. Participation was framed as a voluntary software evaluation unrelated to any professional obligations.
+
+\subsection{Test Subjects}
+
+Eight undergraduate students from Bucknell University were recruited to serve as test subjects. Their role was to interact with the robot during the live trial portion of each wizard's session. Participants were screened to ensure that no test subject was enrolled in a course taught by the wizard they were paired with, to eliminate any risk of coercion. Test subjects were recruited through campus flyers inviting volunteers to interact with a robot for approximately 15 minutes. All participants received international snacks and refreshments upon arrival, regardless of whether they completed the full session.
+
+\subsection{Sample Size Rationale}
+
+With $N = 16$ total participants, this study is small by the standards of a mature research program. That is intentional and appropriate given three constraints. First, this is an honors thesis project conducted over two academic semesters by a single undergraduate researcher with no funded research assistant support. The total person-hours available for participant recruitment, scheduling, session facilitation, and data processing are genuinely bounded. Second, the scope of the study is validation rather than definitive evaluation: the goal is to determine whether HRIStudio produces measurably different outcomes from Choregraphe and to identify failure modes, not to establish effect sizes for a broad population. Third, recruiting faculty from outside computer science for a 75-minute technology evaluation at a small liberal arts university is practically difficult. The target population --- domain experts with no prior robotics tool exposure --- is limited in size and has high competing time demands. Eight participants spans the available pool without resorting to participants who do not meet the inclusion criteria.
+
+This scale is consistent with pilot and feasibility studies in HRI, where small $N$ designs are common in early-stage tool validation~\cite{Steinfeld2009}. Findings should be interpreted as preliminary evidence and directional indicators rather than as conclusive proof.
+
+\section{Task}
+
+Both wizard groups were given the same written task specification: the \emph{Interactive Storyteller} scenario. The specification described a robot that introduces an astronaut named Dara, narrates her discovery of an anomalous glowing rock on Mars, asks the participant a comprehension question about the story, and delivers one of two responses depending on whether the answer is correct. The full specification, including exact robot speech, required gestures, and branching logic, is reproduced in Appendix~\ref{app:materials}.
+
+The task was chosen because it requires several distinct capabilities: speech actions, gesture coordination, conditional branching based on participant input, and a defined conclusion. This exercises the core features of both tools and produces an artifact that can be evaluated against a clear specification.
+
+\section{Apparatus}
+
+Both conditions used the same NAO humanoid robot, a platform approximately 0.58 meters tall capable of speech synthesis, animated gestures, and head movement. Using the same hardware ensured that any differences in execution quality were attributable to the software, not the robot.
+
+The control condition used Choregraphe \cite{Pot2009}, a proprietary visual programming tool developed by Aldebaran Robotics and the standard software for NAO programming. Choregraphe organizes behavior as a finite state machine: nodes represent states and edges represent transitions triggered by conditions or timers.
+
+The experimental condition used HRIStudio, described in Chapter~\ref{ch:implementation}. HRIStudio organizes behavior as a sequential action timeline with support for conditional branches. Unlike Choregraphe, it abstracts robot-specific commands through configuration files, though for this study both tools controlled the same NAO platform.
 
 \section{Procedure}
-% TODO
+
+Each wizard completed a single 75-minute session structured in four phases. Test subjects participated in the live trial phase only, for approximately 15 minutes.
+
+\subsection{Phase 1: Training (15 minutes)}
+
+The session opened with a standardized tutorial tailored to the wizard's assigned tool. The tutorial covered how to create speech actions, specify gestures, define conditional branches, and save the completed design. Training was intentionally brief to simulate a domain expert encountering a new tool without dedicated onboarding. The researcher answered clarification questions but did not offer hints about the design challenge.
+
+\subsection{Phase 2: Design Challenge (30 minutes)}
+
+The wizard received the paper specification and had 30 minutes to implement it using their assigned tool. The researcher observed silently and recorded a screen capture of the wizard's workflow throughout. The researcher noted time to completion, help requests, and any observable errors or misconceptions. If the wizard declared completion before the 30-minute limit, the remaining time was used to review and refine the design.
+
+\subsection{Phase 3: Live Trial (15 minutes)}
+
+After the design phase, a test subject entered the room and the wizard ran their completed program to control the robot during an actual interaction. The researcher video-recorded the full trial to capture robot behavior and timing. The test subject was told they were helping evaluate the robot's performance, not being evaluated themselves.
+
+\subsection{Phase 4: Debrief (15 minutes)}
+
+Following the live trial, the wizard exported their completed project file and completed the System Usability Scale survey. The exported file and video recording served as the primary artifacts for scoring.
+
+\section{Measures}
+\label{sec:measures}
+
+Four measures were collected, two primary and two supplementary.
+
+\subsection{Design Fidelity Score}
+
+The Design Fidelity Score measures how completely and correctly the wizard implemented the paper specification. The exported project file was evaluated against five criteria: whether all four interaction steps were present, whether robot speech matched the specification word-for-word, whether gestures were assigned to the correct steps, whether the conditional branch triggered on the correct condition, and whether both response branches were complete and correctly ordered. Each criterion was scored as met or not met, and the score is the proportion of criteria satisfied.
+
+This measure is motivated by a gap identified by Riek~\cite{Riek2012}, whose systematic review of 54 published WoZ studies found that only 11\% constrained what the wizard could recognize and fewer than 6\% described wizard training procedures --- meaning the vast majority of WoZ studies never verified whether the wizard's design matched any formal specification. Porfirio et al.~\cite{Porfirio2023} similarly argued that formal, verifiable behavior specifications are a prerequisite for reproducible HRI, and the preliminary design of HRIStudio identified specification adherence as a primary evaluation target~\cite{OConnor2024}. The DFS operationalizes these recommendations as a weighted rubric applied to the exported project file. The complete rubric is reproduced in Appendix~\ref{app:materials}. This measure addresses accessibility: did the tool allow a non-expert to produce a correct design?
+
+\subsection{Execution Reliability Score}
+
+The Execution Reliability Score measures whether the designed interaction executed as intended during the live trial. The video recording was reviewed against the specification and the wizard's design. Evaluation criteria included whether the robot delivered the correct speech at each step, whether gestures executed and were synchronized with speech, whether the conditional branch resolved correctly based on the test subject's answer, and whether any errors, disconnections, or hangs occurred. The score is the proportion of the interaction that executed without error.
+
+This measure responds directly to Riek's~\cite{Riek2012} finding that only 3.7\% of published WoZ studies reported any measure of wizard error, making it nearly impossible to determine whether execution matched design intent. Without an execution-level metric, a study could report a technically correct design that nonetheless failed during the live trial due to timing errors, disconnections, or mishandled branches --- precisely the class of problem HRIStudio was designed to detect and log~\cite{OConnor2024, OConnor2025}. The ERS captures those deviations quantitatively. The complete rubric is reproduced in Appendix~\ref{app:materials}. This measure addresses reproducibility: did the design translate reliably into execution?
+
+\subsection{System Usability Scale}
+
+The System Usability Scale (SUS) is a validated 10-item questionnaire measuring perceived usability \cite{Brooke1996}. Wizards completed the SUS after the debrief phase. Scores range from 0 to 100, with higher scores indicating better perceived usability. The full questionnaire is reproduced in Appendix~\ref{app:materials}.
+
+\subsection{Time-to-Completion and Help Requests}
+
+Time to completion measures how long the wizard took to declare the design finished within the 30-minute window. Help request count and type capture where participants encountered difficulty. These supplementary measures provide context for interpreting the primary scores.
+
+\section{Measurement Instruments}
+
+Table~\ref{tbl:measurement_instruments} summarizes the four instruments, when they were collected, and which research question each addresses.
+
+\begin{table}[htbp]
+\centering
+\footnotesize
+\begin{tabular}{|p{3.2cm}|p{4.2cm}|p{2.4cm}|p{3cm}|}
+\hline
+\textbf{Instrument} & \textbf{What it captures} & \textbf{When collected} & \textbf{Research question} \\
+\hline
+Design Fidelity Score & Completeness and correctness of the wizard's implementation against the written specification & End of design phase & Accessibility \\
+\hline
+Execution Reliability Score & Whether the interaction executed as designed during the live trial & Post-trial video review & Reproducibility \\
+\hline
+System Usability Scale & Wizard's perceived usability of the assigned tool & Debrief phase & User experience \\
+\hline
+Time-to-Completion \& Help Requests & Task duration and support requests during design & Throughout design phase & Supplementary \\
+\hline
+\end{tabular}
+\caption{Measurement instruments used in the evaluation study.}
+\label{tbl:measurement_instruments}
+\end{table}
+
+\section{Chapter Summary}
+
+This chapter described a between-subjects study comparing HRIStudio and Choregraphe across eight wizard participants --- four with programming backgrounds and four without --- each of whom designed and ran the Interactive Storyteller task on a NAO robot. Performance was measured through design fidelity against the written specification, execution reliability during the live trial, perceived usability via the SUS, and supplementary timing and help data. Chapter~\ref{ch:results} presents the results.
diff --git a/thesis/chapters/app_materials.tex b/thesis/chapters/app_materials.tex
index a899067..ae65224 100644
--- a/thesis/chapters/app_materials.tex
+++ b/thesis/chapters/app_materials.tex
@@ -1,11 +1,290 @@
 \chapter{Study Materials}
 \label{app:materials}
 
-\section{Protocols}
-% TODO
+This appendix contains the study materials used in the evaluation described in Chapter~\ref{ch:evaluation}, in the order they were presented to participants.
 
-\section{IRB Materials}
-% TODO
+\section{Recruitment Materials}
 
-\section{Questionnaires}
-% TODO
+\subsection*{Email Invitation (Wizard Participants)}
+
+\textit{Subject: Invitation to evaluate Human-Robot Interaction software (International Snacks provided!)}
+
+Dear [Professor Name],
+
+I am conducting an honors thesis study to evaluate ``HRIStudio'', a new platform for designing human-robot interactions. I am seeking Computer Science faculty to act as expert reviewers by participating in a 75-minute Wizard-of-Oz design session.
+
+You will be asked to spend 30 minutes programming a simple behavior on the NAO robot using either HRIStudio or Choregraphe, and then run it live with a student volunteer. No prior experience with the NAO robot is required.
+
+International snacks and refreshments will be provided during the session. If you are willing to participate, please reply to schedule a time.
+
+\hfill Sean O'Connor (\texttt{sso005@bucknell.edu})
+
+\subsection*{Campus Flyer (Test Subject Participants)}
+
+\begin{center}
+\textbf{\large VOLUNTEERS NEEDED: INTERACT WITH A ROBOT!}
+
+\vspace{0.4cm}
+Participate in a short 15-minute session with a NAO humanoid robot.
+
+\vspace{0.4cm}
+\textbf{Snacks from around the world will be provided!}
+
+\vspace{0.2cm}
+Contact: \texttt{sso005@bucknell.edu}
+\end{center}
+
+\section{Informed Consent Forms}
+
+\subsection*{Wizard Participant Consent Form}
+
+\textbf{HRIStudio User Study --- Informed Consent (Faculty/Wizard Participant)}
+
+\textbf{Introduction:} You are invited to participate in a research study evaluating a new software platform for the NAO robot. This study is conducted by Sean O'Connor (Student PI) and Dr.~L.~Felipe Perrone (Advisor) in the Department of Computer Science at Bucknell University.
+
+\textbf{Purpose:} The purpose of this study is to compare the usability and reproducibility of a new visual programming tool (HRIStudio) against the standard software (Choregraphe).
+
+\textbf{Procedures:} If you agree to participate, you will complete the following in a single 75-minute session:
+\begin{enumerate}
+    \item \textbf{Training (15 min):} A brief tutorial on your assigned software interface covering speech, gesture, and branching.
+    \item \textbf{Design Challenge (30 min):} You will receive a written storyboard and program it on the NAO robot using your assigned tool.
+    \item \textbf{Live Trial (15 min):} A student volunteer will enter the room and you will run your program to deliver the story to them.
+    \item \textbf{Debrief (15 min):} You will complete a short usability survey.
+\end{enumerate}
+
+\textbf{Data Collection:} Your workflow will be screen-recorded during the design phase. The live trial will be video recorded to verify robot behavior. All data will be stored on encrypted drives and your identity replaced with a numerical code (e.g., W-01).
+
+\textbf{Risks and Benefits:} There are no known risks beyond those of normal computer use. You will receive international snacks and refreshments during the session. Your participation contributes to research on accessible tools for HRI.
+
+\textbf{Voluntary Participation:} Participation is entirely voluntary and unrelated to any departmental obligations. You may withdraw at any time without penalty.
+
+\textbf{Questions:} Contact Sean O'Connor (\texttt{sso005@bucknell.edu}) or the Bucknell IRB (\texttt{irb@bucknell.edu}).
+
+\vspace{0.8cm}
+\noindent\rule{0.55\textwidth}{0.4pt}\\
+Signature of Participant \hspace{4cm} Date
+
+\vspace{1.2cm}
+
+\subsection*{Test Subject Consent Form}
+
+\textbf{HRIStudio User Study --- Informed Consent (Student/Test Subject)}
+
+\textbf{Introduction:} You are invited to participate in a 15-minute robot interaction session as part of a research study conducted in the Bucknell Computer Science Department.
+
+\textbf{Procedure:} You will enter a lab room and listen to a short story told by a NAO humanoid robot. The robot will then ask you a comprehension question. The interaction takes approximately 5--10 minutes.
+
+\textbf{Data Collection:} The session will be video recorded to analyze the robot's timing and behavior. Your responses are not being graded; we are evaluating the robot's performance, not yours.
+
+\textbf{Risks and Benefits:} Minimal risk. You will receive international snacks and refreshments for your time.
+
+\textbf{Voluntary Participation:} You may stop the interaction and leave at any time without penalty.
+
+\vspace{0.8cm}
+\noindent\rule{0.55\textwidth}{0.4pt}\\
+Signature of Participant \hspace{4cm} Date
+
+\section{Paper Specification: The Interactive Storyteller}
+
+\textit{This document was given to each wizard participant at the start of the Design Phase.}
+
+\textbf{Goal:} Program the robot to tell a short interactive story to a participant. The robot must introduce the story, deliver the narrative with appropriate gestures, ask a comprehension question, and respond to the participant's answer.
+
+\textbf{Script and Logic Flow:}
+
+\begin{enumerate}
+    \item \textbf{Start State}
+    \begin{itemize}
+        \item Robot is standing and looking at the participant.
+    \end{itemize}
+
+    \item \textbf{Step 1 --- The Hook}
+    \begin{itemize}
+        \item \textbf{Speech:} ``Hello. I want to tell you about someone named Dara ---
+               an astronaut who made a decision that changed what we thought we knew about Mars.
+               Are you ready?''
+        \item \textbf{Gesture:} Perform a slow open-hand gesture toward the participant, then lower both arms and stand still before continuing.
+    \end{itemize}
+
+    \item \textbf{Step 2 --- The Narrative}
+    \begin{itemize}
+        \item \textbf{Speech:} ``It was 2147. Dara's crew had been on the Martian surface for six days.
+               Mission protocol said to collect samples, document the terrain, and stay on schedule.
+               But on the sixth morning, while the rest of the crew ran diagnostics,
+               Dara wandered off course.
+               About forty meters from camp, she stopped.
+               Half-buried in the dust was a rock she almost stepped on ---
+               smooth, the size of a fist, and glowing a deep, steady red.
+               Not reflecting sunlight. Glowing.
+               She knelt down, picked it up, and said nothing to anyone.''
+        \item \textbf{Gesture 1:} As the robot says ``stayed on schedule,'' make a precise, dismissive hand wave.
+        \item \textbf{Gesture 2:} As the robot says ``she stopped,'' pause all motion for one full second.
+        \item \textbf{Gesture 3:} As the robot says ``glowing a deep, steady red,'' look slowly downward.
+        \item \textbf{Gesture 4:} As the robot says ``said nothing to anyone,'' lean slightly forward and lower the voice.
+    \end{itemize}
+
+    \item \textbf{Step 3 --- Comprehension Check (Branching)}
+    \begin{itemize}
+        \item \textbf{Speech:} ``She brought it home.
+               The mission report listed it as an anomalous geological sample.
+               NASA has been running tests on it ever since.
+               No one has published anything yet.''
+        \item \textbf{Gesture:} Stand upright, look directly at the participant, and pause for one full second.
+        \item \textbf{Question:} ``What color was the rock Dara found?''
+        \item \textbf{Branch A (Correct answer: ``Red'' or ``red''):}
+        \begin{itemize}
+            \item \textbf{Speech:} ``Red. And it was still glowing when she landed.''
+            \item \textbf{Gesture:} Robot nods once, slowly.
+        \end{itemize}
+        \item \textbf{Branch B (Any other answer):}
+        \begin{itemize}
+            \item \textbf{Speech:} ``Actually, red. Not reflecting light --- emitting it.''
+            \item \textbf{Gesture:} Robot shakes head once.
+        \end{itemize}
+    \end{itemize}
+
+    \item \textbf{Step 4 --- Conclusion}
+    \begin{itemize}
+        \item \textbf{Speech:} ``That was six years ago.
+               The rock is in a lab in Houston.
+               Dara still hasn't told anyone exactly where she found it.
+               That's the end of the story.''
+        \item \textbf{Gesture:} Stand still, lower arms to sides, and bow.
+    \end{itemize}
+\end{enumerate}
+
+\section{Post-Study Questionnaire (System Usability Scale)}
+
+\textit{Completed by wizard participants after the live trial. Circle the number that best reflects your agreement with each statement.}
+
+\vspace{0.4cm}
+\noindent
+\renewcommand{\arraystretch}{2.2}
+\begin{tabularx}{\linewidth}{X *{5}{>{{\centering\arraybackslash}}p{0.85cm}}}
+\textbf{Statement} & \textbf{1} & \textbf{2} & \textbf{3} & \textbf{4} & \textbf{5} \\
+\textit{\footnotesize (Circle one per row)}
+  & \textit{\footnotesize SD} & & & & \textit{\footnotesize SA} \\
+\hline
+1.\enspace I think that I would like to use this system frequently.
+  & $\bigcirc$ & $\bigcirc$ & $\bigcirc$ & $\bigcirc$ & $\bigcirc$ \\
+2.\enspace I found the system unnecessarily complex.
+  & $\bigcirc$ & $\bigcirc$ & $\bigcirc$ & $\bigcirc$ & $\bigcirc$ \\
+3.\enspace I thought the system was easy to use.
+  & $\bigcirc$ & $\bigcirc$ & $\bigcirc$ & $\bigcirc$ & $\bigcirc$ \\
+4.\enspace I think that I would need the support of a technical person to be able to use this system.
+  & $\bigcirc$ & $\bigcirc$ & $\bigcirc$ & $\bigcirc$ & $\bigcirc$ \\
+5.\enspace I found the various functions in this system were well integrated.
+  & $\bigcirc$ & $\bigcirc$ & $\bigcirc$ & $\bigcirc$ & $\bigcirc$ \\
+6.\enspace I thought there was too much inconsistency in this system.
+  & $\bigcirc$ & $\bigcirc$ & $\bigcirc$ & $\bigcirc$ & $\bigcirc$ \\
+7.\enspace I would imagine that most people would learn to use this system very quickly.
+  & $\bigcirc$ & $\bigcirc$ & $\bigcirc$ & $\bigcirc$ & $\bigcirc$ \\
+8.\enspace I found the system very cumbersome to use.
+  & $\bigcirc$ & $\bigcirc$ & $\bigcirc$ & $\bigcirc$ & $\bigcirc$ \\
+9.\enspace I felt very confident using the system.
+  & $\bigcirc$ & $\bigcirc$ & $\bigcirc$ & $\bigcirc$ & $\bigcirc$ \\
+10.\enspace I needed to learn a lot of things before I could get going with this system.
+  & $\bigcirc$ & $\bigcirc$ & $\bigcirc$ & $\bigcirc$ & $\bigcirc$ \\
+\hline
+\end{tabularx}
+\renewcommand{\arraystretch}{1}
+
+\vspace{0.4cm}
+\noindent\textit{\footnotesize SD = Strongly Disagree \quad SA = Strongly Agree}
+
+\section{Design Fidelity Score Rubric}
+
+\textit{To be completed by the researcher after analyzing the exported project file.}
+
+\vspace{0.3cm}
+\noindent\textbf{Participant ID:} \underline{\hspace{3cm}} \hspace{1cm} \textbf{Condition:} \underline{\hspace{3cm}}
+\vspace{0.4cm}
+
+\renewcommand{\arraystretch}{1.6}
+\begin{tabularx}{\linewidth}{X >{\centering\arraybackslash}p{1.4cm} >{\centering\arraybackslash}p{1.4cm} >{\centering\arraybackslash}p{1.4cm}}
+\hline
+\textbf{Component} & \textbf{Present} & \textbf{Correct} & \textbf{Points} \\
+\hline
+\multicolumn{4}{l}{\textbf{Speech Actions (40 points total)}} \\
+\hline
+1.\enspace Introduction speech (``Hello. I want to tell you about someone named Dara\ldots'') & Y~~/~~N & Y~~/~~N & ~~~~~/10 \\
+2.\enspace Narrative speech (``It was 2147. Dara's crew\ldots'') & Y~~/~~N & Y~~/~~N & ~~~~~/10 \\
+3.\enspace Question speech (``What color was the rock Dara found?'') & Y~~/~~N & Y~~/~~N & ~~~~~/10 \\
+4.\enspace Response speeches (correct and incorrect branches) & Y~~/~~N & Y~~/~~N & ~~~~~/10 \\
+\hline
+\multicolumn{4}{l}{\textbf{Gestures and Actions (30 points total)}} \\
+\hline
+5.\enspace Open-hand gesture during introduction & Y~~/~~N & Y~~/~~N & ~~~~~/10 \\
+6.\enspace At least two narrative gestures (pause, lean, gaze) & Y~~/~~N & Y~~/~~N & ~~~~~/10 \\
+7.\enspace Nod (correct branch) or head shake (incorrect branch) & Y~~/~~N & Y~~/~~N & ~~~~~/10 \\
+\hline
+\multicolumn{4}{l}{\textbf{Control Flow and Logic (30 points total)}} \\
+\hline
+8.\enspace Conditional branch triggers on participant's answer & Y~~/~~N & Y~~/~~N & ~~~~~/15 \\
+9.\enspace Correct sequencing of all four steps & Y~~/~~N & Y~~/~~N & ~~~~~/15 \\
+\hline
+\end{tabularx}
+\renewcommand{\arraystretch}{1}
+
+\vspace{0.4cm}
+\noindent\textbf{Scoring:} Award full points if both Present \emph{and} Correct; 50\% if Present but not Correct; 0 if not Present.
+
+\vspace{0.2cm}
+\noindent\textbf{Total:} \underline{\hspace{2cm}} / 100 \hspace{1.5cm} \textbf{Design Fidelity Score:} \underline{\hspace{2cm}}\%
+
+\vspace{0.3cm}
+\noindent\textbf{Notes:}
+
+\vspace{2.5cm}
+
+\section{Execution Reliability Score Rubric}
+
+\textit{To be completed by the researcher after reviewing the video recording of the live trial.}
+
+\vspace{0.3cm}
+\noindent\textbf{Participant ID:} \underline{\hspace{3cm}} \hspace{0.5cm} \textbf{Condition:} \underline{\hspace{3cm}}
+
+\vspace{0.2cm}
+\noindent\textbf{Video File:} \underline{\hspace{6cm}}
+\vspace{0.4cm}
+
+\renewcommand{\arraystretch}{1.6}
+\begin{tabularx}{\linewidth}{X >{\centering\arraybackslash}p{1.4cm} >{\centering\arraybackslash}p{1.6cm} >{\centering\arraybackslash}p{1.4cm}}
+\hline
+\textbf{Behavior} & \textbf{Executed?} & \textbf{Correctly?} & \textbf{Points} \\
+\hline
+\multicolumn{4}{l}{\textbf{Speech Execution (40 points total)}} \\
+\hline
+1.\enspace Introduction speech delivered without errors & Y~~/~~N & Y~~/~~N & ~~~~~/10 \\
+2.\enspace Narrative speech delivered without errors & Y~~/~~N & Y~~/~~N & ~~~~~/10 \\
+3.\enspace Comprehension question delivered correctly & Y~~/~~N & Y~~/~~N & ~~~~~/10 \\
+4.\enspace Appropriate branch response given & Y~~/~~N & Y~~/~~N & ~~~~~/10 \\
+\hline
+\multicolumn{4}{l}{\textbf{Gesture and Movement Execution (30 points total)}} \\
+\hline
+5.\enspace Introduction gesture executed completely & Y~~/~~N & Y~~/~~N & ~~~~~/10 \\
+6.\enspace At least two narrative gestures executed & Y~~/~~N & Y~~/~~N & ~~~~~/10 \\
+7.\enspace Nod or head shake executed correctly & Y~~/~~N & Y~~/~~N & ~~~~~/10 \\
+\hline
+\multicolumn{4}{l}{\textbf{Timing and Synchronization (20 points total)}} \\
+\hline
+8.\enspace Speech and gestures synchronized & Y~~/~~N & Y~~/~~N & ~~~~~/10 \\
+9.\enspace Pause held before comprehension question & Y~~/~~N & Y~~/~~N & ~~~~~/10 \\
+\hline
+\multicolumn{4}{l}{\textbf{System Reliability (10 points --- deduct if problems occur)}} \\
+\hline
+10.\enspace No disconnections, crashes, or hangs occurred & Y~~/~~N & N/A & ~~~~~/10 \\
+\hline
+\end{tabularx}
+\renewcommand{\arraystretch}{1}
+
+\vspace{0.4cm}
+\noindent\textbf{Scoring:} Award full points if both Executed \emph{and} Correct; 50\% if Executed but not Correct; 0 if not Executed. For item 10, award full points only if no errors occurred.
+
+\vspace{0.2cm}
+\noindent\textbf{Total:} \underline{\hspace{2cm}} / 100 \hspace{1.5cm} \textbf{Execution Reliability Score:} \underline{\hspace{2cm}}\%
+
+\vspace{0.3cm}
+\noindent\textbf{Notes:}
+
+\vspace{2.5cm}