final submissions update part 1

2026-05-08 07:08:55 -04:00 · 2026-04-12 17:07:09 -04:00
parent c28408bd9f
commit e1af7b1f8f
11 changed files with 488 additions and 91 deletions
@@ -0,0 +1,54 @@
+# AGENTS.md
+
+## Purpose
+This file defines repository-specific guidance for AI coding/writing agents working on this thesis.
+
+## Repository Layout
+- Thesis source: `thesis/`
+- Main file: `thesis/thesis.tex`
+- Chapters: `thesis/chapters/*.tex`
+- Output PDF: `thesis/out/thesis.pdf`
+- Context/reference docs: `context/`
+
+## Build and Verify
+From `thesis/`:
+- Build: `make`
+- Output should be generated at `build/thesis.pdf` and copied to `out/thesis.pdf`.
+
+If edits touch LaTeX content, run a build before finishing.
+
+## Writing and Editing Priorities
+1. Preserve technical accuracy over stylistic flourish.
+2. Prefer plain, direct language.
+3. Keep paragraph flow tight: avoid repeating claims already made in nearby paragraphs.
+4. Minimize unnecessary chapter recap in chapter introductions.
+5. Do not introduce unsupported claims.
+
+## Terminology Canon (Use Consistently)
+- `experiment`: reusable protocol specification
+- `trial`: one concrete run of an experiment protocol
+- `wizard`: human operator controlling execution
+- `test subject` / `human subject`: person interacting with robot during a trial
+- `session`: scheduled study block that can include training, design challenge, trial, and debrief
+
+When revising text, normalize wording to these definitions unless quoting study materials verbatim.
+
+## Chapter-Specific Conventions
+- Chapter 4 (Design): focus on principles and architecture, not implementation specifics.
+- Chapter 5 (Implementation): focus on how the principles are realized.
+- Chapter 6 (Evaluation): focus on study design, procedure, and measures; avoid re-explaining earlier chapters.
+
+## Figures and Media
+- If requested images are unavailable, add clearly labeled placeholders in the relevant section.
+- Prefer placing robot imagery in evaluation/apparatus context unless explicitly requested elsewhere.
+
+## Scope Discipline
+- Do not edit generated files under `thesis/build/`.
+- Do not add new dependencies/packages unless necessary.
+- Keep edits minimal and localized to the user request.
+
+## Review Checklist Before Finishing
+- Build succeeds (`make` from `thesis/`).
+- Terminology remains consistent across edited sections.
+- No obvious redundancy introduced.
+- References/labels compile without new undefined-reference warnings.
@@ -0,0 +1,314 @@
+# Professor Annotations — Thesis Draft
+
+Complete record of all GoodReader/Notability annotations from both PDFs (`draft1 - flattened.pdf` covering Abstract + Ch1–5, and `6-9.pdf` covering Ch6–9). Status column tracks what has been applied.
+
+**Key rule:** Only apply explicitly marked text changes (strikethroughs, word replacements, caret insertions). Treat observational margin notes as context only.
+
+---
+
+## Professor's Overall Email Feedback
+
+> "Well, this was an Odyssey of a day. You have something very good here, but as every text, it can always be improved. I am not sure how much you really need to do for Monday. If there's anything that cannot be addressed in this time, it will spill over for later. By later, I mean throughout this coming week that proceeds the defense, possibly as modifications that others may require for you to do after the defense."
+
+> "What you have here is likely to not raise a whole lot of concerns from your reviewers. **The point that I find most needing of attention is chapter 7.** It reads very dry, and as you will see in my embedded comments, I was left wondering if the results could be presented more synthetically with the full anecdotes relegated to an appendix."
+
+> "Throughout the text from beginning to the end, I see material that appears repeatedly. Ideally, one would strive to eliminate these redundancies so that the text is more punchy, more direct, more effective in communicating the message that it wants to get across. Redundancies tend to distract the reader as well as to overwhelm them. If you have time, at some point, I would recommend going through this with a fine tooth comb to identify these redundancies and eliminate them as much as possible and as much as time allows."
+
+**Priority order for remaining work (professor's implied ranking):**
+1. Chapter 7 overhaul — highest priority before defense
+2. Abstract rewrite
+3. Redundancy pass (can spill to post-defense week)
+4. Everything else (can spill to post-defense week if needed)
+
+---
+
+## Status Key
+- ✅ Applied
+- ⬜ Pending
+- 🔍 Needs clarification
+
+---
+
+## Abstract (pp. xv–xvi)
+
+| Location | Annotation | Status |
+|---|---|---|
+| p. xv, top margin | "make it tighter and fully self-contained" — restructured into three focused paragraphs | ✅ |
+| p. xv, next to "Abstract" heading | Rephrased to: "Through a thorough literature review, I identified a set of design principles... I implemented HRIStudio... I then evaluated HRIStudio" | ✅ |
+| p. xv, "high" in "impose high technical barriers" | Deleted "high" | ✅ |
+| p. xv, "faculty" in "six faculty participants" | Deleted "faculty" | ✅ |
+| p. xvi, "HRIStudio participants" and "Choregraphe participants" | → "HRIStudio wizards" / "Choregraphe wizards" | ✅ |
+| p. xvi, three "(mean X vs. Y)" parentheticals | Deleted all three | ✅ |
+| p. xvi, bottom | "The pilot study confirms the thesis: HRIStudio wizards achieved higher design fidelity, higher execution reliability, and higher perceived usability..." | ✅ |
+| p. xvi | Third-party replication carve-out added: "Note that reproducibility here concerns execution consistency within a study and the portability of interaction scripts across robot platforms; it does not refer to independent replication of a published study by third-party researchers." | ✅ |
+| p. xvi | Reproducibility defined as: "run the same social interaction script on a different robot platform without rebuilding the implementation from scratch" | ✅ |
+
+---
+
+## Chapter 1 — Introduction
+
+| Location | Annotation | Status |
+|---|---|---|
+| p. 1, "limit HRI research" (highlighted) | "not all research in HRI is about social robotics; be careful to not get out of your scope" — observational | ✅ (applied in session 1) |
+| p. 2, arrow to "Social robotics" | "social robotics, a subfield within HRI" — confirmed phrasing | ✅ |
+| p. 2, "interaction is inherently unpredictable" | Strikethrough; replaced with: "reactions to robot behaviors are not always" | ✅ |
+| p. 2, "uses" in "the researcher uses a WoZ setup" | Circle; "may use" written above — replace "uses" with "may use" | ✅ |
+| p. 2, "In HRI, the wizard..." | "WoZ experiments" replacing "HRI" | ✅ |
+| p. 3, "a high technical barrier prevents" | Strikethrough; replaced with: "may find it challenging to" | ✅ |
+| p. 3, "from conducting" | Strikethrough — implied rewording | ✅ |
+| p. 3, §1.2 Proposed Approach margin | "you are saying that there are many different robots out there and one may want to have the same interaction script execute on different robots → here is where you must define clearly what you mean by reproducibility" — context | ✅ |
+| p. 3, bracket on first paragraph of §1.2 (three circled sentences) | "BARRIERS 1ST", "follows 2ND", "REPRODUCIBILITY 2ND" — ordering directive | ✅ |
+| p. 3, "the HRI research community" | Strikethrough; replaced with: "WoZ experiments" | ✅ |
+| p. 4, ✓ on "the design principles" | Checkmark — keep as is | ✅ |
+| p. 4, ✗ on "plugin architecture that decouples experiment logic from robot-specific implementations" | Strikethrough; annotation: "a reference implementation, HRIStudio, used to evaluate the impact of the design principles" | ✅ |
+| p. 4, §1.3 "the" before "vision" | Strikethrough; replaced with: "our" | ✅ |
+| p. 4, §1.3 "the prototype" | Strikethrough; replaced with: "a first" | ✅ |
+| p. 4, §1.3 bottom | "extends and" written above "formalizes"; "peer" written above "contributions" | ✅ |
+| p. 5, circled "s" letters on "separates", "enforces", "implements", "evaluates" | Flagging subject/verb agreement — make all present-tense verbs consistent | ✅ |
+| p. 5, end of research question paragraph | "& guides them to be consistent in the pursuit of their experimental goals" — add to final sentence | ✅ |
+| p. 5, italicized research question in Chapter Summary | Strikethrough — delete whole sentence | ✅ |
+
+---
+
+## Chapter 2 — Background and Related Work
+
+| Location | Annotation | Status |
+|---|---|---|
+| p. 6, top (Ch2 opening) | Venn diagram: HRI → Social Robotics → WoZ Experiments → (hatched inner area); "your work is situated in a subset of possible activities in HRI" — observational | ✅ |
+| p. 6, "relative to" in "position this thesis relative to prior work" | Strikethrough; replaced with: "within the context of" | ✅ |
+| p. 8, "WoZ4U is unusable with other robot platforms" | "unusable" struck; replaced with: "does not work for various robots"; "and" before "manufacturer" struck; caret suggests restructure | ✅ |
+| p. 9, "best practices like standardized protocols" | "consistently following" written above "like"; "experimental" written above "standardized" | ✅ |
+| p. 9, circle around citation "[14]" next to Riek | Observational — noting citation | ✅ |
+| p. 9, underline on "internal validity" | "define this for the reader" | ✅ |
+| p. 10, underline on "minimize context switching and tool fragmentation" (R1) | "explain to the reader what this means" | ✅ |
+| p. 10, R2 "Creating interaction protocols" | "social" inserted before "interaction"; "between robot & human" inserted before "protocols" | ✅ |
+| p. 10, R3 "across a variety of robotic platforms" | "→ this addresses reproducibility" — observational reading note | ✅ (treated as context) |
+| p. 10, R5 "implementations" highlighted | "actions and behaviors?" written above; "→ this also addresses reproducibility" — observational | ✅ (treated as context) |
+| p. 11, R5 "the platform" in last line | "WoZ" inserted above "platform" | ✅ |
+| p. 11, R6 "review" + "execution" | Caret inserting "of" between "review" and "execution" | ✅ |
+| p. 11, R6 paragraph margin | "you have been calling it reproducibility so stick with your terminology" | ✅ |
+| p. 11, "flexibility" highlighted | Strikethrough — flagged | ✅ |
+| p. 12, "tests" | Strikethrough; replaced with: "evaluates" | ✅ |
+
+---
+
+## Chapter 3 — Reproducibility Challenges
+
+| Location | Annotation | Status |
+|---|---|---|
+| p. 13, "difficult to reproduce" | "consistently" inserted above "reproduce" | ✅ |
+| p. 13, "sources of variability in WoZ studies" | "identified in the literature" inserted above | ✅ |
+| p. 13, §3.1 opening (highlighted "Reproducibility in experimental research...") | Bottom annotation: "This reproducibility definition is about consistent application of the experimental script across multiple trials → different from reproducing the same experiment with different robots → how do you want to state the distinction?" | ✅ (§3.1 reframed with execution consistency + cross-platform reproducibility as two sub-aspects) |
+| p. 14, "replicating" highlighted | "it may seem pedantic, but I think you need to establish the difference between reproducibility and repeatability across replications (trials)" — "replications" changed to "trials"; repeatability/reproducibility distinction + third-party carve-out added to §3.1 | ✅ |
+| p. 15–16 | Clean — no annotations | — |
+
+---
+
+## Chapter 4 — Architectural Design
+
+| Location | Annotation | Status |
+|---|---|---|
+| p. 19, §4.1 heading | "you use the word 'condition' with different meanings across the thesis — be sure to resolve ambiguities!" | ✅ ("condition" replaced with "experiment" in hierarchy contexts) |
+| p. 19, "multiple reusable conditions" | Circled; arrow to annotation above | ✅ → "multiple experiments" |
+| p. 19, "To organize these components" | "why do you call them 'components'?" | ✅ → "To organize these elements" |
+| p. 20, top | "is 'condition' the same as a structural element...?" — context | ✅ |
+| p. 20, "The terms in this hierarchy are used in a strict way" | Replaced with: "I define the elements in this hierarchy as follows." | ✅ |
+| p. 20, "research" in "top-level research container" | Strikethrough — delete "research" | ✅ |
+| p. 20, "conditions" in "groups related protocol conditions" | Circled — replace | ✅ → "groups related experiments" |
+| p. 20, "condition" in "one reusable condition" | Circled — replace | ✅ → "one independently runnable protocol" |
+| p. 20, "recall" in "testing recall" | "information" written above | ✅ → "testing information recall" |
+| p. 20, Fig. 4.2 paragraph | "new paragraph" marked; "define" above "designed once"; replacement clause about instantiation | ✅ |
+| p. 20, left margin bracket | "this needs to be tightened up and edited for clarity" | ✅ |
+| p. 21, top | Research question rephrasing: "Does how the robot tell a story affect how a human will remember the story?" | ✅ |
+| p. 21, "conditions" highlighted | Circled — replace | ✅ → "experiments" |
+| p. 21, left margin | "this comes at the reader very abruptly: have you introduced these robots and their different morphologies/features before?" | ✅ (added brief characterizations) |
+| p. 21, "that study" highlighted | "the study presented above" written below | ✅ |
+| p. 21, "same hierarchy" struck | Replaced with: "hierarchical elements defined in Figure 4.1" | ✅ |
+| p. 21, step sentence | "sequence of" inserted above "ordered"; "defines the specific" above "contains"; "the robot will perform" added | ✅ |
+| p. 21, figures annotation | "These three figures are interrelated as follows: Figure 4.1 defines the experimental structure as an abstraction; Figure 4.2 shows concrete instances of the abstract experimental structure; and Figure 4.3 shows the expansions of each element of the experimental structure." | ✅ |
+| p. 21, "Together, these three figures..." | Circle with arrow: "place this before the suggested paragraph" | ✅ |
+| p. 21, "lets" struck | Replaced with: "compels" | ✅ |
+| p. 21, "any" struck | Replaced with: "multiple" | ✅ |
+| p. 23, "keeps the tool accessible to non-programmers" | "creates a process that is" inserted above | ✅ |
+| p. 23, "levels" struck | Replaced with: "elements" | ✅ |
+| p. 23, "trial flow" struck | Replaced with: "the sequence of events in a trial" | ✅ |
+| p. 23, "timing" struck | Replaced with: "timing of each event" | ✅ |
+| p. 23, R3/R4/R1/R6 annotations | Add parenthetical references | ✅ |
+| p. 23, bottom | "be consistent with the terminology you use" | ✅ |
+| p. 23, §4.2 bottom paragraph | "what really matters is that the order of actions is the same across multiple trials; it would be unnatural to demand that all actions should happen at the same points in time" — context for rewrite | ✅ (action ordering foregrounded) |
+| p. 24, "shows up" circled | Replaced with: "can be evident" | ✅ |
+| p. 25, §4.3.1 "in plain language" | "natural?" and "even having to" — replaced with: "naturally, without even having to write code" | ✅ |
+| p. 25, "research" struck | Replaced with: "experimental" | ✅ |
+| p. 25, stored-format sentence | "This enables third parties to reproduce one's experiments faithfully. The importance of this feature cannot be overstated since it is central to the scientific method." — add reproducibility sentence | ✅ |
+| p. 26, §4.3.2 "shows" struck | Replaced with: "keeps the wizard informed of" | ✅ |
+| p. 26, "are" caret | Inserted into "The current step...all" | ✅ |
+| p. 26, "Execution" highlighted | `\emph{Execution}` for consistency | ✅ |
+| p. 26, "simply" struck before "ignore these moments" | Delete "simply"; replace "these moments" with "these deviations from the script" | ✅ |
+| p. 26, "participant" highlighted | Replaced with: "human subject" | ✅ |
+| p. 26, left margin brackets | "a little polish is needed here" / "needs polish for clarity and directness" — context | ✅ |
+| p. 26, "the" in "access the same live view" struck | Replaced with: "of a trial" | ✅ |
+| p. 26, bottom | "Define a 'deviation' as a spontaneous action introduced by the wizard in response to a reaction of the human subject that was not expected when the script was created" | ✅ (deviation definition added) |
+| p. 27, §4.3.3 "can" struck | "the need for" inserted | ✅ |
+| p. 27, §4.4.1 annotation above heading | "Like the ISO/OSI RM for networking software, HRIStudio communicates layers, as shown in Fig. 4.5." | ✅ |
+| p. 27, "The system" struck | Replaced with: "More specifically, the system is organized as" | ✅ |
+| p. 28, Fig. 4.5 "different components w/ same name?" | Rename "Execution" in App Logic layer to "Trial Engine" | ✅ |
+| p. 28, Fig. 4.5 "should these arrows be bidirectional?" | Changed arrows to bidirectional | ✅ |
+| p. 29, §4.4.2 "begins stepping" | "allows the wizard to" inserted above | ✅ |
+| p. 29, left margin | "I would have used a 'begin enumerable' here" | ✅ (prose converted to enumerate list) |
+| p. 29, "unexpected events" struck | Replaced with: "deviations" | ✅ |
+| p. 29, "ensures" struck | Replaced with: "creates automatically a" | ✅ |
+
+---
+
+## Chapter 5 — Implementation
+
+| Location | Annotation | Status |
+|---|---|---|
+| p. 33, §5.1 "shown in Figure 4.5" | "presented in Chapter 4 and" inserted above | ✅ |
+| p. 33, yellow highlight on local network sentence | Flagged — keep, add explanation before it | ✅ (client/server/local-network explanation added) |
+| p. 33, TypeScript paragraph | "this is more about implementation than about architecture" — context | ✅ (treated as context) |
+| p. 33, bottom | "up until this statement, you hadn't told the reader that the application is a networked composition of client and server, so this comes as a surprise." | ✅ (explanation added to §5.1) |
+| p. 34, §5.2 "experiments" | "descriptions" inserted above → "saves experiment descriptions" | ✅ |
+| p. 34, yellow highlight on "a trial means one concrete run..." | "wasn't this definition due on your first use of the term 'trial'?" — annotation; remove the definition from here | ✅ (misplaced trial definition removed) |
+| p. 34, "trial record" | "sample?" written above — 🔍 unclear; do not change without confirmation | 🔍 |
+| p. 35, below Figure 5.1 | "you should watch out for redundancies" — observational | ✅ (treated as context) |
+| p. 36, left margin | "this was stated in 4.2" (re: event-driven paragraph) — context | ✅ (redundant paragraph trimmed) |
+| p. 36, yellow highlight on "the wizard controls how time advances from action to action" | Flagged — keep this sentence | ✅ |
+| p. 36, strikethrough on "A fully programmed robot..." passage | Replaced with: "Unscripted actions give the wizard the tools to record how these interactions unfold when deviations from the script are required." | ✅ |
+| p. 38, after role list intro sentence | "The capabilities and constraints for each role are described below:" added | ✅ |
+| p. 39, §5.5 "double-blind design" highlighted | "double-blind line" written above — term already defined inline with citation; no change needed | ✅ |
+| p. 40, §5.7 "are complete and integrated" | "with one another" inserted via caret | ✅ |
+| p. 40, §5.7 last sentence, caret after "beyond NAO6" | Caret with ↑ mark — expansion or forward reference needed | 🔍 |
+
+---
+
+## Chapter 6 — Pilot Validation Study
+
+| Location | Annotation | Status |
+|---|---|---|
+| p. 43, §6.1 hypothesis paragraph | "reproducibility" written above "written specification"; "accessibility?" below — both named in hypothesis | ✅ |
+| p. 43, §6.1 second RQ sentence | Yellow highlight — rephrase away from rhetorical framing | ✅ → "The first is whether..." / "The second is whether..." |
+| p. 43, §6.2 "the same training structure" | "similar" written above — replace | ✅ → "a similar training structure" |
+| p. 44, §6.3 "This cross-departmental recruitment was intentional." | "redundant" above — delete sentence | ✅ |
+| p. 44, §6.3 justification paragraph | "you could first state this as a goal, then talk about how you met this goal" | ✅ (restructured: goal stated first) |
+| p. 44, §6.3 "direct email" | "and" inserted via caret after | ✅ → "through direct email, and participation was..." |
+| p. 44, §6.3 sample size "With" struck | "I chose to recruit"; "believing that" inserted | ✅ |
+| p. 44, §6.3 yellow strikethrough block | "two academic semesters...high competing time demands." — delete | ✅ |
+| p. 44, §6.3 left margin | "looks like you're making excuses" — context for deletion | ✅ |
+| p. 44, §6.3 "proof" struck | "substantiation"; "of any claims." added | ✅ |
+| p. 45, §6.4 top margin | "start with this to set up the scenario:" + professor's suggested opening sentence | ✅ |
+| p. 45, §6.4 "glowing" | "red" written above | ✅ → "a red rock on Mars" |
+| p. 45, §6.4 "comprehension" | "recall" written above | ✅ → "a recall question" |
+| p. 45, §6.4 asterisk near Appendix ref | "...one might measure whether a robot or human storyteller would produce better recall." | ✅ (added as sentence) |
+| p. 45, §6.4 "The task was chosen" | Circle — reframe | ✅ → "This scenario was chosen" |
+| p. 45, §6.4 Choregraphe FSM sentence | Yellow highlight — flagged | ✅ (treated as context) |
+| p. 45, §6.5 left margin | "Important to address" + "you talked about the NAO and Choregraphe earlier but only present them here" | ✅ (treated as context; §6.5 is the formal introduction) |
+| p. 46, star annotation | "you used this nugget of info earlier and unveil it here" — context | ✅ (treated as context) |
+| p. 46, yellow highlights on NAO + Choregraphe sentences | Flagged — keep | ✅ |
+| p. 46, circle on "Choregraphe organizes behavior as a finite state machine" | Flagged | ✅ (treated as context) |
+| p. 47, §6.6.2 "found in Appendix X" | Add appendix reference to observer data sheet | ✅ |
+| p. 47, §6.6.2 "paper" in "paper specification" | Strikethrough — delete "paper" | ✅ |
+| p. 47, §6.6.2 yellow highlight on "structured observer data sheet" | Flagged | ✅ |
+| p. 48, §6.6.4 "found in Appendix Y" | Add appendix reference to SUS questionnaire | ✅ |
+| p. 48, §6.6.4 yellow highlight on "System Usability Scale" | Flagged | ✅ |
+| p. 48, §6.7 after "five instruments." | "They are described as follows." written in red | ✅ |
+| p. 48, §6.7.1 DFS heading | "important to state if this is something you defined or if it was someone else's definition" | ✅ → "I define the Design Fidelity Score (DFS) as..." |
+| p. 49, §6.7.1 "This measure" | "DFS" written above → "DFS is motivated by..." | ✅ |
+| p. 49, §6.7.1 accessibility sentence | "the question:" inserted → "This measure addresses the question: did the tool allow a wizard to independently produce a correct design?" | ✅ |
+| p. 49, §6.7.2 ERS heading | "same comment I made for the DFS" — author-defined statement needed | ✅ → "I define the Execution Reliability Score (ERS) as..." |
+| p. 50, §6.7.2 reproducibility sentence | "the" and "question" inserted → "This measure addresses the question: did the design translate reliably into execution without researcher support?" | ✅ |
+| p. 50, §6.7.3 "created by" | Caret before [19] → "perceived usability, created by Brooke [19]" | ✅ |
+| p. 52, §6.9 "This chapter described" | "the structure of" inserted | ✅ |
+
+---
+
+## Chapter 7 — Results
+
+| Location | Annotation | Status |
+|---|---|---|
+| p. 54, Ch7 intro yellow highlight | "Rhetoric is unusual in technical writing → better rephrase this" — rephrased to declarative statement | ✅ |
+| p. 54, §7.1 "participants" | "personas" → "Table 7.1 summarizes the personas and their assigned conditions" | ✅ |
+| p. 54, §7.1 "faculty members" highlighted | "professors" — replaced in §7.1 opening | ✅ |
+| p. 54, §7.1 after table introduction | "This table also presents numerical data representing the study's results, which is discussed next." | ✅ |
+| p. 55, §7.2.1 DFS heading | "(DFS)" added → "Design Fidelity Score (DFS)" | ✅ |
+| p. 55, §7.2.1 "implemented the written specification" | "the experiment they received" inserted | ✅ |
+| p. 55, §7.2.1 "a component" highlighted | Inline definition added: "a rubric criterion representing a required speech action, gesture, or control-flow element" | ✅ |
+| p. 55, §7.2.1 inline parentheticals | All removed from W-01 through W-06 in DFS and ERS sections | ✅ |
+| p. 55, §7.2.1 bottom | Dry/anecdotal tone — added synthetic overview paragraph before per-wizard detail in both DFS and ERS sections | ✅ |
+| p. 56, W-03 paragraph | "(see §6.7.4)" → "One C-type clarification (see Section~\ref{sec:measures}) was required" | ✅ |
+| p. 56, "for that category" highlighted | Cross-reference added: "(For a full description of rubric categories, see Section~\ref{sec:measures}.)" | ✅ |
+| p. 58, §7.2.2 ERS heading | "(ERS)" added → "Execution Reliability Score (ERS)" | ✅ |
+| p. 70, §7.5 Chapter Summary | Rewritten as interpretive conclusions — no score repetition | ✅ |
+
+---
+
+## Chapter 8 — Discussion
+
+| Location | Annotation | Status |
+|---|---|---|
+| p. 71, Ch8 intro | "With all six sessions now complete," struck — delete this clause | ⬜ |
+| p. 73, §8.1.1 end of accessibility paragraph | `\emph{}` on "None", "Moderate", "Extensive" (annotated "temph") — italicize these three experience levels throughout | ⬜ |
+| p. 73, §8.1.1 bottom | "There's a big thing hiding in the background here: only one wizard was a humanist; all others were engineers" — acknowledge this sample composition limitation | ⬜ |
+| p. 77, §8.2 "holds" highlighted green | "is confirmed?" written above — consider replacing "holds" with "is confirmed" | ⬜ |
+| p. 78, §8.2 continued | "both" inserted via caret before "conditions" → "the overall 17.5-point gap in both condition means reflects..." | ⬜ |
+| p. 79, §8.3 "under active development" struck | Replaced with: "continuously evolving" → "HRIStudio is continuously evolving." | ⬜ |
+
+---
+
+## Chapter 9 — Conclusion and Future Work
+
+| Location | Annotation | Status |
+|---|---|---|
+| p. 81, Ch9 intro | Green highlight on "Human-Robot Interaction"; "social robotics" written below → scope to "Wizard-of-Oz-based social robotics research" | ⬜ |
+| p. 82, §9.1 first contribution | Green highlight on "institution" with "?" — word choice questioned in "not specific to any one robot or institution" | ⬜ |
+| p. 82, §9.1 HRIStudio contribution | Circle around "an open-source"; "did you mention this earlier? how is it distributed and licensed?" — add distribution/licensing info | ⬜ |
+| p. 83, §9.2 Reflection on Research Questions | "How much of 9.2 is new and how much of it does it repeat from other sections?" — audit for redundancy with §8.1 and trim | ⬜ |
+| p. 85, §9.3 "Multi-task evaluation." | Strikethrough (green); replaced with: "Evaluations with multiple different tasks." | ⬜ |
+| p. 86, §9.3 community adoption sentence | "not a" struck; "more of a" and "than" inserted → "The reproducibility problem in WoZ research is ultimately more of a community problem than a tool problem." | ⬜ |
+| p. 86, §9.4 "are never shared" | "aren't always shared" written above struck phrase | ⬜ |
+| p. 86, §9.4 bottom | "I struggle with the word rigorous: might 'systematic' be a more precise qualifier?" — consider replacing "rigorous" with "systematic" throughout closing paragraph | ⬜ |
+
+---
+
+## GoodReader Notes Page (p. i)
+
+### Note 33-1 (April 11, 2026) — **MAJOR STRUCTURAL NOTE**
+
+> "Maybe the way to address the different possible interpretations of the word reproducibility is to state outright that HRIStudio was designed to meet two distinct meanings of the term:
+> — reproducibility across trials in the same experiment, with the same or with different wizards running them
+> — create documentation to guide the reproduction of the experiment by third parties, which would mean reproducibility as in https://dl.acm.org/doi/pdf/10.4108/ICST.SIMUTOOLS2009.5684"
+
+**What this means for the thesis:**
+
+The professor wants three interpretations of "reproducibility" explicitly distinguished somewhere in the thesis (Ch3 §3.1 is the natural home):
+
+1. **Execution consistency** — wizard reliably follows the same script across multiple trials in the same experiment (within-study). This IS what the thesis evaluates.
+2. **Cross-platform reproducibility** — the same experiment script runs on a different robot with minimal change. This IS what HRIStudio is designed to support.
+3. **Third-party replication** — another lab reproduces your published experiment from documentation. This is NOT what the thesis evaluates, and the abstract/conclusion must be careful not to claim it.
+
+**Current state:** Ch3 §3.1 already names sub-aspects 1 and 2. The explicit carve-out for sub-aspect 3 ("third-party replication is not what we evaluated") is still **pending** — it belongs in the Abstract and/or a sentence in Ch3 §3.1.
+
+---
+
+## Pending Items Summary
+
+| Chapter | Item |
+|---|---|
+| Abstract | Full rewrite per professor's framing guidance |
+| Ch3 §3.1 | Add sentence explicitly distinguishing third-party replication as out of scope |
+| Ch5 §5.5 | "double-blind design" — define inline |
+| Ch5 §5.7 | Caret after "beyond NAO6" — needs original PDF check |
+| Ch7 §7.1 | "personas" for "participants"; "professors" for "faculty members" (global); add sentence after table |
+| Ch7 §7.2.1 | "(DFS)" in heading; "the experiment they received"; define "a component"; remove inline parentheticals; narrative tone question |
+| Ch7 §7.2.1 | "(see §6.7.4)" on C-type clarification; cross-reference for DFS categories |
+| Ch7 §7.2.2 | "(ERS)" in heading |
+| Ch7 §7.5 | Rewrite Chapter Summary as interpretive conclusions |
+| Ch8 intro | Delete "With all six sessions now complete," |
+| Ch8 §8.1.1 | Italicize None/Moderate/Extensive; acknowledge humanist sample limitation |
+| Ch8 §8.2 | "holds" → consider "is confirmed"; "both" before "conditions" |
+| Ch8 §8.3 | "under active development" → "continuously evolving" |
+| Ch9 intro | Scope to "Wizard-of-Oz-based social robotics research" |
+| Ch9 §9.1 | "institution" word choice; open-source licensing info |
+| Ch9 §9.2 | Audit for redundancy with §8.1 |
+| Ch9 §9.3 | Rename "Multi-task evaluation" heading; community problem sentence |
+| Ch9 §9.4 | "aren't always shared"; "systematic" for "rigorous" |
@@ -1,32 +1,32 @@
 \chapter{Introduction}
 \label{ch:intro}

-Human-Robot Interaction (HRI) is an essential field of study for understanding how robots should communicate, collaborate, and coexist with people. As researchers work to develop social robots capable of natural interaction, they face a fundamental challenge: how to prototype and evaluate interaction designs before the underlying autonomous systems are fully developed. This chapter introduces the technical and methodological barriers that currently limit HRI research, describes a generalized approach to address these challenges, and establishes the research objectives and thesis statement for this work.
+Human-Robot Interaction (HRI) is an essential field of study for understanding how robots should communicate, collaborate, and coexist with people. As researchers work to develop social robots capable of natural interaction, they face a fundamental challenge: how to prototype and evaluate interaction designs before the underlying autonomous systems are fully developed. This chapter introduces the technical and methodological barriers that currently limit Wizard-of-Oz (WoZ) based HRI research, describes a generalized approach to address these challenges, and establishes the research objectives and thesis statement for this work.

 \section{Motivation}

 To build the social robots of tomorrow, researchers must study how people respond to robot behavior today. That requires interactions that feel real even when autonomy is incomplete. The process of designing and optimizing interactions between human and robot is essential to HRI, a discipline dedicated to ensuring these technologies are safe, effective, and accepted by the public \cite{Bartneck2024}. However, current practices for prototyping these interactions are often hindered by complex technical requirements and inconsistent methodologies.

-Social robotics focuses on robots designed for social interaction with humans, and it poses unique challenges for autonomy. In a typical social robotics interaction, a robot operates autonomously based on pre-programmed behaviors. Because human interaction is inherently unpredictable, pre-programmed autonomy often fails to respond appropriately to subtle social cues, causing the interaction to degrade.
+Social robotics, a subfield of HRI, focuses on robots designed for social interaction with humans, and it poses unique challenges for autonomy. In a typical social robotics interaction, a robot operates autonomously based on pre-programmed behaviors. Because human reactions to robot behaviors are not always predictable, pre-programmed autonomy often fails to respond appropriately to subtle social cues, causing the interaction to degrade.

-To overcome this limitation, researchers use the Wizard-of-Oz (WoZ) technique. The name references L. Frank Baum's story \cite{Baum1900}, in which the "great and powerful" Oz is revealed to be an ordinary person operating machinery behind a curtain, creating an illusion of magic. In HRI, the wizard similarly creates an illusion of robot intelligence from behind the scenes. Consider a scenario where a researcher wants to test whether a robot tutor can effectively encourage student subjects during a learning task. Rather than building a complete autonomous system with speech recognition, natural language understanding, and emotion detection, the researcher uses a WoZ setup: a human operator (the ``wizard'') sits in a separate room, observing the interaction through cameras and microphones. When the subject appears frustrated, the wizard makes the robot say an encouraging phrase and perform a supportive gesture. To the subject, the robot appears to be acting autonomously, responding naturally to the subject's emotional state. This methodology allows researchers to rapidly prototype and test interaction designs, gathering valuable data about human responses before investing in the development of complex autonomous capabilities.
+To overcome this limitation, researchers use the WoZ technique. The name references L. Frank Baum's story \cite{Baum1900}, in which the "great and powerful" Oz is revealed to be an ordinary person operating machinery behind a curtain, creating an illusion of magic. In WoZ experiments, the wizard similarly creates an illusion of robot intelligence from behind the scenes. Consider a scenario where a researcher wants to test whether a robot tutor can effectively encourage student subjects during a learning task. Rather than building a complete autonomous system with speech recognition, natural language understanding, and emotion detection, the researcher may use a WoZ setup: a human operator (the ``wizard'') sits in a separate room, observing the interaction through cameras and microphones. When the subject appears frustrated, the wizard makes the robot say an encouraging phrase and perform a supportive gesture. To the subject, the robot appears to be acting autonomously, responding naturally to the subject's emotional state. This methodology allows researchers to rapidly prototype and test interaction designs, gathering valuable data about human responses before investing in the development of complex autonomous capabilities.

-Despite its versatility, WoZ research faces two critical challenges. The first is \emph{The Accessibility Problem}: a high technical barrier prevents many non-programmers, such as experts in psychology or sociology, from conducting their own studies without engineering support. The second is \emph{The Reproducibility Problem}: the hardware landscape is highly fragmented, and researchers frequently build custom control interfaces for specific robots and experiments. These tools are rarely shared, making it difficult for the scientific community to replicate results or compare findings across labs.
+Despite its versatility, WoZ research faces two critical challenges. The first is \emph{The Accessibility Problem}: many non-programmers, such as experts in psychology or sociology, may find it challenging to conduct their own studies without engineering support. The second is \emph{The Reproducibility Problem}: the hardware landscape is highly fragmented, and researchers frequently build custom control interfaces for specific robots and experiments. Because these tools are tightly coupled to particular hardware, running the same social interaction script on a different robot platform typically requires rebuilding the implementation from scratch. These tools are rarely shared, making it difficult for a researcher to reproduce the same study across different robot platforms or for other labs to replicate results.

 \section{Proposed Approach}

 To address the accessibility and reproducibility problems in WoZ-based HRI research, I propose a web-based software framework that integrates three key capabilities. First, the framework must provide an intuitive interface for experiment design that does not require programming expertise, enabling domain experts from psychology, sociology, or other fields to create interaction protocols independently. Second, it must enforce methodological rigor during experiment execution by guiding the wizard through standardized procedures and preventing deviations from the experimental script that could compromise validity. Third, it must be platform-agnostic, meaning the same experiment design can be reused across different robot hardware as technology evolves.

-This approach represents a shift from the current paradigm of custom, robot-specific tools toward a unified platform that can serve as shared infrastructure for the HRI research community. By treating experiment design, execution, and analysis as distinct but integrated phases of a study, such a framework can systematically address both technical barriers and sources of variability that currently limit research quality and reproducibility.
+This approach represents a shift from the current paradigm of custom, robot-specific tools toward a unified platform that can serve as shared infrastructure for WoZ-based HRI research. By treating experiment design, execution, and analysis as distinct but integrated phases of a study, such a framework can systematically address both technical barriers and sources of variability that currently limit research quality and reproducibility.

-The contributions of this thesis are the design principles of this approach, namely: a hierarchical specification model, an event-driven execution model, and a plugin architecture that decouples experiment logic from robot-specific implementations. Together they form a coherent architecture for WoZ infrastructure that any implementation could adopt. The platform I developed, HRIStudio, is a complete realization of this architecture: an open-source, web-based platform that serves as both the primary artifact of this thesis and the instrument for empirical validation.
+The contributions of this thesis are the design principles of this approach, namely: a hierarchical specification model, an event-driven execution model, and a plugin architecture that decouples experiment logic from robot-specific implementations. Together they form a coherent architecture for WoZ infrastructure that any implementation could adopt. To evaluate the impact of these design principles, I developed a reference implementation called HRIStudio: an open-source, web-based platform built on this architecture and used as the instrument for empirical validation.

 \section{Research Objectives}

-This thesis builds upon foundational work presented in two prior peer-reviewed publications. Prof. Perrone and I first introduced the conceptual framework for HRIStudio at the 2024 IEEE International Conference on Robot and Human Interactive Communication (RO-MAN) \cite{OConnor2024}, establishing the vision for a collaborative, web-based platform. Subsequently, we published the detailed system architecture and a first prototype at RO-MAN 2025 \cite{OConnor2025}, validating the technical feasibility of web-based robot control. Those publications established the vision and the prototype. This thesis formalizes the contribution: a set of design principles for WoZ infrastructure that simultaneously address the \textit{Accessibility} and \textit{Reproducibility} Problems, a complete platform that realizes those principles, and pilot empirical evidence that they produce measurably different outcomes in practice.
+This thesis builds upon foundational work presented in two prior peer-reviewed publications. Prof. Perrone and I first introduced the conceptual framework for HRIStudio at the 2024 IEEE International Conference on Robot and Human Interactive Communication (RO-MAN) \cite{OConnor2024}, establishing the vision for a collaborative, web-based platform. Subsequently, we published the detailed system architecture and a first prototype at RO-MAN 2025 \cite{OConnor2025}, validating the technical feasibility of web-based robot control. Those publications established our vision and a first prototype. This thesis extends and formalizes our contributions: a set of design principles for WoZ infrastructure that simultaneously address the \textit{Accessibility} and \textit{Reproducibility} Problems, a reference implementation that realizes those principles, and pilot empirical evidence that they produce measurably different outcomes in practice.

-The central question this thesis addresses is: \emph{can the right software architecture make Wizard-of-Oz experiments more accessible to non-programmers and more reproducible across participants?} To answer it, I propose a hierarchical, event-driven specification model that separates protocol design from trial execution, enforces action sequences, and logs deviations automatically; implement it as HRIStudio; and evaluate it in a pilot study comparing design fidelity and execution reliability against a representative baseline tool. The goal is not to prove a statistical effect at scale, but to establish directional evidence that the architecture changes what researchers can do and how consistently they can do it.
+The central question this thesis addresses is: \emph{can the right software architecture make Wizard-of-Oz experiments more accessible to non-programmers and more reproducible across participants?} To answer it, I propose a hierarchical, event-driven specification model that separates protocol design from trial execution, enforces action sequences, and logs deviations automatically; implement it as HRIStudio; and evaluate it in a pilot study comparing design fidelity and execution reliability against a representative baseline tool. The goal is not to prove a statistical effect at scale, but to establish directional evidence that the architecture changes what researchers can do and guides them to be consistent in the pursuit of their experimental goals.

 \section{Chapter Summary}

-This chapter has established the context and objectives for this thesis. I identified two critical challenges facing WoZ-based HRI research. The first is the \emph{Accessibility Problem}: high technical barriers limit participation by non-programmers. The second is the \emph{Reproducibility Problem}: fragmented tooling makes results difficult to replicate across labs. I proposed a web-based framework approach that addresses these challenges through intuitive design interfaces, enforced experimental protocols, and platform-agnostic architecture. Finally, I posed the central research question (can a hierarchical, event-driven specification model with explicit deviation logging lower the technical barrier and improve reproducibility of WoZ experiments?) and described how this thesis addresses it through formal design, a complete platform, and a pilot validation study. The next chapters establish the technical and methodological foundations.
+This chapter has established the context and objectives for this thesis. I identified two critical challenges facing WoZ-based HRI research. The first is the \emph{Accessibility Problem}: high technical barriers limit participation by non-programmers. The second is the \emph{Reproducibility Problem}: fragmented tooling makes results difficult to replicate across labs. I proposed a web-based framework approach that addresses these challenges through intuitive design interfaces, enforced experimental protocols, and platform-agnostic architecture. Finally, I posed the central research question and described how this thesis addresses it through formal design, a reference implementation, and a pilot validation study. The next chapters establish the technical and methodological foundations.
@@ -1,9 +1,9 @@
 \chapter{Background and Related Work}
 \label{ch:background}

-This chapter provides the necessary context for understanding the challenges addressed by this thesis. I survey the landscape of existing WoZ platforms, analyze their capabilities and limitations, and establish requirements that a modern infrastructure should satisfy. Finally, I position this thesis relative to prior work on this topic.
+This chapter provides the necessary context for understanding the challenges addressed by this thesis. I survey the landscape of existing WoZ platforms, analyze their capabilities and limitations, and establish requirements that a modern infrastructure should satisfy. Finally, I position this thesis within the context of prior work on this topic.

-As established in Chapter~\ref{ch:intro}, the WoZ technique enables researchers to prototype and test robot interaction designs before autonomous capabilities are developed. To understand how the proposed framework advances this research paradigm, I review the existing landscape of WoZ platforms, identify their limitations relative to disciplinary needs, and establish requirements for a more comprehensive approach. HRI is fundamentally a multidisciplinary field which brings together engineers, psychologists, designers, and domain experts from various application areas \cite{Bartneck2024}. Yet two challenges have historically limited participation from non-technical researchers. First, each research group builds custom software for specific robots, creating tool fragmentation across the field. Second, high technical barriers prevent many domain experts from conducting independent studies.
+As established in Chapter~\ref{ch:intro}, the WoZ technique enables researchers to prototype and test robot interaction designs before autonomous capabilities are developed. This thesis is situated within a specific subset of HRI activity: social robotics, a subfield concerned with robots designed for direct social interaction with humans, and more narrowly within that, WoZ experiments used to prototype and evaluate social robot behaviors. To understand how the proposed framework advances this research paradigm, I review the existing landscape of WoZ platforms, identify their limitations relative to disciplinary needs, and establish requirements for a more comprehensive approach. HRI is fundamentally a multidisciplinary field which brings together engineers, psychologists, designers, and domain experts from various application areas \cite{Bartneck2024}. Yet two challenges have historically limited participation from non-technical researchers in WoZ-based HRI studies. First, each research group builds custom software for specific robots, creating tool fragmentation across the field. Second, high technical barriers prevent many domain experts from conducting independent studies.

 \section{Existing WoZ Platforms and Tools}

@@ -11,36 +11,36 @@ Over the last two decades, multiple frameworks to support and automate the WoZ p

 Early platform-agnostic tools focused on providing robust, flexible interfaces for technically sophisticated users. These systems were designed to work with multiple robot types rather than a single hardware platform. Polonius \cite{Lu2011}, built on the Robot Operating System (ROS) \cite{Quigley2009}, exemplifies this generation. It provides a graphical interface for defining finite state machine scripts that control robot behaviors, with integrated logging capabilities to streamline post-experiment analysis. The system was explicitly designed to enable robotics engineers to create experiments that their non-technical collaborators could then execute. However, the initial setup and configuration still required substantial programming expertise. Similarly, OpenWoZ \cite{Hoffman2016} introduced a cloud-based, runtime-configurable architecture using web protocols. Its design allows multiple operators or observers to connect simultaneously, and its plugin system enables researchers to extend functionality such as adding new robot behaviors or sensor integrations. Most importantly, OpenWoZ allows runtime modification of robot behaviors, enabling wizards to deviate from scripts when unexpected situations arise. While architecturally sophisticated and highly flexible, OpenWoZ requires programming knowledge to create custom behaviors and configure experiments, creating the \emph{Accessibility Problem} for non-technical researchers.

-A second wave of tools shifted focus toward usability, often achieving accessibility by coupling tightly with specific hardware platforms. WoZ4U \cite{Rietz2021} was explicitly designed as an ``easy-to-use'' tool for conducting experiments with Aldebaran's Pepper robot. It provides an intuitive graphical interface that allows non-programmers to design interaction flows, and it successfully lowers the technical barrier. However, this usability comes at the cost of generalizability. WoZ4U is unusable with other robot platforms, and manufacturer-provided software follows a similar pattern.
+A second wave of tools shifted focus toward usability, often achieving accessibility by coupling tightly with specific hardware platforms. WoZ4U \cite{Rietz2021} was explicitly designed as an ``easy-to-use'' tool for conducting experiments with Aldebaran's Pepper robot. It provides an intuitive graphical interface that allows non-programmers to design interaction flows, and it successfully lowers the technical barrier. However, this usability comes at the cost of generalizability. WoZ4U does not work for other robot platforms. Manufacturer-provided software for various robots follow a similar pattern.

 Choregraphe \cite{Pot2009}, developed by Aldebaran Robotics for the NAO and Pepper robots, offers a visual programming environment based on connected behavior boxes. Researchers can create complex interaction flows using drag-and-drop blocks without writing code in traditional programming languages. However, when new robot platforms emerge or when hardware becomes obsolete, tools like Choregraphe and WoZ4U lose their utility. Pettersson and Wik, in their review of WoZ tools \cite{Pettersson2015}, note that platform-specific systems often fall out of use as technology evolves, forcing researchers to constantly rebuild their experimental infrastructure.

 Recent years have seen renewed interest in comprehensive WoZ frameworks. Gibert et al. \cite{Gibert2013} developed the Super Wizard of Oz (SWoOZ) platform. This system integrates facial tracking, gesture recognition, and real-time control capabilities to enable naturalistic human-robot interaction studies. Virtual and augmented reality have also emerged as complementary approaches to WoZ. Helgert et al. \cite{Helgert2024} demonstrated how VR-based WoZ environments can simplify experimental setup while providing researchers with precise control over environmental conditions and high-fidelity data collection.

-This expanding landscape reveals a persistent fundamental gap in the design space of WoZ tools. Flexible, general-purpose platforms like Polonius and OpenWoZ offer powerful capabilities but present high technical barriers. Accessible, user-friendly tools like WoZ4U and Choregraphe lower those barriers but sacrifice cross-platform compatibility and longevity. Newer approaches such as VR-based frameworks attempt to bridge this gap, yet no existing tool successfully combines accessibility, flexibility, deployment portability, and built-in methodological rigor. By methodological rigor, I refer to systematic features that guide experimenters toward best practices like standardized protocols, comprehensive logging, and reproducible experimental designs.
+This expanding landscape reveals a persistent fundamental gap in the design space of WoZ tools. Flexible, general-purpose platforms like Polonius and OpenWoZ offer powerful capabilities but present high technical barriers. Accessible, user-friendly tools like WoZ4U and Choregraphe lower those barriers but sacrifice cross-platform compatibility and longevity. Newer approaches such as VR-based frameworks attempt to bridge this gap, yet no existing tool successfully combines accessibility, flexibility, deployment portability, and built-in methodological rigor. By methodological rigor, I refer to systematic features that guide experimenters toward best practices: consistently following experimental protocols, maintaining comprehensive logging, and producing reproducible experimental designs.

-Moreover, few platforms directly address the methodological concerns raised by systematic reviews of WoZ research. Riek's influential analysis \cite{Riek2012} of 54 HRI studies uncovered widespread inconsistencies in how wizard behaviors were controlled and reported. Very few studies documented standardized wizard training procedures or measured wizard error rates, raising questions about internal validity. The tools themselves often exacerbate this problem: poorly designed interfaces increase cognitive load on wizards, leading to timing errors and behavioral inconsistencies that can confound experimental results. Recent work by Strazdas et al. \cite{Strazdas2020} further demonstrates the importance of careful interface design in WoZ systems, showing that intuitive wizard interfaces directly improve both the quality of robot behavior and the reliability of collected data.
+Moreover, few platforms directly address the methodological concerns raised by systematic reviews of WoZ research. Riek's influential analysis \cite{Riek2012} of 54 HRI studies uncovered widespread inconsistencies in how wizard behaviors were controlled and reported. Very few studies documented standardized wizard training procedures or measured wizard error rates, raising questions about internal validity---that is, whether observed outcomes can be attributed to the intended experimental manipulation rather than to uncontrolled variation in wizard behavior. The tools themselves often exacerbate this problem: poorly designed interfaces increase cognitive load on wizards, leading to timing errors and behavioral inconsistencies that can confound experimental results. Recent work by Strazdas et al. \cite{Strazdas2020} further demonstrates the importance of careful interface design in WoZ systems, showing that intuitive wizard interfaces directly improve both the quality of robot behavior and the reliability of collected data.

 \section{Requirements for Modern WoZ Infrastructure}

 This thesis is the latest step in a multi-year effort to build infrastructure that addresses the challenges identified in the WoZ platform landscape. Based on the analysis of existing platforms and identified methodological gaps, I derived requirements for a modern WoZ research infrastructure. Through our preliminary work \cite{OConnor2024}, we identified six critical capabilities that a comprehensive platform should provide:

 \begin{description}
-\item[R1: Integrated workflow.] All phases of the experimental workflow (design, execution, and analysis) should be integrated within a single unified environment to minimize context switching and tool fragmentation.
-\item[R2: Low technical barrier.] Creating interaction protocols should require minimal to no programming expertise, enabling domain experts from psychology, education, or other fields to work independently \cite{Bartneck2024}.
-\item[R3: Real-time control.] The system must support fine-grained, responsive real-time control during live experiment sessions across a variety of robotic platforms.
+\item[R1: Integrated workflow.] All phases of the experimental workflow (design, execution, and analysis) should be integrated within a single unified environment, so that researchers do not need to move between separate tools to design, run, and analyze their experiments.
+\item[R2: Low technical barrier.] Creating social interaction protocols between robot and human should require minimal to no programming expertise, enabling domain experts from psychology, education, or other fields to work independently \cite{Bartneck2024}.
+\item[R3: Real-time control.] The system must support fine-grained, responsive real-time control during live experiment sessions across a variety of robotic platforms. Consistent real-time control across platforms also directly supports reproducibility: the same script should execute with equivalent responsiveness regardless of which robot is used.
 \item[R4: Automated logging.] All actions, timings, and sensor data should be automatically logged with synchronized timestamps to facilitate analysis.
-\item[R5: Platform agnosticism.] The architecture should decouple experimental logic from robot-specific implementations. This allows experiments designed for one robot type to be adapted to others, ensuring the platform remains viable as hardware evolves.
-\item[R6: Collaborative support.] Multiple team members should be able to contribute to experiment design and review execution data, supporting truly interdisciplinary research.
+\item[R5: Platform agnosticism.] The architecture should decouple experimental logic from robot-specific actions and behaviors. This allows experiments designed for one robot platform to be adapted to others, ensuring the WoZ platform remains viable as hardware evolves. This requirement also directly addresses the Reproducibility Problem: a platform-agnostic design makes it possible to run the same interaction script on different robots with minimal change to the implementing program.
+\item[R6: Collaborative support.] Multiple team members should be able to contribute to experiment design and review of execution data, supporting truly interdisciplinary research.
 \end{description}

-To the best of my knowledge, no existing platform satisfies all six requirements. Most critically, the trade-off between accessibility and flexibility remains unresolved. Few tools embed methodological best practices directly into their design to guide experimenters toward sound methodology by default.
+To the best of my knowledge, no existing platform satisfies all six requirements. Most critically, the trade-off between accessibility and reproducibility remains unresolved. Few tools embed methodological best practices directly into their design to guide experimenters toward sound methodology by default.

 This work builds on two prior peer-reviewed publications. We first introduced the concept for HRIStudio as a Late-Breaking Report at the 2024 IEEE International Conference on Robot and Human Interactive Communication (RO-MAN) \cite{OConnor2024}. In that position paper, we identified the lack of accessible tooling as a primary barrier to entry in HRI and proposed the high-level vision of a web-based, collaborative platform. We established the core requirements listed above and argued for a web-based approach to achieve them.

 Following the initial proposal, we published the detailed system architecture and preliminary prototype as a full paper at RO-MAN 2025 \cite{OConnor2025}. That publication validated the technical feasibility of our approach, detailing the communication protocols, data models, and plugin architecture necessary to support real-time robot control using standard web technologies while maintaining platform independence.

-While those prior publications established the conceptual framework and technical architecture, this thesis formalizes those design principles, realizes them in a complete implementation, and tests whether they produce measurably different outcomes in a pilot validation study. The pilot study compares design fidelity and execution reliability between HRIStudio and a representative baseline tool, showing whether these principles translate into better outcomes for real researchers.
+While those prior publications established the conceptual framework and technical architecture, this thesis formalizes those design principles, realizes them in a complete implementation, and evaluates whether they produce measurably different outcomes in a pilot validation study. The pilot study compares design fidelity and execution reliability between HRIStudio and a representative baseline tool, showing whether these principles translate into better outcomes for real researchers.

 \section{Chapter Summary}

@@ -1,15 +1,17 @@
 \chapter{Reproducibility Challenges}
 \label{ch:reproducibility}

-Having established the landscape of existing WoZ platforms and their limitations, I now examine the factors that make WoZ experiments difficult to reproduce and how software infrastructure can address them. This chapter analyzes the sources of variability in WoZ studies and examines how current practices in infrastructure and reporting contribute to reproducibility problems. Understanding these challenges is essential for designing a system that supports reproducible, rigorous experimentation.
+Having established the landscape of existing WoZ platforms and their limitations, I now examine the factors that make WoZ experiments difficult to reproduce consistently and how software infrastructure can address them. This chapter analyzes the sources of variability identified in the WoZ literature and examines how current practices in infrastructure and reporting contribute to \emph{the Reproducibility Problem}. Understanding these challenges is essential for designing a system that supports reproducible, rigorous experimentation.

 \section{Sources of Variability}

-Reproducibility in experimental research requires that independent investigators can obtain consistent results when following the same procedures. In WoZ-based HRI studies, however, multiple sources of variability can compromise this goal. The wizard is simultaneously the strength and weakness of the WoZ paradigm. While human control enables sophisticated, adaptive interactions, it also introduces inconsistency. Consider a wizard conducting multiple trials of the same experiment with different participants. Even with a detailed script, the wizard may vary in timing, with delays between a participant's action and the robot's response fluctuating based on the wizard's attention, fatigue, or interpretation of when to act. When a script allows for choices, different wizards may make different selections, or the same wizard may act differently across trials. Furthermore, a wizard may accidentally skip steps, trigger actions in the wrong order, or misinterpret experimental protocols.
+\emph{The Reproducibility Problem}, as introduced in Chapter~\ref{ch:intro}, encompasses two related challenges. The first concerns \emph{execution consistency}: whether a wizard reliably follows the same experimental script across multiple trials with different participants, producing comparable robot behavior in each. The second concerns \emph{cross-platform reproducibility}: whether the same experiment can be transferred to a different robot platform with minimal change to the implementing program. Both stem from gaps in current WoZ infrastructure and are examined in this chapter. A third interpretation of the term — independent replication of a published study by researchers at other institutions — is distinct from both and is not what this thesis evaluates. It is also worth noting that execution consistency, as defined here, corresponds to what the measurement literature sometimes calls \emph{repeatability}: the degree to which the same procedure produces consistent results when repeated across multiple trials of the same study.
+
+In WoZ-based HRI studies, multiple sources of variability can compromise execution consistency. The wizard is simultaneously the strength and weakness of the WoZ paradigm. While human control enables sophisticated, adaptive interactions, it also introduces inconsistency. Consider a wizard conducting multiple trials of the same experiment with different participants. Even with a detailed script, the wizard may vary in timing, with delays between a participant's action and the robot's response fluctuating based on the wizard's attention, fatigue, or interpretation of when to act. When a script allows for choices, different wizards may make different selections, or the same wizard may act differently across trials. Furthermore, a wizard may accidentally skip steps, trigger actions in the wrong order, or misinterpret experimental protocols.

 Riek's systematic review \cite{Riek2012} found that very few published studies reported measuring wizard error rates or providing standardized wizard training. Without such measures, it becomes impossible to determine whether experimental results reflect the intended interaction design or inadvertent variations in wizard behavior.

-Beyond wizard behavior, the custom nature of many WoZ control systems introduces technical variability. When each research group builds custom software for each study, several problems arise. Custom interfaces may have undocumented capabilities, hidden features, default behaviors, or timing characteristics researchers never formally describe. Software tightly coupled to specific robot models or operating system versions may become unusable when hardware or software is upgraded or replaced. Each system logs data differently, with different file formats, different levels of granularity, and different choices about what to record. This fragmentation means that replicating a study often requires not just following an experimental protocol but also reverse-engineering or rebuilding the original software and hardware infrastructure.
+Beyond wizard behavior, the custom nature of many WoZ control systems introduces technical variability. When each research group builds custom software for each study, several problems arise. Custom interfaces may have undocumented capabilities, hidden features, default behaviors, or timing characteristics researchers never formally describe. Software tightly coupled to specific robot models or operating system versions may become unusable when hardware or software is upgraded or replaced. Each system logs data differently, with different file formats, different levels of granularity, and different choices about what to record. This fragmentation undermines both execution consistency and reproducibility. Rebuilding custom infrastructure for each study makes it nearly impossible to guarantee that wizard behavior is controlled the same way across trials. More broadly, reproducing the same experiment on a different robot platform typically requires reverse-engineering or rebuilding the original software from scratch.

 Even when researchers intend for their work to be reproducible, practical constraints on publication length lead to incomplete documentation. Papers often omit exact timing parameters. Authors leave decision rules for wizard actions unspecified and fail to report details of the wizard interface. Specifications of data collection, including which sensor streams were recorded and at what sampling rate, frequently go missing. Without this information, other researchers cannot faithfully recreate the experimental conditions, limiting both direct replication and conceptual extensions of prior work.

@@ -35,4 +37,4 @@ The reproducibility challenges identified above directly motivate the infrastruc

 \section{Chapter Summary}

-This chapter has analyzed the reproducibility challenges inherent in WoZ-based HRI research, identifying three primary sources of variability: inconsistent wizard behavior, fragmented technical infrastructure, and incomplete documentation. Rather than treating these challenges as inherent to the WoZ paradigm, I showed how each stems from gaps in current infrastructure. Software design can systematically mitigate these challenges through enforced experimental protocols, comprehensive automatic logging, self-documenting experiment designs, and platform-independent abstractions. These design goals directly address the six infrastructure requirements identified in Chapter~\ref{ch:background}. The following chapters describe the design, implementation, and pilot validation of a system that prioritizes reproducibility as a foundational design principle from inception.
+This chapter has analyzed the reproducibility challenges inherent in WoZ-based HRI research, identifying three primary sources of variability: inconsistent wizard behavior, fragmented technical infrastructure, and incomplete documentation. Rather than treating these challenges as inherent to the WoZ paradigm, I showed how each stems from gaps in current infrastructure. Software design can systematically mitigate them through enforced experimental protocols, comprehensive automatic logging, self-documenting experiment designs, and platform-independent abstractions. These design goals directly address the six infrastructure requirements identified in Chapter~\ref{ch:background}. The following chapters describe the design, implementation, and pilot validation of a system that prioritizes reproducibility as a foundational design principle from inception.
@@ -5,15 +5,19 @@ Chapter~\ref{ch:background} established six requirements for modern WoZ infrastr

 \section{Hierarchical Organization of Experiments}

-WoZ studies involve multiple reusable conditions, shared protocol phases, and platform-specific behaviors that span the full research lifecycle. To organize these components without requiring researchers to write code, the system structures every study as a four-level hierarchy: \emph{study} $\rightarrow$ \emph{experiment} $\rightarrow$ \emph{step} $\rightarrow$ \emph{action}. This structure separates high-level protocol design from low-level execution behavior, keeping the authoring process code-free while integrating design, execution, and analysis into a single unified workflow.
+WoZ studies involve multiple experiments, shared protocol phases, and platform-specific behaviors that span the full research lifecycle. To organize these elements without requiring researchers to write code, the system structures every study as a four-level hierarchy: \emph{study} $\rightarrow$ \emph{experiment} $\rightarrow$ \emph{step} $\rightarrow$ \emph{action}. This structure separates high-level protocol design from low-level execution behavior, keeping the authoring process code-free while integrating design, execution, and analysis into a single unified workflow.

-The terms in this hierarchy are used in a strict way. A \emph{study} is the top-level research container that groups related protocol conditions. An \emph{experiment} is one reusable condition within that study (for example, a control versus experimental condition). A \emph{step} is one phase of the protocol timeline (for example, an introduction, telling a story, or testing recall). An \emph{action} is the smallest executable unit inside a step (for example, trigger a gesture, play audio, or speak a prompt).
+I define the elements in this hierarchy as follows. A \emph{study} is the top-level container that groups related experiments. An \emph{experiment} is one independently runnable protocol within that study (for example, a control or experimental condition). A \emph{step} is one phase of the protocol timeline (for example, an introduction, telling a story, or testing information recall). An \emph{action} is the smallest executable unit inside a step (for example, trigger a gesture, play audio, or speak a prompt).

-Figure~\ref{fig:experiment-hierarchy} shows a representation of this hierarchical structure for social robotics studies. Reading top-down, one study contains one or more experiments, each experiment contains one or more steps, and each step contains one or more actions. Figure~\ref{fig:trial-instantiation} shows the protocol-versus-instance separation in isolation. The left column holds the protocol designed once before the study begins; the right column shows the separate trial records produced each time a participant runs it. A dashed line marks the protocol/trial boundary: everything to its left was authored by the researcher before any participant arrived; everything to its right was generated during a live session. The \textit{instantiates} arrows from the experiment node fan out to each trial record, making the relationship explicit. This separation is central to reproducibility: the same experiment specification generates a distinct, timestamped record per participant, so researchers can compare across participants without conflating what was designed with what was executed.
+Figure~\ref{fig:experiment-hierarchy} shows this hierarchical structure. Reading top-down, one study contains one or more experiments, each experiment contains one or more steps, and each step contains one or more actions.

-To illustrate how the schema can be used with a concrete example, consider an interactive storytelling study with the research question: \emph{Does robot interaction modality influence participant recall performance?} The two conditions differ in how the robot looks and behaves: NAO6 has a human-like form and uses expressive gestures, while TurtleBot is visibly machine-like with no social movement cues. This keeps the narrative task the same across both conditions while changing only how the robot delivers it.
+Figure~\ref{fig:trial-instantiation} illustrates how a protocol definition relates to its instantiation. The left column holds the protocol, defined before the study begins; the right column shows how the abstraction defined as a protocol is instantiated as independent trials. A dashed line marks the protocol/trial boundary: everything to its left was authored by the researcher before any participant arrived; everything to its right was generated during a live session. The \textit{instantiates} arrows from the experiment node fan out to each trial record, making the relationship explicit. This separation is central to reproducibility: the same experiment specification generates a distinct, timestamped record per participant, so researchers can compare across participants without conflating what was designed with what was executed.

-Figure~\ref{fig:example-hierarchy} maps that study onto the same hierarchy. The study branches into two experiments (TurtleBot with only voice, NAO6 with added gestures), each experiment uses the same ordered steps (Intro, Story Telling, Recall Test), and each step contains actions. The figure expands only the Story Telling step to keep the diagram readable, but Intro and Recall Test follow the same structure. Figures~\ref{fig:experiment-hierarchy}, \ref{fig:trial-instantiation}, and~\ref{fig:example-hierarchy} together progress from abstract schema, to protocol-versus-instance separation, to a concrete instantiation.
+To illustrate the hierarchy with a concrete example, consider an interactive storytelling study with the research question: \emph{Does how the robot tells a story affect how a human will remember the story?} The two experiments use different robots: the NAO6, a humanoid robot with expressive gestures and a human-like form, and the TurtleBot, a wheeled mobile robot that is visibly machine-like with no social movement cues. The narrative task remains the same across both experiments; only how the robot delivers it changes.
+
+Figure~\ref{fig:example-hierarchy} maps the study presented above onto the hierarchical elements defined in Figure~\ref{fig:experiment-hierarchy}. The study branches into two experiments (TurtleBot with only voice, NAO6 with added gestures), each experiment uses the same sequence of ordered steps (Intro, Story Telling, Recall Test), and each step defines the specific actions the robot will perform. The figure expands only the Story Telling step to keep the diagram readable, but Intro and Recall Test follow the same structure.
+
+Together, these three figures motivate why the hierarchy is useful in practice. These three figures are interrelated as follows: Figure~\ref{fig:experiment-hierarchy} defines the experimental structure as an abstraction; Figure~\ref{fig:trial-instantiation} shows how the abstract experimental structure is instantiated as concrete trial records; and Figure~\ref{fig:example-hierarchy} shows the expansion of each element of the experimental structure.

 \begin{figure}[htbp]
 \centering
@@ -134,7 +138,7 @@ Figure~\ref{fig:example-hierarchy} maps that study onto the same hierarchy. The
 \label{fig:example-hierarchy}
 \end{figure}

-Together, these three figures motivate why the hierarchy is useful in practice. The layered structure lets researchers define protocols at any level of granularity without writing code, which keeps the tool accessible to non-programmers. The step and action levels also align naturally with trial flow, so the wizard stays guided by the protocol while retaining control over timing, which supports the real-time control requirement. Action-level execution provides a natural unit for timestamped logging and post-trial analysis, satisfying the automated logging requirement. Finally, keeping experiment definitions separate from trial instances means the same protocol can be reproduced across participants and conditions, supporting both the integrated workflow and collaborative support requirements.
+The layered structure compels researchers to define experimental protocols at multiple levels of granularity without writing code, which creates a process that is accessible to non-programmers. The step and action elements also align naturally with the sequence of events in a trial, so the wizard stays guided by the protocol while retaining control over the timing of each event, which supports the real-time control requirement (R3). Action-level execution provides a natural unit for timestamped logging and post-trial analysis, satisfying the automated logging requirement (R4). Finally, keeping experiment definitions separate from trial instances means the same protocol can be reproduced across participants and experiments, supporting both the integrated workflow (R1) and collaborative support (R6) requirements.

 \section{Event-Driven Execution Model}

@@ -217,9 +221,9 @@ To achieve real-time responsiveness while maintaining methodological rigor (R3,
 \label{fig:event-driven-timeline}
 \end{figure}

-This approach has several implications. First, not all trials of the same experiment will have identical timing or duration; the length of a learning task, for example, depends on the participant's progress. The system records the actual timing of actions, permitting researchers to capture these natural variations in their data. Second, the event-driven model enables the wizard to respond contextually without departing from the protocol; the wizard remains guided by the sequence of available actions while having control over when to advance based on participant cues.
+This approach has several implications. What the event-driven model guarantees is not identical timing across trials, but consistent action ordering: every participant experiences the same sequence of protocol steps, even if the pace varies. Timing is recorded accurately, permitting researchers to analyze natural variation across participants. The wizard responds contextually without departing from the protocol; the wizard remains guided by the sequence of available actions while retaining control over when to advance based on participant cues.

-The system guides the wizard through the protocol step-by-step, ensuring the intended sequence is followed. Every action is logged with a timestamp whether it was scripted or not, and anything outside the protocol is flagged as a deviation. This means inconsistent wizard behavior shows up in the data rather than disappearing into it.
+The system guides the wizard through the protocol step-by-step, ensuring the intended sequence is followed. Every action is logged with a timestamp whether it was scripted or not, and anything outside the protocol is flagged as a deviation. This means inconsistent wizard behavior can be evident in the data rather than disappearing into it.

 \section{Modular Interface Architecture}

@@ -229,19 +233,19 @@ Researchers interact with the system through three interfaces, each one encapsul

 The \emph{Design} interface gives researchers a drag-and-drop canvas for building experiment protocols, creating a visual programming environment. Researchers drag pre-built action components, including robot movements, speech, wizard instructions, and conditional logic, onto the canvas and drop them into sequence. Clicking a component opens a side panel where its parameters can be set, such as the text for a speech action or the gesture name for a movement.

-By treating experiment design as a visual specification task, the interface lowers technical barriers (R2). Researchers can assemble interaction logic by dragging components into sequence and setting parameters in plain language, without writing code. The resulting protocol specification is also human-readable and shareable alongside research results. The specification is stored in a structured format that can be displayed as a timeline for analysis and executed directly by the platform's runtime.
+By treating experiment design as a visual specification task, the interface lowers technical barriers (R2). Researchers can assemble interaction logic by dragging components into sequence and setting parameters naturally, without even having to write code. The resulting protocol specification is also human-readable and shareable alongside experimental results. The specification is stored in a structured format that can be displayed as a timeline for analysis and executed directly by the platform's runtime. This property is central to reproducibility: a third party with access to the specification can run the experiment faithfully without reverse-engineering the original system.

 \subsection{Execution Interface}

-During trials, the Execution interface shows the wizard exactly where they are in the protocol: the current step, the available actions, and the robot's current state, all updated in real time as the trial progresses.
+During trials, the \emph{Execution} interface keeps the wizard informed of exactly where they are in the protocol. The current step, the available actions, and the robot's current state are all updated in real time as the trial progresses.

-The Execution interface also exposes a set of manual controls for actions that fall outside the scripted protocol. Consider a participant who asks an unexpected question mid-trial: the wizard can trigger an unscripted speech response on the spot rather than leaving the interaction to stall. This keeps the interaction feeling natural for the participant. Critically, the system does not simply ignore these moments. Every unscripted action is timestamped and written to the trial log as an explicit deviation, giving researchers a complete picture of what actually happened versus what was planned. This makes unscripted actions a feature rather than a source of noise: the wizard retains real-time control over the interaction, and the logging infrastructure captures everything needed for post-trial analysis.
+The \emph{Execution} interface also exposes a set of manual controls for actions that fall outside the scripted protocol. A \emph{deviation} is a spontaneous action introduced by the wizard in response to a reaction of the human subject that was not anticipated when the script was created. Consider a human subject who asks an unexpected question mid-trial: the wizard can trigger an unscripted speech response on the spot rather than leaving the interaction to stall, keeping the interaction feeling natural for the human subject. Critically, the system does not ignore these deviations from the script. Every deviation is timestamped and written to the trial log, giving researchers a complete picture of what actually happened versus what was planned. This makes unscripted actions a feature rather than a source of noise: the wizard retains real-time control over the interaction, and the logging infrastructure captures everything needed for post-trial analysis.

-Additional researchers can simultaneously access this same live view through the platform's Dashboard by selecting a trial to ``spectate.'' Multiple researchers observing the same trial view the identical synchronized display of the wizard's controls, participant interactions, and robot state, supporting real-time collaboration and interdisciplinary observation (R6). Observers can take notes and mark significant moments without interfering with the wizard's control or the participant's experience.
+Additional researchers can simultaneously access a live view of a trial through the platform's Dashboard by selecting a trial to ``spectate.'' Multiple researchers observing the same trial view an identical synchronized display of the wizard's controls, human subject interactions, and robot state, supporting real-time collaboration and interdisciplinary observation (R6). Observers can take notes and mark significant moments without interfering with the wizard's control or the human subject's experience.

 \subsection{Analysis Interface}

-After a trial concludes, the \emph{Analysis} interface lets researchers review everything that was recorded: video of the interaction, audio, timestamped action logs, and robot sensor data, all scrubable from a single timeline. Researchers can annotate significant moments and export segments for further analysis. Because the same platform produced both the protocol and the recording, the interface can show exactly where the execution matched the design and where it deviated, without any manual cross-referencing.
+After a trial concludes, the \emph{Analysis} interface lets researchers review everything that was recorded: video of the interaction, audio, timestamped action logs, and robot sensor data, all scrubable from a single timeline. Researchers can annotate significant moments and export segments for further analysis. Because the same platform produced both the protocol and the recording, the interface eliminates the need for manual cross-referencing by showing exactly where the execution matched the design and where it deviated.

 \section{Data Flow and Infrastructure Implementation}

@@ -249,7 +253,7 @@ To ensure that data from every experimental phase remains traceable, the system

 \subsection{Architectural Layers}

-The system is structured as a three-layer architecture, each with a specific responsibility:
+Like the ISO/OSI reference model for networking software, HRIStudio separates its communicative and functional responsibilities into distinct layers, as shown in Figure~\ref{fig:three-tier}. More specifically, the system is organized as a three-layer architecture, each layer with a specific responsibility:

 \begin{description}
 \item[User Interface layer.] Runs in researchers' web browsers and exposes the three interfaces (Design, Execution, Analysis), managing user interactions such as clicking buttons, dragging and dropping experiment components, and reviewing experimental results.
@@ -274,18 +278,18 @@ This separation of concerns provides two concrete benefits. First, each layer ca
    % Layer 2: Logic
    \node[layer, fill=gray!30] (logic) at (0, 1.8) {
        \textbf{Application Logic}\\[0.1cm]
-        {\small Execution, Authentication, Logger}
+        {\small Trial Engine, Authentication, Logger}
    };
-    
+
    % Layer 3: Data
    \node[layer, fill=gray!45] (data) at (0, 0.1) {
        \textbf{Data \& Robot Control}\\[0.1cm]
        {\small Database, File Storage, ROS}
    };
-    
-    % Arrows
-    \draw[arrow] (ui.south) -- (logic.north);
-    \draw[arrow] (logic.south) -- (data.north);
+
+    % Arrows (bidirectional)
+    \draw[<->, thick, line width=1.5pt] (ui.south) -- (logic.north);
+    \draw[<->, thick, line width=1.5pt] (logic.south) -- (data.north);
    
 \end{tikzpicture}
 \caption{Three-layer architecture separates user interface, application logic, and data/robot control.}
@@ -296,9 +300,18 @@ This separation of concerns provides two concrete benefits. First, each layer ca

 During the design phase, researchers create experiment specifications that are stored in the system database. During a trial, the system manages bidirectional communication between the wizard's interface and the robot control layer. All actions, sensor data, and events are streamed to a data logging service that stores complete records. After the trial, researchers can inspect these records through the Analysis interface.

-The flow of data during a trial proceeds through six distinct phases, as shown in Figure~\ref{fig:trial-dataflow}. First, a researcher creates an experiment protocol using the Design interface. Second, when a trial begins, the application server loads the protocol and begins stepping through it, sending commands to the robot and waiting for events such as wizard inputs, sensor readings, or timeouts. Third, every action, both planned protocol steps and unexpected events, is immediately written to the trial log with precise timing information. Fourth, the Execution interface continuously displays the current state, allowing the wizard and observers to monitor the progress of a trial in real-time. Fifth, when the trial concludes, all recorded media (video and audio) is transferred from the browser to the server and persisted in a database as part of the trial record. Sixth, the Analysis interface retrieves the stored trial data and reconstructs exactly what happened, synchronizing notable events with the video and audio recordings.
+The flow of data during a trial proceeds through six distinct phases, as shown in Figure~\ref{fig:trial-dataflow}:

-This design ensures comprehensive documentation of every trial, supporting both fine-grained analysis and reproducibility. Researchers can review not just what they intended to happen, but what actually did happen, including timing variations and unexpected events.
+\begin{enumerate}
+\item A researcher creates an experiment protocol using the Design interface.
+\item When a trial begins, the application server loads the protocol and allows the wizard to step through it, sending commands to the robot and waiting for events such as wizard inputs, sensor readings, or timeouts.
+\item Every action, both planned protocol steps and deviations, is immediately written to the trial log with precise timing information.
+\item The \emph{Execution} interface continuously displays the current state, allowing the wizard and observers to monitor the progress of a trial in real-time.
+\item When the trial concludes, all recorded media (video and audio) is transferred from the browser to the server and persisted in a database as part of the trial record.
+\item The \emph{Analysis} interface retrieves the stored trial data and reconstructs exactly what happened, synchronizing notable events with the video and audio recordings.
+\end{enumerate}
+
+This design creates automatically a comprehensive documentation of every trial, supporting both fine-grained analysis and reproducibility. Researchers can review not just what they intended to happen, but what actually did happen, including timing variations and deviations.

 \begin{figure}[htbp]
 \centering
@@ -7,13 +7,13 @@ HRIStudio is a complete, operational platform that realizes the design principle

 HRIStudio follows the model of a web application. Users access it through a standard browser without installing specialized software, and the entire study team, including researchers, wizards, and observers, connect to the same shared system. This eliminates the need for a local installation and ensures the platform works identically on any operating system, directly addressing the low-technical-barrier requirement (R2, from Chapter~\ref{ch:background}). It also enables easy collaboration (R6): multiple team members can access experiment data and observe trials simultaneously from different machines without any additional configuration.

-I organized the system into three layers: User Interface, Application Logic, and Data \& Robot Control. This layered structure is shown in Figure~\ref{fig:three-tier}. In the implementation of this architecture, it is essential that the application server and the robot control hardware run on the same local network. This keeps communication latency low during trials: a noticeable delay between the wizard's input and the robot's response would break the interaction.
+I organized the system into three layers: User Interface, Application Logic, and Data \& Robot Control. This layered structure is presented in Chapter~\ref{ch:design} and shown in Figure~\ref{fig:three-tier}. In practice, the User Interface layer runs in each researcher's browser (the client), while the Application Logic and Data \& Robot Control layers run on a shared application server. It is essential that this server and the robot control hardware run on the same local network. This keeps communication latency low during trials: a noticeable delay between the wizard's input and the robot's response would break the interaction.

 I implemented all three layers in the same language: TypeScript~\cite{TypeScript2014}, a statically-typed superset of JavaScript. The single-language decision keeps the type system consistent across the full stack. When the structure of experiment data changes, the type checker surfaces inconsistencies across the entire codebase at compile time rather than allowing them to appear as runtime failures during a trial.

 \section{Experiment Storage and Trial Logging}

-The system saves experiments to persistent storage when a researcher completes them in the Design interface. A saved experiment is a complete, reusable specification that a researcher can run across any number of trials without modification. In this chapter, a trial means one concrete run of an experiment protocol with one human subject; this is where spontaneous wizard deviations can occur.
+The system saves experiment descriptions to persistent storage when a researcher completes them in the Design interface. A saved experiment is a complete, reusable specification that a researcher can run across any number of trials without modification.

 When a trial begins, the system creates a new trial record linked to that experiment. The system writes every action the wizard triggers to that record with a precise timestamp, whether scripted or not, including any unscripted actions triggered outside the protocol. The system flags those unscripted actions as deviations. The Execution interface records video, audio, and robot sensor data alongside the action log for the duration of the trial. The Analysis interface can directly compare what was planned against what was executed for any trial, without any manual work by the researcher, because the trial record and the experiment reference the same underlying specification. Figure~\ref{fig:trial-record} shows the structure of a completed trial record: action log entries, video, audio, and robot sensor data all share a common timestamp reference so the Analysis interface can align them without manual synchronization; dashed lines mark step boundaries; and the system flags any deviation from the experiment specification at the appropriate position in the timeline.

@@ -93,9 +93,9 @@ The system stores structured and media data separately. Experiment specification

 The execution engine is the component that runs a trial: it loads the experiment, manages the wizard's connection, sends robot commands, and keeps all connected clients in sync.

-When a trial begins, the server loads the experiment and maintains a live connection to the wizard's browser and any observer connections. The execution engine does not advance through the actions of an experiment on a timer; instead, the wizard controls how time advances from action to action. This preserves the natural pacing of the interaction: the wizard advances only when the participant is ready, while the experiment structure ensures the protocol is followed. When the wizard triggers an action, the server sends the related command to the robot, writes the log entry, and pushes the updated experiment state to all connected clients in the same operation, keeping the wizard's view, the observer view, and the actual robot state synchronized in real time.
+When a trial begins, the server loads the experiment and maintains a live connection to the wizard's browser and any observer connections. The wizard controls how time advances from action to action. When the wizard triggers an action, the server sends the related command to the robot, writes the log entry, and pushes the updated experiment state to all connected clients in the same operation, keeping the wizard's view, the observer view, and the actual robot state synchronized in real time.

-No two human subjects respond identically to an experimental protocol. One subject gives a one-word answer; another offers a paragraph; a third asks the robot a question the script never anticipated. A fully programmed robot has no answer for that third subject: the interaction stalls, or immersion breaks. The wizard exists to fill that gap: where the program runs out of instructions, the wizard draws on their knowledge of human social interaction to keep the exchange coherent. Unscripted actions give the wizard the tools to exercise that judgment in the moment. The wizard triggers them via the manual controls in the Execution interface, the robot command runs, and the system logs the action with a deviation flag. This design preserves research value: the interaction gains the flexibility only a human can provide, and that flexibility appears explicitly in the record rather than disappearing into it.
+No two human subjects respond identically to an experimental protocol. One subject gives a one-word answer; another offers a paragraph; a third asks the robot a question the script never anticipated. Unscripted actions give the wizard the tools to record how these interactions unfold when deviations from the script are required. The wizard triggers them via the manual controls in the Execution interface, the robot command runs, and the system logs the action with a deviation flag. This design preserves research value: the interaction gains the flexibility only a human can provide, and that flexibility appears explicitly in the record rather than disappearing into it.

 \section{Robot Integration}

@@ -152,7 +152,7 @@ Figure~\ref{fig:plugin-architecture} illustrates this mapping using NAO6 and Tur

 \section{Access Control}

-I implemented access control using a role-based access control (RBAC) model with two layers. System-level roles govern what a user can do across the platform (administrator, researcher, wizard, observer), while study-level roles govern what a user can see and do within a specific study (owner, researcher, wizard, observer). The two layers are checked independently, so a user who is a wizard on one study can be an observer on another without any additional configuration. Within a study, the four study-level roles define a clear separation of capabilities: those who own the study, those who design it, those who run it, and those who observe it. This enforces need-to-know access at the study level so that each team member sees or is able to modify only what their role requires.
+I implemented access control using a role-based access control (RBAC) model with two layers. System-level roles govern what a user can do across the platform (administrator, researcher, wizard, observer), while study-level roles govern what a user can see and do within a specific study (owner, researcher, wizard, observer). The two layers are checked independently, so a user who is a wizard on one study can be an observer on another without any additional configuration. Within a study, the four study-level roles define a clear separation of capabilities: those who own the study, those who design it, those who run it, and those who observe it. This enforces need-to-know access at the study level so that each team member sees or is able to modify only what their role requires. The capabilities and constraints for each role are described below:

 \begin{description}
    \item[Owner.] Full control over the study: can invite or remove members, configure the study settings, and access all data.
@@ -175,7 +175,7 @@ The following two problems required specific solutions during implementation.

 \section{Implementation Status}

-HRIStudio is fully operational for controlled Wizard-of-Oz studies. The Design, Execution, and Analysis interfaces are complete and integrated. The execution engine handles scripted and unscripted actions with full timestamped logging, and I validated robot communication on the NAO6 platform during development. A researcher can design an experiment, run a live trial with a wizard, and review the resulting logs and recordings without modification to the platform's core architecture or execution workflow.
+HRIStudio is fully operational for controlled Wizard-of-Oz studies. The Design, Execution, and Analysis interfaces are complete and integrated with one another. The execution engine handles scripted and unscripted actions with full timestamped logging, and I validated robot communication on the NAO6 platform during development. A researcher can design an experiment, run a live trial with a wizard, and review the resulting logs and recordings without modification to the platform's core architecture or execution workflow.

 Work remaining for future development includes broader validation of the plugin file approach on robot platforms beyond NAO6.

@@ -7,27 +7,29 @@ This chapter presents the pilot validation study used to evaluate whether HRIStu

 The validation study targets the two problems established in Chapter~\ref{ch:background}. The first is the \emph{Accessibility Problem}: existing tools require substantial programming expertise, which prevents domain experts from conducting independent HRI studies. The second is the \emph{Reproducibility Problem}: without structured logging and protocol enforcement, experiment execution varies across participants and wizards in ways that are difficult to detect or control after the fact.

-These problems give rise to two research questions. The first asks whether HRIStudio enables domain experts without prior robotics experience to successfully implement a robot interaction from a written specification. The second asks whether HRIStudio produces more reliable execution of that interaction compared to standard practice.
+These problems give rise to two research questions. The first is whether HRIStudio enables domain experts without prior robotics experience to successfully implement a robot interaction from a written specification. The second is whether HRIStudio produces more reliable execution of that interaction compared to standard practice.

-I hypothesized that wizards using HRIStudio would more completely and correctly implement the written specification, and that their designs would execute more reliably during the trial, compared to wizards using ad-hoc programs created for specific social robotics experiments, with Choregraphe as the baseline tool in this study.
+I hypothesized that HRIStudio would improve both accessibility and reproducibility compared to Choregraphe: wizards using HRIStudio would more completely and correctly implement the written specification, and their designs would execute more reliably during the trial.

 \section{Study Design}

-I used what Bartneck et al.~\cite{Bartneck2024} call a between-subjects design, in which each participant is assigned to only one condition. I randomly assigned each wizard participant to one of two conditions: HRIStudio or Choregraphe. Both groups received the same task, the same time allocation, and the same training structure. Measuring each participant in only one condition prevents carryover effects, meaning performance changes caused by prior exposure to another condition rather than by the assigned condition itself.
+I used what Bartneck et al.~\cite{Bartneck2024} call a between-subjects design, in which each participant is assigned to only one condition. I randomly assigned each wizard participant to one of two conditions: HRIStudio or Choregraphe. Both groups received the same task, the same time allocation, and a similar training structure. Measuring each participant in only one condition prevents carryover effects, meaning performance changes caused by prior exposure to another condition rather than by the assigned condition itself.

 \section{Participants}

-\textbf{Wizards.} I recruited six Bucknell University faculty members drawn from across departments to serve as wizards. I deliberately recruited from both ends of the programming experience spectrum, targeting participants with substantial programming backgrounds as well as those who described themselves as non-programmers or having minimal coding experience. This cross-departmental recruitment was intentional. A primary claim of HRIStudio is that it lowers the technical barrier for domain experts who are not programmers; drawing wizards from outside computer science allows the data to speak to whether that claim holds for the intended user population.
+\textbf{Wizards.} A primary claim of HRIStudio is that it lowers the technical barrier for domain experts who are not programmers; testing this claim with its intended user population was therefore a primary goal of participant recruitment. I recruited six Bucknell University faculty members drawn from across departments to serve as wizards, deliberately targeting both ends of the programming experience spectrum: those with substantial programming backgrounds as well as those who described themselves as non-programmers or having minimal coding experience. Drawing wizards from outside computer science allows the data to speak to whether that claim holds for the intended user population.

-The key inclusion criterion for all wizards was no prior experience with either the NAO robot or Choregraphe software specifically. This controls for tool familiarity so that performance differences reflect the tools themselves rather than prior exposure. I recruited wizards through direct email. Participation was framed as a voluntary software evaluation unrelated to any professional obligations.
+The key inclusion criterion for all wizards was no prior experience with either the NAO robot or Choregraphe software specifically. This controls for tool familiarity so that performance differences reflect the tools themselves rather than prior exposure. I recruited wizards through direct email, and participation was framed as a voluntary software evaluation unrelated to any professional obligations.

-\textbf{Sample size rationale.} With six wizard participants ($N = 6$), this sample size is appropriate for a pilot validation study whose goal is directional evidence and failure-mode identification rather than effect-size estimation for a broad population. The size matches the scope and constraints of this honors thesis: two academic semesters, one undergraduate researcher, and no funded research assistant support. It also reflects the target population and recruitment context. Faculty domain experts outside computer science with no prior NAO or Choregraphe experience are a limited pool at a small liberal arts university and have high competing time demands. This scale is consistent with pilot and feasibility studies in HRI, where small $N$ designs are common in early-stage tool validation~\cite{HoffmanZhao2021}. Findings should be interpreted as preliminary evidence and directional indicators rather than as conclusive proof.
+\textbf{Sample size rationale.} I chose to recruit six wizard participants ($N = 6$), believing that this sample size is appropriate for a pilot validation study whose goal is directional evidence and failure-mode identification rather than effect-size estimation for a broad population. This scale is consistent with pilot and feasibility studies in HRI, where small $N$ designs are common in early-stage tool validation~\cite{HoffmanZhao2021}. Findings should be interpreted as preliminary evidence and directional indicators rather than as conclusive substantiation of any claims.

 \section{Task}

-Both wizard groups received the same written task specification: the \emph{Interactive Storyteller} scenario. The specification described a robot that introduces an astronaut named Kai, narrates her discovery of a glowing rock on Mars, asks a comprehension question, and delivers a response according to the answer given. The full specification, including exact robot speech, required gestures, and branching logic, is reproduced in Appendix~\ref{app:blank_templates}.
+The task chosen was to have a robot tell a story to a human subject and later evaluate if that subject could recall a specific detail.

-The task was chosen because it requires several distinct capabilities: speech actions, gesture coordination, conditional branching, and a defined conclusion. In both conditions, wizards had to translate the same written protocol into an executable interaction script, including action ordering, branching logic, and timing decisions. In Choregraphe, that meant assembling and connecting behavior nodes in a finite state machine. In HRIStudio, it meant building a sequential action timeline with conditional branches. This makes the task a direct comparison of how each tool supports coding the robot behavior required by the same protocol.
+Both wizard groups received the same written task specification: the \emph{Interactive Storyteller} scenario. The specification described a robot that introduces an astronaut named Kai, narrates her discovery of a red rock on Mars, asks a recall question, and delivers a response according to the answer given. The full specification, including exact robot speech, required gestures, and branching logic, is reproduced in Appendix~\ref{app:blank_templates}. This scenario is representative of HRI tasks in which a robot conveys information to a human subject; one might, for example, measure whether a robot or human storyteller produces better recall in subjects.
+
+This scenario was chosen because it requires several distinct capabilities: speech actions, gesture coordination, conditional branching, and a defined conclusion. In both conditions, wizards had to translate the same written protocol into an executable interaction script, including action ordering, branching logic, and timing decisions. In Choregraphe, that meant assembling and connecting behavior nodes in a finite state machine. In HRIStudio, it meant building a sequential action timeline with conditional branches. This makes the task a direct comparison of how each tool supports coding the robot behavior required by the same protocol.

 \section{Robot Platform and Software Apparatus}

@@ -56,7 +58,7 @@ I opened each session with a standardized tutorial tailored to the wizard's assi

 \subsection{Phase 2: Design Challenge (30 minutes)}

-The wizard received the paper specification and had 30 minutes to implement it using their assigned tool. Using a structured observer data sheet, I logged every instance in which I provided assistance to the wizard, categorizing each by type: \emph{tool-operation} (T), \emph{task clarification} (C), \emph{hardware or technical} (H), or \emph{general} (G). For each tool-operation intervention, I also recorded which rubric item it pertained to. If the wizard declared completion before the time limit, the remaining time was used to review and refine the design.
+The wizard received the specification and had 30 minutes to implement it using their assigned tool. Using a structured observer data sheet (found in Appendix~\ref{app:blank_templates}), I logged every instance in which I provided assistance to the wizard, categorizing each by type: \emph{tool-operation} (T), \emph{task clarification} (C), \emph{hardware or technical} (H), or \emph{general} (G). For each tool-operation intervention, I also recorded which rubric item it pertained to. If the wizard declared completion before the time limit, the remaining time was used to review and refine the design.

 \subsection{Phase 3: Live Trial (10 minutes)}

@@ -64,32 +66,32 @@ After the design phase, the wizard ran their completed program to execute the de

 \subsection{Phase 4: Debrief (5 minutes)}

-Following the trial, the wizard completed the System Usability Scale survey. The DFS and ERS were scored during and immediately after the session using live observation and the Observer Data Sheet.
+Following the trial, the wizard completed the System Usability Scale survey (found in Appendix~\ref{app:blank_templates}). The DFS and ERS were scored during and immediately after the session using live observation and the Observer Data Sheet.

 \section{Measures}
 \label{sec:measures}

-The study collected five measures, two primary and three supplementary, operationalized through five instruments.
+The study collected five measures, two primary and three supplementary, operationalized through five instruments. They are described as follows.

 \subsection{Design Fidelity Score}

-The Design Fidelity Score (DFS) measures how completely and correctly the wizard implemented the paper specification. I evaluated the exported project file against nine weighted criteria grouped into three categories: speech actions, gestures and actions, and control flow and logic. Each criterion is scored as present, correct, and independently achieved.
+I define the Design Fidelity Score (DFS) as a measure of how completely and correctly the wizard implemented the specification. I evaluated the exported project file against nine weighted criteria grouped into three categories: speech actions, gestures and actions, and control flow and logic. Each criterion is scored as present, correct, and independently achieved.

 The DFS rubric includes an \emph{Assisted} column. For each rubric item, the researcher marks T if a tool-operation intervention was given specifically for that item during the design phase (for example, if the researcher explained how to add a gesture node or how to wire a conditional branch). T marks are recorded and reported separately alongside the DFS score; they do not affect the Points total. This preserves the DFS as a clean measure of design fidelity while providing a parallel record of where tool-specific assistance was needed. General interventions (task clarification, hardware issues, or momentary forgetfulness) are not marked T, because those categories of difficulty are independent of the tool under evaluation.

-This measure is motivated by a gap identified by Riek~\cite{Riek2012}, whose systematic review of 54 published WoZ studies found that only 11\% constrained wizard behavior and fewer than 6\% described wizard training procedures. Porfirio et al.~\cite{Porfirio2023} similarly argued that formal, verifiable behavior specifications are a prerequisite for reproducible HRI. The DFS applies these recommendations as a weighted rubric scored against the exported project file. The complete rubric is reproduced in Appendix~\ref{app:blank_templates}. This measure addresses accessibility: did the tool allow a wizard to independently produce a correct design?
+DFS is motivated by a gap identified by Riek~\cite{Riek2012}, whose systematic review of 54 published WoZ studies found that only 11\% constrained wizard behavior and fewer than 6\% described wizard training procedures. Porfirio et al.~\cite{Porfirio2023} similarly argued that formal, verifiable behavior specifications are a prerequisite for reproducible HRI. The DFS applies these recommendations as a weighted rubric scored against the exported project file. The complete rubric is reproduced in Appendix~\ref{app:blank_templates}. This measure addresses the question: did the tool allow a wizard to independently produce a correct design?

 \subsection{Execution Reliability Score}

-The Execution Reliability Score (ERS) measures whether the designed interaction executed as intended during the live trial. I scored the ERS live and immediately after the session, using the Observer Data Sheet and the wizard's exported project file. Evaluation criteria included whether the robot delivered the correct speech at each step, whether gestures executed and synchronized with speech, whether the conditional branch was present in the design and executed during the trial, and whether any errors, disconnections, or hangs occurred.
+I define the Execution Reliability Score (ERS) as a measure of whether the designed interaction executed as intended during the live trial. I scored the ERS live and immediately after the session, using the Observer Data Sheet and the wizard's exported project file. Evaluation criteria included whether the robot delivered the correct speech at each step, whether gestures executed and synchronized with speech, whether the conditional branch was present in the design and executed during the trial, and whether any errors, disconnections, or hangs occurred.

 The ERS rubric applies the same \emph{Assisted} modifier as the DFS, extended to the trial phase. Any tool-operation intervention I provided during the trial (for example, explaining to the wizard how to launch or advance their program) caps the affected ERS item at half points. This is scored separately from design-phase interventions: a wizard who needed help only during design can still achieve a full ERS score if the trial runs without assistance, and vice versa. The rubric records whether the trial reached its conclusion step. I additionally note whether any branch resolved through programmed conditional logic or through manual intervention by the wizard during execution.

-This measure responds directly to Riek's~\cite{Riek2012} finding that only 3.7\% of published WoZ studies reported any measure of wizard error, making it nearly impossible to determine whether execution matched design intent~\cite{OConnor2024, OConnor2025}. The complete rubric is reproduced in Appendix~\ref{app:blank_templates}. This measure addresses reproducibility: did the design translate reliably into execution without researcher support?
+This measure responds directly to Riek's~\cite{Riek2012} finding that only 3.7\% of published WoZ studies reported any measure of wizard error, making it nearly impossible to determine whether execution matched design intent~\cite{OConnor2024, OConnor2025}. The complete rubric is reproduced in Appendix~\ref{app:blank_templates}. This measure addresses the question: did the design translate reliably into execution without researcher support?

 \subsection{System Usability Scale}

-The System Usability Scale (SUS) is a validated 10-item questionnaire measuring perceived usability \cite{Brooke1996}. Wizards completed the SUS after the debrief phase. Scores range from 0 to 100, with higher scores indicating better perceived usability. The full questionnaire is reproduced in Appendix~\ref{app:blank_templates}.
+The System Usability Scale (SUS) is a validated 10-item questionnaire measuring perceived usability, created by Brooke~\cite{Brooke1996}. Wizards completed the SUS after the debrief phase. Scores range from 0 to 100, with higher scores indicating better perceived usability. The full questionnaire is reproduced in Appendix~\ref{app:blank_templates}.

 \subsection{Intervention Log and Session Timing}

@@ -134,4 +136,4 @@ Session Timing & Actual duration of each phase; time to design completion & Thro

 \section{Chapter Summary}

-This chapter described a pilot between-subjects study I designed to test whether the design principles formalized in Chapters~\ref{ch:design} and~\ref{ch:implementation} produce measurably different outcomes from existing practice. Six wizard participants ($N = 6$), drawn from across departments and spanning the programming experience spectrum, each designed and ran the Interactive Storyteller task on a NAO robot using either HRIStudio or Choregraphe. Each 60-minute session was structured in four phases: a 15-minute standardized tutorial, a 30-minute design challenge, a 10-minute live trial, and a 5-minute debrief. I measured design fidelity (DFS) and execution reliability (ERS) against the written specification, applying a per-item scoring modifier that caps any rubric criterion for which tool-operation assistance was given. I also collected perceived usability via the SUS, a structured intervention log categorizing all researcher assistance by type, and session phase timings. Chapter~\ref{ch:results} presents the results.
+This chapter described the structure of a pilot between-subjects study I designed to test whether the design principles formalized in Chapters~\ref{ch:design} and~\ref{ch:implementation} produce measurably different outcomes from existing practice. Six wizard participants ($N = 6$), drawn from across departments and spanning the programming experience spectrum, each designed and ran the Interactive Storyteller task on a NAO robot using either HRIStudio or Choregraphe. Each 60-minute session was structured in four phases: a 15-minute standardized tutorial, a 30-minute design challenge, a 10-minute live trial, and a 5-minute debrief. I measured design fidelity (DFS) and execution reliability (ERS) against the written specification, applying a per-item scoring modifier that caps any rubric criterion for which tool-operation assistance was given. I also collected perceived usability via the SUS, a structured intervention log categorizing all researcher assistance by type, and session phase timings. Chapter~\ref{ch:results} presents the results.
@@ -1,11 +1,11 @@
 \chapter{Results}
 \label{ch:results}

-This chapter presents the results of the pilot validation study described in Chapter~\ref{ch:evaluation}. Because this is a small pilot, I report descriptive statistics and qualitative observations rather than inferential tests. The goal is directional evidence: do the patterns in the data suggest that HRIStudio changes what wizards can produce and how reliably they can produce it?
+This chapter presents the results of the pilot validation study described in Chapter~\ref{ch:evaluation}. Because this is a small pilot, I report descriptive statistics and qualitative observations rather than inferential tests. The goal is directional evidence: the chapter reports whether patterns in the data consistently favor HRIStudio across the primary and supplementary measures.

 \section{Participant Overview}

-Table~\ref{tbl:sessions} summarizes the participants and their assigned conditions. Wizards are identified by code to protect confidentiality. All six participants were Bucknell University faculty members drawn from Computer Science, Chemical Engineering, Digital Humanities, and Logic and Philosophy of Science. Demographic information (programming background) was collected during recruitment.
+Table~\ref{tbl:sessions} summarizes the personas and their assigned conditions. Wizards are identified by code to protect confidentiality. All six participants were Bucknell University professors drawn from Computer Science, Chemical Engineering, Digital Humanities, and Logic and Philosophy of Science. Demographic information (programming background) was collected during recruitment.


 \begin{table}[htbp]
@@ -32,39 +32,47 @@ W-06 & HRIStudio & Computer Science & Extensive & 100 & 100 & 70 \\
 \label{tbl:sessions}
 \end{table}

+This table also presents numerical data representing the study's results, which is discussed next.
+
 \section{Primary Measures}

-\subsection{Design Fidelity Score}
+\subsection{Design Fidelity Score (DFS)}

-The Design Fidelity Score measures how completely and correctly each wizard implemented the written specification. Scores range from 0 to 100, with full points awarded only when a component is both present and correct.
+The Design Fidelity Score measures how completely and correctly each wizard implemented the written specification, the experiment they received. Scores range from 0 to 100, with full points awarded only when a component — a rubric criterion representing a required speech action, gesture, or control-flow element — is both present and correct. (For a full description of rubric categories, see Section~\ref{sec:measures}.)

-W-01 (Choregraphe, Digital Humanities, no programming experience) received a DFS of 42.5. Analysis of the exported project file found all four interaction steps present and correctly sequenced; the conditional branch was wired and functional. Speech fidelity was partial: W-01 deviated from the specification by substituting a different rock color in the narrative and comprehension question, departing from the ``red'' specified in the paper protocol. Items 1 and 4 (introduction and branch responses) received full points; items 2 and 3 received half points due to the content mismatch. The gesture category scored zero. Both the introduction wave and the narrative gesture were implemented via the tool's \emph{Animated Say} function, which generates motion non-deterministically from a library rather than placing a specific gesture node; under the rubric's clarifying rule, this does not satisfy the Correct criterion. Item 7 (nod or head shake) was not explicitly programmed. The control-flow category was split: item 9 (correct step sequence) received full points; item 8 (conditional branch) received half points because the branch was resolved by manually deleting and re-routing connections during the trial rather than through a dedicated conditional node wired at design time.
+Across the six participants, DFS scores divided sharply by condition: all three HRIStudio wizards achieved a perfect score of 100, while the three Choregraphe wizards scored 42.5, 65, and 62.5. The following paragraphs describe the key findings from each session.

-W-02 (HRIStudio, Logic and Philosophy of Science, moderate programming) received a DFS of 100. The exported project file confirmed all four interaction steps present and correctly sequenced, speech content matching the written specification verbatim, gestures placed using dedicated action nodes, and the conditional branch wired through HRIStudio's branch component. No tool-operation interventions were logged during the design phase. W-02 completed the design in 24 minutes, within the 30-minute allocation.
+W-01 received a DFS of 42.5. Analysis of the exported project file found all four interaction steps present and correctly sequenced; the conditional branch was wired and functional. Speech fidelity was partial: W-01 deviated from the specification by substituting a different rock color in the narrative and comprehension question, departing from the ``red'' specified in the paper protocol. Items 1 and 4 (introduction and branch responses) received full points; items 2 and 3 received half points due to the content mismatch. The gesture category scored zero. Both the introduction wave and the narrative gesture were implemented via the tool's \emph{Animated Say} function, which generates motion non-deterministically from a library rather than placing a specific gesture node; under the rubric's clarifying rule, this does not satisfy the Correct criterion. Item 7 (nod or head shake) was not explicitly programmed. The control-flow category was split: item 9 (correct step sequence) received full points; item 8 (conditional branch) received half points because the branch was resolved by manually deleting and re-routing connections during the trial rather than through a dedicated conditional node wired at design time.

-W-03 (Choregraphe, Computer Science, extensive programming) received a DFS of 65. W-03 approached the design as a block programming exercise, constructing extra nodes and attempting a concurrent execution structure not called for by the specification. One C-type clarification was required: I noted that control-flow logic relying on onboard speech recognition was outside the scope of this study, since Wizard-of-Oz execution routes all speech decisions through the wizard rather than the robot. Speech fidelity was partial: two of the three scorable speech items were present, with not all delivered correctly. No conditional branch was implemented in the final design, resulting in zero points for that category. The design phase extended to 37 minutes, seven minutes over the 30-minute allocation.
+W-02 received a DFS of 100. The exported project file confirmed all four interaction steps present and correctly sequenced, speech content matching the written specification verbatim, gestures placed using dedicated action nodes, and the conditional branch wired through HRIStudio's branch component. No tool-operation interventions were logged during the design phase. W-02 completed the design in 24 minutes, within the 30-minute allocation.

-W-04 (Choregraphe, Chemical Engineering, moderate programming experience) received a DFS of 62.5. The design phase ran 35 minutes without reaching completion, making W-04 the only wizard in the study who did not finish the design before the cutoff. Four T-type tool-operation interventions and one C-type clarification were logged. During training, W-04 asked about running two behavior blocks simultaneously and how to edit a block, reflecting early engagement with Choregraphe's concurrent flow model. During the design phase, W-04 asked about interpretation of punctuation in speech content, generating three simultaneous T-type marks across items 1--3. W-04 also independently attempted to use Choregraphe's choice block for conditional branching; the block did not execute correctly. The researcher re-explained the WoZ execution model and how to branch by manual step selection. Speech items 1, 2, and 4 received full points; item 3 (the comprehension question) was absent from the final design. Gesture items 5 and 6 received full points; item 7 (nod or head shake) was present but not marked correct (5/10). The conditional branch received zero points; no functional branch was wired at export. Step sequencing received partial credit (7.5/15).
+W-03 received a DFS of 65. W-03 approached the design as a block programming exercise, constructing extra nodes and attempting a concurrent execution structure not called for by the specification. One C-type clarification (see Section~\ref{sec:measures}) was required: I noted that control-flow logic relying on onboard speech recognition was outside the scope of this study, since Wizard-of-Oz execution routes all speech decisions through the wizard rather than the robot. Speech fidelity was partial: two of the three scorable speech items were present, with not all delivered correctly. No conditional branch was implemented in the final design, resulting in zero points for that category. The design phase extended to 37 minutes, seven minutes over the 30-minute allocation.

-W-05 (HRIStudio, Chemical Engineering, no programming experience) received a DFS of 100. The design phase completed in 18 minutes, the shortest design phase in the study. Training concluded in 6 minutes with no questions asked; the wizard described the platform as ``pretty straightforward.'' Two T-type interventions and three C-type clarifications were logged during the design phase. The T-type interventions concerned editing properties in the right pane of the experiment designer and understanding that the branch block requires predefined steps; both were addressed without affecting the final design. The C-type clarifications concerned what ``steps'' represent as structural containers, the relationship between the written specification's speech and platform speech actions, and a related conceptual question. The wizard added a creative narrative gesture not specified in the protocol (a crouch animation); this was present and correct under the rubric. The DFS assessment noted that the wizard's design mapped well from the specification.
+W-04 received a DFS of 62.5. The design phase ran 35 minutes without reaching completion, making W-04 the only wizard in the study who did not finish the design before the cutoff. Four T-type tool-operation interventions and one C-type clarification were logged. During training, W-04 asked about running two behavior blocks simultaneously and how to edit a block, reflecting early engagement with Choregraphe's concurrent flow model. During the design phase, W-04 asked about interpretation of punctuation in speech content, generating three simultaneous T-type marks across items 1--3. W-04 also independently attempted to use Choregraphe's choice block for conditional branching; the block did not execute correctly. The researcher re-explained the WoZ execution model and how to branch by manual step selection. Speech items 1, 2, and 4 received full points; item 3 (the comprehension question) was absent from the final design. Gesture items 5 and 6 received full points; item 7 (nod or head shake) was present but not marked correct (5/10). The conditional branch received zero points; no functional branch was wired at export. Step sequencing received partial credit (7.5/15).

-W-06 (HRIStudio, Computer Science, extensive programming) received a DFS of 100. Two T-type interventions were logged during the design phase, both pertaining to item 6 (narrative gesture): at 15:21, W-06 attempted to use parallel execution for a gesture action and was unable to edit the action node; at 15:24, W-06 encountered difficulty resetting the robot's posture and was directed to recommended posture blocks. In both cases, W-06 resolved the issue independently after the initial prompt. W-06's programming background led to a more elaborate design than the specification required, including extra posture-reset actions that were ultimately redundant since the robot was already in the correct starting position; these additions did not affect scoring since all required actions were present and correct in the exported project file. The conditional branch was wired correctly, and all speech and gesture items matched the specification. W-06 completed the design in 21 minutes, within the 30-minute allocation.
+W-05 received a DFS of 100. The design phase completed in 18 minutes, the shortest design phase in the study. Training concluded in 6 minutes with no questions asked; the wizard described the platform as ``pretty straightforward.'' Two T-type interventions and three C-type clarifications were logged during the design phase. The T-type interventions concerned editing properties in the right pane of the experiment designer and understanding that the branch block requires predefined steps; both were addressed without affecting the final design. The C-type clarifications concerned what ``steps'' represent as structural containers, the relationship between the written specification's speech and platform speech actions, and a related conceptual question. The wizard added a creative narrative gesture not specified in the protocol (a crouch animation); this was present and correct under the rubric. The DFS assessment noted that the wizard's design mapped well from the specification.
+
+W-06 received a DFS of 100. Two T-type interventions were logged during the design phase, both pertaining to item 6 (narrative gesture): at 15:21, W-06 attempted to use parallel execution for a gesture action and was unable to edit the action node; at 15:24, W-06 encountered difficulty resetting the robot's posture and was directed to recommended posture blocks. In both cases, W-06 resolved the issue independently after the initial prompt. W-06's programming background led to a more elaborate design than the specification required, including extra posture-reset actions that were ultimately redundant since the robot was already in the correct starting position; these additions did not affect scoring since all required actions were present and correct in the exported project file. The conditional branch was wired correctly, and all speech and gesture items matched the specification. W-06 completed the design in 21 minutes, within the 30-minute allocation.

 Across the three HRIStudio sessions, DFS scores were 100, 100, and 100 (mean 100). Across the three Choregraphe sessions, DFS scores were 42.5, 65, and 62.5 (mean 56.7).

-\subsection{Execution Reliability Score}
+\subsection{Execution Reliability Score (ERS)}

-The Execution Reliability Score measures how faithfully the designed interaction executed during the live trial. W-01 received an ERS of 65. The trial ran for approximately five minutes. In this session, I served as the test subject during the live trial. Through that experience I confirmed that a separately recruited participant is not required: the DFS and ERS both evaluate the wizard's implementation and execution fidelity rather than a subject's behavioral responses. Subsequent sessions therefore ran the trial phase with the wizard executing the designed interaction directly, without a separate test subject. The introduction speech and gesture executed correctly. The narrative speech executed but deviated from the specification due to the modified rock color described above. The comprehension question was delivered, a branch response was triggered, and the interaction proceeded to its conclusion. Gesture synchronization was partial: a pause gesture executed, but coordination between speech and movement was inconsistent at several points. No system disconnections or crashes occurred.
+The Execution Reliability Score measures how faithfully the designed interaction executed during the live trial.

-W-02 (HRIStudio) received an ERS of 95. The trial ran for approximately five minutes. Introduction speech and gesture, narrative speech, comprehension question, and branch response content all executed correctly and matched the specification. During the trial, the interaction briefly advanced to an incorrect step when a branch transition misfired; this was immediately corrected by manually selecting the correct step in the execution interface. This incident was logged as an H-type intervention (platform behavior, not wizard error). The branching item scored 5 out of 10 on its own merits: the branch was present in the design and execution reached the branch step, but the initial misfire meant the transition was not fully correct before manual correction. No other deviations or system failures occurred.
+Execution results followed the same pattern as design fidelity. HRIStudio trials produced ERS scores of 95, 95, and 100, with no session requiring tool-operation guidance to reach the interaction's conclusion. Choregraphe trials averaged 66.7, with branching failures or absences in two of three sessions and the study's only unprompted content deviation occurring in the third. The per-session details are as follows.

-W-03 (Choregraphe) received an ERS of 60. The trial ran for approximately five minutes. Speech execution was partial: two of three items were present but not all delivered correctly. Gesture and speech synchronization was poor throughout the interaction; motion cues were present but did not coordinate reliably with corresponding speech actions. The conditional branch, absent from W-03's design, was not executed during the trial; the interaction proceeded without a branch resolution step. No system disconnections or crashes occurred.
+W-01 received an ERS of 65. The trial ran for approximately five minutes. In this session, I served as the test subject during the live trial. Through that experience I confirmed that a separately recruited participant is not required: the DFS and ERS both evaluate the wizard's implementation and execution fidelity rather than a subject's behavioral responses. Subsequent sessions therefore ran the trial phase with the wizard executing the designed interaction directly, without a separate test subject. The introduction speech and gesture executed correctly. The narrative speech executed but deviated from the specification due to the modified rock color described above. The comprehension question was delivered, a branch response was triggered, and the interaction proceeded to its conclusion. Gesture synchronization was partial: a pause gesture executed, but coordination between speech and movement was inconsistent at several points. No system disconnections or crashes occurred.

-W-04 (Choregraphe) received an ERS of 75. The trial ran for approximately four minutes. Introduction and narrative speech executed correctly. The comprehension question, absent from the design, was not delivered; the interaction proceeded directly to the branch step. A T-type trial intervention was required to remind W-04 how to trigger the branch; the yes-branch response was delivered following that prompt, capping item 4 at 5/10 (T-assisted). Gesture execution was strong: introduction wave, narrative gesture, and nod or head shake all executed correctly. Speech and gesture synchronization scored full points. The pause before the comprehension question scored zero, as no question was delivered. No system errors occurred.
+W-02 received an ERS of 95. The trial ran for approximately five minutes. Introduction speech and gesture, narrative speech, comprehension question, and branch response content all executed correctly and matched the specification. During the trial, the interaction briefly advanced to an incorrect step when a branch transition misfired; this was immediately corrected by manually selecting the correct step in the execution interface. This incident was logged as an H-type intervention (platform behavior, not wizard error). The branching item scored 5 out of 10 on its own merits: the branch was present in the design and execution reached the branch step, but the initial misfire meant the transition was not fully correct before manual correction. No other deviations or system failures occurred.

-W-05 (HRIStudio) received an ERS of 95. The trial ran for approximately four minutes and reached step 4. The researcher's answer was ``Red'' (the correct answer), and branch A fired via programmed conditional logic. All speech items executed correctly. Introduction gesture, nod or head shake, speech synchronization, and the pre-question pause all scored full points. One trial intervention pair was logged: the researcher briefly forgot they were in live execution (G-type), then was reminded and manually skipped a non-functional crouch action (T-type, capping item 6 at 5/10). The crouch animation exists in HRIStudio's action library but does not execute on the NAO6 robot-side; skipping it was the correct recovery. All other items scored full points and no system errors occurred. The overall ERS assessment recorded that the interaction executed as designed.
+W-03 received an ERS of 60. The trial ran for approximately five minutes. Speech execution was partial: two of three items were present but not all delivered correctly. Gesture and speech synchronization was poor throughout the interaction; motion cues were present but did not coordinate reliably with corresponding speech actions. The conditional branch, absent from W-03's design, was not executed during the trial; the interaction proceeded without a branch resolution step. No system disconnections or crashes occurred.

-W-06 (HRIStudio) received a perfect ERS of 100. The trial ran for approximately three minutes. No interventions of any type were logged during the trial phase. All speech items executed correctly and matched the specification. Gestures, speech synchronization, and the pre-question pause all scored full points. The conditional branch was present in the design and fired correctly during execution via programmed conditional logic. The interaction reached its conclusion without errors, disconnections, or researcher involvement.
+W-04 received an ERS of 75. The trial ran for approximately four minutes. Introduction and narrative speech executed correctly. The comprehension question, absent from the design, was not delivered; the interaction proceeded directly to the branch step. A T-type trial intervention was required to remind W-04 how to trigger the branch; the yes-branch response was delivered following that prompt, capping item 4 at 5/10 (T-assisted). Gesture execution was strong: introduction wave, narrative gesture, and nod or head shake all executed correctly. Speech and gesture synchronization scored full points. The pause before the comprehension question scored zero, as no question was delivered. No system errors occurred.
+
+W-05 received an ERS of 95. The trial ran for approximately four minutes and reached step 4. The researcher's answer was ``Red'' (the correct answer), and branch A fired via programmed conditional logic. All speech items executed correctly. Introduction gesture, nod or head shake, speech synchronization, and the pre-question pause all scored full points. One trial intervention pair was logged: the researcher briefly forgot they were in live execution (G-type), then was reminded and manually skipped a non-functional crouch action (T-type, capping item 6 at 5/10). The crouch animation exists in HRIStudio's action library but does not execute on the NAO6 robot-side; skipping it was the correct recovery. All other items scored full points and no system errors occurred. The overall ERS assessment recorded that the interaction executed as designed.
+
+W-06 received a perfect ERS of 100. The trial ran for approximately three minutes. No interventions of any type were logged during the trial phase. All speech items executed correctly and matched the specification. Gestures, speech synchronization, and the pre-question pause all scored full points. The conditional branch was present in the design and fired correctly during execution via programmed conditional logic. The interaction reached its conclusion without errors, disconnections, or researcher involvement.

 Across the three HRIStudio sessions, ERS scores were 95, 95, and 100 (mean 96.7). Across the three Choregraphe sessions, ERS scores were 65, 60, and 75 (mean 66.7). In the HRIStudio condition, branching was present in every design and executed correctly in every trial; no trial required tool-operation guidance from the researcher to complete. In the Choregraphe condition, branching was absent from two of three designs (W-03, W-04) and was resolved by manual redesign during the trial in the third (W-01).

@@ -167,4 +175,4 @@ W-06 approached the design with a programmer's instinct for thoroughness, initia

 \section{Chapter Summary}

-This chapter presented results from all six sessions of the pilot validation study. Across the three Choregraphe sessions (W-01, W-03, W-04), DFS scores were 42.5, 65, and 62.5 (mean 56.7); ERS scores were 65, 60, and 75 (mean 66.7); and SUS scores were 60, 75, and 42.5 (mean 59.2). Design phases in the Choregraphe condition averaged 35.7 minutes; W-01 and W-03 exceeded the 30-minute target but completed their designs, while W-04 was the only wizard cut off by the session time limit without finishing. Across the three HRIStudio sessions (W-02, W-05, W-06), DFS scores were 100, 100, and 100 (mean 100); ERS scores were 95, 95, and 100 (mean 96.7); and SUS scores were 90, 70, and 70 (mean 76.7). HRIStudio design phases averaged 21 minutes, all within the allocation. The only unprompted speech content deviation observed in the dataset occurred in the Choregraphe condition (W-01). Branching failures or absences appeared in two of three Choregraphe sessions (W-03, W-04) and in none of the three HRIStudio sessions. The direction of the evidence across all measures consistently favors HRIStudio. Chapter~\ref{ch:discussion} interprets these findings in the context of the research questions.
+Across all six sessions, the evidence consistently favored HRIStudio on every primary and supplementary measure. On accessibility, every HRIStudio wizard produced a perfect design without requiring tool-operation assistance, while all three Choregraphe wizards scored below 70 and the only wizard who did not complete the design before the session cutoff was in the Choregraphe condition. On execution consistency, HRIStudio trials reached their conclusion without researcher guidance in every case; Choregraphe produced branching failures or absences in two of three sessions and the study's only unprompted content deviation from the written specification in the third. Perceived usability followed the same split: all HRIStudio ratings exceeded the SUS benchmark of 68, while all Choregraphe ratings fell at or below it. Supplementary measures reinforced this pattern — HRIStudio design phases completed faster, generated fewer tool-operation interventions, and produced no incomplete designs, while Choregraphe consistently required more time and guidance to reach the same outcome. Taken together, these results suggest that HRIStudio's design principles produce measurable gains in both accessibility and execution consistency compared to standard practice. Chapter~\ref{ch:discussion} interprets these findings in the context of the research questions.
@@ -1,7 +1,7 @@
 \chapter{Discussion}
 \label{ch:discussion}

-This chapter interprets the results presented in Chapter~\ref{ch:results} against the two research questions established in Chapter~\ref{ch:evaluation}, situates the findings within the broader literature on WoZ methodology, and identifies the limitations of this study. With all six sessions now complete, this chapter presents the full dataset and draws conclusions across the complete sample.
+This chapter interprets the results presented in Chapter~\ref{ch:results} against the two research questions established in Chapter~\ref{ch:evaluation}, situates the findings within the broader literature on WoZ methodology, and identifies the limitations of this study.

 \section{Interpretation of Findings}

@@ -11,7 +11,7 @@ The first research question asked whether HRIStudio enables domain experts witho

 The six completed sessions provide directional evidence on the accessibility question. Across the three Choregraphe wizards, design fidelity scores were 42.5, 65, and 62.5, yielding a condition mean of 56.7. Across the three HRIStudio sessions, all three wizards achieved a DFS of 100. No HRIStudio wizard required a T-type intervention that reflected an inability to operate the platform; the T-type marks logged for W-05 concerned interface orientation, and those logged for W-06 concerned gesture execution details (parallel execution and posture-reset blocks), neither of which constituted fundamental operational barriers. By contrast, Choregraphe produced design difficulties across all three sessions. W-01 required T-type assistance for connection routing and branch wiring. W-03 required no T-type interventions but over-engineered the design, adding concurrent execution nodes and attempting onboard speech-recognition logic that falls outside the WoZ paradigm. W-04 required T-type assistance for speech content punctuation and a failed choice block attempt.

-The SUS scores reinforce this pattern. Choregraphe SUS scores were 60, 75, and 42.5 (mean 59.2), all at or below the average usability benchmark of 68~\cite{Brooke1996}. HRIStudio SUS scores were 90, 70, and 70 (mean 76.7), all above the benchmark. The Choregraphe condition produced the lowest single SUS score in the study (42.5, W-04), a wizard who described the platform as getting in the way of their attempt. The HRIStudio condition produced the highest (90, W-02). With programming backgrounds now balanced across conditions---each condition contains one wizard with no programming experience, one with moderate experience, and one with extensive experience---a cross-background comparison is possible: W-01 (non-programmer, Choregraphe, SUS 60) versus W-05 (non-programmer, HRIStudio, SUS 70); W-04 (moderate programmer, Choregraphe, SUS 42.5) versus W-02 (moderate programmer, HRIStudio, SUS 90); W-03 (extensive programmer, Choregraphe, SUS 75) versus W-06 (extensive programmer, HRIStudio, SUS 70). HRIStudio scores exceed Choregraphe scores at the None and Moderate levels; at the Extensive level the scores reverse by five points (W-03 Choregraphe 75 vs.\ W-06 HRIStudio 70), suggesting that extensive programming experience largely attenuates the tool-level usability difference.
+The SUS scores reinforce this pattern. Choregraphe SUS scores were 60, 75, and 42.5 (mean 59.2), all at or below the average usability benchmark of 68~\cite{Brooke1996}. HRIStudio SUS scores were 90, 70, and 70 (mean 76.7), all above the benchmark. The Choregraphe condition produced the lowest single SUS score in the study (42.5, W-04), a wizard who described the platform as getting in the way of their attempt. The HRIStudio condition produced the highest (90, W-02). With programming backgrounds now balanced across conditions---each condition contains one wizard with \emph{None} programming experience, one with \emph{Moderate} experience, and one with \emph{Extensive} experience---a cross-background comparison is possible: W-01 (\emph{None}, Choregraphe, SUS 60) versus W-05 (\emph{None}, HRIStudio, SUS 70); W-04 (\emph{Moderate}, Choregraphe, SUS 42.5) versus W-02 (\emph{Moderate}, HRIStudio, SUS 90); W-03 (\emph{Extensive}, Choregraphe, SUS 75) versus W-06 (\emph{Extensive}, HRIStudio, SUS 70). HRIStudio scores exceed Choregraphe scores at the \emph{None} and \emph{Moderate} levels; at the \emph{Extensive} level the scores reverse by five points, suggesting that extensive programming experience largely attenuates the tool-level usability difference.

 The most striking accessibility finding comes from W-05: a Chemical Engineering faculty member with no programming experience trained in 6 minutes, completed a perfect design in 18 minutes with no operational confusion, and ran the trial to conclusion. This outcome directly addresses the accessibility claim. HRIStudio's timeline-based model and guided workflow allowed a domain novice to implement the written specification correctly on their first attempt, without the interface friction that blocked or slowed all three Choregraphe wizards. Session timing data underscores the difference: Choregraphe design phases averaged 35.7 minutes (two overruns, one incomplete), while HRIStudio design phases averaged 21 minutes (all three within the allocation). Underlying this difference is a structural property of the two tools: HRIStudio's model is domain-specific to Wizard-of-Oz execution, so wizard effort is channeled toward implementing the specification more completely rather than elaborating the tool's architecture. Choregraphe's general-purpose programming model makes the opposite available, and both W-03 and W-04 took it, spending time on concurrent execution structures and a speech-recognition-driven choice block that the WoZ context does not support. No HRIStudio wizard had that option, and all three scored 100 on the DFS.

@@ -39,7 +39,11 @@
 \listoffigures

 \abstract{
-    The Wizard-of-Oz (WoZ) technique is widely used in Human-Robot Interaction (HRI) research to prototype and evaluate robot interaction designs before autonomous capabilities are fully developed. However, two persistent problems limit the technique's effectiveness. First, existing WoZ tools impose high technical barriers that prevent domain experts outside engineering from conducting independent studies (the Accessibility Problem). Second, the fragmented landscape of custom, robot-specific tools makes it difficult to verify or replicate experimental results across labs (the Reproducibility Problem). This thesis formalizes a set of design principles for WoZ infrastructure that address both problems simultaneously: a hierarchical specification model that organizes experiments as studies, experiments, steps, and actions; an event-driven execution model that separates protocol design from live trial control; and a plugin architecture that decouples experiment logic from robot-specific implementations. These principles are realized in HRIStudio, an open-source, web-based platform that provides a visual experiment designer, a guided wizard execution interface, automated timestamped logging with explicit deviation tracking, and role-based access control for research teams. A pilot between-subjects study compared HRIStudio against Choregraphe, a representative baseline tool, using six faculty participants who each designed and executed an interactive storytelling task on a NAO robot. Across all six sessions, HRIStudio participants achieved higher design fidelity (mean 100 vs. 56.7), higher execution reliability (mean 96.7 vs. 66.7), and higher perceived usability (mean SUS 76.7 vs. 59.2) than Choregraphe participants. The only unprompted specification deviation in the dataset occurred in the Choregraphe condition, illustrating the reproducibility failure mode HRIStudio's enforcement model is designed to prevent. While the pilot scale precludes inferential claims, the directional evidence across all measures suggests that the right software architecture can make WoZ experiments more accessible to non-programmers and more reproducible across executions.
+    The Wizard-of-Oz (WoZ) technique is widely used in Human-Robot Interaction research to prototype and evaluate robot interaction designs before autonomous capabilities are fully developed. However, two persistent problems limit the technique's effectiveness. First, existing WoZ tools impose technical barriers that prevent domain experts outside engineering from conducting independent studies --- the Accessibility Problem. Second, the fragmented landscape of custom, robot-specific tools makes it difficult to run the same social interaction script on a different robot platform without rebuilding the implementation from scratch --- the Reproducibility Problem, as the term is used in this thesis. Note that reproducibility here concerns execution consistency within a study and the portability of interaction scripts across robot platforms; it does not refer to independent replication of a published study by third-party researchers.
+
+    Through a thorough literature review, I identified a set of design principles to guide the development of WoZ support tools: a hierarchical specification model that organizes experiments as studies, experiments, steps, and actions; an event-driven execution model that separates protocol design from live trial control; and a plugin architecture that decouples experiment logic from robot-specific implementations. I implemented HRIStudio, an open-source, web-based platform that follows these design principles, providing a visual experiment designer, a guided wizard execution interface, automated timestamped logging with explicit deviation tracking, and role-based access control for research teams. I then evaluated HRIStudio in a pilot between-subjects study comparing it against Choregraphe, the standard NAO programming tool, using six participants who each designed and executed an interactive storytelling task on a NAO robot.
+
+    The pilot study confirms the thesis: HRIStudio wizards achieved higher design fidelity, higher execution reliability, and higher perceived usability than Choregraphe wizards across all six sessions. The only unprompted specification deviation in the dataset occurred in the Choregraphe condition, illustrating the execution-consistency failure mode that HRIStudio's enforcement model is designed to prevent. While the pilot scale precludes inferential claims, the directional evidence across all measures supports the position that a tool built to realize the identified design principles can have significant impact on accessibility and reproducibility in WoZ-based HRI research.
 }

 \mainmatter