revisions of the revisions

add signed cover page
feat: add honors council representative and update department name in thesis
2026-05-08 15:18:54 -04:00 · 2026-04-30 00:19:02 -04:00 · 2026-04-29 12:42:41 -04:00 · 2026-04-21 11:00:20 -04:00 · 2026-04-21 00:25:54 -04:00 · 2026-04-20 23:19:10 -04:00
63 changed files with 1633 additions and 490 deletions
@@ -12,6 +12,8 @@
 *.synctex.gz
 *.dvi
 *.pdf
+!pdfs/**
+!thesis/pdfs/**

 # Build directory
 build/
@@ -0,0 +1,54 @@
+# AGENTS.md
+
+## Purpose
+This file defines repository-specific guidance for AI coding/writing agents working on this thesis.
+
+## Repository Layout
+- Thesis source: `thesis/`
+- Main file: `thesis/thesis.tex`
+- Chapters: `thesis/chapters/*.tex`
+- Output PDF: `thesis/out/thesis.pdf`
+- Context/reference docs: `context/`
+
+## Build and Verify
+From `thesis/`:
+- Build: `make`
+- Output should be generated at `build/thesis.pdf` and copied to `out/thesis.pdf`.
+
+If edits touch LaTeX content, run a build before finishing.
+
+## Writing and Editing Priorities
+1. Preserve technical accuracy over stylistic flourish.
+2. Prefer plain, direct language.
+3. Keep paragraph flow tight: avoid repeating claims already made in nearby paragraphs.
+4. Minimize unnecessary chapter recap in chapter introductions.
+5. Do not introduce unsupported claims.
+
+## Terminology Canon (Use Consistently)
+- `experiment`: reusable protocol specification
+- `trial`: one concrete run of an experiment protocol
+- `wizard`: human operator controlling execution
+- `test subject` / `human subject`: person interacting with robot during a trial
+- `session`: scheduled study block that can include training, design challenge, trial, and debrief
+
+When revising text, normalize wording to these definitions unless quoting study materials verbatim.
+
+## Chapter-Specific Conventions
+- Chapter 4 (Design): focus on principles and architecture, not implementation specifics.
+- Chapter 5 (Implementation): focus on how the principles are realized.
+- Chapter 6 (Evaluation): focus on study design, procedure, and measures; avoid re-explaining earlier chapters.
+
+## Figures and Media
+- If requested images are unavailable, add clearly labeled placeholders in the relevant section.
+- Prefer placing robot imagery in evaluation/apparatus context unless explicitly requested elsewhere.
+
+## Scope Discipline
+- Do not edit generated files under `thesis/build/`.
+- Do not add new dependencies/packages unless necessary.
+- Keep edits minimal and localized to the user request.
+
+## Review Checklist Before Finishing
+- Build succeeds (`make` from `thesis/`).
+- Terminology remains consistent across edited sections.
+- No obvious redundancy introduced.
+- References/labels compile without new undefined-reference warnings.
@@ -0,0 +1,316 @@
+# Professor Annotations — Thesis Draft
+
+Complete record of all GoodReader/Notability annotations from both PDFs (`draft1 - flattened.pdf` covering Abstract + Ch1–5, and `6-9.pdf` covering Ch6–9). Status column tracks what has been applied.
+
+**Key rule:** Only apply explicitly marked text changes (strikethroughs, word replacements, caret insertions). Treat observational margin notes as context only.
+
+---
+
+## Professor's Overall Email Feedback
+
+> "Well, this was an Odyssey of a day. You have something very good here, but as every text, it can always be improved. I am not sure how much you really need to do for Monday. If there's anything that cannot be addressed in this time, it will spill over for later. By later, I mean throughout this coming week that proceeds the defense, possibly as modifications that others may require for you to do after the defense."
+
+> "What you have here is likely to not raise a whole lot of concerns from your reviewers. **The point that I find most needing of attention is chapter 7.** It reads very dry, and as you will see in my embedded comments, I was left wondering if the results could be presented more synthetically with the full anecdotes relegated to an appendix."
+
+> "Throughout the text from beginning to the end, I see material that appears repeatedly. Ideally, one would strive to eliminate these redundancies so that the text is more punchy, more direct, more effective in communicating the message that it wants to get across. Redundancies tend to distract the reader as well as to overwhelm them. If you have time, at some point, I would recommend going through this with a fine tooth comb to identify these redundancies and eliminate them as much as possible and as much as time allows."
+
+**Priority order for remaining work (professor's implied ranking):**
+1. Chapter 7 overhaul — highest priority before defense
+2. Abstract rewrite
+3. Redundancy pass (can spill to post-defense week)
+4. Everything else (can spill to post-defense week if needed)
+
+---
+
+## Status Key
+- ✅ Applied
+- ⬜ Pending
+- 🔍 Needs clarification
+
+---
+
+## Abstract (pp. xv–xvi)
+
+| Location | Annotation | Status |
+|---|---|---|
+| p. xv, top margin | "make it tighter and fully self-contained" — restructured into three focused paragraphs | ✅ |
+| p. xv, next to "Abstract" heading | Rephrased to: "Through a thorough literature review, I identified a set of design principles... I implemented HRIStudio... I then evaluated HRIStudio" | ✅ |
+| p. xv, "high" in "impose high technical barriers" | Deleted "high" | ✅ |
+| p. xv, "faculty" in "six faculty participants" | Deleted "faculty" | ✅ |
+| p. xvi, "HRIStudio participants" and "Choregraphe participants" | → "HRIStudio wizards" / "Choregraphe wizards" | ✅ |
+| p. xvi, three "(mean X vs. Y)" parentheticals | Deleted all three | ✅ |
+| p. xvi, bottom | "The pilot study confirms the thesis: HRIStudio wizards achieved higher design fidelity, higher execution reliability, and higher perceived usability..." | ✅ |
+| p. xvi | Third-party replication carve-out added: "Note that reproducibility here concerns execution consistency within a study and the portability of interaction scripts across robot platforms; it does not refer to independent replication of a published study by third-party researchers." | ✅ |
+| p. xvi | Reproducibility defined as: "run the same social interaction script on a different robot platform without rebuilding the implementation from scratch" | ✅ |
+
+---
+
+## Chapter 1 — Introduction
+
+| Location | Annotation | Status |
+|---|---|---|
+| p. 1, "limit HRI research" (highlighted) | "not all research in HRI is about social robotics; be careful to not get out of your scope" — observational | ✅ (applied in session 1) |
+| p. 2, arrow to "Social robotics" | "social robotics, a subfield within HRI" — confirmed phrasing | ✅ |
+| p. 2, "interaction is inherently unpredictable" | Strikethrough; replaced with: "reactions to robot behaviors are not always" | ✅ |
+| p. 2, "uses" in "the researcher uses a WoZ setup" | Circle; "may use" written above — replace "uses" with "may use" | ✅ |
+| p. 2, "In HRI, the wizard..." | "WoZ experiments" replacing "HRI" | ✅ |
+| p. 3, "a high technical barrier prevents" | Strikethrough; replaced with: "may find it challenging to" | ✅ |
+| p. 3, "from conducting" | Strikethrough — implied rewording | ✅ |
+| p. 3, §1.2 Proposed Approach margin | "you are saying that there are many different robots out there and one may want to have the same interaction script execute on different robots → here is where you must define clearly what you mean by reproducibility" — context | ✅ |
+| p. 3, bracket on first paragraph of §1.2 (three circled sentences) | "BARRIERS 1ST", "follows 2ND", "REPRODUCIBILITY 2ND" — ordering directive | ✅ |
+| p. 3, "the HRI research community" | Strikethrough; replaced with: "WoZ experiments" | ✅ |
+| p. 4, ✓ on "the design principles" | Checkmark — keep as is | ✅ |
+| p. 4, ✗ on "plugin architecture that decouples experiment logic from robot-specific implementations" | Strikethrough; annotation: "a reference implementation, HRIStudio, used to evaluate the impact of the design principles" | ✅ |
+| p. 4, §1.3 "the" before "vision" | Strikethrough; replaced with: "our" | ✅ |
+| p. 4, §1.3 "the prototype" | Strikethrough; replaced with: "a first" | ✅ |
+| p. 4, §1.3 bottom | "extends and" written above "formalizes"; "peer" written above "contributions" | ✅ |
+| p. 5, circled "s" letters on "separates", "enforces", "implements", "evaluates" | Flagging subject/verb agreement — make all present-tense verbs consistent | ✅ |
+| p. 5, end of research question paragraph | "& guides them to be consistent in the pursuit of their experimental goals" — add to final sentence | ✅ |
+| p. 5, italicized research question in Chapter Summary | Strikethrough — delete whole sentence | ✅ |
+
+---
+
+## Chapter 2 — Background and Related Work
+
+| Location | Annotation | Status |
+|---|---|---|
+| p. 6, top (Ch2 opening) | Venn diagram: HRI → Social Robotics → WoZ Experiments → (hatched inner area); "your work is situated in a subset of possible activities in HRI" — observational | ✅ |
+| p. 6, "relative to" in "position this thesis relative to prior work" | Strikethrough; replaced with: "within the context of" | ✅ |
+| p. 8, "WoZ4U is unusable with other robot platforms" | "unusable" struck; replaced with: "does not work for various robots"; "and" before "manufacturer" struck; caret suggests restructure | ✅ |
+| p. 9, "best practices like standardized protocols" | "consistently following" written above "like"; "experimental" written above "standardized" | ✅ |
+| p. 9, circle around citation "[14]" next to Riek | Observational — noting citation | ✅ |
+| p. 9, underline on "internal validity" | "define this for the reader" | ✅ |
+| p. 10, underline on "minimize context switching and tool fragmentation" (R1) | "explain to the reader what this means" | ✅ |
+| p. 10, R2 "Creating interaction protocols" | "social" inserted before "interaction"; "between robot & human" inserted before "protocols" | ✅ |
+| p. 10, R3 "across a variety of robotic platforms" | "→ this addresses reproducibility" — observational reading note | ✅ (treated as context) |
+| p. 10, R5 "implementations" highlighted | "actions and behaviors?" written above; "→ this also addresses reproducibility" — observational | ✅ (treated as context) |
+| p. 11, R5 "the platform" in last line | "WoZ" inserted above "platform" | ✅ |
+| p. 11, R6 "review" + "execution" | Caret inserting "of" between "review" and "execution" | ✅ |
+| p. 11, R6 paragraph margin | "you have been calling it reproducibility so stick with your terminology" | ✅ |
+| p. 11, "flexibility" highlighted | Strikethrough — flagged | ✅ |
+| p. 12, "tests" | Strikethrough; replaced with: "evaluates" | ✅ |
+
+---
+
+## Chapter 3 — Reproducibility Challenges
+
+| Location | Annotation | Status |
+|---|---|---|
+| p. 13, "difficult to reproduce" | "consistently" inserted above "reproduce" | ✅ |
+| p. 13, "sources of variability in WoZ studies" | "identified in the literature" inserted above | ✅ |
+| p. 13, §3.1 opening (highlighted "Reproducibility in experimental research...") | Bottom annotation: "This reproducibility definition is about consistent application of the experimental script across multiple trials → different from reproducing the same experiment with different robots → how do you want to state the distinction?" | ✅ (§3.1 reframed with execution consistency + cross-platform reproducibility as two sub-aspects) |
+| p. 14, "replicating" highlighted | "it may seem pedantic, but I think you need to establish the difference between reproducibility and repeatability across replications (trials)" — "replications" changed to "trials"; repeatability/reproducibility distinction + third-party carve-out added to §3.1 | ✅ |
+| p. 15–16 | Clean — no annotations | — |
+
+---
+
+## Chapter 4 — Architectural Design
+
+| Location | Annotation | Status |
+|---|---|---|
+| p. 19, §4.1 heading | "you use the word 'condition' with different meanings across the thesis — be sure to resolve ambiguities!" | ✅ ("condition" replaced with "experiment" in hierarchy contexts) |
+| p. 19, "multiple reusable conditions" | Circled; arrow to annotation above | ✅ → "multiple experiments" |
+| p. 19, "To organize these components" | "why do you call them 'components'?" | ✅ → "To organize these elements" |
+| p. 20, top | "is 'condition' the same as a structural element...?" — context | ✅ |
+| p. 20, "The terms in this hierarchy are used in a strict way" | Replaced with: "I define the elements in this hierarchy as follows." | ✅ |
+| p. 20, "research" in "top-level research container" | Strikethrough — delete "research" | ✅ |
+| p. 20, "conditions" in "groups related protocol conditions" | Circled — replace | ✅ → "groups related experiments" |
+| p. 20, "condition" in "one reusable condition" | Circled — replace | ✅ → "one independently runnable protocol" |
+| p. 20, "recall" in "testing recall" | "information" written above | ✅ → "testing information recall" |
+| p. 20, Fig. 4.2 paragraph | "new paragraph" marked; "define" above "designed once"; replacement clause about instantiation | ✅ |
+| p. 20, left margin bracket | "this needs to be tightened up and edited for clarity" | ✅ |
+| p. 21, top | Research question rephrasing: "Does how the robot tell a story affect how a human will remember the story?" | ✅ |
+| p. 21, "conditions" highlighted | Circled — replace | ✅ → "experiments" |
+| p. 21, left margin | "this comes at the reader very abruptly: have you introduced these robots and their different morphologies/features before?" | ✅ (added brief characterizations) |
+| p. 21, "that study" highlighted | "the study presented above" written below | ✅ |
+| p. 21, "same hierarchy" struck | Replaced with: "hierarchical elements defined in Figure 4.1" | ✅ |
+| p. 21, step sentence | "sequence of" inserted above "ordered"; "defines the specific" above "contains"; "the robot will perform" added | ✅ |
+| p. 21, figures annotation | "These three figures are interrelated as follows: Figure 4.1 defines the experimental structure as an abstraction; Figure 4.2 shows concrete instances of the abstract experimental structure; and Figure 4.3 shows the expansions of each element of the experimental structure." | ✅ |
+| p. 21, "Together, these three figures..." | Circle with arrow: "place this before the suggested paragraph" | ✅ |
+| p. 21, "lets" struck | Replaced with: "compels" | ✅ |
+| p. 21, "any" struck | Replaced with: "multiple" | ✅ |
+| p. 23, "keeps the tool accessible to non-programmers" | "creates a process that is" inserted above | ✅ |
+| p. 23, "levels" struck | Replaced with: "elements" | ✅ |
+| p. 23, "trial flow" struck | Replaced with: "the sequence of events in a trial" | ✅ |
+| p. 23, "timing" struck | Replaced with: "timing of each event" | ✅ |
+| p. 23, R3/R4/R1/R6 annotations | Add parenthetical references | ✅ |
+| p. 23, bottom | "be consistent with the terminology you use" | ✅ |
+| p. 23, §4.2 bottom paragraph | "what really matters is that the order of actions is the same across multiple trials; it would be unnatural to demand that all actions should happen at the same points in time" — context for rewrite | ✅ (action ordering foregrounded) |
+| p. 24, "shows up" circled | Replaced with: "can be evident" | ✅ |
+| p. 25, §4.3.1 "in plain language" | "natural?" and "even having to" — replaced with: "naturally, without even having to write code" | ✅ |
+| p. 25, "research" struck | Replaced with: "experimental" | ✅ |
+| p. 25, stored-format sentence | "This enables third parties to reproduce one's experiments faithfully. The importance of this feature cannot be overstated since it is central to the scientific method." — add reproducibility sentence | ✅ |
+| p. 26, §4.3.2 "shows" struck | Replaced with: "keeps the wizard informed of" | ✅ |
+| p. 26, "are" caret | Inserted into "The current step...all" | ✅ |
+| p. 26, "Execution" highlighted | `\emph{Execution}` for consistency | ✅ |
+| p. 26, "simply" struck before "ignore these moments" | Delete "simply"; replace "these moments" with "these deviations from the script" | ✅ |
+| p. 26, "participant" highlighted | Replaced with: "human subject" | ✅ |
+| p. 26, left margin brackets | "a little polish is needed here" / "needs polish for clarity and directness" — context | ✅ |
+| p. 26, "the" in "access the same live view" struck | Replaced with: "of a trial" | ✅ |
+| p. 26, bottom | "Define a 'deviation' as a spontaneous action introduced by the wizard in response to a reaction of the human subject that was not expected when the script was created" | ✅ (deviation definition added) |
+| p. 27, §4.3.3 "can" struck | "the need for" inserted | ✅ |
+| p. 27, §4.4.1 annotation above heading | "Like the ISO/OSI RM for networking software, HRIStudio communicates layers, as shown in Fig. 4.5." | ✅ |
+| p. 27, "The system" struck | Replaced with: "More specifically, the system is organized as" | ✅ |
+| p. 28, Fig. 4.5 "different components w/ same name?" | Rename "Execution" in App Logic layer to "Trial Engine" | ✅ |
+| p. 28, Fig. 4.5 "should these arrows be bidirectional?" | Changed arrows to bidirectional | ✅ |
+| p. 29, §4.4.2 "begins stepping" | "allows the wizard to" inserted above | ✅ |
+| p. 29, left margin | "I would have used a 'begin enumerable' here" | ✅ (prose converted to enumerate list) |
+| p. 29, "unexpected events" struck | Replaced with: "deviations" | ✅ |
+| p. 29, "ensures" struck | Replaced with: "creates automatically a" | ✅ |
+
+---
+
+## Chapter 5 — Implementation
+
+| Location | Annotation | Status |
+|---|---|---|
+| p. 33, §5.1 "shown in Figure 4.5" | "presented in Chapter 4 and" inserted above | ✅ |
+| p. 33, yellow highlight on local network sentence | Flagged — keep, add explanation before it | ✅ (client/server/local-network explanation added) |
+| p. 33, TypeScript paragraph | "this is more about implementation than about architecture" — context | ✅ (treated as context) |
+| p. 33, bottom | "up until this statement, you hadn't told the reader that the application is a networked composition of client and server, so this comes as a surprise." | ✅ (explanation added to §5.1) |
+| p. 34, §5.2 "experiments" | "descriptions" inserted above → "saves experiment descriptions" | ✅ |
+| p. 34, yellow highlight on "a trial means one concrete run..." | "wasn't this definition due on your first use of the term 'trial'?" — annotation; remove the definition from here | ✅ (misplaced trial definition removed) |
+| p. 34, "trial record" | "sample?" written above — 🔍 unclear; do not change without confirmation | ✅ (term is appropriate; "trial record" is the structured log of a trial, not a statistical sample) |
+| p. 35, below Figure 5.1 | "you should watch out for redundancies" — observational | ✅ (treated as context) |
+| p. 36, left margin | "this was stated in 4.2" (re: event-driven paragraph) — context | ✅ (redundant paragraph trimmed) |
+| p. 36, yellow highlight on "the wizard controls how time advances from action to action" | Flagged — keep this sentence | ✅ |
+| p. 36, strikethrough on "A fully programmed robot..." passage | Replaced with: "Unscripted actions give the wizard the tools to record how these interactions unfold when deviations from the script are required." | ✅ |
+| p. 38, after role list intro sentence | "The capabilities and constraints for each role are described below:" added | ✅ |
+| p. 39, §5.5 "double-blind design" highlighted | "double-blind line" written above — term already defined inline with citation; no change needed | ✅ |
+| p. 40, §5.7 "are complete and integrated" | "with one another" inserted via caret | ✅ |
+| p. 40, §5.7 last sentence, caret after "beyond NAO6" | Caret with ↑ mark — expansion or forward reference needed | ✅ (forward reference to Chapter 9 added) |
+
+---
+
+## Chapter 6 — Pilot Validation Study
+
+| Location | Annotation | Status |
+|---|---|---|
+| p. 43, §6.1 hypothesis paragraph | "reproducibility" written above "written specification"; "accessibility?" below — both named in hypothesis | ✅ |
+| p. 43, §6.1 second RQ sentence | Yellow highlight — rephrase away from rhetorical framing | ✅ → "The first is whether..." / "The second is whether..." |
+| p. 43, §6.2 "the same training structure" | "similar" written above — replace | ✅ → "a similar training structure" |
+| p. 44, §6.3 "This cross-departmental recruitment was intentional." | "redundant" above — delete sentence | ✅ |
+| p. 44, §6.3 justification paragraph | "you could first state this as a goal, then talk about how you met this goal" | ✅ (restructured: goal stated first) |
+| p. 44, §6.3 "direct email" | "and" inserted via caret after | ✅ → "through direct email, and participation was..." |
+| p. 44, §6.3 sample size "With" struck | "I chose to recruit"; "believing that" inserted | ✅ |
+| p. 44, §6.3 yellow strikethrough block | "two academic semesters...high competing time demands." — delete | ✅ |
+| p. 44, §6.3 left margin | "looks like you're making excuses" — context for deletion | ✅ |
+| p. 44, §6.3 "proof" struck | "substantiation"; "of any claims." added | ✅ |
+| p. 45, §6.4 top margin | "start with this to set up the scenario:" + professor's suggested opening sentence | ✅ |
+| p. 45, §6.4 "glowing" | "red" written above | ✅ → "a red rock on Mars" |
+| p. 45, §6.4 "comprehension" | "recall" written above | ✅ → "a recall question" |
+| p. 45, §6.4 asterisk near Appendix ref | "...one might measure whether a robot or human storyteller would produce better recall." | ✅ (added as sentence) |
+| p. 45, §6.4 "The task was chosen" | Circle — reframe | ✅ → "This scenario was chosen" |
+| p. 45, §6.4 Choregraphe FSM sentence | Yellow highlight — flagged | ✅ (treated as context) |
+| p. 45, §6.5 left margin | "Important to address" + "you talked about the NAO and Choregraphe earlier but only present them here" | ✅ (treated as context; §6.5 is the formal introduction) |
+| p. 46, star annotation | "you used this nugget of info earlier and unveil it here" — context | ✅ (treated as context) |
+| p. 46, yellow highlights on NAO + Choregraphe sentences | Flagged — keep | ✅ |
+| p. 46, circle on "Choregraphe organizes behavior as a finite state machine" | Flagged | ✅ (treated as context) |
+| p. 47, §6.6.2 "found in Appendix X" | Add appendix reference to observer data sheet | ✅ |
+| p. 47, §6.6.2 "paper" in "paper specification" | Strikethrough — delete "paper" | ✅ |
+| p. 47, §6.6.2 yellow highlight on "structured observer data sheet" | Flagged | ✅ |
+| p. 48, §6.6.4 "found in Appendix Y" | Add appendix reference to SUS questionnaire | ✅ |
+| p. 48, §6.6.4 yellow highlight on "System Usability Scale" | Flagged | ✅ |
+| p. 48, §6.7 after "five instruments." | "They are described as follows." written in red | ✅ |
+| p. 48, §6.7.1 DFS heading | "important to state if this is something you defined or if it was someone else's definition" | ✅ → "I define the Design Fidelity Score (DFS) as..." |
+| p. 49, §6.7.1 "This measure" | "DFS" written above → "DFS is motivated by..." | ✅ |
+| p. 49, §6.7.1 accessibility sentence | "the question:" inserted → "This measure addresses the question: did the tool allow a wizard to independently produce a correct design?" | ✅ |
+| p. 49, §6.7.2 ERS heading | "same comment I made for the DFS" — author-defined statement needed | ✅ → "I define the Execution Reliability Score (ERS) as..." |
+| p. 50, §6.7.2 reproducibility sentence | "the" and "question" inserted → "This measure addresses the question: did the design translate reliably into execution without researcher support?" | ✅ |
+| p. 50, §6.7.3 "created by" | Caret before [19] → "perceived usability, created by Brooke [19]" | ✅ |
+| p. 52, §6.9 "This chapter described" | "the structure of" inserted | ✅ |
+
+---
+
+## Chapter 7 — Results
+
+| Location | Annotation | Status |
+|---|---|---|
+| p. 54, Ch7 intro yellow highlight | "Rhetoric is unusual in technical writing → better rephrase this" — rephrased to declarative statement | ✅ |
+| p. 54, §7.1 "participants" | "personas" → "Table 7.1 summarizes the personas and their assigned conditions" | ✅ |
+| p. 54, §7.1 "faculty members" highlighted | "professors" — replaced in §7.1 opening | ✅ |
+| p. 54, §7.1 after table introduction | "This table also presents numerical data representing the study's results, which is discussed next." | ✅ |
+| p. 55, §7.2.1 DFS heading | "(DFS)" added → "Design Fidelity Score (DFS)" | ✅ |
+| p. 55, §7.2.1 "implemented the written specification" | "the experiment they received" inserted | ✅ |
+| p. 55, §7.2.1 "a component" highlighted | Inline definition added: "a rubric criterion representing a required speech action, gesture, or control-flow element" | ✅ |
+| p. 55, §7.2.1 inline parentheticals | All removed from W-01 through W-06 in DFS and ERS sections | ✅ |
+| p. 55, §7.2.1 bottom | Dry/anecdotal tone — added synthetic overview paragraph before per-wizard detail in both DFS and ERS sections | ✅ |
+| p. 56, W-03 paragraph | "(see §6.7.4)" → "One C-type clarification (see Section~\ref{sec:measures}) was required" | ✅ |
+| p. 56, "for that category" highlighted | Cross-reference added: "(For a full description of rubric categories, see Section~\ref{sec:measures}.)" | ✅ |
+| p. 58, §7.2.2 ERS heading | "(ERS)" added → "Execution Reliability Score (ERS)" | ✅ |
+| p. 70, §7.5 Chapter Summary | Rewritten as interpretive conclusions — no score repetition | ✅ |
+
+---
+
+## Chapter 8 — Discussion
+
+| Location | Annotation | Status |
+|---|---|---|
+| p. 71, Ch8 intro | "With all six sessions now complete," struck — delete this clause | ✅ (already absent from text) |
+| p. 73, §8.1.1 end of accessibility paragraph | `\emph{}` on "None", "Moderate", "Extensive" (annotated "temph") — italicize these three experience levels throughout | ✅ (already using `\emph{}` consistently) |
+| p. 73, §8.1.1 bottom | "There's a big thing hiding in the background here: only one wizard was a humanist; all others were engineers" — acknowledge this sample composition limitation | ✅ (sample composition acknowledgment added to §8.1.1) |
+| p. 77, §8.2 "holds" highlighted green | "is confirmed?" written above — consider replacing "holds" with "is confirmed" | ✅ (word "holds" not present in current text) |
+| p. 78, §8.2 continued | "both" inserted via caret before "conditions" → "the overall 17.5-point gap in both condition means reflects..." | ✅ (fixed to "both conditions' means") |
+| p. 79, §8.3 "under active development" struck | Replaced with: "continuously evolving" → "HRIStudio is continuously evolving." | ✅ |
+
+---
+
+## Chapter 9 — Conclusion and Future Work
+
+| Location | Annotation | Status |
+|---|---|---|
+| p. 81, Ch9 intro | Green highlight on "Human-Robot Interaction"; "social robotics" written below → scope to "Wizard-of-Oz-based social robotics research" | ✅ |
+| p. 82, §9.1 first contribution | Green highlight on "institution" with "?" — word choice questioned in "not specific to any one robot or institution" | ✅ ("institution" replaced with "research group") |
+| p. 82, §9.1 HRIStudio contribution | Circle around "an open-source"; "did you mention this earlier? how is it distributed and licensed?" — add distribution/licensing info | ✅ (MIT License added) |
+| p. 83, §9.2 Reflection on Research Questions | "How much of 9.2 is new and how much of it does it repeat from other sections?" — audit for redundancy with §8.1 and trim | ✅ (§9.2 trimmed to ~15 lines, cut ~50% of duplicated content) |
+| p. 85, §9.3 "Multi-task evaluation." | Strikethrough (green); replaced with: "Evaluations with multiple different tasks." | ✅ |
+| p. 86, §9.3 community adoption sentence | "not a" struck; "more of a" and "than" inserted → "The reproducibility problem in WoZ research is ultimately more of a community problem than a tool problem." | ✅ (already correctly worded in text) |
+| p. 86, §9.4 "are never shared" | "aren't always shared" written above struck phrase | ✅ |
+| p. 86, §9.4 bottom | "I struggle with the word rigorous: might 'systematic' be a more precise qualifier?" — consider replacing "rigorous" with "systematic" throughout closing paragraph | ✅ ("rigorous" replaced with "systematic" in closing paragraph) |
+
+---
+
+## GoodReader Notes Page (p. i)
+
+### Note 33-1 (April 11, 2026) — **MAJOR STRUCTURAL NOTE**
+
+> "Maybe the way to address the different possible interpretations of the word reproducibility is to state outright that HRIStudio was designed to meet two distinct meanings of the term:
+> — reproducibility across trials in the same experiment, with the same or with different wizards running them
+> — create documentation to guide the reproduction of the experiment by third parties, which would mean reproducibility as in https://dl.acm.org/doi/pdf/10.4108/ICST.SIMUTOOLS2009.5684"
+
+**What this means for the thesis:**
+
+The professor wants three interpretations of "reproducibility" explicitly distinguished somewhere in the thesis (Ch3 §3.1 is the natural home):
+
+1. **Execution consistency** — wizard reliably follows the same script across multiple trials in the same experiment (within-study). This IS what the thesis evaluates.
+2. **Cross-platform reproducibility** — the same experiment script runs on a different robot with minimal change. This IS what HRIStudio is designed to support.
+3. **Third-party replication** — another lab reproduces your published experiment from documentation. This is NOT what the thesis evaluates, and the abstract/conclusion must be careful not to claim it.
+
+**Current state:** Ch3 §3.1 already names sub-aspects 1 and 2. The explicit carve-out for sub-aspect 3 ("third-party replication is not what we evaluated") is still **pending** — it belongs in the Abstract and/or a sentence in Ch3 §3.1.
+
+---
+
+## Pending Items Summary
+
+All items above are resolved. The stale tracking table below is retained for reference only.
+
+| Chapter | Item | Status |
+|---|---|---|
+| Abstract | Full rewrite per professor's framing guidance | ✅ |
+| Ch3 §3.1 | Add sentence explicitly distinguishing third-party replication as out of scope | ✅ |
+| Ch5 §5.5 | "double-blind design" — define inline | ✅ |
+| Ch5 §5.7 | Caret after "beyond NAO6" — needs original PDF check | ✅ (resolved) |
+| Ch7 §7.1 | "personas" for "participants"; "professors" for "faculty members" (global); add sentence after table | ✅ |
+| Ch7 §7.2.1 | "(DFS)" in heading; "the experiment they received"; define "a component"; remove inline parentheticals; narrative tone question | ✅ |
+| Ch7 §7.2.1 | "(see §6.7.4)" on C-type clarification; cross-reference for DFS categories | ✅ |
+| Ch7 §7.2.2 | "(ERS)" in heading | ✅ |
+| Ch7 §7.5 | Rewrite Chapter Summary as interpretive conclusions | ✅ |
+| Ch8 intro | Delete "With all six sessions now complete," | ✅ |
+| Ch8 §8.1.1 | Italicize None/Moderate/Extensive; acknowledge humanist sample limitation | ✅ |
+| Ch8 §8.2 | "holds" → consider "is confirmed"; "both" before "conditions" | ✅ |
+| Ch8 §8.3 | "under active development" → "continuously evolving" | ✅ |
+| Ch9 intro | Scope to "Wizard-of-Oz-based social robotics research" | ✅ |
+| Ch9 §9.1 | "institution" word choice; open-source licensing info | ✅ |
+| Ch9 §9.2 | Audit for redundancy with §8.1 | ✅ |
+| Ch9 §9.3 | Rename "Multi-task evaluation" heading; community problem sentence | ✅ |
+| Ch9 §9.4 | "aren't always shared"; "systematic" for "rigorous" | ✅ |
@@ -1,20 +0,0 @@
-.DS_Store
-
-# LaTeX build artifacts (if they leak into root)
-*.aux
-*.bbl
-*.blg
-*.log
-*.out
-*.toc
-*.lof
-*.lot
-*.fls
-*.fdb_latexmk
-*.synctex.gz
-*.dvi
-*.pdf
-
-# Directories
-build/
-out/
@@ -36,6 +36,7 @@
 \setlength{\parskip}{0.2in}
 \newcommand{\advisor}[1]{\newcommand{\advisorname}{#1}}
 \newcommand{\advisorb}[1]{\newcommand{\advisornameb}{#1}}
+\newcommand{\honorscouncilrep}[1]{\newcommand{\honorscouncilrepname}{#1}}
 \newcommand{\chair}[1]{\newcommand{\chairname}{#1}}
 \newcommand{\department}[1]{\newcommand{\departmentname}{#1}}
 \newcommand{\butitle}[1]{\newcommand{\titletext}{#1}}
@@ -114,33 +115,46 @@ in Partial Fulfillment of the Requirements for the Degree of\\
 \today
 \end{center}

-
+\vspace{0.03in}
+{\small
 \ifthenelse{\boolean{@twoadv}}{
-\vspace{0.25in}
-
 Approved: \hspace{0.2in}\underline{\hspace{2.5in}}\\
 \mbox{\hspace{1.3in}}\advisorname\\
 \mbox{\hspace{1.3in}}Thesis Advisor
-\vspace{0.25in}
+\vspace{0.03in}

 \mbox{\hspace{1.0in}}\underline{\hspace{2.5in}}\\
 \mbox{\hspace{1.3in}}\advisornameb\\
-\mbox{\hspace{1.3in}}Second Reader
-\vspace{0.25in}
+\mbox{\hspace{1.3in}}Reader
+\vspace{0.03in}

 \mbox{\hspace{1.0in}}\underline{\hspace{2.5in}}\\
 \mbox{\hspace{1.3in}}\chairname\\
-\mbox{\hspace{1.3in}}Chair of the Department of \departmentname}
-{\vspace{1.0in}
+\mbox{\hspace{1.3in}}Chair of the Department of \departmentname
+\vspace{0.03in}

-Approved: \hspace{0.2in}\underline{\hspace{2.5in}}\\
+\mbox{\hspace{1.0in}}\underline{\hspace{2.5in}}\\
+\mbox{\hspace{1.3in}}\honorscouncilrepname\\
+\mbox{\hspace{1.3in}}Honors Council Representative}
+{Approved: \hspace{0.2in}\underline{\hspace{2.5in}}\\
 \mbox{\hspace{1.3in}}\advisorname \\
 \mbox{\hspace{1.3in}}Thesis Advisor
-\vspace{0.5in}
+\vspace{0.03in}
+
+\mbox{\hspace{1.0in}}\underline{\hspace{2.5in}}\\
+\mbox{\hspace{1.3in}}\advisornameb\\
+\mbox{\hspace{1.3in}}Reader
+\vspace{0.03in}

 \mbox{\hspace{1.0in}}\underline{\hspace{2.5in}}\\
 \mbox{\hspace{1.3in}}\chairname\\
-\mbox{\hspace{1.3in}}Chair of the Department of \departmentname}
+\mbox{\hspace{1.3in}}Chair of the Department of \departmentname
+\vspace{0.03in}
+
+\mbox{\hspace{1.0in}}\underline{\hspace{2.5in}}\\
+\mbox{\hspace{1.3in}}\honorscouncilrepname\\
+\mbox{\hspace{1.3in}}Honors Council Representative}
+}
 \end{singlespace}
 \vfill
 \end{titlepage}}
@@ -1,32 +1,32 @@
 \chapter{Introduction}
 \label{ch:intro}

-Human-Robot Interaction (HRI) is an essential field of study for understanding how robots should communicate, collaborate, and coexist with people. As researchers work to develop social robots capable of natural interaction, they face a fundamental challenge: how to prototype and evaluate interaction designs before the underlying autonomous systems are fully developed. This chapter introduces the technical and methodological barriers that currently limit HRI research, describes a generalized approach to address these challenges, and establishes the research objectives and thesis statement for this work.
+Human-Robot Interaction (HRI) is an essential field of study for understanding how robots should communicate, collaborate, and coexist with people. As researchers work to develop social robots capable of natural interaction, they face a fundamental challenge: how to prototype and evaluate interaction designs before the underlying autonomous systems are fully developed. This chapter introduces the technical and methodological barriers that currently limit Wizard-of-Oz (WoZ) based HRI research, describes a generalized approach to address these challenges, and establishes the research objectives and thesis statement for this work.

 \section{Motivation}

 To build the social robots of tomorrow, researchers must study how people respond to robot behavior today. That requires interactions that feel real even when autonomy is incomplete. The process of designing and optimizing interactions between human and robot is essential to HRI, a discipline dedicated to ensuring these technologies are safe, effective, and accepted by the public \cite{Bartneck2024}. However, current practices for prototyping these interactions are often hindered by complex technical requirements and inconsistent methodologies.

-Social robotics focuses on robots designed for social interaction with humans, and it poses unique challenges for autonomy. In a typical social robotics interaction, a robot operates autonomously based on pre-programmed behaviors. Because human interaction is inherently unpredictable, pre-programmed autonomy often fails to respond appropriately to subtle social cues, causing the interaction to degrade.
+Social robotics, a subfield of HRI, focuses on robots designed for social interaction with humans, and it poses unique challenges for autonomy. In a typical social robotics interaction, a robot operates autonomously based on pre-programmed behaviors. Because human reactions to robot behaviors are not always predictable, pre-programmed autonomy often fails to respond appropriately to subtle social cues, causing the interaction to degrade.

-To overcome this limitation, researchers use the Wizard-of-Oz (WoZ) technique. The name references L. Frank Baum's story \cite{Baum1900}, in which the "great and powerful" Oz is revealed to be an ordinary person operating machinery behind a curtain, creating an illusion of magic. In HRI, the wizard similarly creates an illusion of robot intelligence from behind the scenes. Consider a scenario where a researcher wants to test whether a robot tutor can effectively encourage student subjects during a learning task. Rather than building a complete autonomous system with speech recognition, natural language understanding, and emotion detection, the researcher uses a WoZ setup: a human operator (the ``wizard'') sits in a separate room, observing the interaction through cameras and microphones. When the subject appears frustrated, the wizard makes the robot say an encouraging phrase and perform a supportive gesture. To the subject, the robot appears to be acting autonomously, responding naturally to the subject's emotional state. This methodology allows researchers to rapidly prototype and test interaction designs, gathering valuable data about human responses before investing in the development of complex autonomous capabilities.
+To overcome this limitation, researchers use the WoZ technique. The name references L. Frank Baum's story \cite{Baum1900}, in which the ``great and powerful'' Oz is revealed to be an ordinary person operating machinery behind a curtain, creating an illusion of magic. In WoZ experiments, the wizard similarly creates an illusion of robot intelligence from behind the scenes. Consider a scenario where a researcher wants to test whether a robot tutor can effectively encourage student subjects during a learning task. Rather than building a complete autonomous system with speech recognition, natural language understanding, and emotion detection, the researcher may use a WoZ setup: a human operator (the ``wizard'') sits in a separate room, observing the interaction through cameras and microphones. When the subject appears frustrated, the wizard makes the robot say an encouraging phrase and perform a supportive gesture. To the subject, the robot appears to be acting autonomously, responding naturally to the subject's emotional state. This methodology allows researchers to rapidly prototype and test interaction designs, gathering valuable data about human responses before investing in the development of complex autonomous capabilities.

-Despite its versatility, WoZ research faces two critical challenges. The first is \emph{The Accessibility Problem}: a high technical barrier prevents many non-programmers, such as experts in psychology or sociology, from conducting their own studies without engineering support. The second is \emph{The Reproducibility Problem}: the hardware landscape is highly fragmented, and researchers frequently build custom control interfaces for specific robots and experiments. These tools are rarely shared, making it difficult for the scientific community to replicate results or compare findings across labs.
+Despite its versatility, WoZ research faces two critical challenges. The first is \emph{The Accessibility Problem}: many non-programmers, such as experts in psychology or sociology, may find it challenging to conduct their own studies without engineering support. The second is \emph{The Reproducibility Problem}: the hardware landscape is highly fragmented, and researchers frequently build custom control interfaces for specific robots and experiments. Because these tools are tightly coupled to particular hardware, running the same social interaction script on a different robot platform typically requires rebuilding the implementation from scratch. These tools are rarely shared, making it difficult for a researcher to reproduce the same study across different robot platforms or for other labs to replicate results.

 \section{Proposed Approach}

 To address the accessibility and reproducibility problems in WoZ-based HRI research, I propose a web-based software framework that integrates three key capabilities. First, the framework must provide an intuitive interface for experiment design that does not require programming expertise, enabling domain experts from psychology, sociology, or other fields to create interaction protocols independently. Second, it must enforce methodological rigor during experiment execution by guiding the wizard through standardized procedures and preventing deviations from the experimental script that could compromise validity. Third, it must be platform-agnostic, meaning the same experiment design can be reused across different robot hardware as technology evolves.

-This approach represents a shift from the current paradigm of custom, robot-specific tools toward a unified platform that can serve as shared infrastructure for the HRI research community. By treating experiment design, execution, and analysis as distinct but integrated phases of a study, such a framework can systematically address both technical barriers and sources of variability that currently limit research quality and reproducibility.
+This approach represents a shift from the current paradigm of custom, robot-specific tools toward a unified platform that can serve as shared infrastructure for WoZ-based HRI research. By treating experiment design, execution, and analysis as distinct but integrated phases of a study, such a framework can systematically address both technical barriers and sources of variability that currently limit research quality and reproducibility.

-The contributions of this thesis are the design principles of this approach, namely: a hierarchical specification model, an event-driven execution model, and a protocol/trial separation with explicit deviation logging. Together they form a coherent architecture for WoZ infrastructure that any implementation could adopt. The platform I developed, HRIStudio, is one implementation of this architecture: an open-source reference system that realizes those principles and serves as the instrument for empirical validation.
+The contributions of this thesis are the design principles of this approach, namely: a hierarchical specification model, an event-driven execution model, and a plugin architecture that decouples experiment logic from robot-specific implementations. Together they form a coherent architecture for WoZ infrastructure that any implementation could adopt. To evaluate the impact of these design principles, I developed a reference implementation called HRIStudio: an open-source, web-based platform built on this architecture and used as the instrument for empirical validation.

 \section{Research Objectives}

-This thesis builds upon foundational work presented in two prior peer-reviewed publications. Prof. Perrone and I first introduced the conceptual framework for HRIStudio at the 2024 IEEE International Conference on Robot and Human Interactive Communication (RO-MAN) \cite{OConnor2024}, establishing the vision for a collaborative, web-based platform. Subsequently, we published the detailed system architecture and a first prototype at RO-MAN 2025 \cite{OConnor2025}, validating the technical feasibility of web-based robot control. Those publications established the vision and the prototype. This thesis formalizes the contribution: a set of design principles for WoZ infrastructure that simultaneously address the \textit{Accessibility} and \textit{Reproducibility} Problems, a reference implementation of those principles, and pilot empirical evidence that they produce measurably different outcomes in practice.
+This thesis builds upon foundational work presented in two prior peer-reviewed publications. Prof. Perrone and I first introduced the conceptual framework for HRIStudio at the 2024 IEEE International Conference on Robot and Human Interactive Communication (RO-MAN) \cite{OConnor2024}, establishing the vision for a collaborative, web-based platform. Subsequently, we published the detailed system architecture and a first prototype at RO-MAN 2025 \cite{OConnor2025}, validating the technical feasibility of web-based robot control. Those publications established our vision and a first prototype. This thesis extends and formalizes our contributions: a set of design principles for WoZ infrastructure that simultaneously address the \textit{Accessibility} and \textit{Reproducibility} Problems, a reference implementation that realizes those principles, and pilot empirical evidence that they produce measurably different outcomes in practice.

-The central question this thesis addresses is: \emph{can the right software architecture make Wizard-of-Oz experiments more accessible to non-programmers and more reproducible across participants?} To answer it, I propose a hierarchical, event-driven specification model that separates protocol design from trial execution, enforces action sequences, and logs deviations automatically; implement it as HRIStudio; and evaluate it in a pilot study comparing design fidelity and execution reliability against a representative baseline tool. The goal is not to prove a statistical effect at scale, but to establish directional evidence that the architecture changes what researchers can do and how consistently they can do it.
+The central question this thesis addresses is: \emph{can the right software architecture make Wizard-of-Oz experiments more accessible to non-programmers and more reproducible across participants?} To answer it, I propose a hierarchical, event-driven specification model that separates protocol design from trial execution, enforces action sequences, and logs deviations automatically; implement it as HRIStudio; and evaluate it in a pilot study comparing design fidelity and execution reliability against a representative baseline tool. The goal is not to prove a statistical effect at scale, but to establish directional evidence that the architecture changes what researchers can do and guides them to be consistent in the pursuit of their experimental goals.

 \section{Chapter Summary}

-This chapter has established the context and objectives for this thesis. I identified two critical challenges facing WoZ-based HRI research. The first is the \emph{Accessibility Problem}: high technical barriers limit participation by non-programmers. The second is the \emph{Reproducibility Problem}: fragmented tooling makes results difficult to replicate across labs. I proposed a web-based framework approach that addresses these challenges through intuitive design interfaces, enforced experimental protocols, and platform-agnostic architecture. Finally, I posed the central research question (can a hierarchical, event-driven specification model with explicit deviation logging lower the technical barrier and improve reproducibility of WoZ experiments?) and described how this thesis addresses it through formal design, a reference implementation, and a pilot validation study. The next chapters establish the technical and methodological foundations.
+This chapter has established the context and objectives for this thesis. I identified two critical challenges facing WoZ-based HRI research. The first is the \emph{Accessibility Problem}: high technical barriers limit participation by non-programmers. The second is the \emph{Reproducibility Problem}: fragmented tooling makes results difficult to replicate across labs. I proposed a web-based framework approach that addresses these challenges through intuitive design interfaces, enforced experimental protocols, and platform-agnostic architecture. Finally, I posed the central research question and described how this thesis addresses it through formal design, a reference implementation, and a pilot validation study.
@@ -1,9 +1,9 @@
 \chapter{Background and Related Work}
 \label{ch:background}

-This chapter provides the necessary context for understanding the challenges addressed by this thesis. I survey the landscape of existing WoZ platforms, analyze their capabilities and limitations, and establish requirements that a modern infrastructure should satisfy. Finally, I position this thesis relative to prior work on this topic.
+This chapter provides the necessary context for understanding the challenges addressed by this thesis. I survey the landscape of existing WoZ platforms, analyze their capabilities and limitations, and establish requirements that a modern infrastructure should satisfy. Finally, I position this thesis within the context of prior work on this topic.

-As established in Chapter~\ref{ch:intro}, the WoZ technique enables researchers to prototype and test robot interaction designs before autonomous capabilities are developed. To understand how the proposed framework advances this research paradigm, I review the existing landscape of WoZ platforms, identify their limitations relative to disciplinary needs, and establish requirements for a more comprehensive approach. HRI is fundamentally a multidisciplinary field which brings together engineers, psychologists, designers, and domain experts from various application areas \cite{Bartneck2024}. Yet two challenges have historically limited participation from non-technical researchers. First, each research group builds custom software for specific robots, creating tool fragmentation across the field. Second, high technical barriers prevent many domain experts from conducting independent studies.
+As established in Chapter~\ref{ch:intro}, the WoZ technique enables researchers to prototype and test robot interaction designs before autonomous capabilities are developed. This thesis is situated within a specific subset of HRI activity: social robotics, a subfield concerned with robots designed for direct social interaction with humans, and more narrowly within that, WoZ experiments used to prototype and evaluate social robot behaviors. To understand how the proposed framework advances this research paradigm, I review the existing landscape of WoZ platforms, identify their limitations relative to disciplinary needs, and establish requirements for a more comprehensive approach. HRI is fundamentally a multidisciplinary field which brings together engineers, psychologists, designers, and domain experts from various application areas \cite{Bartneck2024}. Yet two challenges have historically limited participation from non-technical researchers in WoZ-based HRI studies. First, high technical barriers prevent many domain experts from conducting independent studies. Second, each research group builds custom software for specific robots, creating tool fragmentation across the field.

 \section{Existing WoZ Platforms and Tools}

@@ -11,36 +11,82 @@ Over the last two decades, multiple frameworks to support and automate the WoZ p

 Early platform-agnostic tools focused on providing robust, flexible interfaces for technically sophisticated users. These systems were designed to work with multiple robot types rather than a single hardware platform. Polonius \cite{Lu2011}, built on the Robot Operating System (ROS) \cite{Quigley2009}, exemplifies this generation. It provides a graphical interface for defining finite state machine scripts that control robot behaviors, with integrated logging capabilities to streamline post-experiment analysis. The system was explicitly designed to enable robotics engineers to create experiments that their non-technical collaborators could then execute. However, the initial setup and configuration still required substantial programming expertise. Similarly, OpenWoZ \cite{Hoffman2016} introduced a cloud-based, runtime-configurable architecture using web protocols. Its design allows multiple operators or observers to connect simultaneously, and its plugin system enables researchers to extend functionality such as adding new robot behaviors or sensor integrations. Most importantly, OpenWoZ allows runtime modification of robot behaviors, enabling wizards to deviate from scripts when unexpected situations arise. While architecturally sophisticated and highly flexible, OpenWoZ requires programming knowledge to create custom behaviors and configure experiments, creating the \emph{Accessibility Problem} for non-technical researchers.

-A second wave of tools shifted focus toward usability, often achieving accessibility by coupling tightly with specific hardware platforms. WoZ4U \cite{Rietz2021} was explicitly designed as an ``easy-to-use'' tool for conducting experiments with Aldebaran's Pepper robot. It provides an intuitive graphical interface that allows non-programmers to design interaction flows, and it successfully lowers the technical barrier. However, this usability comes at the cost of generalizability. WoZ4U is unusable with other robot platforms, and manufacturer-provided software follows a similar pattern.
+A second wave of tools shifted focus toward usability, often achieving accessibility by coupling tightly with specific hardware platforms. WoZ4U \cite{Rietz2021} was explicitly designed as an ``easy-to-use'' tool for conducting experiments with Aldebaran's Pepper robot. It provides an intuitive graphical interface that allows non-programmers to design interaction flows, and it successfully lowers the technical barrier. However, this usability comes at the cost of generalizability. WoZ4U does not work for other robot platforms. Manufacturer-provided software for various robots follow a similar pattern.

 Choregraphe \cite{Pot2009}, developed by Aldebaran Robotics for the NAO and Pepper robots, offers a visual programming environment based on connected behavior boxes. Researchers can create complex interaction flows using drag-and-drop blocks without writing code in traditional programming languages. However, when new robot platforms emerge or when hardware becomes obsolete, tools like Choregraphe and WoZ4U lose their utility. Pettersson and Wik, in their review of WoZ tools \cite{Pettersson2015}, note that platform-specific systems often fall out of use as technology evolves, forcing researchers to constantly rebuild their experimental infrastructure.

-Recent years have seen renewed interest in comprehensive WoZ frameworks. Gibert et al. \cite{Gibert2013} developed the Super Wizard of Oz (SWoOZ) platform. This system integrates facial tracking, gesture recognition, and real-time control capabilities to enable naturalistic human-robot interaction studies. Virtual and augmented reality have also emerged as complementary approaches to WoZ. Helgert et al. \cite{Helgert2024} demonstrated how VR-based WoZ environments can simplify experimental setup while providing researchers with precise control over environmental conditions and high fidelity data collection.
+Recent years have seen renewed interest in comprehensive WoZ frameworks. Gibert et al. \cite{Gibert2013} developed the Super Wizard of Oz (SWoOZ) platform. This system integrates facial tracking, gesture recognition, and real-time control capabilities to enable naturalistic human-robot interaction studies. Virtual and augmented reality have also emerged as complementary approaches to WoZ. Helgert et al. \cite{Helgert2024} demonstrated how VR-based WoZ environments can simplify experimental setup while providing researchers with precise control over environmental conditions and high-fidelity data collection.

-This expanding landscape reveals a persistent fundamental gap in the design space of WoZ tools. Flexible, general-purpose platforms like Polonius and OpenWoZ offer powerful capabilities but present high technical barriers. Accessible, user-friendly tools like WoZ4U and Choregraphe lower those barriers but sacrifice cross-platform compatibility and longevity. Newer approaches such as VR-based frameworks attempt to bridge this gap, yet no existing tool successfully combines accessibility, flexibility, deployment portability, and built-in methodological rigor. By methodological rigor, I refer to systematic features that guide experimenters toward best practices like standardized protocols, comprehensive logging, and reproducible experimental designs.
+This expanding landscape reveals a persistent fundamental gap in the design space of WoZ tools. Flexible, general-purpose platforms like Polonius and OpenWoZ offer powerful capabilities but present high technical barriers. Accessible, user-friendly tools like WoZ4U and Choregraphe lower those barriers but sacrifice cross-platform compatibility and longevity. Newer approaches such as VR-based frameworks attempt to bridge this gap, yet no existing tool successfully combines accessibility, flexibility, deployment portability, and built-in methodological rigor. 

-Moreover, few platforms directly address the methodological concerns raised by systematic reviews of WoZ research. Riek's influential analysis \cite{Riek2012} of 54 HRI studies uncovered widespread inconsistencies in how wizard behaviors were controlled and reported. Very few studies documented standardized wizard training procedures or measured wizard error rates, raising questions about internal validity. The tools themselves often exacerbate this problem: poorly designed interfaces increase cognitive load on wizards, leading to timing errors and behavioral inconsistencies that can confound experimental results. Recent work by Strazdas et al. \cite{Strazdas2020} further demonstrates the importance of careful interface design in WoZ systems, showing that intuitive wizard interfaces directly improve both the quality of robot behavior and the reliability of collected data.
+\begin{figure}[htbp]
+\centering
+\begin{tikzpicture}[
+    scale=1.0,
+    quadbox/.style={rectangle, draw=white, ultra thick, minimum width=5.5cm, minimum height=4.5cm, align=center},
+    title/.style={font=\small\bfseries, align=center},
+    desc/.style={font=\footnotesize, text=gray!60, align=center},
+    axislabel/.style={font=\small\bfseries, align=center}
+]
+
+    % Quadrant Backgrounds
+    \fill[gray!20] (0, 4.5) rectangle (5.5, 9.0);   % Top Left (HRIStudio)
+    \fill[gray!15] (5.5, 4.5) rectangle (11.0, 9.0); % Top Right (Polonius)
+    \fill[gray!10] (0, 0) rectangle (5.5, 4.5);      % Bottom Left (WoZ4U)
+    \fill[gray!5] (5.5, 0) rectangle (11.0, 4.5);    % Bottom Right (Choregraphe)
+
+    % Quadrant Lines
+    \draw[white, ultra thick] (5.5, 0) -- (5.5, 9.0);
+    \draw[white, ultra thick] (0, 4.5) -- (11.0, 4.5);
+
+    % Axis Labels
+    \node[axislabel, above] at (2.75, 9.2) {Low technical barrier};
+    \node[axislabel, above] at (8.25, 9.2) {High technical barrier};
+    \node[axislabel, left] at (-0.2, 6.75) {More rigorous};
+    \node[axislabel, left] at (-0.2, 2.25) {Less rigorous};
+
+    % Top Left: The Gap
+    \node[axislabel] at (2.75, 6.75) {\Huge ?};
+
+    % Top Right: Polonius, OpenWoZ, SWoOZ
+    \node[title] at (8.25, 7.4) {Polonius, OpenWoZ\\SWoOZ, VR Environments};
+    \node[desc] at (8.25, 6.0) {Flexible and powerful,\\but requires significant\\programming expertise};
+
+    % Bottom Left: WoZ4U
+    \node[title] at (2.75, 2.7) {WoZ4U};
+    \node[desc] at (2.75, 1.7) {Accessible, but\\platform-specific\\No methodological rigor};
+
+    % Bottom Right: Choregraphe
+    \node[title] at (8.25, 2.7) {Choregraphe};
+    \node[desc] at (8.25, 1.7) {Requires specialized\\training\\No methodological rigor};
+
+\end{tikzpicture}
+\caption{WoZ tool design space by technical barrier and methodological rigor.}
+\label{fig:tool-matrix}
+\end{figure}
+
+The missing quadrant in Figure~\ref{fig:tool-matrix} matters because methodological rigor requires systematic features that guide experimenters toward best practices: consistently following experimental protocols, maintaining comprehensive logging, and producing reproducible experimental designs. Few platforms directly address the methodological concerns raised by systematic reviews of WoZ research. Riek's influential analysis \cite{Riek2012} of 54 HRI studies uncovered widespread inconsistencies in how wizard behaviors were controlled and reported. Very few studies documented standardized wizard training procedures or measured wizard error rates, raising questions about internal validity---that is, whether observed outcomes can be attributed to the intended experimental manipulation rather than to uncontrolled variation in wizard behavior. The tools themselves often exacerbate this problem: poorly designed interfaces increase cognitive load on wizards, leading to timing errors and behavioral inconsistencies that can confound experimental results. Recent work by Strazdas et al. \cite{Strazdas2020} further demonstrates the importance of careful interface design in WoZ systems, showing that intuitive wizard interfaces directly improve both the quality of robot behavior and the reliability of collected data.

 \section{Requirements for Modern WoZ Infrastructure}

 This thesis is the latest step in a multi-year effort to build infrastructure that addresses the challenges identified in the WoZ platform landscape. Based on the analysis of existing platforms and identified methodological gaps, I derived requirements for a modern WoZ research infrastructure. Through our preliminary work \cite{OConnor2024}, we identified six critical capabilities that a comprehensive platform should provide:

 \begin{description}
-\item[R1: Integrated workflow.] All phases of the experimental workflow (design, execution, and analysis) should be integrated within a single unified environment to minimize context switching and tool fragmentation.
-\item[R2: Low technical barrier.] Creating interaction protocols should require minimal to no programming expertise, enabling domain experts from psychology, education, or other fields to work independently \cite{Bartneck2024}.
-\item[R3: Real-time control.] The system must support fine-grained, responsive real-time control during live experiment sessions across a variety of robotic platforms.
+\item[R1: Integrated workflow.] All phases of the experimental workflow (design, execution, and analysis) should be integrated within a single unified environment, so that researchers do not need to move between separate tools to design, run, and analyze their experiments.
+\item[R2: Low technical barrier.] Creating social interaction protocols between robot and human should require minimal to no programming expertise, enabling domain experts from psychology, education, or other fields to work independently \cite{Bartneck2024}.
+\item[R3: Real-time control.] The system must support fine-grained, responsive real-time control during live experiment sessions across a variety of robotic platforms. Consistent real-time control across platforms also directly supports reproducibility: the same script should execute with equivalent responsiveness regardless of which robot is used.
 \item[R4: Automated logging.] All actions, timings, and sensor data should be automatically logged with synchronized timestamps to facilitate analysis.
-\item[R5: Platform agnosticism.] The architecture should decouple experimental logic from robot-specific implementations. This allows experiments designed for one robot type to be adapted to others, ensuring the platform remains viable as hardware evolves.
-\item[R6: Collaborative support.] Multiple team members should be able to contribute to experiment design and review execution data, supporting truly interdisciplinary research.
+\item[R5: Platform agnosticism.] The architecture should decouple experimental logic from robot-specific actions and behaviors. This allows experiments designed for one robot platform to be adapted to others, ensuring the WoZ platform remains viable as hardware evolves. This requirement also directly addresses the Reproducibility Problem: a platform-agnostic design makes it possible to run the same interaction script on different robots with minimal change to the implementing program.
+\item[R6: Collaborative support.] Multiple team members should be able to contribute to experiment design and review of execution data, supporting truly interdisciplinary research.
 \end{description}

-To the best of my knowledge, no existing platform satisfies all six requirements. Most critically, the trade-off between accessibility and flexibility remains unresolved. Few tools embed methodological best practices directly into their design to guide experimenters toward sound methodology by default.
+To the best of my knowledge, no existing platform satisfies all six requirements. Most critically, the trade-off between accessibility and reproducibility remains unresolved. Few tools embed methodological best practices directly into their design to guide experimenters toward sound methodology by default.

 This work builds on two prior peer-reviewed publications. We first introduced the concept for HRIStudio as a Late-Breaking Report at the 2024 IEEE International Conference on Robot and Human Interactive Communication (RO-MAN) \cite{OConnor2024}. In that position paper, we identified the lack of accessible tooling as a primary barrier to entry in HRI and proposed the high-level vision of a web-based, collaborative platform. We established the core requirements listed above and argued for a web-based approach to achieve them.

 Following the initial proposal, we published the detailed system architecture and preliminary prototype as a full paper at RO-MAN 2025 \cite{OConnor2025}. That publication validated the technical feasibility of our approach, detailing the communication protocols, data models, and plugin architecture necessary to support real-time robot control using standard web technologies while maintaining platform independence.

-While those prior publications established the conceptual framework and technical architecture, this thesis formalizes those design principles, realizes them in a complete implementation, and tests whether they produce measurably different outcomes in a pilot validation study. The pilot study compares design fidelity and execution reliability between HRIStudio and a representative baseline tool, showing whether these principles translate into better outcomes for real researchers.
+While those prior publications established the conceptual framework and technical architecture, this thesis formalizes those design principles, realizes them in a complete implementation, and evaluates whether they produce measurably different outcomes in a pilot validation study. The pilot study compares design fidelity and execution reliability between HRIStudio and a representative baseline tool, showing whether these principles translate into better outcomes for real researchers.

 \section{Chapter Summary}

@@ -1,15 +1,18 @@
 \chapter{Reproducibility Challenges}
 \label{ch:reproducibility}

-Having established the landscape of existing WoZ platforms and their limitations, I now examine the factors that make WoZ experiments difficult to reproduce and how software infrastructure can address them. This chapter analyzes the sources of variability in WoZ studies and examines how current practices in infrastructure and reporting contribute to reproducibility problems. Understanding these challenges is essential for designing a system that supports reproducible, rigorous experimentation.
+Having established the landscape of existing WoZ platforms and their limitations, I now examine the factors that make WoZ experiments difficult to reproduce consistently and how software infrastructure can address them. This chapter analyzes the sources of variability identified in the WoZ literature and examines how current practices in infrastructure and reporting contribute to \emph{the Reproducibility Problem}. Understanding these challenges is essential for designing a system that supports reproducible, rigorous experimentation.

 \section{Sources of Variability}

-Reproducibility in experimental research requires that independent investigators can obtain consistent results when following the same procedures. In WoZ-based HRI studies, however, multiple sources of variability can compromise this goal. The wizard is simultaneously the strength and weakness of the WoZ paradigm. While human control enables sophisticated, adaptive interactions, it also introduces inconsistency. Consider a wizard conducting multiple trials of the same experiment with different participants. Even with a detailed script, the wizard may vary in timing, with delays between a participant's action and the robot's response fluctuating based on the wizard's attention, fatigue, or interpretation of when to act. When a script allows for choices, different wizards may make different selections, or the same wizard may act differently across trials. Furthermore, a wizard may accidentally skip steps, trigger actions in the wrong order, or misinterpret experimental protocols.
+\emph{The Reproducibility Problem}, as introduced in Chapter~\ref{ch:intro}, encompasses two related challenges. The first concerns \emph{execution consistency}: whether a wizard reliably follows the same experimental script across multiple trials with different participants, producing comparable robot behavior in each. The second concerns \emph{cross-platform reproducibility}: whether the same experiment can be transferred to a different robot platform with minimal change to the implementing program. Both stem from gaps in current WoZ infrastructure and are examined in this chapter. It is important to note that the term reproducibility may also refer to \emph{allowing independent replications of published studies}; this is not what this thesis evaluates. Execution consistency, as defined here, corresponds to what the measurement literature sometimes calls \emph{repeatability}: the degree to which the same procedure produces consistent results when repeated across multiple trials of the same study.
+
+In WoZ-based HRI studies, multiple sources of variability can compromise execution consistency. The wizard is simultaneously the strength and weakness of the WoZ paradigm. While human control enables sophisticated, adaptive interactions, it also introduces inconsistency. Consider a wizard conducting multiple trials of the same experiment with different participants.
+ Even with a detailed script, the wizard may vary in timing, with the delay between a participant's action and the robot's response fluctuating based on the wizard's attention, fatigue, or interpretation of when to act. When a script allows for choices, different wizards may make different selections, or the same wizard may act differently across trials. Furthermore, a wizard may accidentally skip steps, trigger actions in the wrong order, or misinterpret experimental protocols.

 Riek's systematic review \cite{Riek2012} found that very few published studies reported measuring wizard error rates or providing standardized wizard training. Without such measures, it becomes impossible to determine whether experimental results reflect the intended interaction design or inadvertent variations in wizard behavior.

-Beyond wizard behavior, the custom nature of many WoZ control systems introduces technical variability. When each research group builds custom software for each study, several problems arise. Custom interfaces may have undocumented capabilities, hidden features, default behaviors, or timing characteristics researchers never formally describe. Software tightly coupled to specific robot models or operating system versions may become unusable when hardware or software is upgraded or replaced. Each system logs data differently, with different file formats, different levels of granularity, and different choices about what to record. This fragmentation means that replicating a study often requires not just following an experimental protocol but also reverse-engineering or rebuilding the original software and hardware infrastructure.
+Beyond wizard behavior, the custom nature of many WoZ control systems introduces technical variability. When each research group builds custom software for each study, several problems arise. Custom interfaces may have undocumented capabilities, hidden features, default behaviors, or timing characteristics researchers never formally describe. Software tightly coupled to specific robot models or operating system versions may become unusable when hardware or software is upgraded or replaced. Each system logs data differently, with different file formats, different levels of granularity, and different choices about what to record. This fragmentation undermines both execution consistency and reproducibility. Rebuilding custom infrastructure for each study makes it nearly impossible to guarantee that wizard behavior is controlled the same way across trials. More broadly, reproducing the same experiment on a different robot platform typically requires reverse-engineering or rebuilding the original software from scratch.

 Even when researchers intend for their work to be reproducible, practical constraints on publication length lead to incomplete documentation. Papers often omit exact timing parameters. Authors leave decision rules for wizard actions unspecified and fail to report details of the wizard interface. Specifications of data collection, including which sensor streams were recorded and at what sampling rate, frequently go missing. Without this information, other researchers cannot faithfully recreate the experimental conditions, limiting both direct replication and conceptual extensions of prior work.

@@ -31,8 +34,8 @@ Based on this analysis, I identify specific ways that software infrastructure ca

 \section{Connecting Reproducibility Challenges to Infrastructure Requirements}

-The reproducibility challenges identified above directly motivate the infrastructure requirements (R1--R6) established in Chapter~\ref{ch:background}. Inconsistent wizard behavior creates the need for enforced execution protocols (R1) that guide wizards step by step, and for automatic logging (R4) that captures any deviations that occur. Timing errors specifically motivate responsive, fine-grained real-time control (R3): a wizard working with a sluggish interface introduces latency that disrupts the interaction and confounds timing analysis. Technical fragmentation forces each lab to rebuild infrastructure as hardware changes, violating platform agnosticism (R5). Incomplete documentation reflects the need for self-documenting, code-free protocol specifications (R1, R2) that are simultaneously executable and shareable. Finally, the isolation of individual research groups motivates collaborative support (R6): allowing multiple team members to observe and review trials enables the shared scrutiny that reproducibility requires. As Chapter~\ref{ch:background} demonstrated, no existing platform simultaneously satisfies all six requirements. Addressing this gap requires rethinking how WoZ infrastructure is designed, prioritizing reproducibility and methodological rigor as first-class design goals rather than afterthoughts.
+The reproducibility challenges identified above directly motivate the infrastructure requirements (R1--R6) established in Chapter~\ref{ch:background}. Inconsistent wizard behavior creates the need for real-time control mechanisms (R3) that guide wizards step by step, and for automatic logging (R4) that captures any deviations that occur. Timing errors further motivate responsive, fine-grained real-time control (R3): a wizard working with a sluggish interface introduces latency that disrupts the interaction and confounds timing analysis. Technical fragmentation forces each lab to rebuild infrastructure as hardware changes, violating platform agnosticism (R5). Incomplete documentation reflects the need for self-documenting, code-free protocol specifications (R2) that are simultaneously executable and shareable, integrated into a single workflow (R1) so that the specification and the execution environment are never separated. Finally, the isolation of individual research groups motivates collaborative support (R6): allowing multiple team members to observe and review trials enables the shared scrutiny that reproducibility requires. As Chapter~\ref{ch:background} demonstrated, no existing platform simultaneously satisfies all six requirements. Addressing this gap requires rethinking how WoZ infrastructure is designed, prioritizing reproducibility and methodological rigor as first-class design goals rather than afterthoughts.

 \section{Chapter Summary}

-This chapter has analyzed the reproducibility challenges inherent in WoZ-based HRI research, identifying three primary sources of variability: inconsistent wizard behavior, fragmented technical infrastructure, and incomplete documentation. Rather than treating these challenges as inherent to the WoZ paradigm, I showed how each stems from gaps in current infrastructure. Software design can systematically mitigate these challenges through enforced experimental protocols, comprehensive automatic logging, self-documenting experiment designs, and platform-independent abstractions. These design goals directly address the six infrastructure requirements identified in Chapter~\ref{ch:background}. The following chapters describe the design, implementation, and pilot validation of a system that prioritizes reproducibility as a foundational design principle from inception.
+This chapter has analyzed the reproducibility challenges inherent in WoZ-based HRI research, identifying three primary sources of variability: inconsistent wizard behavior, fragmented technical infrastructure, and incomplete documentation. Rather than treating these challenges as inherent to the WoZ paradigm, I showed how each stems from gaps in current infrastructure. Software design can systematically mitigate them through enforced experimental protocols, comprehensive automatic logging, self-documenting experiment designs, and platform-independent abstractions. These design goals directly address the six infrastructure requirements identified in Chapter~\ref{ch:background}. The following chapters describe the design, implementation, and pilot validation of a system that prioritizes reproducibility as a foundational design principle from inception.
@@ -5,15 +5,45 @@ Chapter~\ref{ch:background} established six requirements for modern WoZ infrastr

 \section{Hierarchical Organization of Experiments}

-WoZ studies involve multiple reusable conditions, shared protocol phases, and platform-specific behaviors that span the full research lifecycle. To organize these components without requiring researchers to write code, the system structures every study as a four-level hierarchy: \emph{study} $\rightarrow$ \emph{experiment} $\rightarrow$ \emph{step} $\rightarrow$ \emph{action}. This structure separates high-level protocol design from low-level execution behavior, keeping the authoring process code-free while integrating design, execution, and analysis into a single unified workflow.
+WoZ studies involve multiple experiments, shared protocol phases, and platform-specific behaviors that span the full research lifecycle. To organize these elements without requiring researchers to write code, the system structures every study as a four-level hierarchy: \emph{study} $\rightarrow$ \emph{experiment} $\rightarrow$ \emph{step} $\rightarrow$ \emph{action}. This structure separates high-level protocol design from low-level execution behavior, keeping the authoring process code-free while integrating design, execution, and analysis into a single unified workflow.

-The terms in this hierarchy are used in a strict way. A \emph{study} is the top-level research container that groups related protocol conditions. An \emph{experiment} is one reusable condition within that study (for example, a control versus experimental condition). A \emph{step} is one phase of the protocol timeline (for example, an introduction, telling a story, or testing recall). An \emph{action} is the smallest executable unit inside a step (for example, trigger a gesture, play audio, or speak a prompt).
+I define the elements in this hierarchy as follows. A \emph{study} is the top-level container that groups related experiments. An \emph{experiment} is one independently runnable protocol within that study (for example, a control or experimental condition). A \emph{step} is one phase of the protocol timeline (for example, an introduction, telling a story, or testing information recall). An \emph{action} is the smallest executable unit inside a step (for example, trigger a gesture, play audio, or speak a prompt).

-Figure~\ref{fig:experiment-hierarchy} shows a representation of this hierarchical structure for social robotics studies. Reading top-down, one study contains one or more experiments, each experiment contains one or more steps, and each step contains one or more actions. Figure~\ref{fig:trial-instantiation} shows the protocol-versus-instance separation in isolation. The left column holds the protocol designed once before the study begins; the right column shows the separate trial records produced each time a participant runs it. A dashed line marks the protocol/trial boundary: everything to its left was authored by the researcher before any participant arrived; everything to its right was generated during a live session. The \textit{instantiates} arrows from the experiment node fan out to each trial record, making the relationship explicit. This separation is central to reproducibility: the same experiment specification generates a distinct, timestamped record per participant, so researchers can compare across participants without conflating what was designed with what was executed.
+Figure~\ref{fig:experiment-hierarchy} shows this hierarchical structure. Reading top-down, one study contains one or more experiments, each experiment contains one or more steps, and each step contains one or more actions.

-To illustrate how the schema can be used with a concrete example, consider an interactive storytelling study with the research question: \emph{Does robot interaction modality influence participant recall performance?} The two conditions differ in how the robot looks and behaves: NAO6 has a human-like form and uses expressive gestures, while TurtleBot is visibly machine-like with no social movement cues. This keeps the narrative task the same across both conditions while changing only how the robot delivers it.
+Figure~\ref{fig:trial-instantiation} illustrates how a protocol definition relates to its instantiation. The left column holds the protocol, defined before the study begins; the right column shows how the abstraction defined as a protocol is instantiated as independent trials. A dashed line marks the protocol/trial boundary: everything to its left was authored by the researcher before any participant arrived; everything to its right was generated during a live session. The \textit{instantiates} arrows from the experiment node fan out to each trial record, making the relationship explicit. This separation is central to reproducibility: the same experiment specification generates a distinct, timestamped record per participant, so researchers can compare across participants without conflating what was designed with what was executed.

-Figure~\ref{fig:example-hierarchy} maps that study onto the same hierarchy. The study branches into two experiments (TurtleBot with only voice, NAO6 with added gestures), each experiment uses the same ordered steps (Intro, Story Telling, Recall Test), and each step contains actions. The figure expands only the Story Telling step to keep the diagram readable, but Intro and Recall Test follow the same structure. Figures~\ref{fig:experiment-hierarchy}, \ref{fig:trial-instantiation}, and~\ref{fig:example-hierarchy} together progress from abstract schema, to protocol-versus-instance separation, to a concrete instantiation.
+To illustrate the hierarchy with a concrete example, consider an interactive storytelling study with the research question: \emph{Does how the robot tells a story affect how a human will remember the story?} The experiment might use different robots, for instance Pepper, NAO6, and TurtleBot. Figure~\ref{fig:robot-morphologies} shows the morphology of these three different robots: Pepper and NAO6 are humanoid social robots with expressive gestures and human-like forms, while TurtleBot is a wheeled mobile robot with a visibly machine-like form and no social movement cues. In the example below, the narrative task remains the same across two robot-specific experiments; only how the robot delivers it changes.
+
+\begin{figure}[htbp]
+\centering
+\begin{subfigure}[b]{0.3\textwidth}
+    \centering
+    \includegraphics[width=\textwidth]{images/nao6.jpg}
+    \caption{NAO6 (Humanoid)}
+    \label{fig:robot-nao}
+\end{subfigure}
+\hfill
+\begin{subfigure}[b]{0.3\textwidth}
+    \centering
+    \includegraphics[width=\textwidth]{images/pepper.png}
+    \caption{Pepper (Social)}
+    \label{fig:robot-pepper}
+\end{subfigure}
+\hfill
+\begin{subfigure}[b]{0.3\textwidth}
+    \centering
+    \includegraphics[width=\textwidth]{images/turtlebot.png}
+    \caption{TurtleBot (Mechanical)}
+    \label{fig:robot-turtlebot}
+\end{subfigure}
+\caption{Three robot morphologies supported by the HRIStudio architecture.}
+\label{fig:robot-morphologies}
+\end{figure}
+
+Figure~\ref{fig:example-hierarchy} maps the study presented above onto the hierarchical elements defined in Figure~\ref{fig:experiment-hierarchy}. The study branches into two experiments (TurtleBot with only voice, NAO6 with added gestures), each experiment uses the same sequence of ordered steps (Intro, Story Telling, Recall Test), and each step defines the specific actions the robot will perform. The figure expands only the Story Telling step to keep the diagram readable, but Intro and Recall Test follow the same structure.
+
+Together, these three figures motivate why the hierarchy is useful in practice. These three figures are interrelated as follows: Figure~\ref{fig:experiment-hierarchy} defines the experimental structure as an abstraction; Figure~\ref{fig:trial-instantiation} shows how the abstract experimental structure is instantiated as concrete trial records; and Figure~\ref{fig:example-hierarchy} shows the expansion of each element of the experimental structure.

 \begin{figure}[htbp]
 \centering
@@ -62,9 +92,9 @@ Figure~\ref{fig:example-hierarchy} maps that study onto the same hierarchy. The
 	\draw[arrow] (exp.south)   -- (step.north);

 	%% ---- Trial column ----
-	\node[trial] (t1) at (7.9, 5.5) {Trial --- P01\\{\footnotesize timestamped log}};
-	\node[trial] (t2) at (7.9, 4.2) {Trial --- P02\\{\footnotesize timestamped log}};
-	\node[trial] (t3) at (7.9, 2.9) {Trial --- P03\\{\footnotesize timestamped log}};
+	\node[trial] (t1) at (7.9, 5.5) {Trial: P01\\{\footnotesize timestamped log}};
+	\node[trial] (t2) at (7.9, 4.2) {Trial: P02\\{\footnotesize timestamped log}};
+	\node[trial] (t3) at (7.9, 2.9) {Trial: P03\\{\footnotesize timestamped log}};

 	%% ---- Separator ----
 	\draw[gray!60, thick, dashed] (4.85, 1.8) -- (4.85, 6.6);
@@ -134,11 +164,11 @@ Figure~\ref{fig:example-hierarchy} maps that study onto the same hierarchy. The
 \label{fig:example-hierarchy}
 \end{figure}

-Together, these three figures motivate why the hierarchy is useful in practice. The layered structure lets researchers define protocols at any level of granularity without writing code, which keeps the tool accessible to non-programmers. The step and action levels also align naturally with trial flow, so the wizard stays guided by the protocol while retaining control over timing, which supports the real-time control requirement. Action-level execution provides a natural unit for timestamped logging and post-trial analysis, satisfying the automated logging requirement. Finally, keeping experiment definitions separate from trial instances means the same protocol can be reproduced across participants and conditions, supporting both the integrated workflow and collaborative support requirements.
+The layered structure compels researchers to define experimental protocols at multiple levels of granularity without writing code, which creates a process that is accessible to non-programmers. The step and action elements also align naturally with the sequence of events in a trial, so the wizard stays guided by the protocol while retaining control over the timing of each event, which supports the real-time control requirement (R3). Action-level execution provides a natural unit for timestamped logging and post-trial analysis, satisfying the automated logging requirement (R4). Finally, keeping experiment definitions separate from trial instances means the same protocol can be reproduced across participants and experiments, supporting both the integrated workflow (R1) and collaborative support (R6) requirements.

 \section{Event-Driven Execution Model}

-To achieve real-time responsiveness while maintaining methodological rigor (R3, R5), the system uses an event-driven execution model rather than a time-driven one. In a time-driven approach, the system advances through actions on a fixed schedule regardless of what the participant is doing, so the robot might speak over a participant who is still talking, or move on before a response has been given. The event-driven model avoids this by letting the wizard trigger each action when the interaction is ready for it. Figure~\ref{fig:event-driven-timeline} contrasts the two approaches using the same four-action sequence: Greet (G), Begin Story (BS), Ask Question (AQ), and End (E). In the time-driven row, fixed intervals $t_0$ through $t_2$ define when each event fires, and dashed vertical lines show where those moments fall relative to the event-driven rows below. In both event-driven rows, the wizard fires the same four labeled events at different real-time positions --- T1 (a faster participant) finishes well before T2 (a slower one) --- while both preserve the same action order.
+To achieve real-time responsiveness while maintaining methodological rigor (R3, R5), the system uses an event-driven execution model rather than a time-driven one. In a time-driven approach, the system advances through actions on a fixed schedule regardless of what the participant is doing, so the robot might speak over a participant who is still talking, or move on before a response has been given. The event-driven model avoids this by letting the wizard trigger each action when the interaction is ready for it. Figure~\ref{fig:event-driven-timeline} contrasts the two approaches using the same four-action sequence: Greet (G), Begin Story (BS), Ask Question (AQ), and End (E). In the time-driven row, fixed intervals $t_0$ through $t_2$ define when each event fires, and dashed vertical lines show where those moments fall relative to the event-driven rows below. In both event-driven rows, the wizard fires the same four labeled events at different real-time positions (T1, a faster participant, finishes well before T2, a slower one), while both preserve the same action order.

 \begin{figure}[htbp]
 \centering
@@ -217,9 +247,9 @@ To achieve real-time responsiveness while maintaining methodological rigor (R3,
 \label{fig:event-driven-timeline}
 \end{figure}

-This approach has several implications. First, not all trials of the same experiment will have identical timing or duration; the length of a learning task, for example, depends on the participant's progress. The system records the actual timing of actions, permitting researchers to capture these natural variations in their data. Second, the event-driven model enables the wizard to respond contextually without departing from the protocol; the wizard remains guided by the sequence of available actions while having control over when to advance based on participant cues.
+This approach has several implications. What the event-driven model guarantees is not identical timing across trials, but consistent action ordering: every participant experiences the same sequence of protocol steps, even if the pace varies. Timing is recorded accurately, permitting researchers to analyze natural variation across participants. The wizard responds contextually without departing from the protocol; the wizard remains guided by the sequence of available actions while retaining control over when to advance based on participant cues.

-The system guides the wizard through the protocol step-by-step, ensuring the intended sequence is followed. Every action is logged with a timestamp whether it was scripted or not, and anything outside the protocol is flagged as a deviation. This means inconsistent wizard behavior shows up in the data rather than disappearing into it.
+The system guides the wizard through the protocol step-by-step, ensuring the intended sequence is followed. Every action is logged with a timestamp whether it was scripted or not, and anything outside the protocol is flagged as a deviation. This means inconsistent wizard behavior can be evident in the data rather than disappearing into it.

 \section{Modular Interface Architecture}

@@ -229,19 +259,19 @@ Researchers interact with the system through three interfaces, each one encapsul

 The \emph{Design} interface gives researchers a drag-and-drop canvas for building experiment protocols, creating a visual programming environment. Researchers drag pre-built action components, including robot movements, speech, wizard instructions, and conditional logic, onto the canvas and drop them into sequence. Clicking a component opens a side panel where its parameters can be set, such as the text for a speech action or the gesture name for a movement.

-By treating experiment design as a visual specification task, the interface lowers technical barriers (R2). Researchers can assemble interaction logic by dragging components into sequence and setting parameters in plain language, without writing code. The resulting protocol specification is also human-readable and shareable alongside research results. The specification is stored in a structured format that can be displayed as a timeline for analysis and executed directly by the platform's runtime.
+By treating experiment design as a visual specification task, the interface lowers technical barriers (R2). Researchers can assemble interaction logic by dragging components into sequence and setting parameters naturally, without even having to write code. The resulting protocol specification is also human-readable and shareable alongside experimental results. The specification is stored in a structured format that can be displayed as a timeline for analysis and executed directly by the platform's runtime. This property is central to reproducibility: a third party with access to the specification can run the experiment faithfully without reverse-engineering the original system.

 \subsection{Execution Interface}

-During trials, the Execution interface shows the wizard exactly where they are in the protocol: the current step, the available actions, and the robot's current state, all updated in real time as the trial progresses.
+During trials, the \emph{Execution} interface keeps the wizard informed of exactly where they are in the protocol. The current step, the available actions, and the robot's current state are all updated in real time as the trial progresses.

-The Execution interface also exposes a set of manual controls for actions that fall outside the scripted protocol. Consider a participant who asks an unexpected question mid-trial: the wizard can trigger an unscripted speech response on the spot rather than leaving the interaction to stall. This keeps the interaction feeling natural for the participant. Critically, the system does not simply ignore these moments. Every unscripted action is timestamped and written to the trial log as an explicit deviation, giving researchers a complete picture of what actually happened versus what was planned. This makes unscripted actions a feature rather than a source of noise: the wizard retains real-time control over the interaction, and the logging infrastructure captures everything needed for post-trial analysis.
+The \emph{Execution} interface also exposes a set of manual controls for actions that fall outside the scripted protocol. A \emph{deviation} is a spontaneous action introduced by the wizard in response to a reaction of the human subject that was not anticipated when the script was created. Consider a human subject who asks an unexpected question mid-trial: the wizard can trigger an unscripted speech response on the spot rather than leaving the interaction to stall, keeping the interaction feeling natural for the human subject. Critically, the system does not ignore these deviations from the script. Every deviation is timestamped and written to the trial log, giving researchers a complete picture of what actually happened versus what was planned. This makes unscripted actions a feature rather than a source of noise: the wizard retains real-time control over the interaction, and the logging infrastructure captures everything needed for post-trial analysis.

-Additional researchers can simultaneously access this same live view through the platform's Dashboard by selecting a trial to ``spectate.'' Multiple researchers observing the same trial view the identical synchronized display of the wizard's controls, participant interactions, and robot state, supporting real-time collaboration and interdisciplinary observation (R6). Observers can take notes and mark significant moments without interfering with the wizard's control or the participant's experience.
+Additional researchers can simultaneously access a live view of a trial through the platform's Dashboard by selecting a trial to ``spectate.'' Multiple researchers observing the same trial view an identical synchronized display of the wizard's controls, human subject interactions, and robot state, supporting real-time collaboration and interdisciplinary observation (R6). Observers can take notes and mark significant moments without interfering with the wizard's control or the human subject's experience.

 \subsection{Analysis Interface}

-After a trial concludes, the \emph{Analysis} interface lets researchers review everything that was recorded: video of the interaction, audio, timestamped action logs, and robot sensor data, all scrubable from a single timeline. Researchers can annotate significant moments and export segments for further analysis. Because the same platform produced both the protocol and the recording, the interface can show exactly where the execution matched the design and where it deviated, without any manual cross-referencing.
+After a trial concludes, the \emph{Analysis} interface lets researchers review everything that was recorded: video of the interaction, audio, timestamped action logs, and robot sensor data, all scrubable from a single timeline. Researchers can annotate significant moments and export segments for further analysis. Because the same platform produced both the protocol and the recording, the interface eliminates the need for manual cross-referencing by showing exactly where the execution matched the design and where it deviated.

 \section{Data Flow and Infrastructure Implementation}

@@ -249,7 +279,7 @@ To ensure that data from every experimental phase remains traceable, the system

 \subsection{Architectural Layers}

-The system is structured as a three-layer architecture, each with a specific responsibility:
+HRIStudio separates its communicative and functional responsibilities into distinct layers, in a manner analogous to the layered reference models used in networking software. More specifically, the system is organized as a three-layer architecture, as shown in Figure~\ref{fig:three-tier}, each layer with a specific responsibility:

 \begin{description}
 \item[User Interface layer.] Runs in researchers' web browsers and exposes the three interfaces (Design, Execution, Analysis), managing user interactions such as clicking buttons, dragging and dropping experiment components, and reviewing experimental results.
@@ -263,7 +293,7 @@ This separation of concerns provides two concrete benefits. First, each layer ca
 \centering
 \begin{tikzpicture}[
    layer/.style={rectangle, draw=black, thick, fill, minimum width=6.5cm, minimum height=1cm, align=center, text width=6.2cm},
-    arrow/.style={->, thick, line width=1.5pt}]
+    arrow/.style={-, thick, line width=1.5pt}]
    
    % Layer 1: UI
    \node[layer, fill=gray!15] (ui) at (0, 3.5) {
@@ -274,7 +304,7 @@ This separation of concerns provides two concrete benefits. First, each layer ca
    % Layer 2: Logic
    \node[layer, fill=gray!30] (logic) at (0, 1.8) {
        \textbf{Application Logic}\\[0.1cm]
-        {\small Execution, Authentication, Logger}
+        {\small Trial Engine, Authentication, Logger}
    };

    % Layer 3: Data
@@ -283,9 +313,9 @@ This separation of concerns provides two concrete benefits. First, each layer ca
        {\small Database, File Storage, ROS}
    };

-    % Arrows
-    \draw[arrow] (ui.south) -- (logic.north);
-    \draw[arrow] (logic.south) -- (data.north);
+    % Arrows (bidirectional)
+    \draw[-, thick, line width=1.5pt] (ui.south) -- (logic.north);
+    \draw[-, thick, line width=1.5pt] (logic.south) -- (data.north);
    
 \end{tikzpicture}
 \caption{Three-layer architecture separates user interface, application logic, and data/robot control.}
@@ -296,9 +326,18 @@ This separation of concerns provides two concrete benefits. First, each layer ca

 During the design phase, researchers create experiment specifications that are stored in the system database. During a trial, the system manages bidirectional communication between the wizard's interface and the robot control layer. All actions, sensor data, and events are streamed to a data logging service that stores complete records. After the trial, researchers can inspect these records through the Analysis interface.

-The flow of data during a trial proceeds through six distinct phases, as shown in Figure~\ref{fig:trial-dataflow}. First, a researcher creates an experiment protocol using the Design interface. Second, when a trial begins, the application server loads the protocol and begins stepping through it, sending commands to the robot and waiting for events such as wizard inputs, sensor readings, or timeouts. Third, every action, both planned protocol steps and unexpected events, is immediately written to the trial log with precise timing information. Fourth, the Execution interface continuously displays the current state, allowing the wizard and observers to monitor the progress of a trial in real-time. Fifth, when the trial concludes, all recorded media (video and audio) is transferred from the browser to the server and persisted in a database as part of the trial record. Sixth, the Analysis interface retrieves the stored trial data and reconstructs exactly what happened, synchronizing notable events with the video and audio recordings.
+The flow of data during a trial proceeds through six distinct phases as discussed below; these phases are summarized in Figure~\ref{fig:trial-dataflow}:

-This design ensures comprehensive documentation of every trial, supporting both fine-grained analysis and reproducibility. Researchers can review not just what they intended to happen, but what actually did happen, including timing variations and unexpected events.
+\begin{enumerate}
+\item A researcher creates an experiment protocol using the Design interface.
+\item When a trial begins, the application server loads the protocol and allows the wizard to step through it, sending commands to the robot and waiting for events such as wizard inputs, sensor readings, or timeouts.
+\item Every action, both planned protocol steps and deviations, is immediately written to the trial log with precise timing information.
+\item The \emph{Execution} interface continuously displays the current state, allowing the wizard and observers to monitor the progress of a trial in real-time.
+\item When the trial concludes, all recorded media (video and audio) is transferred from the browser to the server and persisted in a database as part of the trial record.
+\item The \emph{Analysis} interface retrieves the stored trial data and reconstructs exactly what happened, synchronizing notable events with the video and audio recordings.
+\end{enumerate}
+
+This design creates automatically a comprehensive documentation of every trial, supporting both fine-grained analysis and reproducibility. Researchers can review not just what they intended to happen, but what actually did happen, including timing variations and deviations.

 \begin{figure}[htbp]
 \centering
@@ -322,7 +361,7 @@ This design ensures comprehensive documentation of every trial, supporting both
    \draw[arrow] (s5.south) -- (s6.north);
    
 \end{tikzpicture}
-\caption{Trial data flow: from protocol design through execution and recording, to analysis and playback.}
+\caption{Six-phase trial data flow.}
 \label{fig:trial-dataflow}
 \end{figure}

@@ -332,4 +371,4 @@ The design choices described in this chapter were made to meet the requirements

 \section{Chapter Summary}

-This chapter described the architectural design with emphasis on how each design choice directly implements the infrastructure requirements identified in Chapter~\ref{ch:background}. The hierarchical organization of experiment specifications enables intuitive, executable design. The event-driven execution model balances protocol consistency with realistic interaction dynamics. The modular interface architecture separates concerns across design, execution, and analysis phases while maintaining data coherence. The integrated data flow ensures that reproducibility is supported by design rather than by afterthought. The following chapter presents HRIStudio as a reference implementation of these design principles, discussing specific technologies and architectural components.
+This chapter described the architectural design with emphasis on how each design choice directly implements the infrastructure requirements identified in Chapter~\ref{ch:background}. The hierarchical organization of experiment specifications enables intuitive, executable design. The event-driven execution model balances protocol consistency with realistic interaction dynamics. The modular interface architecture separates concerns across design, execution, and analysis phases while maintaining data coherence. The integrated data flow ensures that reproducibility is supported by design rather than by afterthought. The following chapter presents HRIStudio, the platform built on these design principles, describing the specific technologies and architectural components that bring them to life.
@@ -1,19 +1,98 @@
 \chapter{Implementation}
 \label{ch:implementation}

-HRIStudio is a reference implementation of the design principles established in Chapter~\ref{ch:design}. The central contribution of this work is not the tool itself but the design principles that underpin it: the hierarchical specification model, the event-driven execution model, and the integrated data flow. Any system built on those principles would satisfy the same requirements. This chapter explains how HRIStudio realizes them, covering the architectural choices and mechanisms behind how the platform stores experiments, executes trials, integrates robot hardware, and controls access. The specific technologies used in this particular implementation are presented in Appendix~\ref{app:tech_docs}.
+HRIStudio is a complete, operational platform that realizes the design principles established in Chapter~\ref{ch:design}. As the primary artifact of this thesis, it demonstrates that those principles are not merely theoretical: the hierarchical specification model, the event-driven execution model, and the integrated data flow can be built into a system that real researchers use without programming expertise. Any system built on those principles could satisfy the same requirements; HRIStudio is the implementation that proves they work in practice. This chapter explains how HRIStudio realizes those principles, covering the architectural choices and mechanisms behind how the platform stores experiments, executes trials, integrates robot hardware, and controls access. The specific technologies used are presented in Appendix~\ref{app:tech_docs}.

 \section{Platform Architecture}

 HRIStudio follows the model of a web application. Users access it through a standard browser without installing specialized software, and the entire study team, including researchers, wizards, and observers, connect to the same shared system. This eliminates the need for a local installation and ensures the platform works identically on any operating system, directly addressing the low-technical-barrier requirement (R2, from Chapter~\ref{ch:background}). It also enables easy collaboration (R6): multiple team members can access experiment data and observe trials simultaneously from different machines without any additional configuration.

-I organized the system into three layers: User Interface, Application Logic, and Data \& Robot Control. This layered structure is shown in Figure~\ref{fig:three-tier}. In the implementation of this architecture, it is essential that the application server and the robot control hardware run on the same local network. This keeps communication latency low during trials: a noticeable delay between the wizard's input and the robot's response would break the interaction.
+I organized the system into three layers: User Interface, Application Logic, and Data \& Robot Control. This layered structure is presented in Chapter~\ref{ch:design} and shown in Figure~\ref{fig:three-tier}. In practice, the User Interface layer runs in each researcher's browser (the client), while the Application Logic and Data \& Robot Control layers run on a shared application server. 

-I implemented all three layers in the same language — TypeScript~\cite{TypeScript2014}, a statically-typed superset of JavaScript. The single-language decision keeps the type system consistent across the full stack. When the structure of experiment data changes, the type checker surfaces inconsistencies across the entire codebase at compile time rather than allowing them to appear as runtime failures during a trial.
+While the system can run entirely on a single machine for local testing, this architecture allows the components to be distributed across different systems. The application server can be hosted centrally or even in a remote data center, enabling observers to connect to a live trial from any location with internet access. In such a configuration, it is essential that the robot control hardware and the client computer running the wizard's Execution interface stay on the same local network as the robot. This ensures that the WebSocket-based communication between the wizard and the robot bridge maintains low latency, as a noticeable delay between the wizard's input and the robot's response would break the interaction.
+
+This flexibility of deployment also addresses the varying data security and compliance needs of different research institutions. A lab may choose to host HRIStudio on a public-facing server to prioritize collaborative ease and accessibility for remote team members. Alternatively, a lab with strict data privacy requirements or institutional review board (IRB) constraints can deploy the entire stack on a private, air-gapped network. Because the platform is self-contained and does not rely on external cloud services for its core execution logic, researchers have full control over where their experimental data is stored and who can access it.
+
+I implemented all three layers in the same language: TypeScript~\cite{TypeScript2014}, a statically-typed superset of JavaScript. The single-language decision keeps the type system consistent across the full stack. When the structure of experiment data changes, the type checker surfaces inconsistencies across the entire codebase at compile time rather than allowing them to appear as runtime failures during a trial.
+
+HRIStudio is released as open-source software under the MIT License, with the application hosted at a public repository~\cite{HRIStudioRepo}. The companion robot plugin repository~\cite{RobotPluginsRepo} is maintained as a git submodule and is updated whenever HRIStudio requires schema or protocol updates. Both repositories are available for inspection, extension, and deployment by other research groups.
+
+HRIStudio is implemented as a set of containerized services that work together to provide the platform's functionality. This modular architecture ensures that each component can be scaled or replaced independently as requirements change.
+
+\begin{figure}[htbp]
+\centering
+\begin{tikzpicture}[
+    node distance=0.8cm and 1.8cm,
+    servicebox/.style={rectangle, draw=black, thick, fill=gray!15, align=center, font=\small, inner sep=5pt, minimum width=2.2cm},
+    containerbox/.style={rectangle, draw=black, thick, dashed, fill=gray!5, align=center, font=\small\bfseries, inner sep=12pt},
+    wsbox/.style={rectangle, draw=black, ultra thick, fill=white, align=center, font=\scriptsize\bfseries, inner sep=3pt},
+    arrow/.style={->, thick, >=stealth},
+    darrow/.style={<->, thick, >=stealth, dashed},
+    labelstyle/.style={font=\scriptsize\itshape, align=center}
+]
+
+    % HRIStudio System Container Services
+    \node[servicebox] (nextjs) {Next.js\\Server};
+    \node[servicebox, below=of nextjs] (postgres) {PostgreSQL\\Database};
+    \node[servicebox, below=of postgres] (minio) {MinIO\\Object Storage};
+    \draw[arrow] (nextjs) -- (postgres);
+    \draw[arrow] (nextjs) -- (minio);
+    
+    % HRIStudio Container Boundary
+    \begin{scope}[on background layer]
+        \node[containerbox, fit=(nextjs) (postgres) (minio), inner sep=15pt] (hri_cont) {};
+        \node[anchor=south, font=\small\bfseries, yshift=2pt] at (hri_cont.north) {HRIStudio System};
+    \end{scope}
+
+    % NAO6 Integration Bridge Container Services
+    \node[servicebox, right=4.5cm of nextjs] (driver) {NAOqi\\Driver};
+    \node[servicebox, below=of driver] (ros) {ROS 2\\Core};
+    \node[servicebox, below=of ros] (adapter) {HRIStudio\\Adapter};
+    \draw[darrow] (driver) -- (ros);
+    \draw[darrow] (ros) -- (adapter);
+
+    % Bridge Container Boundary
+    \begin{scope}[on background layer]
+        \node[containerbox, fit=(driver) (ros) (adapter), inner sep=15pt] (bridge_cont) {};
+        \node[anchor=south, font=\small\bfseries, yshift=2pt] at (bridge_cont.north) {NAO6 Bridge};
+    \end{scope}
+
+    % Client/Wizard
+    \node[servicebox] (client) at ($(hri_cont.north)!0.5!(bridge_cont.north) + (0, 2.2)$) {Wizard Browser};
+
+    % WebSocket Connections
+    \node[wsbox] (sys_ws) at ($(client.south)!0.5!(hri_cont.north)$) {System WebSocket};
+    \node[wsbox] (robot_ws) at ($(client.south)!0.5!(bridge_cont.north)$) {Robot WebSocket};
+
+    \draw[darrow] (client.south) -- (sys_ws.north);
+    \draw[darrow] (sys_ws.south) -- (hri_cont.north);
+    
+    \draw[darrow] (client.south) -- (robot_ws.north);
+    \draw[darrow] (robot_ws.south) -- (bridge_cont.north);
+
+    % Hardware
+    \node[servicebox, right=1.5cm of bridge_cont] (robot) {NAO6\\Robot};
+    \draw[arrow] (bridge_cont.east) -- node[above, font=\scriptsize, align=center] {NAOqi\\API} (robot.west);
+
+\end{tikzpicture}
+\caption{Containerized HRIStudio and NAO6 integration architecture.}
+\label{fig:system-architecture}
+\end{figure}
+
+The HRIStudio system consists of three primary services: a Next.js application server that handles the user interface and business logic, a PostgreSQL database for persistent storage of experiment and trial data, and a MinIO object storage service for managing large media files like video and audio recordings. For robot integration, the \texttt{nao6-hristudio-integration} bridge also employs a containerized structure consisting of the NAOqi driver, a ROS 2 core for message routing, and a specialized adapter that communicates with HRIStudio.
+
+During a live trial, the wizard's browser establishes two independent WebSocket connections. The System WebSocket connects to the HRIStudio server to manage trial state, protocol progression, and logging. The Robot WebSocket connects directly to the integration bridge to provide low-latency control of the robot platform. This split-connection model ensures that system-level management does not introduce latency into the robot's physical responses.
+
+\subsection{Working with AI Coding Assistants}
+\label{sec:ai-ws}
+
+The scale of the implementation described in this chapter, a full-stack TypeScript application spanning user interface, application logic, persistent storage, and real-time robot control, would not have been possible within the timeframe of this thesis without the use of AI coding assistants. I distinguish clearly between the engineering and implementation roles in this work: I architected the system, made the design decisions documented in Chapter~\ref{ch:design} and this chapter, specified the behavior and constraints of each component, and reviewed and integrated all code before it entered the codebase. AI agents acted as software developers working under that direction, producing TypeScript code in response to the specifications I provided and the feedback I gave as the implementation evolved. The division of labor was consistent throughout: I engineered, they implemented.
+
+The tools I used in this capacity spanned several vendors and interaction paradigms, and the set evolved as the AI landscape changed over the course of the project. Claude~\cite{Anthropic2024Claude} was the conversational model I relied on most consistently for design discussions and code review. I used Claude Code~\cite{AnthropicClaudeCode}, OpenCode~\cite{OpenCode}, the Gemini CLI~\cite{GeminiCLI}, and Google Antigravity~\cite{GoogleAntigravity} as terminal- and editor-integrated coding agents for implementing the features I specified; the Zed editor~\cite{ZedEditor} served as the surrounding development environment and provided its own AI-assisted editing features. These tools overlapped in places, but I generally used one at a time and switched between them as new capabilities became available and as I learned which tool suited which kind of work. Appendix~\ref{app:ai_workflow} documents this workflow in more detail: the division of responsibility between me and the agents, the kinds of tasks each category of tool handled well, and the limits I ran into.

 \section{Experiment Storage and Trial Logging}

-The system saves experiments to persistent storage when a researcher completes them in the Design interface. A saved experiment is a complete, reusable specification that a researcher can run across any number of trials without modification. In this chapter, a trial means one concrete run of an experiment protocol with one human subject; this is where spontaneous wizard deviations can occur.
+The system saves experiment descriptions to persistent storage when a researcher completes them in the Design interface. A saved experiment is a complete, reusable specification that a researcher can run across any number of trials without modification.

 When a trial begins, the system creates a new trial record linked to that experiment. The system writes every action the wizard triggers to that record with a precise timestamp, whether scripted or not, including any unscripted actions triggered outside the protocol. The system flags those unscripted actions as deviations. The Execution interface records video, audio, and robot sensor data alongside the action log for the duration of the trial. The Analysis interface can directly compare what was planned against what was executed for any trial, without any manual work by the researcher, because the trial record and the experiment reference the same underlying specification. Figure~\ref{fig:trial-record} shows the structure of a completed trial record: action log entries, video, audio, and robot sensor data all share a common timestamp reference so the Analysis interface can align them without manual synchronization; dashed lines mark step boundaries; and the system flags any deviation from the experiment specification at the appropriate position in the timeline.

@@ -85,25 +164,43 @@ When a trial begins, the system creates a new trial record linked to that experi
 \label{fig:trial-record}
 \end{figure}

-Video and audio are recorded locally in the researcher's browser during the trial rather than streamed to the server in real time. This prevents network delays or server load from dropping frames or degrading audio quality during the interaction. When the trial concludes, the browser transfers the complete recordings to the server and associates them with the trial record. The Analysis interface can align video and audio with the logged actions without any manual synchronization, because the timestamp when recording starts is logged alongside the action log.
+Video and audio are recorded locally in the wizard's browser during the trial rather than streamed to the server in real time. The wizard's browser is the canonical recording client because the wizard is the only role required for a trial to run; observer and researcher roles connect in read-only capacities and do not capture media. Recording locally prevents network delays or server load from dropping frames or degrading audio quality during the interaction. When the trial concludes, the wizard's browser transfers the complete recordings to the server and associates them with the trial record. The Analysis interface can align video and audio with the logged actions without any manual synchronization, because the timestamp when recording starts is logged alongside the action log.

 The system stores structured and media data separately. Experiment specifications and trial records are stored in the same structured database, which makes it efficient to query across trials (for example, retrieving all trials for a specific participant or comparing action timing across conditions). Video and audio files are stored in a dedicated file store, since their size makes them unsuitable for a database and the system never queries their content directly.

+Figure~\ref{fig:trial-report} shows the Analysis interface reconstructing a completed trial. The recorded video is presented alongside a synchronized action log, with each logged event linked to its moment in the recording so researchers can jump directly to the corresponding interaction without manual cross-referencing.
+
+\begin{figure}[htbp]
+\centering
+\includegraphics[width=0.95\textwidth]{assets/trial-report.png}
+\caption{The HRIStudio Analysis interface showing a completed trial with video and a synchronized, timestamped action log.}
+\label{fig:trial-report}
+\end{figure}
+
 \section{The Execution Engine}

 The execution engine is the component that runs a trial: it loads the experiment, manages the wizard's connection, sends robot commands, and keeps all connected clients in sync.

-When a trial begins, the server loads the experiment and maintains a live connection to the wizard's browser and any observer connections. The execution engine does not advance through the actions of an experiment on a timer; instead, the wizard controls how time advances from action to action. This preserves the natural pacing of the interaction: the wizard advances only when the participant is ready, while the experiment structure ensures the protocol is followed. When the wizard triggers an action, the server sends the related command to the robot, writes the log entry, and pushes the updated experiment state to all connected clients in the same operation — keeping the wizard's view, the observer view, and the actual robot state synchronized in real time.
+When a trial begins, the server loads the experiment and maintains a live connection to the wizard's browser and any observer connections. The wizard controls how time advances from action to action. When the wizard triggers an action, the server sends the related command to the robot, writes the log entry, and pushes the updated experiment state to all connected clients in the same operation, keeping the wizard's view, the observer view, and the actual robot state synchronized in real time.

-No two human subjects respond identically to an experimental protocol. One subject gives a one-word answer; another offers a paragraph; a third asks the robot a question the script never anticipated. A fully programmed robot has no answer for that third subject: the interaction stalls, or immersion breaks. The wizard exists to fill that gap: where the program runs out of instructions, the wizard draws on their knowledge of human social interaction to keep the exchange coherent. Unscripted actions give the wizard the tools to exercise that judgment in the moment. The wizard triggers them via the manual controls in the Execution interface, the robot command runs, and the system logs the action with a deviation flag. This design preserves research value: the interaction gains the flexibility only a human can provide, and that flexibility appears explicitly in the record rather than disappearing into it.
+No two human subjects respond identically to an experimental protocol. One subject gives a one-word answer; another offers a paragraph; a third asks the robot a question the script never anticipated. Unscripted actions give the wizard the tools to record how these interactions unfold when deviations from the script are required. The wizard triggers them via the manual controls in the Execution interface, the robot command runs, and the system logs the action with a deviation flag. This design preserves research value: the interaction gains the flexibility only a human can provide, and that flexibility appears explicitly in the record rather than disappearing into it.
+
+Figure~\ref{fig:execution-view} shows the Execution interface as it appears to a wizard during a live trial. The current step is highlighted in the protocol sidebar, the available actions for that step are surfaced as triggerable buttons, and the wizard has manual-control affordances for introducing unscripted actions that the system will flag as deviations in the trial log.
+
+\begin{figure}[htbp]
+\centering
+\includegraphics[width=0.95\textwidth]{assets/execution-view.png}
+\caption{The HRIStudio Execution interface during a live trial, showing the current step, available actions, and manual deviation controls.}
+\label{fig:execution-view}
+\end{figure}

 \section{Robot Integration}

-A configuration file describes each robot platform, listing the actions it supports and specifying how each one maps to a command the robot understands. The execution engine reads this file at startup and uses it whenever it needs to dispatch a command: it looks up the action type, assembles the appropriate message, and sends it to the robot over a bridge process running on the local network. The web server itself has no knowledge of any specific robot; all hardware-specific logic lives in the configuration file.
+A plugin file describes each robot platform, listing the actions it supports and specifying how each one maps to a command the robot understands. The execution engine reads this file at startup and uses it whenever it needs to dispatch a command: it looks up the action type, assembles the appropriate message, and sends it to the robot over a bridge process running on the local network. For the NAO6 platform, I developed a specialized ROS-based bridge called \texttt{nao6-hristudio-integration}~\cite{NaoIntegrationRepo} that translates HRIStudio commands into the NAOqi API calls required by the robot. The web server itself has no knowledge of any specific robot; all hardware-specific logic lives in the plugin file.

 The execution engine treats control flow elements such as branches and conditionals, which function as elements of a computer program, the same way as robot actions. These control-flow elements appear as action groups in the experiment and are evaluated during the trial, so researchers can freely mix logical decisions and physical robot behaviors when designing an experiment without any special handling.

-Figure~\ref{fig:plugin-architecture} illustrates this mapping using NAO6 and TurtleBot as an example. Actions a platform does not support (such as \texttt{raise\_arm} on TurtleBot) appear as explicitly unsupported in the configuration file rather than silently failing. Because all hardware-specific logic lives in the configuration file, the experiment itself does not change between platforms.
+Figure~\ref{fig:plugin-architecture} illustrates this mapping using NAO6 and TurtleBot as an example. Actions a platform does not support (such as \texttt{raise\_arm} on TurtleBot) appear as explicitly unsupported in the plugin file rather than silently failing. Because all hardware-specific logic lives in the plugin file, the experiment itself does not change between platforms.

 \begin{figure}[htbp]
 \centering
@@ -146,13 +243,17 @@ Figure~\ref{fig:plugin-architecture} illustrates this mapping using NAO6 and Tur
    \draw[arrow] (cfg.east) -- (tb.west);

 \end{tikzpicture}
-\caption{Abstract experiment actions translated to platform-specific robot commands through per-platform configuration files.}
+\caption{Abstract experiment actions translated to platform-specific robot commands through per-platform plugin files.}
 \label{fig:plugin-architecture}
 \end{figure}

+\subsection{Containerized Development Environment}
+
+To support development and testing for the NAO platform, I also developed \texttt{nao-workspace}, a containerized workspace~\cite{NaoWorkspaceRepo}. This was motivated by the technical constraints of Choregraphe and its related libraries, which only supported x86-64 systems running Ubuntu 22.04. The containerized structure was the only way I could run the proprietary NAO development tools on modern hardware. While I developed this stack primarily to enable technical testing and material preparation during the project, the resulting tooling may be useful to other HRI researchers facing similar platform constraints.
+
 \section{Access Control}

-I implemented access control using a role-based access control (RBAC) model. Each study has a membership list, and each member is assigned one of four roles that define a clear separation of capabilities: those who own the study, those who design it, those who run it, and those who observe it. This enforces need-to-know access at the study level so that each team member sees or is able to modify only what their role requires.
+I implemented access control using a role-based access control (RBAC) model with two layers. System-level roles govern what a user can do across the platform (administrator, researcher, wizard, observer), while study-level roles govern what a user can see and do within a specific study (owner, researcher, wizard, observer). The two layers are checked independently, so a user who is a wizard on one study can be an observer on another without any additional configuration. Within a study, the four study-level roles define a clear separation of capabilities: those who own the study, those who design it, those who run it, and those who observe it. This enforces need-to-know access at the study level so that each team member sees or is able to modify only what their role requires. The capabilities and constraints for each role are described below:

 \begin{description}
    \item[Owner.] Full control over the study: can invite or remove members, configure the study settings, and access all data.
@@ -168,17 +269,17 @@ The role definitions above determine who can view and change data during normal
 The following two problems required specific solutions during implementation.

 \begin{description}
-    \item[Execution latency.] During a trial, the execution engine must respond quickly to wizard input --- a noticeable delay between the button press and the robot's action can disrupt the interaction. I addressed this by maintaining a persistent network connection to the robot bridge for the duration of each trial. The connection is established once at trial start and kept open, eliminating per-action setup overhead.
+    \item[Execution latency.] During a trial, the execution engine must respond quickly to wizard input, as a noticeable delay between the button press and the robot's action can disrupt the interaction. I addressed this by maintaining a persistent network connection to the robot bridge for the duration of each trial. The connection is established once at trial start and kept open, eliminating per-action setup overhead.

    \item[Multi-source synchronization.] The Analysis interface requires aligning data streams captured at different sampling rates by different components: video, audio, action logs, and sensor data. The solution is a shared time reference: every data source records its timestamps relative to the same trial start time, $t_0$, so the Analysis interface can align all tracks without requiring manual calibration.
 \end{description}

 \section{Implementation Status}

-HRIStudio has reached minimum viable product status. The Design, Execution, and Analysis interfaces are operational. The execution engine handles scripted and unscripted actions with full timestamped logging, and I validated robot communication on the NAO6 platform during development. The platform can run a controlled WoZ study without modification to its core architecture or execution workflow.
+HRIStudio is fully operational for controlled Wizard-of-Oz studies. The Design, Execution, and Analysis interfaces are complete and integrated with one another. The execution engine handles scripted and unscripted actions with full timestamped logging, and I validated robot communication on the NAO6 platform during development. A researcher can design an experiment, run a live trial with a wizard, and review the resulting logs and recordings without modification to the platform's core architecture or execution workflow.

-Work remaining for future development includes broader validation of the configuration file approach on robot platforms beyond NAO6.
+Work remaining for future development includes broader validation of the plugin file approach on robot platforms beyond NAO6, as discussed further in Chapter~\ref{ch:conclusion}.

 \section{Chapter Summary}

-This chapter described how HRIStudio realizes the design principles from Chapter~\ref{ch:design} in practice. Experiments are persistent, reusable specifications that produce complete, comparable trial records. The execution engine is event-driven rather than timer-driven, keeping the wizard in control of pacing while logging every action automatically. Per-platform configuration files keep the execution engine hardware-agnostic. The role system enforces access control at the study level. The platform is at minimum viable product status and can run a controlled WoZ study today. HRIStudio is one realization of these principles; the contribution lies in the design principles themselves, which any implementation could adopt.
+This chapter described how HRIStudio realizes the design principles from Chapter~\ref{ch:design} in practice. Experiments are persistent, reusable specifications that produce complete, comparable trial records. The execution engine is event-driven rather than timer-driven, keeping the wizard in control of pacing while logging every action automatically. Per-platform plugin files keep the execution engine hardware-agnostic. The role system enforces access control at the study level. The platform is fully operational for controlled WoZ studies today, demonstrated through the pilot validation study presented in Chapter~\ref{ch:evaluation}. The design principles are general; HRIStudio shows they are workable.
@@ -1,128 +1,153 @@
 \chapter{Pilot Validation Study}
 \label{ch:evaluation}

-This chapter presents the pilot validation study used to evaluate whether HRIStudio improves accessibility and reproducibility in WoZ-based HRI research. It defines the research questions, study design, participant roles, task, apparatus, procedure, and measurement instruments.
+This chapter presents the pilot validation study used to evaluate whether HRIStudio improves accessibility and reproducibility in WoZ-based HRI research. It defines the research questions, study design, task, apparatus, procedure, and measurement instruments.

 \section{Research Questions}

-The evaluation targets the two problems established in Chapter~\ref{ch:background}. The first is the \emph{Accessibility Problem}: existing tools require substantial programming expertise, which prevents domain experts from conducting independent HRI studies. The second is the \emph{Reproducibility Problem}: without structured logging and protocol enforcement, experiment execution varies across participants and wizards in ways that are difficult to detect or control after the fact.
+The validation study targets the two problems established in Chapter~\ref{ch:background}. The first is the \emph{Accessibility Problem}: existing tools require substantial programming expertise, which prevents domain experts from conducting independent HRI studies. The second is the \emph{Reproducibility Problem}: without structured logging and protocol enforcement, experiment execution varies across participants and wizards in ways that are difficult to detect or control after the fact.

-These problems give rise to two research questions. The first asks whether HRIStudio enables domain experts without prior robotics experience to successfully implement a robot interaction from a written specification. The second asks whether HRIStudio produces more reliable execution of that interaction compared to standard practice.
+These problems give rise to two research questions. The first is whether HRIStudio enables domain experts without prior robotics experience to successfully implement a robot interaction from a written specification. The second is whether HRIStudio produces more reliable execution of that interaction compared to standard practice.

-I hypothesized that wizards using HRIStudio would more completely and correctly implement the written specification, and that their designs would execute more reliably during the trial, compared to wizards using ad hoc programs created for specific social robotics experiments, with Choregraphe as the baseline tool in this study.
+I hypothesized that HRIStudio would improve both accessibility and reproducibility compared to Choregraphe: wizards using HRIStudio would more completely and correctly implement the written specification, and their designs would execute more reliably during the trial.

 \section{Study Design}

-I used what Bartneck et al.~\cite{Bartneck2024} call a between-subjects design, in which each participant is assigned to only one condition. I randomly assigned each wizard participant to one of two conditions: HRIStudio or Choregraphe. Both groups received the same task, the same time allocation, and the same training structure. Measuring each participant in only one condition prevents carryover effects, meaning performance changes caused by prior exposure to another condition rather than by the assigned condition itself.
-
-In this study, I defined two types of participants with distinct roles. Wizards were faculty members drawn from across departments who designed and ran the robot interaction. Test subjects were undergraduate students who interacted with the robot during the trial. This separation ensures that the evaluation captures both the design experience and the quality of the resulting interaction. The next section details recruitment, inclusion criteria, and sample rationale for both groups.
+I used what Bartneck et al.~\cite{Bartneck2024} call a \emph{between-subjects design}, in which each participant is assigned to only one condition. To ensure that programming experience was balanced across conditions, I stratified assignment by self-reported programming background: each wizard was first classified as having \emph{None}, \emph{Moderate}, or \emph{Extensive} programming experience, and then randomly assigned within that stratum to HRIStudio or Choregraphe. This produced a design in which each condition contained exactly one wizard at each experience level, reducing the risk that tool effects would be confused with differences in programming experience. Both groups received the same task, the same time allocation, and a similar training structure. Because each wizard used only one tool, the design also avoided carryover effects from prior exposure to the other condition.

 \section{Participants}

-\textbf{Wizards.} I recruited eight Bucknell University faculty members drawn from across departments to serve as wizards. I deliberately recruited from both ends of the programming experience spectrum: four had substantial programming backgrounds, and four described themselves as non-programmers or having minimal coding experience. This cross-departmental recruitment was intentional. A primary claim of HRIStudio is that it lowers the technical barrier for domain experts who are not programmers; drawing wizards from outside computer science allows the data to speak to whether that claim holds for the intended user population.
+\textbf{Wizards.} A primary claim of HRIStudio is that it lowers the technical barrier for domain experts who are not programmers; testing this claim with its intended user population was therefore a primary goal of participant recruitment. I recruited six Bucknell University faculty members drawn from across departments to serve as wizards, deliberately targeting both ends of the programming experience spectrum: those with substantial programming backgrounds as well as those who described themselves as non-programmers or having minimal coding experience. Drawing wizards from outside computer science allows the data to speak to whether that claim holds for the intended user population.

-The key inclusion criterion for all wizards was no prior experience with either the NAO robot or Choregraphe software specifically. This controls for tool familiarity so that performance differences reflect the tools themselves rather than prior exposure. I recruited wizards through direct email. Participation was framed as a voluntary software evaluation unrelated to any professional obligations.
+The key inclusion criterion for all wizards was no prior experience with either the NAO robot or Choregraphe software specifically. This controls for tool familiarity so that performance differences reflect the tools themselves rather than prior exposure. I recruited wizards through direct email, and participation was framed as a voluntary software evaluation unrelated to any professional obligations.

-\textbf{Test subjects.} I recruited eight undergraduate students from Bucknell University to serve as test subjects. Their role was to serve as the subjects for the experimental protocol coded by each wizard. To eliminate any risk of coercion, I screened participants to ensure that no test subject was enrolled in a course taught by the wizard they were paired with. Recruitment used campus flyers inviting volunteers to interact with a robot for approximately 15 minutes, and all participants received international snacks and refreshments upon arrival regardless of whether they completed the full session.
-
-\textbf{Sample size rationale.} With $N = 16$ total participants, this sample size is appropriate for a pilot validation study whose goal is directional evidence and failure-mode identification rather than effect-size estimation for a broad population. The size matches the scope and constraints of this honors thesis: two academic semesters, one undergraduate researcher, and no funded research assistant support. It also reflects the target population and recruitment context. Faculty domain experts outside computer science with no prior NAO or Choregraphe experience are a limited pool at a small liberal arts university and have high competing time demands; eight wizard participants represent the available pool without relaxing inclusion criteria.
-
-This scale is consistent with pilot and feasibility studies in HRI, where small $N$ designs are common in early-stage tool validation~\cite{Steinfeld2009}. Findings should be interpreted as preliminary evidence and directional indicators rather than as conclusive proof.
+\textbf{Sample size rationale.} I chose to recruit six wizard participants ($N = 6$), believing that this sample size is appropriate for a pilot validation study whose goal is directional evidence and failure-mode identification rather than effect-size estimation for a broad population. This scale is consistent with pilot and feasibility studies in HRI, where small $N$ designs are common in early-stage tool validation~\cite{HoffmanZhao2021}. Findings should be interpreted as preliminary evidence and directional indicators rather than as conclusive substantiation of any claims.

 \section{Task}

-Both wizard groups received the same written task specification: the \emph{Interactive Storyteller} scenario. The specification described a robot that introduces an astronaut named Dara, narrates her discovery of an anomalous glowing rock on Mars, asks the human subject a comprehension question about the story, and delivers one of two responses depending on whether the answer is correct. The full specification, including exact robot speech, required gestures, and branching logic, is reproduced in Appendix~\ref{app:materials}.
+The task chosen was to have a robot tell a story to a human subject and later evaluate if that subject could recall a specific detail.

-The task was chosen because it requires several distinct capabilities: speech actions, gesture coordination, conditional branching based on human-subject input, and a defined conclusion. In both conditions, wizards had to translate the same written protocol into an executable interaction script, including action ordering, branching logic, and timing decisions. In Choregraphe, that meant assembling and connecting behavior nodes in a finite state machine. In HRIStudio, it meant building a sequential action timeline with conditional branches. This makes the task a direct comparison of how each tool supports coding the robot behavior required by the same protocol.
+Both wizard groups received the same written task specification: the \emph{Interactive Storyteller} scenario. The specification described a robot that introduces an astronaut named Kai, narrates her discovery of a red rock on Mars, asks a recall question, and delivers a response according to the answer given. The full specification, including exact robot speech, required gestures, and branching logic, is reproduced in Appendix~\ref{app:blank_templates}. This scenario is representative of HRI tasks in which a robot conveys information to a human subject; one might, for example, measure whether a robot or human storyteller produces better recall in subjects.
+
+This scenario was chosen because it requires several distinct capabilities: speech actions, gesture coordination, conditional branching, and a defined conclusion. In both conditions, wizards had to translate the same written protocol into an executable interaction script, including action ordering, branching logic, and timing decisions. In Choregraphe, that meant assembling and connecting behavior nodes in a finite state machine. In HRIStudio, it meant building a sequential action timeline with conditional branches. This makes the task a direct comparison of how each tool supports coding the robot behavior required by the same protocol.

 \section{Robot Platform and Software Apparatus}

-Both conditions used the same NAO humanoid robot, a platform approximately 0.58 meters tall capable of speech synthesis, animated gestures, and head movement. Using the same hardware ensured that any differences in execution quality were attributable to the software, not the robot.
+Both conditions used the same NAO humanoid robot (Figure~\ref{fig:nao6-photo}), a platform approximately 0.58 meters tall capable of speech synthesis, animated gestures, and head movement. Using the same hardware ensured that any differences in execution quality were attributable to the software, not the robot.

-Figure~\ref{fig:platform-photo-placeholders} reserves space for final platform images. Replace these placeholders with the final NAO6 and TurtleBot photos when available.

 \begin{figure}[htbp]
 \centering
-\begin{tikzpicture}
-	\draw[thick] (0,0) rectangle (6,4);
-	\node at (3,2.5) {\textbf{NAO6 Image Placeholder}};
-	\node at (3,1.7) {Humanoid platform photo};
-
-	\draw[thick] (7,0) rectangle (13,4);
-	\node at (10,2.5) {\textbf{TurtleBot Image Placeholder}};
-	\node at (10,1.7) {Mobile base platform photo};
-\end{tikzpicture}
-\caption{Placeholder image slots for NAO6 and TurtleBot platforms.}
-\label{fig:platform-photo-placeholders}
+\includegraphics[width=0.45\textwidth]{images/nao6.jpg}
+\caption{The NAO6 humanoid robot used in both conditions of the pilot study.}
+\label{fig:nao6-photo}
 \end{figure}

+
 The control condition used Choregraphe \cite{Pot2009}, a proprietary visual programming tool developed by Aldebaran Robotics and the standard software for NAO programming. Choregraphe organizes behavior as a finite state machine: nodes represent states and edges represent transitions triggered by conditions or timers.

-The experimental condition used HRIStudio, described in Chapter~\ref{ch:implementation}. HRIStudio organizes behavior as a sequential action timeline with support for conditional branches. Unlike Choregraphe, it abstracts robot-specific commands through configuration files, though for this study both tools controlled the same NAO platform.
+The experimental condition used HRIStudio, described in Chapter~\ref{ch:implementation}. HRIStudio organizes behavior as a sequential action timeline with support for conditional branches. Unlike Choregraphe, it abstracts robot-specific commands through plugin files, though for this study both tools controlled the same NAO platform.
+
+Figure~\ref{fig:design-tool-compare} places the two design environments side by side. On the left, Choregraphe's behavior-box canvas (Figure~\ref{fig:choregraphe-ui}) lets the wizard wire nodes and transitions in a finite-state-machine layout. On the right, HRIStudio's experiment designer (Figure~\ref{fig:hristudio-designer}) presents the same protocol as a vertical action timeline with dedicated blocks for speech, gesture, and conditional branching.
+
+\begin{figure}[htbp]
+\centering
+\begin{minipage}[t]{0.48\textwidth}
+\centering
+\includegraphics[width=\textwidth]{assets/choregraphe.png}
+\subcaption{Choregraphe: behavior-box canvas with nodes and transitions.}
+\label{fig:choregraphe-ui}
+\end{minipage}\hfill
+\begin{minipage}[t]{0.48\textwidth}
+\centering
+\includegraphics[width=\textwidth]{assets/experiment-designer.png}
+\subcaption{HRIStudio: vertical action timeline with structured step and action blocks.}
+\label{fig:hristudio-designer}
+\end{minipage}
+\caption{The two design environments compared. Each wizard used one of these tools to implement the Interactive Storyteller specification.}
+\label{fig:design-tool-compare}
+\end{figure}

 \section{Procedure}

-Each wizard completed a single 75-minute session structured in four phases. Each session was run by one wizard and included one test subject during the trial phase, which lasted approximately 15 minutes.
+Each wizard completed a single 60-minute session structured in four phases.

 \subsection{Phase 1: Training (15 minutes)}

-I opened each session with a standardized tutorial tailored to the wizard's assigned tool. The tutorial covered how to create speech actions, specify gestures, define conditional branches, and save the completed design. Training was intentionally brief to simulate a domain expert encountering a new tool without dedicated onboarding. I answered clarification questions but did not offer hints about the design challenge.
+I opened each session with a standardized tutorial tailored to the wizard's assigned tool. The tutorial covered how to create speech actions, specify gestures, define conditional branches, and save the completed design. Training was intentionally allocated 15 minutes to allow enough time for wizards to ask clarifying questions about the tool before the design challenge began, while still simulating first encounter with a new tool without extensive onboarding. I answered clarification questions during this phase but did not offer hints about the design challenge.

 \subsection{Phase 2: Design Challenge (30 minutes)}

-The wizard received the paper specification and had 30 minutes to implement it using their assigned tool. I observed silently and recorded a screen capture of the wizard's workflow throughout. I noted time to completion, help requests, and any observable errors or misconceptions. If the wizard declared completion before the 30-minute limit, the remaining time was used to review and refine the design.
+The wizard received the specification and had 30 minutes to implement it using their assigned tool. Using a structured observer data sheet (found in Appendix~\ref{app:blank_templates}), I logged every instance in which I provided assistance to the wizard, categorizing each by type: \emph{tool-operation} (T), \emph{task clarification} (C), \emph{hardware or technical} (H), or \emph{general} (G). For each tool-operation intervention, I also recorded which rubric item it pertained to. If the wizard declared completion before the time limit, the remaining time was used to review and refine the design.

-\subsection{Phase 3: Trial (15 minutes)}
+\subsection{Phase 3: Live Trial (10 minutes)}

-After the design phase, a test subject entered the room and the wizard ran their completed program to control the robot during an actual interaction. I video-recorded the full trial to capture robot behavior and timing. I told the test subject they were helping evaluate the robot's performance, not being evaluated themselves.
+After the design phase, the wizard ran their completed program to execute the designed interaction on the robot. I continued logging any researcher interventions during the trial using the same type categories, noting the relevant ERS rubric item for any tool-operation intervention.

-\subsection{Phase 4: Debrief (15 minutes)}
+\subsection{Phase 4: Debrief (5 minutes)}

-Following the trial, the wizard exported their completed project file and completed the System Usability Scale survey. The exported file and video recording served as the primary artifacts for scoring.
+Following the trial, the wizard completed the System Usability Scale survey (found in Appendix~\ref{app:blank_templates}). The DFS and ERS were scored during and immediately after the session using live observation and the Observer Data Sheet.

 \section{Measures}
 \label{sec:measures}

-The study collected four measures, two primary and two supplementary.
+The study collected five measures, two primary and three supplementary, operationalized through five instruments. They are described as follows.

 \subsection{Design Fidelity Score}

-The Design Fidelity Score measures how completely and correctly the wizard implemented the paper specification. I evaluated the exported project file against five criteria: whether all four interaction steps were present, whether robot speech matched the specification word-for-word, whether gestures were assigned to the correct steps, whether the conditional branch triggered on the correct condition, and whether both response branches were complete and correctly ordered. I scored each criterion as met or not met; the DFS is the proportion of criteria satisfied.
+I define the Design Fidelity Score (DFS) as a measure of how completely and correctly the wizard implemented the specification. I evaluated the exported project file against nine weighted criteria grouped into three categories: speech actions, gestures and actions, and control flow and logic. Each criterion is scored as present, correct, and independently achieved.

-This measure is motivated by a gap identified by Riek~\cite{Riek2012}, whose systematic review of 54 published WoZ studies found that only 11\% constrained what the wizard could recognize and fewer than 6\% described wizard training procedures, meaning the vast majority of WoZ studies never verified whether the wizard's design matched any formal specification. Porfirio et al.~\cite{Porfirio2023} similarly argued that formal, verifiable behavior specifications are a prerequisite for reproducible HRI, and the preliminary design of HRIStudio identified specification adherence as a primary evaluation target~\cite{OConnor2024}. The DFS applies these recommendations as a weighted rubric scored against the exported project file. The complete rubric is reproduced in Appendix~\ref{app:materials}. This measure addresses accessibility: did the tool allow a non-expert to produce a correct design?
+The DFS rubric includes an \emph{Assisted} column. For each rubric item, I marked a T if I provided a tool-operation intervention specifically for that item during the design phase (for example, if I explained how to add a gesture node or how to wire a conditional branch). T marks are recorded and reported separately alongside the DFS score; they do not affect the Points total. This preserves the DFS as a clean measure of design fidelity while providing a parallel record of where tool-specific assistance was needed. General interventions (task clarification, hardware issues, or momentary forgetfulness) are not marked T, because those categories of difficulty are independent of the tool under evaluation.
+
+DFS is motivated by a gap identified by Riek~\cite{Riek2012}, whose systematic review of 54 published WoZ studies found that only 11\% constrained wizard behavior and fewer than 6\% described wizard training procedures. Porfirio et al.~\cite{Porfirio2023} similarly argued that formal, verifiable behavior specifications are a prerequisite for reproducible HRI. The DFS applies these recommendations as a weighted rubric scored against the exported project file. The complete rubric is reproduced in Appendix~\ref{app:blank_templates}. This measure addresses the question: did the tool allow a wizard to independently produce a correct design?

 \subsection{Execution Reliability Score}

-The Execution Reliability Score measures whether the designed interaction executed as intended during the trial. I reviewed the video recording against the specification and the wizard's design. Evaluation criteria included whether the robot delivered the correct speech at each step, whether gestures executed and synchronized with speech, whether the conditional branch resolved correctly based on the test subject's answer, and whether any errors, disconnections, or hangs occurred. The score is the proportion of the interaction that executed without error.
+I define the Execution Reliability Score (ERS) as a measure of whether the designed interaction executed as intended during the live trial. I scored the ERS live and immediately after the session, using the Observer Data Sheet and the wizard's exported project file. Evaluation criteria included whether the robot delivered the correct speech at each step, whether gestures executed and synchronized with speech, whether the conditional branch was present in the design and executed during the trial, and whether any errors, disconnections, or hangs occurred.

-This measure responds directly to Riek's~\cite{Riek2012} finding that only 3.7\% of published WoZ studies reported any measure of wizard error, making it nearly impossible to determine whether execution matched design intent. Without an execution-level metric, a study could report a technically correct design that nonetheless failed during the trial due to timing errors, disconnections, or mishandled branches, exactly the kind of problem HRIStudio was designed to detect and log~\cite{OConnor2024, OConnor2025}. The ERS captures those deviations quantitatively. The complete rubric is reproduced in Appendix~\ref{app:materials}. This measure addresses reproducibility: did the design translate reliably into execution?
+The ERS rubric applies the same \emph{Assisted} modifier as the DFS, extended to the trial phase. Any tool-operation intervention I provided during the trial (for example, explaining to the wizard how to launch or advance their program) caps the affected ERS item at half points. This is scored separately from design-phase interventions: a wizard who needed help only during design can still achieve a full ERS score if the trial runs without assistance, and vice versa. The rubric records whether the trial reached its conclusion step. I additionally note whether any branch resolved through programmed conditional logic or through manual intervention by the wizard during execution.
+
+This measure responds directly to Riek's~\cite{Riek2012} finding that only 3.7\% of published WoZ studies reported any measure of wizard error, making it nearly impossible to determine whether execution matched design intent~\cite{OConnor2024, OConnor2025}. The complete rubric is reproduced in Appendix~\ref{app:blank_templates}. This measure addresses the question: did the design translate reliably into execution without researcher support?

 \subsection{System Usability Scale}

-The System Usability Scale (SUS) is a validated 10-item questionnaire measuring perceived usability \cite{Brooke1996}. Wizards completed the SUS after the debrief phase. Scores range from 0 to 100, with higher scores indicating better perceived usability. The full questionnaire is reproduced in Appendix~\ref{app:materials}.
+The System Usability Scale (SUS) is a validated 10-item questionnaire measuring perceived usability, created by Brooke~\cite{Brooke1996}. Wizards completed the SUS after the debrief phase. Scores range from 0 to 100, with higher scores indicating better perceived usability. The full questionnaire is reproduced in Appendix~\ref{app:blank_templates}.

-\subsection{Time-to-Completion and Help Requests}
+\subsection{Intervention Log and Session Timing}

-Time to completion measures how long the wizard took to declare the design finished within the 30-minute window. Help request count and type capture where participants encountered difficulty. These supplementary measures provide context for interpreting the primary scores.
+During each session, I maintained a structured intervention log on the observer data sheet, recording the timestamp, type code, affected rubric item number, and a brief description for every instance in which I assisted the wizard. The four intervention type codes are:
+
+\begin{description}
+    \item[T (tool-operation).] I explained how to operate a specific feature of the assigned software tool.
+    \item[C (task clarification).] I clarified the written specification or an aspect of the task design.
+    \item[H (hardware or technical).] I addressed a robot connection issue or other technical problem outside the wizard's control.
+    \item[G (general).] Brief assistance not attributable to the tool or the task, such as momentary forgetfulness.
+\end{description}
+
+Only T-type interventions affect rubric scoring; the others are recorded to provide context for interpreting session flow and wizard experience. I also recorded the actual duration of each session phase and the time at which the wizard completed or abandoned the design, providing supplementary evidence about tool accessibility beyond the DFS score itself.

 \section{Measurement Instruments}

-Table~\ref{tbl:measurement_instruments} summarizes the four instruments, when they were collected, and which research question each addresses.
+The five measures are designed to work together. The DFS and ERS address separate phases of the session: DFS captures what was designed, and ERS captures whether that design translated faithfully into execution. Taken together, they make it possible to distinguish a wizard who implemented the specification correctly but whose design failed during the trial from one whose design was incomplete but executed without researcher assistance. The SUS grounds both scores in the wizard's subjective experience of the tool. The intervention log and session timing are supplementary: they do not directly answer the research questions but provide context for interpreting the primary scores, particularly for understanding whether help requests concerned the tool itself or the task.
+
+Table~\ref{tbl:measurement_instruments} summarizes the five instruments, when they were collected, and which research question each addresses.

 \begin{table}[htbp]
 \centering
 \footnotesize
-\begin{tabular}{|p{3.2cm}|p{4.2cm}|p{2.4cm}|p{3cm}|}
+\begin{tabular}{|p{3.0cm}|p{4.4cm}|p{2.4cm}|p{2.8cm}|}
 \hline
 \textbf{Instrument} & \textbf{What it captures} & \textbf{When collected} & \textbf{Research question} \\
 \hline
-Design Fidelity Score & Completeness and correctness of the wizard's implementation against the written specification & End of design phase & Accessibility \\
+Design Fidelity Score (DFS) & Completeness and correctness of the wizard's implementation; caps items where tool-operation assistance was given & Post-session file review & Accessibility \\
 \hline
-Execution Reliability Score & Whether the interaction executed as designed during the trial & Post-trial video review & Reproducibility \\
+Execution Reliability Score (ERS) & Whether the interaction executed as designed during the trial; caps items where trial-phase tool assistance occurred & Live and post-trial (ODS) & Reproducibility \\
 \hline
-System Usability Scale & Wizard's perceived usability of the assigned tool & Debrief phase & User experience \\
+System Usability Scale (SUS) & Wizard's perceived usability of the assigned tool & Debrief phase & User experience \\
 \hline
-Time-to-Completion \& Help Requests & Task duration and support requests during design & Throughout design phase & Supplementary \\
+Intervention Log & Timestamped record of all researcher assistance by type (T/C/H/G) and affected rubric item & Throughout session & Supplementary \\
+\hline
+Session Timing & Actual duration of each phase; time to design completion & Throughout session & Supplementary \\
 \hline
 \end{tabular}
 \caption{Measurement instruments used in the pilot validation study.}
@@ -131,4 +156,4 @@ Time-to-Completion \& Help Requests & Task duration and support requests during

 \section{Chapter Summary}

-This chapter described a pilot between-subjects study I designed to test whether the design principles formalized in Chapters~\ref{ch:design} and~\ref{ch:implementation} produce measurably different outcomes from existing practice. Eight wizard participants (four with programming backgrounds and four without) each designed and ran the Interactive Storyteller task on a NAO robot using either HRIStudio or Choregraphe. I measured design fidelity against the written specification, execution reliability during the trial, perceived usability via the SUS, and supplementary timing and help data. Chapter~\ref{ch:results} presents the results.
+This chapter described the structure of a pilot between-subjects study I designed to test whether the design principles formalized in Chapters~\ref{ch:design} and~\ref{ch:implementation} produce measurably different outcomes from existing practice. Six wizard participants ($N = 6$), drawn from across departments and spanning the programming experience spectrum, each designed and ran the Interactive Storyteller task on a NAO robot using either HRIStudio or Choregraphe. Each 60-minute session was structured in four phases: a 15-minute standardized tutorial, a 30-minute design challenge, a 10-minute live trial, and a 5-minute debrief. I measured design fidelity (DFS) and execution reliability (ERS) against the written specification, applying a per-item scoring modifier that caps any rubric criterion for which tool-operation assistance was given. I also collected perceived usability via the SUS, a structured intervention log categorizing all researcher assistance by type, and session phase timings. Chapter~\ref{ch:results} presents the results.
@@ -1,8 +1,282 @@
 \chapter{Results}
 \label{ch:results}

-\section{Quantitative Results}
-% TODO
+This chapter presents the results of the pilot validation study described in Chapter~\ref{ch:evaluation}. Because this is a small pilot, I report descriptive statistics and qualitative observations rather than inferential tests. The goal is directional evidence: the chapter reports whether patterns in the data consistently favor HRIStudio across the primary and supplementary measures.
+
+\section{Participant Overview}
+
+Table~\ref{tbl:sessions} summarizes the participants and their assigned conditions. Wizards are identified by code to protect confidentiality. All six participants were Bucknell University professors drawn from Computer Science, Chemical Engineering, Digital Humanities, and Logic and Philosophy of Science. Demographic information (programming background) was collected during recruitment.
+
+
+\begin{table}[htbp]
+\centering
+\footnotesize
+\begin{tabular}{|l|l|l|l|}
+\hline
+\textbf{ID} & \textbf{Condition} & \textbf{Background} & \makecell[l]{\textbf{Programming}\\\textbf{Experience}} \\
+\hline
+W-01 & Choregraphe & Digital Humanities & None \\
+\hline
+W-02 & HRIStudio & Logic and Philosophy of Science & Moderate \\
+\hline
+W-03 & Choregraphe & Computer Science & Extensive \\
+\hline
+W-04 & Choregraphe & Chemical Engineering & Moderate \\
+\hline
+W-05 & HRIStudio & Chemical Engineering & None \\
+\hline
+W-06 & HRIStudio & Computer Science & Extensive \\
+\hline
+\end{tabular}
+\caption{Summary of wizard participants and assigned conditions.}
+\label{tbl:sessions}
+\end{table}
+
+Table~\ref{tbl:primary-outcomes} presents the primary outcome scores, which are discussed next.
+
+\section{Primary Measures}
+
+\begin{table}[htbp]
+\centering
+\footnotesize
+\begin{tabular}{|l|l|r|r|r|}
+\hline
+\textbf{ID} & \textbf{Condition} & \textbf{DFS} & \textbf{ERS} & \textbf{SUS} \\
+\hline
+W-01 & Choregraphe & 42.5 & 65 & 60 \\
+\hline
+W-02 & HRIStudio & 100 & 95 & 90 \\
+\hline
+W-03 & Choregraphe & 65 & 60 & 75 \\
+\hline
+W-04 & Choregraphe & 62.5 & 75 & 42.5 \\
+\hline
+W-05 & HRIStudio & 100 & 95 & 70 \\
+\hline
+W-06 & HRIStudio & 100 & 100 & 70 \\
+\hline
+\end{tabular}
+\caption{Primary outcome scores by wizard and condition.}
+\label{tbl:primary-outcomes}
+\end{table}
+
+\subsection{Design Fidelity Score (DFS)}
+
+The Design Fidelity Score measures how completely and correctly each wizard implemented the written specification of their assigned experiment. Scores range from 0 to 100, with full points awarded only when a component — a rubric criterion representing a required speech action, gesture, or control-flow element — is both present and correct. (For a full description of rubric categories, see Section~\ref{sec:measures}.)
+
+Across the six participants, DFS scores divided sharply by study condition: all three HRIStudio wizards achieved a perfect score of 100, while the three Choregraphe wizards scored 42.5, 65, and 62.5. The following paragraphs describe the key findings from each session.
+
+W-01 received a DFS of 42.5. Analysis of the exported project file found all four interaction steps present and correctly sequenced; the conditional branch was wired and functional. Speech fidelity was partial: W-01 deviated from the specification by substituting a different rock color in the narrative and comprehension question, departing from the ``red'' specified in the paper protocol. Items 1 and 4 (introduction and branch responses) received full points; items 2 and 3 received half points due to the content mismatch. The gesture category scored zero. Both the introduction wave and the narrative gesture were implemented via the tool's \emph{Animated Say} function, which generates motion non-deterministically from a library rather than placing a specific gesture node; under the rubric's clarifying rule, this does not satisfy the Correct criterion. Item 7 (nod or head shake) was not explicitly programmed. The control-flow category was split: item 9 (correct step sequence) received full points; item 8 (conditional branch) received half points because the branch was resolved by manually deleting and re-routing connections during the trial rather than through a dedicated conditional node wired at design time.
+
+W-02 received a DFS of 100. The exported project file confirmed all four interaction steps present and correctly sequenced, speech content matching the written specification verbatim, gestures placed using dedicated action nodes, and the conditional branch wired through HRIStudio's branch component. No tool-operation interventions were logged during the design phase. W-02 completed the design in 24 minutes, within the 30-minute allocation.
+
+W-03 received a DFS of 65. W-03 approached the design as a block programming exercise, constructing extra nodes and attempting a concurrent execution structure not called for by the specification. One C-type clarification (see Section~\ref{sec:measures}) was required: I noted that control-flow logic relying on onboard speech recognition was outside the scope of this study, since Wizard-of-Oz execution routes all speech decisions through the wizard rather than the robot. Speech fidelity was partial: two of the three scorable speech items were present, with not all delivered correctly. No conditional branch was implemented in the final design, resulting in zero points for that category. The design phase extended to 37 minutes, seven minutes over the 30-minute allocation.
+
+W-04 received a DFS of 62.5. The design phase ran 35 minutes without reaching completion, making W-04 the only wizard in the study who did not finish the design before the cutoff. Four T-type tool-operation interventions and one C-type clarification were logged. During training, W-04 asked about running two behavior blocks simultaneously and how to edit a block, reflecting early engagement with Choregraphe's concurrent flow model. During the design phase, W-04 asked about interpretation of punctuation in speech content, generating three simultaneous T-type marks across items 1--3. W-04 also independently attempted to use Choregraphe's choice block for conditional branching; the block did not execute correctly. The researcher re-explained the WoZ execution model and how to branch by manual step selection. Speech items 1, 2, and 4 received full points; item 3 (the comprehension question) was absent from the final design. Gesture items 5 and 6 received full points; item 7 (nod or head shake) was present but not marked correct (5/10). The conditional branch received zero points; no functional branch was wired at export. Step sequencing received partial credit (7.5/15).
+
+W-05 received a DFS of 100. The design phase completed in 18 minutes, the shortest design phase in the study. Training concluded in 6 minutes with no questions asked; the wizard described the platform as ``pretty straightforward.'' Two T-type interventions and three C-type clarifications were logged during the design phase. The T-type interventions concerned editing properties in the right pane of the experiment designer and understanding that the branch block requires predefined steps; both were addressed without affecting the final design. The C-type clarifications concerned what ``steps'' represent as structural containers, the relationship between the written specification's speech and platform speech actions, and a related conceptual question. The wizard added a creative narrative gesture not specified in the protocol (a crouch animation); this was present and correct under the rubric. The DFS assessment noted that the wizard's design mapped well from the specification.
+
+W-06 received a DFS of 100. Two T-type interventions were logged during the design phase, both pertaining to item 6 (narrative gesture): at 15:21, W-06 attempted to use parallel execution for a gesture action and was unable to edit the action node; at 15:24, W-06 encountered difficulty resetting the robot's posture and was directed to recommended posture blocks. In both cases, W-06 resolved the issue independently after the initial prompt. W-06's programming background led to a more elaborate design than the specification required, including extra posture-reset actions that were ultimately redundant since the robot was already in the correct starting position; these additions did not affect scoring since all required actions were present and correct in the exported project file. The conditional branch was wired correctly, and all speech and gesture items matched the specification. W-06 completed the design in 21 minutes, within the 30-minute allocation.
+
+Across the three HRIStudio sessions, DFS scores were 100, 100, and 100 (mean 100). Across the three Choregraphe sessions, DFS scores were 42.5, 65, and 62.5 (mean 56.7).
+
+\subsection{Execution Reliability Score (ERS)}
+
+The Execution Reliability Score measures how faithfully the designed interaction executed during the live trial.
+
+Execution results followed the same pattern as design fidelity. HRIStudio trials produced ERS scores of 95, 95, and 100, with no session requiring tool-operation guidance to reach the interaction's conclusion. Choregraphe trials averaged 66.7, with branching failures in two of three sessions and a speech content deviation in the third (see Section~\ref{sec:results-qualitative} for details). The per-session details are as follows.
+
+W-01 received an ERS of 65. The trial ran for approximately five minutes. In this session, I served as the test subject during the live trial. Through that experience I confirmed that a separately recruited participant is not required: the DFS and ERS both evaluate the wizard's implementation and execution fidelity rather than a subject's behavioral responses. Subsequent sessions therefore ran the trial phase with the wizard executing the designed interaction directly, without a separate test subject. The introduction speech and gesture executed correctly. The narrative speech executed but deviated from the specification due to the modified rock color described above. The comprehension question was delivered, a branch response was triggered, and the interaction proceeded to its conclusion. Gesture synchronization was partial: a pause gesture executed, but coordination between speech and movement was inconsistent at several points. No system disconnections or crashes occurred.
+
+W-02 received an ERS of 95. The trial ran for approximately five minutes. Introduction speech and gesture, narrative speech, comprehension question, and branch response content all executed correctly and matched the specification. During the trial, the interaction briefly advanced to an incorrect step when a branch transition misfired; this was immediately corrected by manually selecting the correct step in the execution interface. This incident was logged as an H-type intervention (platform behavior, not wizard error). The branching item scored 5 out of 10 on its own merits: the branch was present in the design and execution reached the branch step, but the initial misfire meant the transition was not fully correct before manual correction. No other deviations or system failures occurred.
+
+W-03 received an ERS of 60. The trial ran for approximately five minutes. Speech execution was partial: two of three items were present but not all delivered correctly. Gesture and speech synchronization was poor throughout the interaction; motion cues were present but did not coordinate reliably with corresponding speech actions. The conditional branch, absent from W-03's design, was not executed during the trial; the interaction proceeded without a branch resolution step. No system disconnections or crashes occurred.
+
+W-04 received an ERS of 75. The trial ran for approximately four minutes. Introduction and narrative speech executed correctly. The comprehension question, absent from the design, was not delivered; the interaction proceeded directly to the branch step. A T-type trial intervention was required to remind W-04 how to trigger the branch; the yes-branch response was delivered following that prompt, capping item 4 at 5/10 (T-assisted). Gesture execution was strong: introduction wave, narrative gesture, and nod or head shake all executed correctly. Speech and gesture synchronization scored full points. The pause before the comprehension question scored zero, as no question was delivered. No system errors occurred.
+
+W-05 received an ERS of 95. The trial ran for approximately four minutes and reached step 4. The researcher's answer was ``Red'' (the correct answer), and branch A fired via programmed conditional logic. All speech items executed correctly. Introduction gesture, nod or head shake, speech synchronization, and the pre-question pause all scored full points. One trial intervention pair was logged: the researcher briefly forgot they were in live execution (G-type), then was reminded and manually skipped a non-functional crouch action (T-type, capping item 6 at 5/10). The crouch animation exists in HRIStudio's action library but does not execute on the NAO6 robot-side; skipping it was the correct recovery. All other items scored full points and no system errors occurred. The overall ERS assessment recorded that the interaction executed as designed.
+
+W-06 received a perfect ERS of 100. The trial ran for approximately three minutes. No interventions of any type were logged during the trial phase. All speech items executed correctly and matched the specification. Gestures, speech synchronization, and the pre-question pause all scored full points. The conditional branch was present in the design and fired correctly during execution via programmed conditional logic. The interaction reached its conclusion without errors, disconnections, or researcher involvement.
+
+Across the three HRIStudio sessions, ERS scores were 95, 95, and 100 (mean 96.7). Across the three Choregraphe sessions, ERS scores were 65, 60, and 75 (mean 66.7).
+
+\subsection{System Usability Scale}
+
+The System Usability Scale (SUS) uses a 0--100 scale with a conventional average of 68; scores above 68 indicate above-average perceived usability~\cite{Brooke1996}. W-01 rated Choregraphe with a SUS score of 60. A score of 60 suggests that W-01, a Digital Humanities faculty member with no programming background, found Choregraphe marginal in usability; this outcome is consistent with the high volume of interface-level help requests observed during the design phase.
+
+W-02 rated HRIStudio with a SUS score of 90, the highest score in the study. W-02, a Logic and Philosophy of Science faculty member with moderate programming experience, completed the design phase without tool-operation assistance and rated the platform favorably across usability dimensions.
+
+W-03 rated Choregraphe with a SUS score of 75. W-03, a programmer with prior experience in block programming environments, perceived the tool positively in general terms, framing it as a capable system for its category. Post-session comments indicated that W-03 found the tool harder to apply to this specific task than its general capability suggested, particularly given the WoZ framing's constraint against onboard control-flow logic. W-03 had no prior knowledge of HRIStudio, providing no comparative baseline for their usability rating.
+
+W-04 rated Choregraphe with a SUS score of 42.5, the lowest score in the study. Researcher notes recorded that W-04 attempted the task with evident self-driven engagement but that the platform appeared to get in the way. The gap between effort and outcome in W-04's session, a motivated wizard who exceeded the time allocation without completing the design and required four T-type interventions, is directly reflected in this rating.
+
+W-05 rated HRIStudio with a SUS score of 70. Post-session comments recorded no issues. W-05, a Chemical Engineering faculty member with no programming background, completed the design well within the allocation and ran the trial to its conclusion without tool-operation difficulty during execution.
+
+W-06 rated HRIStudio with a SUS score of 70. W-06, a Computer Science faculty member with extensive programming experience, completed the design within the allocation and ran a perfect trial without researcher intervention. The score matches W-05's rating exactly; both wizards found the platform above-average in usability despite approaching the task from very different programming backgrounds.
+
+HRIStudio study condition SUS scores were 90, 70, and 70 (mean 76.7). Choregraphe study condition SUS scores were 60, 75, and 42.5 (mean 59.2).
+
+Figure~\ref{fig:results-chart} summarizes the three primary measures side-by-side. In each group, the left bar represents the Choregraphe mean and the right bar represents the HRIStudio mean. HRIStudio exceeds Choregraphe on every measure, with the largest gap on DFS (43.3 points) and the smallest on SUS (17.5 points).
+
+\begin{figure}[htbp]
+\centering
+\begin{tikzpicture}
+    % Axes
+    \draw[thick] (0,0) -- (0,6.3);
+    \draw[thick] (0,0) -- (11.2,0);
+
+    % Y-axis ticks and labels (0--100, with 1 unit = 0.06 cm)
+    \foreach \tick/\val in {0/0, 1.2/20, 2.4/40, 3.6/60, 4.8/80, 6.0/100} {
+        \draw (-0.08, \tick) -- (0, \tick);
+        \node[left, font=\footnotesize] at (-0.05, \tick) {\val};
+    }
+    \node[rotate=90, font=\small] at (-1.05, 3.0) {Mean Score (0--100)};
+
+    % Horizontal gridlines
+    \foreach \tick in {1.2, 2.4, 3.6, 4.8, 6.0} {
+        \draw[gray!25, thin] (0.02, \tick) -- (11.2, \tick);
+    }
+
+    % DFS group
+    \fill[gray!40, draw=black] (1.0, 0) rectangle (2.3, 3.402);
+    \fill[gray!75, draw=black] (2.4, 0) rectangle (3.7, 6.000);
+    \node[font=\footnotesize] at (1.65, 3.60) {56.7};
+    \node[font=\footnotesize] at (3.05, 6.20) {100};
+    \node[font=\small] at (2.35, -0.38) {DFS};
+
+    % ERS group
+    \fill[gray!40, draw=black] (4.5, 0) rectangle (5.8, 4.002);
+    \fill[gray!75, draw=black] (5.9, 0) rectangle (7.2, 5.802);
+    \node[font=\footnotesize] at (5.15, 4.20) {66.7};
+    \node[font=\footnotesize] at (6.55, 6.00) {96.7};
+    \node[font=\small] at (5.85, -0.38) {ERS};
+
+    % SUS group
+    \fill[gray!40, draw=black] (8.0, 0) rectangle (9.3, 3.552);
+    \fill[gray!75, draw=black] (9.4, 0) rectangle (10.7, 4.602);
+    \node[font=\footnotesize] at (8.65, 3.75) {59.2};
+    \node[font=\footnotesize] at (10.05, 4.80) {76.7};
+    \node[font=\small] at (9.35, -0.38) {SUS};
+
+    % Legend
+    \fill[gray!40, draw=black] (2.6, -1.25) rectangle (3.0, -1.00);
+    \node[anchor=west, font=\footnotesize] at (3.1, -1.125) {Choregraphe};
+    \fill[gray!75, draw=black] (7.0, -1.25) rectangle (7.4, -1.00);
+    \node[anchor=west, font=\footnotesize] at (7.5, -1.125) {HRIStudio};
+
+\end{tikzpicture}
+\caption{Mean scores by condition across the three primary outcome measures.}
+\label{fig:results-chart}
+\end{figure}
+
+\section{Supplementary Measures}
+
+\subsection{Session Timing}
+
+W-01's design phase extended to 35 minutes, five minutes over the 30-minute allocation, compressing the trial and debrief to 5 minutes each. Despite this, W-01 declared the design complete rather than abandoning it, and the robot executed a recognizable version of the specification during the trial.
+
+W-02's training phase concluded in 7 minutes, roughly half the standard 15-minute allocation. This reflects HRIStudio's more intuitive onboarding rather than simply W-02's technical background: the platform's guided workflow and timeline-based model required less explanation before the wizard was ready to begin the design phase. W-02's design phase then concluded in 24 minutes, within the allocation, and the trial ran for approximately five minutes.
+
+W-03's design phase extended to 37 minutes, the longest design phase in the study, despite W-03's programming background. The overrun reflects not conventional interface friction but the time spent constructing and then revising an over-engineered design; beginning sessions from W-02 onward enforced the 30-minute transition, so W-03's overrun constitutes a procedural exception noted in the observer log.
+
+W-04's design phase ran 35 minutes without completion, the only session in which the wizard did not finish before the cutoff. Training took 17 minutes, the longest training phase in the study; W-04 entered the design phase with questions about concurrent block execution that presaged later difficulties with branching.
+
+W-05's design phase completed in 18 minutes, the shortest in the study. The overall session lasted 32 minutes, also the shortest. Training took 6 minutes with no questions asked. The contrast between W-04 and W-05 is striking: both come from Chemical Engineering, both with no robotics background, yet the difference in assigned tool produced a 17-minute gap in design completion time and a qualitatively different session experience.
+
+W-06's training phase concluded in 8 minutes and the design phase completed in 21 minutes, both within their allocations. The overall session lasted 37 minutes. The trial ran for approximately three minutes, the shortest trial phase in the study, reflecting a clean execution without errors or researcher interventions.
+
+Across all six sessions, Choregraphe design phases averaged approximately 35.7 minutes; W-01 and W-03 exceeded the 30-minute target but completed their designs before the session time limit, while W-04 was the only wizard cut off by the limit without finishing. HRIStudio design phases averaged 21 minutes across three sessions, all within the allocation. Training phases similarly diverged: Choregraphe training averaged approximately 14.7 minutes, while HRIStudio training averaged 7 minutes.
+
+Figure~\ref{fig:timing-chart} compares the per-condition means for training, design, and total session duration. The gap is concentrated in the design phase and carries through to the total session length; training duration also diverges, with Choregraphe wizards requiring roughly twice as long to reach readiness.
+
+\begin{figure}[htbp]
+\centering
+\begin{tikzpicture}
+    % Axes (1 minute = 0.1 cm, so 60 min = 6 cm)
+    \draw[thick] (0,0) -- (0,6.3);
+    \draw[thick] (0,0) -- (11.2,0);
+
+    % Y-axis ticks and labels (0--60 minutes)
+    \foreach \tick/\val in {0/0, 1/10, 2/20, 3/30, 4/40, 5/50, 6/60} {
+        \draw (-0.08, \tick) -- (0, \tick);
+        \node[left, font=\footnotesize] at (-0.05, \tick) {\val};
+    }
+    \node[rotate=90, font=\small] at (-1.05, 3.0) {Mean Duration (minutes)};
+
+    % Horizontal gridlines
+    \foreach \tick in {1,2,3,4,5,6} {
+        \draw[gray!25, thin] (0.02, \tick) -- (11.2, \tick);
+    }
+
+    % Training group — Choregraphe 14.7, HRIStudio 7.0
+    \fill[gray!40, draw=black] (1.0, 0) rectangle (2.3, 1.47);
+    \fill[gray!75, draw=black] (2.4, 0) rectangle (3.7, 0.70);
+    \node[font=\footnotesize] at (1.65, 1.67) {14.7};
+    \node[font=\footnotesize] at (3.05, 0.90) {7.0};
+    \node[font=\small] at (2.35, -0.38) {Training};
+
+    % Design group — Choregraphe 35.7, HRIStudio 21.0
+    \fill[gray!40, draw=black] (4.5, 0) rectangle (5.8, 3.57);
+    \fill[gray!75, draw=black] (5.9, 0) rectangle (7.2, 2.10);
+    \node[font=\footnotesize] at (5.15, 3.77) {35.7};
+    \node[font=\footnotesize] at (6.55, 2.30) {21.0};
+    \node[font=\small] at (5.85, -0.38) {Design};
+
+    % Total group — Choregraphe 59.7, HRIStudio 36.7
+    \fill[gray!40, draw=black] (8.0, 0) rectangle (9.3, 5.97);
+    \fill[gray!75, draw=black] (9.4, 0) rectangle (10.7, 3.67);
+    \node[font=\footnotesize] at (8.65, 6.17) {59.7};
+    \node[font=\footnotesize] at (10.05, 3.87) {36.7};
+    \node[font=\small] at (9.35, -0.38) {Total Session};
+
+    % Legend
+    \fill[gray!40, draw=black] (2.6, -1.25) rectangle (3.0, -1.00);
+    \node[anchor=west, font=\footnotesize] at (3.1, -1.125) {Choregraphe};
+    \fill[gray!75, draw=black] (7.0, -1.25) rectangle (7.4, -1.00);
+    \node[anchor=west, font=\footnotesize] at (7.5, -1.125) {HRIStudio};
+
+\end{tikzpicture}
+\caption{Mean phase durations by condition.}
+\label{fig:timing-chart}
+\end{figure}
+
+\subsection{Intervention Log}
+
+W-01 generated a high volume of help requests during the design phase, primarily concerning Choregraphe's interface rather than the specification itself. The wizard demonstrated understanding of the task but encountered repeated friction with the tool's connection model, behavior box configuration, and branch routing. This pattern, understanding the goal but struggling with the mechanism, is characteristic of the accessibility problem described in Chapter~\ref{ch:background}.
+
+W-02 generated minimal interventions. No T-type tool-operation assistance was required during the design phase; the wizard navigated HRIStudio's interface without guidance. One H-type intervention was logged during the trial phase, corresponding to the branch step misfire described in the ERS section above.
+
+W-03 generated one C-type intervention during the design phase: a clarification that control-flow logic dependent on onboard speech recognition was outside the study's scope. No T-type interventions were required; W-03 navigated Choregraphe independently throughout the design phase. The absence of T-type interventions for W-03, compared to W-01's high T-type volume, suggests that programming background moderates the interface accessibility problem in Choregraphe: the tool does not block programmers the way it blocked a non-programmer, though it still produced a lower DFS than HRIStudio.
+
+W-04 generated the highest T-type count in the Choregraphe study condition: five total design-phase interventions (4 T-type, 1 C-type), plus one T-type intervention during the trial. The design-phase T marks covered speech content punctuation ($\times$3, items 1--3) and the failed choice block attempt (item 8). The pattern echoes W-01's volume of tool-level friction, concentrated in a wizard with moderate rather than no programming experience.
+
+W-05 generated five design-phase interventions (2 T-type, 3 C-type) and two trial interventions (1 T-type, 1 G-type). The design-phase T marks concerned interface orientation (right-pane editing, branch block configuration); the C-type clarifications concerned conceptual mappings between the written specification and HRIStudio's structural model. Importantly, none of the clarifications blocked design completion, and the final DFS was unaffected. The C-type pattern for W-05 reflects a different kind of engagement from Choregraphe's T-type pattern: questions about what the tool means rather than how to operate it.
+
+W-06 generated two T-type interventions during the design phase, both pertaining to item 6 (narrative gesture): one for an attempted use of parallel action execution, and one for difficulty resetting the robot's posture, for which specific recommended blocks were suggested. W-06 resolved both issues independently after the initial prompts. No interventions of any type were logged during the trial phase, making W-06 the only wizard in the study to complete the trial with zero interventions.

 \section{Qualitative Findings}
-% TODO
+\label{sec:results-qualitative}
+
+\subsection{Observed Specification Deviation}
+
+A notable qualitative finding from W-01's session was an unprompted deviation from the written specification: the wizard substituted a different rock color in the robot's speech and comprehension question, departing from the ``red'' specified in the paper protocol. This was not a tool failure; the wizard made a deliberate creative choice that the tool did not prevent or flag. The deviation was undetected until the live trial, when it surfaced during execution. This incident illustrates the reproducibility problem concretely: without automated protocol enforcement, wizard behavior can drift from the specification in ways that are invisible until execution, affecting the validity of the resulting interaction data.
+
+No specification deviations from the written protocol were observed in W-02, W-04, W-05, or W-06. W-03 introduced extra nodes beyond the specification's scope, which was addressed by a C-type clarification during design. W-05 added a creative gesture not required by the specification (crouch), which was not a deviation from the protocol's content but an elaboration of the gesture category; it scored within the rubric and was noted for completeness. The speech substitution incident in W-01 remains the only case of content drift from the written specification, and it occurred exclusively in the Choregraphe study condition.
+
+\subsection{Wizard Experience}
+
+W-01 expressed that the training was comprehensible and that the underlying logic of the task was clear. The primary source of frustration was Choregraphe's interface for handling conditional branches and managing the timing of parallel behaviors. Post-session comments suggested that the wizard would not use Choregraphe independently for future HRI work without technical support.
+
+W-02 engaged with HRIStudio's timeline-based interface without requiring tool-operation guidance. The session proceeded efficiently, and W-02's Logic and Philosophy of Science background, combined with moderate programming experience, appeared to support both the technical implementation and the contextual understanding of the interaction scenario. No notable sources of friction were observed during design or trial phases.
+
+W-03 approached the task as a programming challenge, applying Choregraphe's full feature set beyond what the specification required. When the WoZ framing was clarified (specifically that branching should reflect wizard decisions rather than onboard robot logic), W-03 revised the design but the over-engineered structure introduced earlier persisted in the final export and was reflected in the DFS score. W-03 described Choregraphe as a powerful block programming environment, but noted that applying it to this task was harder than its general capability implied, a characterization consistent with the tool-task mismatch the study is designed to surface.
+
+W-04 approached the session with clear engagement and self-driven exploration: independently attempting Choregraphe features (concurrent blocks, choice node) that went beyond what prior instructions had covered. The researcher noted ``Great attempt. Self-driven to explore.'' The SUS score of 42.5 reflects a session where ambition consistently exceeded what the tool's interface could support without additional guidance. W-04's post-session comment that quality was attempted but the platform got in the way is arguably the most direct characterization of the accessibility problem in the dataset.
+
+W-05 presented the clearest demonstration of HRIStudio's accessibility case. With no programming background, W-05 trained in 6 minutes, asked no questions, completed the design in 18 minutes with a creative addition, and ran the trial to completion. The researcher's session notes observed: ``Overall good session. Learning: different backgrounds determine tool curiosity and drive to self-explore.'' W-05's willingness to add a crouch gesture beyond the specification, and their straightforward navigation of the platform without tool-operation confusion, suggests that HRIStudio's design model successfully supports exploratory use by non-programmers without producing the friction pattern observed in the Choregraphe study condition.
+
+W-06 approached the design with a programmer's instinct for thoroughness, initially exploring parallel execution structures for gesture actions and adding posture-reset steps beyond what the specification called for. The two T-type design-phase interventions reflected this exploratory behavior rather than confusion about the task. The extra posture-reset actions in the final design were redundant in practice since the robot was already in the correct starting position, but they did not interfere with the required items and the design achieved a perfect DFS. W-06's trial ran entirely without researcher intervention, producing the only perfect ERS in the study. The session illustrates a different accessibility profile from W-05: where W-05 encountered no interface friction at all, W-06's programming background produced brief exploratory detours that the platform absorbed without compromising the final design or execution.
+
+\section{Chapter Summary}
+
+Across all six sessions, the evidence consistently favored HRIStudio on every primary and supplementary measure. Every HRIStudio wizard produced a perfect design without tool-operation assistance, while all three Choregraphe wizards scored below perfect and the only wizard who did not finish before the session cutoff was in the Choregraphe study condition. On execution consistency, HRIStudio trials reached their conclusion without researcher guidance in every case; Choregraphe produced branching failures in two of three sessions and a content deviation in the third (see Section~\ref{sec:results-qualitative}). Perceived usability followed the same split, with all HRIStudio ratings above the SUS average and all Choregraphe ratings below it. Taken together, these results suggest that HRIStudio's design principles produce measurable gains in both accessibility and execution consistency compared to standard practice. Chapter~\ref{ch:discussion} interprets these findings in the context of the research questions.
@@ -1,11 +1,58 @@
 \chapter{Discussion}
 \label{ch:discussion}

+This chapter interprets the results presented in Chapter~\ref{ch:results} against the two research questions established in Chapter~\ref{ch:evaluation}, situates the findings within the broader literature on WoZ methodology, and identifies the limitations of this study.
+
 \section{Interpretation of Findings}
-% TODO
+
+\subsection{Research Question 1: Accessibility}
+
+The first research question asked whether HRIStudio enables domain experts without prior robotics experience to successfully implement a robot interaction from a written specification. The Choregraphe study condition provides the baseline against which this question is evaluated.
+
+The six completed sessions provide directional evidence on the accessibility question. Across the three Choregraphe wizards, design fidelity scores were 42.5, 65, and 62.5, yielding a Choregraphe mean of 56.7. Across the three HRIStudio sessions, all three wizards achieved a DFS of 100. No HRIStudio wizard required a T-type intervention that reflected an inability to operate the platform; the T-type marks logged for W-05 concerned interface orientation, and those logged for W-06 concerned gesture execution details (parallel execution and posture-reset blocks), neither of which constituted fundamental operational barriers. By contrast, Choregraphe produced design difficulties across all three sessions. W-01 required T-type assistance for connection routing and branch wiring. W-03 required no T-type interventions but over-engineered the design, adding concurrent execution nodes and attempting onboard speech-recognition logic that falls outside the WoZ paradigm. W-04 required T-type assistance for speech content punctuation and a failed choice block attempt.
+
+The SUS scores reinforce this pattern. Choregraphe SUS scores were 60, 75, and 42.5 (mean 59.2), all at or below the average usability benchmark of 68~\cite{Brooke1996}. HRIStudio SUS scores were 90, 70, and 70 (mean 76.7), all above the benchmark. The Choregraphe study condition produced the lowest single SUS score in the study (42.5, W-04), a wizard who described the platform as getting in the way of their attempt. The HRIStudio study condition produced the highest (90, W-02). Because assignment was stratified by programming background, each condition contains exactly one wizard with \emph{None} experience, one with \emph{Moderate} experience, and one with \emph{Extensive} experience, enabling a direct cross-background comparison: W-01 (\emph{None}, Choregraphe, SUS 60) versus W-05 (\emph{None}, HRIStudio, SUS 70); W-04 (\emph{Moderate}, Choregraphe, SUS 42.5) versus W-02 (\emph{Moderate}, HRIStudio, SUS 90); W-03 (\emph{Extensive}, Choregraphe, SUS 75) versus W-06 (\emph{Extensive}, HRIStudio, SUS 70). HRIStudio scores exceed Choregraphe scores at the \emph{None} and \emph{Moderate} levels; at the \emph{Extensive} level the scores reverse by five points, suggesting that extensive programming experience largely attenuates the tool-level usability difference. It is worth noting that only one participant (W-01, Digital Humanities) came from a non-STEM discipline; the remaining five wizards held backgrounds in Computer Science, Chemical Engineering, or Logic and Philosophy of Science, a composition that limits claims about accessibility for humanities-domain researchers.
+
+The most striking accessibility finding comes from W-05: a Chemical Engineering faculty member with no programming experience trained in 6 minutes, completed a perfect design in 18 minutes with no operational confusion, and ran the trial to conclusion. This outcome directly addresses the accessibility research question. HRIStudio's timeline-based model and guided workflow allowed a domain novice to implement the written specification correctly on their first attempt, without the interface friction that blocked or slowed all three Choregraphe wizards. Session timing data underscores the difference: Choregraphe design phases averaged 35.7 minutes (two overruns, one incomplete), while HRIStudio design phases averaged 21 minutes (all three within the allocation). Underlying this difference is a structural property of the two tools: HRIStudio's model is domain-specific to Wizard-of-Oz execution, so wizard effort is channeled toward implementing the specification more completely rather than elaborating the tool's architecture. Choregraphe's general-purpose programming model makes the opposite available, and both W-03 and W-04 took it, spending time on concurrent execution structures and a speech-recognition-driven choice block that the WoZ context does not support. No HRIStudio wizard had that option, and all three scored 100 on the DFS.
+
+\subsection{Research Question 2: Reproducibility}
+
+The second research question asked whether HRIStudio produces more reliable execution of a designed interaction compared to Choregraphe. The most instructive finding from W-01's session is not a score but an incident: the wizard deviated from the written specification by substituting different speech content, and this was not flagged or caught until the live trial (see Section~\ref{sec:results-qualitative} for the full account).
+
+This is precisely the failure mode the reproducibility problem predicts. Riek's~\cite{Riek2012} review found that fewer than 4\% of published WoZ studies reported any measure of wizard error, meaning most studies have no mechanism to detect whether execution matched design intent. W-01's session demonstrates that such deviations occur even in controlled conditions with a single, simple specification and an engaged wizard. The deviation was not negligence; it was creative drift made possible by a tool that places no structural constraint on what the wizard types into a speech action.
+
+HRIStudio's protocol enforcement model is designed to prevent this class of deviation by locking speech content at design time. The available data supports this design intent. No speech content deviations occurred in any of the three HRIStudio sessions. W-05 added an action beyond the specification (a crouch gesture), but this was an elaboration of the gesture category rather than a substitution of specified content, and it was scored within the rubric. The Choregraphe study condition produced the only speech substitution in the dataset (W-01) and two sessions in which branching was absent from the design entirely (W-03, W-04).
+
+ERS scores reflect the downstream effect of these design differences. Choregraphe ERS scores were 65, 60, and 75 (mean 66.7). HRIStudio ERS scores were 95, 95, and 100 (mean 96.7). The branching item is particularly instructive: in the Choregraphe study condition, branch execution was either absent from the design entirely (W-03) or present but not implemented as conditional logic (W-01, W-04). W-01 resolved the branch by manually re-routing connections during the trial; W-04 required a T-type trial intervention to be reminded how to trigger the branch step. In all three HRIStudio sessions, the conditional branch was present in the design and executed during the trial. W-05's branch fired cleanly via programmed conditional logic; W-02's session saw a brief platform-side step misfire immediately corrected by manual step selection, logged as an H-type (platform behavior) intervention rather than a wizard error; W-06's branch fired cleanly with no intervention of any kind. In no HRIStudio session did branch execution depend on tool-operation guidance from the researcher.
+
+\subsection{Session Timing and Downstream Effects}
+
+W-01's design phase extended to 35 minutes, overrunning the 30-minute allocation by five minutes and leaving approximately five minutes for the trial phase. It is worth distinguishing between the two factors at play here: the overrun reflected both the tool's demands on the wizard and a procedural decision not to interrupt W-01 at the 30-minute mark. Subsequent sessions enforced the transition to the trial phase at 30 minutes regardless of design completion status, consistent with the observer protocol. That said, if a tool's demands make design completion within the allocation genuinely difficult, the risk of an overrun is real regardless of enforcement: a wizard who has not finished at 30 minutes faces a reduced trial window no matter when the cutoff is applied.
+
+Across all six sessions, design phase overruns are concentrated in the Choregraphe study condition. W-01 and W-03 each exceeded the 30-minute design target but completed their designs before the session time limit; W-04 was the only wizard cut off by the limit without finishing. No HRIStudio wizard exceeded the target. This pattern holds across programming backgrounds: W-01 (non-programmer) and W-03 (extensive programmer) both overran in the Choregraphe study condition, while W-05 (non-programmer, HRIStudio) completed in 18 minutes and W-06 (extensive programmer, HRIStudio) completed in 21 minutes. The timing data thus corroborates the DFS and SUS findings as a supplementary accessibility indicator, and supports the conclusion that the overrun pattern is attributable to assigned tool rather than wizard background alone. Because programming experience was balanced across conditions by stratified assignment, the design-phase timing difference cannot be attributed to prior programming experience.

 \section{Comparison to Prior Work}
-% TODO
+
+Because assignment was stratified by programming experience, the overall 17.5-point gap in both means reflects a genuine tool-level effect rather than a sampling artifact. Pot et al.~\cite{Pot2009} introduced Choregraphe as a tool for enabling non-programmers to create NAO behaviors, but subsequent HRI research has treated it primarily as a programmer's tool in practice. This study confirms that characterization: W-01 (no programming experience) and W-04 (moderate experience) both required substantial T-type assistance and produced incomplete or deviation-prone designs, while W-03 (extensive experience) navigated the interface without T-type support yet still over-engineered the design and scored below every HRIStudio participant on both DFS and ERS. Riek's~\cite{Riek2012} observation that WoZ tools tend to require substantial technical investment even when the underlying experiment is conceptually simple holds across all three Choregraphe sessions regardless of background. In contrast, the HRIStudio results support the claim advanced in prior work~\cite{OConnor2024, OConnor2025} that a domain-specific, web-based platform can decouple task complexity from interface complexity: all three HRIStudio wizards---spanning no, moderate, and extensive programming experience---achieved a perfect DFS, and none encountered a fundamental barrier to operating the platform.
+
+The specification deviation in W-01's session connects directly to Porfirio et al.'s~\cite{Porfirio2023} argument that formal, verifiable behavior specifications are a prerequisite for reproducible HRI. Porfirio et al. propose specification languages as the solution; HRIStudio takes a complementary approach by embedding the specification into the execution environment, making deviation structurally harder rather than formally detectable after the fact. The ERS data confirms this design intent: no speech content deviations occurred across all three HRIStudio sessions, and the HRIStudio ERS mean of 96.7 versus 66.7 for Choregraphe supports the conclusion that structural enforcement produces more reliable execution in practice. Riek's~\cite{Riek2012} finding that only 3.7\% of published WoZ studies reported any measure of wizard error makes this comparison particularly significant: the ERS operationalizes exactly the kind of execution measurement the literature has consistently omitted, and the difference it surfaces here is substantial.
+
+The SUS scores are consistent with prior tool evaluations in HCI. The Choregraphe mean of 59.2 falls below the average benchmark of 68~\cite{Brooke1996} and below scores reported for general-purpose visual programming environments in comparable studies, consistent with Bartneck et al.'s~\cite{Bartneck2024} finding that domain-specific design is necessary to make tools genuinely accessible to non-programmers. The HRIStudio mean of 76.7 places the platform above the benchmark across all three sessions. Because programming experience is balanced across conditions by design, the overall 17.5-point gap in the two conditions' means reflects a genuine tool-level effect rather than an artifact of the sample's background composition. The gap is largest at the Moderate experience level (W-02 HRIStudio 90 vs.\ W-04 Choregraphe 42.5) and smallest at the Extensive level, where the scores reverse by five points (W-03 Choregraphe 75 vs.\ W-06 HRIStudio 70), suggesting that extensive programming experience largely attenuates the tool-level usability difference while the accessibility advantage remains pronounced for non-programmers and moderate programmers.

 \section{Limitations}
-% TODO
+
+This study has several limitations that must be considered when interpreting the findings.
+
+\textbf{Sample size.} With six wizard participants ($N = 6$), the study is too small for inferential statistics. The reported scores are descriptive. Patterns in the data can suggest directions for future work but cannot establish causal claims about the effect of the tool on design fidelity or execution reliability.
+
+\textbf{Trial execution without a separate test subject.} Following scheduling difficulties, the study protocol was adjusted so that the wizard executes the designed interaction directly rather than running it for a separate test subject. Because the DFS and ERS are scored against the exported project file and live observation rather than a subject's behavioral responses, this change does not affect the primary quantitative measures. The trial phase evaluates whether the wizard's design executes as specified; the presence or absence of a separate subject does not alter that criterion.
+
+\textbf{Single task.} Both study conditions used the same Interactive Storyteller specification. While this controls for task difficulty, it limits generalizability. The task is simple relative to real HRI experiments; the gap between conditions may be larger or smaller with a more complex protocol involving multiple branches or longer interaction sequences.
+
+\textbf{Uncontrolled dimensions.} Programming experience was balanced across conditions by stratified assignment (see Section~\ref{sec:measures} and Chapter~\ref{ch:evaluation}): each condition contains one wizard at each of the three experience levels (\emph{None}, \emph{Moderate}, \emph{Extensive}). This controls for programming background as a potential confounder but does not extend to other dimensions. The small $N$ means that balance on other potentially relevant dimensions (disciplinary background, prior experience with visual programming tools, or familiarity with robots more broadly) was not assessed or controlled and remains a source of variability not addressed in this pilot.
+
+\textbf{Platform version.} HRIStudio is continuously evolving. The version used in this study represents the system at a specific point in time. Future iterations may change how the wizard interface presents protocol steps, how branch conditions are constructed during the design phase, or how protocol enforcement is applied during execution. Any of these changes could affect how easily a non-programmer completes the design challenge or how reliably the tool enforces the specification during the trial, potentially altering the DFS and ERS scores observed under otherwise identical conditions. Results from this study therefore describe the system as it existed at the time of data collection and may not generalize to later releases.
+
+\section{Chapter Summary}
+
+This chapter interpreted the results of all six completed pilot sessions against the two research questions and connected the findings to prior work. Across all primary measures, the directional evidence favors HRIStudio. HRIStudio wizards uniformly achieved perfect design fidelity (DFS 100) and near-perfect execution reliability (mean ERS 96.7), while Choregraphe wizards averaged DFS 56.7 and ERS 66.7, with design overruns in all three sessions and no session completing without researcher guidance. The W-01 content deviation (see Section~\ref{sec:results-qualitative}) illustrates the reproducibility problem concretely; its absence in all three HRIStudio sessions is consistent with the enforcement model's design intent. Programming backgrounds are balanced across study conditions by stratified assignment, strengthening the cross-background comparisons. The limitations of this pilot study, including sample size, task simplicity, and the single-session design, are acknowledged and inform the future directions described in Chapter~\ref{ch:conclusion}.
@@ -1,8 +1,44 @@
 \chapter{Conclusion and Future Work}
 \label{ch:conclusion}

+This thesis set out to address two persistent problems in Wizard-of-Oz-based social robotics research. The first is the Accessibility Problem: a high technical barrier prevents domain experts who are not programmers from conducting HRI studies independently. The second is the Reproducibility Problem: the fragmented landscape of custom tools makes it difficult to verify or replicate experimental results across studies and labs. This chapter summarizes the contributions of the work, reflects on what the pilot study results suggest, and identifies directions for future investigation.
+
 \section{Contributions}
-% TODO
+
+This thesis makes three contributions to the field of HRI research infrastructure.
+
+\textbf{A principled architecture for WoZ platforms.} The primary contribution is a set of design principles for Wizard-of-Oz infrastructure: a hierarchical specification model (Study $\to$ Experiment $\to$ Step $\to$ Action), an event-driven execution model that separates protocol design from live trial control, and a plugin architecture that decouples experiment logic from robot-specific implementations. These principles are not specific to any one robot platform or research group; they describe a general approach to building WoZ tools that are simultaneously accessible to non-programmers and reproducible across executions. The principles were derived from a systematic analysis of reproducibility failures in published WoZ literature, grounded in the prior work of Riek~\cite{Riek2012} and Porfirio et al.~\cite{Porfirio2023}, and refined through the design and implementation process described in Chapters~\ref{ch:design} and~\ref{ch:implementation}.
+
+\textbf{HRIStudio: a complete, operational platform.} The second contribution is HRIStudio, an open-source, web-based platform that fully realizes the design principles described above. HRIStudio is distributed under the MIT License and available at a public repository. HRIStudio provides a visual experiment designer, a consolidated wizard execution interface, role-based access control for research teams, and a repository-based plugin system for integrating robot platforms including the NAO6 used in this study. HRIStudio demonstrates that the design principles are not only technically feasible but can be delivered as a complete system that real researchers use without programming expertise, making it both an artifact and an instrument of validation. The platform's architecture is documented in detail in Chapter~\ref{ch:implementation} and the accompanying technical appendix.
+
+\textbf{Pilot empirical evidence.} The third contribution is a pilot between-subjects study comparing HRIStudio against Choregraphe as a representative baseline tool. While the pilot scale precludes inferential claims, the study provides directional evidence on both research questions and produces a concrete demonstration of the reproducibility problem in a controlled setting: a wizard using Choregraphe deviated from the written specification in a way that was undetected until the live trial. This incident motivates the enforcement model at the core of HRIStudio's design and illustrates why the reproducibility problem is difficult to solve through training or norms alone.
+
+\section{Reflection on Research Questions}
+
+The central question this thesis addressed was: \emph{can the right software architecture make Wizard-of-Oz experiments more accessible to non-programmers and more reproducible across participants?} The evidence from the pilot study suggests the answer is yes, with the qualifications appropriate to a small $N$ directional study.
+
+On accessibility, all three HRIStudio sessions produced a perfect DFS of 100, with design phases averaging 21 minutes, all within the allocation. The most direct demonstration comes from W-05: a Chemical Engineering faculty member with no programming background trained in 6 minutes, completed a perfect design in 18 minutes, and ran the trial to conclusion. SUS scores reflect the same directional split: Choregraphe mean 59.2 (below the average benchmark of 68), HRIStudio mean 76.7 (above it).
+
+On reproducibility, the content deviation observed in W-01's Choregraphe session (see Section~\ref{sec:results-qualitative}) illustrates the failure mode the reproducibility problem predicts. No equivalent deviation occurred in any HRIStudio session. Branching was present and executed correctly in all three HRIStudio trials; ERS means reflect the outcome: 66.7 for Choregraphe, 96.7 for HRIStudio.

 \section{Future Directions}
-% TODO
+
+The work described in this thesis suggests several directions for future investigation.
+
+\textbf{Larger validation study.} The most immediate next step is a full-scale study with sufficient participants to support inferential analysis. A sample of 20 or more wizard participants, balanced across programming backgrounds and conditions, would allow the DFS and ERS comparisons to be evaluated for statistical significance. A larger study would also enable subgroup analysis, for example whether the accessibility benefit of HRIStudio is concentrated among non-programmers or extends equally to programmers.
+
+\textbf{Evaluations with multiple different tasks.} The Interactive Storyteller is a simple single-interaction task with one conditional branch. Real HRI experiments are more complex: they involve multiple conditions, longer interactions, and more elaborate branching logic. Evaluating HRIStudio on richer specifications would test whether the accessibility and reproducibility benefits scale with task complexity, and whether any new limitations emerge at that scale.
+
+\textbf{Longitudinal use.} This study evaluated first-session performance, which captures the initial learning curve but not longer-term practice. A longitudinal study tracking wizard performance across multiple sessions would reveal whether HRIStudio's benefits persist or diminish as wizards become proficient, and whether the tool's structured approach continues to enforce reproducibility over time.
+
+\textbf{Observer and researcher roles.} HRIStudio's role-based architecture includes Observer and Researcher roles that were not formally evaluated in this study. Future work should investigate how these roles support team coordination in multi-experimenter studies, and whether the annotation and logging capabilities they enable produce analysis workflows that are meaningfully more efficient than manual video coding.
+
+\textbf{Platform expansion.} The NAO integration used in this study is one instance of HRIStudio's plugin architecture. Extending the plugin ecosystem to include mobile robots, socially assistive robots, and non-humanoid platforms would broaden the system's applicability and test whether the plugin abstraction is sufficiently general to accommodate the range of robot capabilities used in published HRI research.
+
+\textbf{Community adoption.} The reproducibility problem in WoZ research is ultimately a community problem, not a tool problem. Future work should investigate what it would take for HRIStudio to be adopted as shared infrastructure across multiple labs, including documentation standards, experiment sharing mechanisms, and incentive structures that make reproducibility a norm rather than an exception.
+
+\section{Closing Remarks}
+
+The Wizard-of-Oz technique is one of the most powerful tools available to HRI researchers: it allows the study of interaction designs that do not yet exist as autonomous systems, accelerating the feedback loop between design intuition and empirical evidence. But the technique has been practiced for decades without the infrastructure needed to make it systematic. Studies are conducted with custom tools that are not always shared, by wizards whose behavior is never verified against a protocol, producing results that cannot be replicated because the exact conditions that produced them were never precisely recorded.
+
+HRIStudio is an attempt to build that infrastructure. It will not solve the reproducibility problem by itself; that requires community norms, institutional incentives, and continued investment in open, shared tooling. But it demonstrates that the technical barriers are not insurmountable: a web-based platform can make WoZ research accessible to domain experts who are not engineers, and execution enforcement can prevent the kinds of specification drift that silently degrade research quality. That is, at minimum, where the work begins.
@@ -0,0 +1,97 @@
+\chapter{AI-Assisted Development Workflow}
+\label{app:ai_workflow}
+
+This appendix documents the role that AI coding assistants played in the construction of HRIStudio. It is included both for transparency about how the system was built and because the workflow itself is, in my view, one of the more interesting artifacts produced by the project. Section~\ref{sec:ai-ws} in Chapter~\ref{ch:implementation} introduces the topic briefly; here I describe the specific responsibilities I kept for myself, the tasks I delegated to coding agents, the tools I used, the limits I encountered, and the integrity controls I maintained between implementation work and the evaluation reported in Chapter~\ref{ch:results}.
+
+\section{Context}
+\label{sec:ai-context}
+
+I built HRIStudio while also carrying a full course load, writing this thesis, and running the pilot validation study described in Chapter~\ref{ch:evaluation}. The feature surface described in Chapters~\ref{ch:design} and~\ref{ch:implementation} is larger than what I could reasonably have produced on that schedule without assistance, given both the scope and the level of ambition of the work. AI coding assistants made that scope tractable. They did not replace design judgment; they reduced the cost of the mechanical work that sits between a well-specified design and a working feature: scaffolding new modules, implementing well-defined create/read/update/delete (CRUD) and validation code, applying consistent patterns across files, and producing the many small edits that a project of this size accumulates.
+
+The set of tools available to me as a solo developer changed substantially during the project's timeline. When I began, agentic coding tools were still early and most of my AI use was conversational, primarily through Cursor~\cite{CursorEditor} and Zed~\cite{ZedEditor}. By the end of the project, multiple mature terminal- and editor-integrated agents were available. I changed tools as the landscape evolved, eventually moving into a mixed workflow between Visual Studio Code, Antigravity~\cite{GoogleAntigravity}, Claude Code~\cite{AnthropicClaudeCode}, and OpenCode~\cite{OpenCode}.
+
+\section{Tools and Hardware}
+\label{sec:ai-tools}
+
+Table~\ref{tbl:ai-tools} lists the tools I used during development and the capacity in which I used each. The split between them was determined partly by capability and partly by availability over time.
+
+\begin{table}[htbp]
+\centering
+\footnotesize
+\begin{tabular}{|l|l|p{3.4in}|}
+\hline
+\textbf{Tool} & \textbf{Category} & \textbf{Primary use} \\
+\hline
+Claude~\cite{Anthropic2024Claude} & Chat model & Design discussions, architectural review, debugging assistance, and refactoring proposals. \\
+\hline
+Claude Code~\cite{AnthropicClaudeCode} & Terminal agent & Multi-file feature implementation against a written spec; codemod-style refactors; and test scaffolding. \\
+\hline
+OpenCode~\cite{OpenCode} & Terminal agent & Same class of task as Claude Code, used when I preferred an open-source workflow or a different backing model. \\
+\hline
+Gemini CLI~\cite{GeminiCLI} & Terminal agent & Occasional cross-check on changes produced by a different agent, and work against Google's models when I wanted a second reading of a larger diff. \\
+\hline
+Antigravity~\cite{GoogleAntigravity} & IDE agent & Editor-integrated agentic coding work, primarily late in the project as the tool became available. \\
+\hline
+Cursor~\cite{CursorEditor} & Editor & Early development; AI-native editing and indexing. \\
+\hline
+Zed~\cite{ZedEditor} & Editor & High-performance editing; transition phase before moving to specialized agents. \\
+\hline
+\end{tabular}
+\caption{AI tools used during HRIStudio development.}
+\label{tbl:ai-tools}
+\end{table}
+
+Beyond cloud-hosted models, I experimented with local execution using \texttt{llama.cpp} to run various open-weights models on my local hardware (Apple M4 Pro, 14-core CPU, 48GB RAM). While the hardware was capable of running 7B and 14B parameter models with high throughput, the reasoning performance of the local models frequently lagged behind the state-of-the-art frontier models. I found that the additional cognitive overhead of correcting errors in local model output outweighed the benefits of offline execution, leading me to rely primarily on the cloud-hosted agents for complex implementation tasks.
+
+\section{Division of Responsibility}
+\label{sec:ai-division}
+
+My working rule throughout the project was for me to handle the engineering and for the agents to flesh out the implementation. In practice, this meant that I was responsible for every decision that had downstream consequences for the shape of the system, and the agents were responsible for producing code that realized those decisions. Concretely, I did the following work directly, without delegating it to an agent:
+
+\begin{itemize}
+\item \textbf{Architecture.} The three-tier structure described in Chapter~\ref{ch:design}, the separation between experiment specifications and trial records, the choice to route all robot communication through plugin files, and the overall shape of the event-driven execution model were mine. I wrote these decisions as prose before any code was written.
+
+\item \textbf{Data model.} The PostgreSQL schema and the tRPC procedure boundaries were designed by me. Because downstream type safety depends on the shape of the schema and the API, I was unwilling to let an agent make those choices.
+
+\item \textbf{Research design.} The pilot validation study in Chapter~\ref{ch:evaluation} was designed and analyzed entirely by me. The Observer Data Sheet, Design Fidelity Score rubric, and Execution Reliability Score rubric were written by hand. No AI tool was used to score sessions, compute results, or draft claims about what the data showed.
+
+\item \textbf{The prose of this thesis.} Every chapter was written by me. The structure of the argument and the specific claims I make are my own. While AI assisted with the nuances of \LaTeX{} formatting (particularly the generation of TikZ diagrams and complex chart syntax), the content is mine.
+\end{itemize}
+
+\section{Evolution of the Workflow}
+\label{sec:ai-pattern}
+
+My use of these tools changed over the course of the project, and evolved as the models improved. Early on, I treated the agent's output as a draft that required line-by-line review. The typical loop followed five steps: writing a specification, generating a diff, reading the diff, running the code, and then accepting or rejecting the change.
+
+As the models improved and the agents became more reliable, the focus of my effort shifted. By the final stages of development, I spent significantly less time on manual line-by-line reviews and more time on empirical testing. I moved from being a ``code reviewer'' to a ``test-driven supervisor.'' If the agent produced a feature that passed my manual acceptance tests and integrated correctly with the existing system, I was more likely to accept the implementation without auditing every line in the program. This shift allowed me to increase the velocity of development significantly in the weeks leading up to the evaluation.
+
+\section{What Worked and What Did Not}
+\label{sec:ai-limits}
+
+The tasks that agents handled well were those with a narrow and well-specified interface. Implementing a tRPC procedure from a signature, writing a Drizzle migration that matched a schema diff, adding a new field through an existing form, or applying a consistent rename across files: these were cheap to specify and the agent's output was usually accepted on the first or second iteration. Agents were also good at scaffolding: producing the initial shape of a component, test file, or API route that I then edited to completion.
+
+The tasks that agents handled poorly were those that required reasoning across more of the system than the context window could hold, or that depended on a piece of context I had not written down. Cross-cutting changes to the experiment and trial data models, for example, required careful coordination across the schema, the tRPC procedures, the execution runtime, and the analysis interface. When I tried to delegate changes of this shape to an agent, the diffs were often locally plausible but globally inconsistent; I ended up doing that work myself. Subtle concurrency and timing questions in the execution layer were another category the agents did not handle well. The event-driven execution model in Chapter~\ref{ch:design} has enough non-obvious ordering constraints that an agent without the full picture tended to introduce races; those parts of the codebase I wrote by hand.
+
+Across the full set of tools I used, the differences in capability for the work I asked of them were smaller than I expected. Any of the agents could, in principle, produce a correct diff for a well-scoped task, and when one tool failed it was usually because the task was underspecified rather than because of a difference in model capability. The practical differences between tools mattered more at the workflow level, such as which shell integration I preferred, how the tool handled long diffs, and how it behaved when it needed to ask for clarification, than at the capability level.
+
+\section{Research Integrity}
+\label{sec:ai-integrity}
+
+Because this thesis reports an empirical evaluation, I treat the boundary between AI-assisted development and the evaluation itself as a matter of research integrity rather than a matter of preference. The following statements reflect the actual workflow I followed:
+
+\begin{itemize}
+\item No AI tool generated, modified, or interpreted any of the evaluation data reported in Chapter~\ref{ch:results}. Every Design Fidelity Score, Execution Reliability Score, and System Usability Scale rating was recorded by me during or immediately after each session from direct observation, using the rubrics in Appendix~\ref{app:blank_templates}.
+
+\item No AI tool produced the tables, means, or comparative claims in Chapter~\ref{ch:results}. The numbers were tabulated by hand from the completed Observer Data Sheets reproduced in Appendix~\ref{app:completed_materials}, and the claims about what those numbers support or do not support are mine.
+
+\item No AI tool drafted the prose of this thesis. The chapters were written by me, in my own voice, and I am responsible for every claim they make and every argument they advance. AI tools were occasionally used as a proofreading aid to catch typos, flag awkward phrasing, or suggest an alternative word; however, the sentences are mine.
+
+\item The code that implements HRIStudio and that was the subject of the evaluation was written under the workflow described in Sections~\ref{sec:ai-division} and~\ref{sec:ai-pattern}. Agents produced drafts; I read, tested, and accepted or rejected every one. The final state of the code is the product of my engineering decisions, regardless of who wrote any particular line.
+\end{itemize}
+
+\section{A Note on the Workflow as a Contribution}
+\label{sec:ai-reflection}
+
+The workflow described in this appendix is not a contribution of the thesis, and I do not claim that it is generalizable or optimal. I describe it because it is the actual workflow under which the system was built, and because a reader evaluating the claims in Chapter~\ref{ch:results} is entitled to know how the system being evaluated came into existence.
+
+The more interesting observation, at least to me, is about where the boundary between human and agent naturally fell in practice. It fell at the point where a task required a decision with downstream consequences for the shape of the system. Tasks that realized a decision were inexpensive to delegate and inexpensive to verify; tasks that made a decision were neither, and delegating them produced diffs that were locally plausible and globally wrong. Whether that boundary will move as tools improve is a question I cannot answer from the evidence of a single project, but the boundary was stable across every tool I used during this one.
@@ -0,0 +1,15 @@
+\chapter{Blank Study Templates}
+\label{app:blank_templates}
+
+This appendix contains the blank versions of all study instruments used in the pilot validation study. These templates were used to produce the completed materials in Appendix~\ref{app:completed_materials}.
+
+A note on the Informed Consent Form (ICF): the ICF was submitted with the original protocol to the Bucknell University Institutional Review Board (Protocol \#2526-025) and reflects the study design as initially proposed. The protocol was refined before data collection began; the key differences between the ICF and the executed protocol are as follows. First, phase durations were adjusted: Training was planned at 15 minutes, the Design Challenge at 30 minutes, the Live Trial at 10 minutes, and the Debrief at 5 minutes, rather than the 10/20/15/15-minute allocations stated in the ICF. Second, screen recording during the design phase was not implemented, as the DFS is scored from the exported project file rather than from screen footage. Third, the live trial was conducted with the researcher serving as the test subject rather than a recruited student volunteer, as discussed in Chapter~\ref{ch:evaluation}. The ODS, DFS, ERS, and SUS templates reflect the protocol as executed.
+
+\medskip
+\noindent\textbf{Contents of this appendix, in order:} ODS, DFS, ERS, SUS, ICF
+
+\includepdf[pages=-,pagecommand={}]{pdfs/templates/ODS-Template.pdf}
+\includepdf[pages=-,pagecommand={}]{pdfs/templates/DFS-Template.pdf}
+\includepdf[pages=-,pagecommand={}]{pdfs/templates/ERS-Template.pdf}
+\includepdf[pages=-,pagecommand={}]{pdfs/templates/SUS-Template.pdf}
+\includepdf[pages=-,pagecommand={}]{pdfs/templates/ICF-Template.pdf}
@@ -1,290 +1,51 @@
-\chapter{Study Materials}
-\label{app:materials}
+\chapter{Completed Study Materials}
+\label{app:completed_materials}

-This appendix contains the study materials used in the evaluation described in Chapter~\ref{ch:evaluation}, in the order they were presented to participants.
+This appendix contains the completed study instruments for each of the six sessions conducted prior to the submission of this thesis (W-01 through W-06). The DFS and ERS were scored during and immediately after each session using live observation and the Observer Data Sheet; the SUS was completed by the wizard during the debrief phase.

-\section{Recruitment Materials}
+\medskip
+\noindent\textbf{Contents of this appendix, in order:}
+\begin{itemize}
+  \item \textbf{W-01 (Choregraphe):} ODS, DFS, ERS, SUS
+  \item \textbf{W-02 (HRIStudio):} ODS, DFS, ERS, SUS
+  \item \textbf{W-03 (Choregraphe):} ODS, DFS, ERS, SUS
+  \item \textbf{W-04 (Choregraphe):} ODS, DFS, ERS, SUS
+  \item \textbf{W-05 (HRIStudio):} ODS, DFS, ERS, SUS
+  \item \textbf{W-06 (HRIStudio):} ODS, DFS, ERS, SUS
+\end{itemize}

-\subsection*{Email Invitation (Wizard Participants)}
+% --- W-01 -------------------------------------------------------------------
+\includepdf[pages=-,pagecommand={}]{pdfs/completed/01/ODS-01.pdf}
+\includepdf[pages=-,pagecommand={}]{pdfs/completed/01/DFS-01.pdf}
+\includepdf[pages=-,pagecommand={}]{pdfs/completed/01/ERS-01.pdf}
+\includepdf[pages=-,pagecommand={}]{pdfs/completed/01/SUS-01.pdf}

-\textit{Subject: Invitation to evaluate Human-Robot Interaction software (International Snacks provided!)}
+% --- W-02 -------------------------------------------------------------------
+\includepdf[pages=-,pagecommand={}]{pdfs/completed/02/ODS-02.pdf}
+\includepdf[pages=-,pagecommand={}]{pdfs/completed/02/DFS-02.pdf}
+\includepdf[pages=-,pagecommand={}]{pdfs/completed/02/ERS-02.pdf}
+\includepdf[pages=-,pagecommand={}]{pdfs/completed/02/SUS-02.pdf}

-Dear [Professor Name],
+% --- W-03 -------------------------------------------------------------------
+\includepdf[pages=-,pagecommand={}]{pdfs/completed/03/ODS-03.pdf}
+\includepdf[pages=-,pagecommand={}]{pdfs/completed/03/DFS-03.pdf}
+\includepdf[pages=-,pagecommand={}]{pdfs/completed/03/ERS-03.pdf}
+\includepdf[pages=-,pagecommand={}]{pdfs/completed/03/SUS-03.pdf}

-I am conducting an honors thesis study to evaluate ``HRIStudio'', a new platform for designing human-robot interactions. I am seeking Computer Science faculty to act as expert reviewers by participating in a 75-minute Wizard-of-Oz design session.
+% --- W-04 -------------------------------------------------------------------
+\includepdf[pages=-,pagecommand={}]{pdfs/completed/04/ODS-04.pdf}
+\includepdf[pages=-,pagecommand={}]{pdfs/completed/04/DFS-04.pdf}
+\includepdf[pages=-,pagecommand={}]{pdfs/completed/04/ERS-04.pdf}
+\includepdf[pages=-,pagecommand={}]{pdfs/completed/04/SUS-04.pdf}

-You will be asked to spend 30 minutes programming a simple behavior on the NAO robot using either HRIStudio or Choregraphe, and then run it live with a student volunteer. No prior experience with the NAO robot is required.
+% --- W-05 -------------------------------------------------------------------
+\includepdf[pages=-,pagecommand={}]{pdfs/completed/05/ODS-05.pdf}
+\includepdf[pages=-,pagecommand={}]{pdfs/completed/05/DFS-05.pdf}
+\includepdf[pages=-,pagecommand={}]{pdfs/completed/05/ERS-05.pdf}
+\includepdf[pages=-,pagecommand={}]{pdfs/completed/05/SUS-05.pdf}

-International snacks and refreshments will be provided during the session. If you are willing to participate, please reply to schedule a time.
-
-\hfill Sean O'Connor (\texttt{sso005@bucknell.edu})
-
-\subsection*{Campus Flyer (Test Subject Participants)}
-
-\begin{center}
-\textbf{\large VOLUNTEERS NEEDED: INTERACT WITH A ROBOT!}
-
-\vspace{0.4cm}
-Participate in a short 15-minute session with a NAO humanoid robot.
-
-\vspace{0.4cm}
-\textbf{Snacks from around the world will be provided!}
-
-\vspace{0.2cm}
-Contact: \texttt{sso005@bucknell.edu}
-\end{center}
-
-\section{Informed Consent Forms}
-
-\subsection*{Wizard Participant Consent Form}
-
-\textbf{HRIStudio User Study --- Informed Consent (Faculty/Wizard Participant)}
-
-\textbf{Introduction:} You are invited to participate in a research study evaluating a new software platform for the NAO robot. This study is conducted by Sean O'Connor (Student PI) and Dr.~L.~Felipe Perrone (Advisor) in the Department of Computer Science at Bucknell University.
-
-\textbf{Purpose:} The purpose of this study is to compare the usability and reproducibility of a new visual programming tool (HRIStudio) against the standard software (Choregraphe).
-
-\textbf{Procedures:} If you agree to participate, you will complete the following in a single 75-minute session:
-\begin{enumerate}
-    \item \textbf{Training (15 min):} A brief tutorial on your assigned software interface covering speech, gesture, and branching.
-    \item \textbf{Design Challenge (30 min):} You will receive a written storyboard and program it on the NAO robot using your assigned tool.
-    \item \textbf{Live Trial (15 min):} A student volunteer will enter the room and you will run your program to deliver the story to them.
-    \item \textbf{Debrief (15 min):} You will complete a short usability survey.
-\end{enumerate}
-
-\textbf{Data Collection:} Your workflow will be screen-recorded during the design phase. The live trial will be video recorded to verify robot behavior. All data will be stored on encrypted drives and your identity replaced with a numerical code (e.g., W-01).
-
-\textbf{Risks and Benefits:} There are no known risks beyond those of normal computer use. You will receive international snacks and refreshments during the session. Your participation contributes to research on accessible tools for HRI.
-
-\textbf{Voluntary Participation:} Participation is entirely voluntary and unrelated to any departmental obligations. You may withdraw at any time without penalty.
-
-\textbf{Questions:} Contact Sean O'Connor (\texttt{sso005@bucknell.edu}) or the Bucknell IRB (\texttt{irb@bucknell.edu}).
-
-\vspace{0.8cm}
-\noindent\rule{0.55\textwidth}{0.4pt}\\
-Signature of Participant \hspace{4cm} Date
-
-\vspace{1.2cm}
-
-\subsection*{Test Subject Consent Form}
-
-\textbf{HRIStudio User Study --- Informed Consent (Student/Test Subject)}
-
-\textbf{Introduction:} You are invited to participate in a 15-minute robot interaction session as part of a research study conducted in the Bucknell Computer Science Department.
-
-\textbf{Procedure:} You will enter a lab room and listen to a short story told by a NAO humanoid robot. The robot will then ask you a comprehension question. The interaction takes approximately 5--10 minutes.
-
-\textbf{Data Collection:} The session will be video recorded to analyze the robot's timing and behavior. Your responses are not being graded; we are evaluating the robot's performance, not yours.
-
-\textbf{Risks and Benefits:} Minimal risk. You will receive international snacks and refreshments for your time.
-
-\textbf{Voluntary Participation:} You may stop the interaction and leave at any time without penalty.
-
-\vspace{0.8cm}
-\noindent\rule{0.55\textwidth}{0.4pt}\\
-Signature of Participant \hspace{4cm} Date
-
-\section{Paper Specification: The Interactive Storyteller}
-
-\textit{This document was given to each wizard participant at the start of the Design Phase.}
-
-\textbf{Goal:} Program the robot to tell a short interactive story to a participant. The robot must introduce the story, deliver the narrative with appropriate gestures, ask a comprehension question, and respond to the participant's answer.
-
-\textbf{Script and Logic Flow:}
-
-\begin{enumerate}
-    \item \textbf{Start State}
-    \begin{itemize}
-        \item Robot is standing and looking at the participant.
-    \end{itemize}
-
-    \item \textbf{Step 1 --- The Hook}
-    \begin{itemize}
-        \item \textbf{Speech:} ``Hello. I want to tell you about someone named Dara ---
-               an astronaut who made a decision that changed what we thought we knew about Mars.
-               Are you ready?''
-        \item \textbf{Gesture:} Perform a slow open-hand gesture toward the participant, then lower both arms and stand still before continuing.
-    \end{itemize}
-
-    \item \textbf{Step 2 --- The Narrative}
-    \begin{itemize}
-        \item \textbf{Speech:} ``It was 2147. Dara's crew had been on the Martian surface for six days.
-               Mission protocol said to collect samples, document the terrain, and stay on schedule.
-               But on the sixth morning, while the rest of the crew ran diagnostics,
-               Dara wandered off course.
-               About forty meters from camp, she stopped.
-               Half-buried in the dust was a rock she almost stepped on ---
-               smooth, the size of a fist, and glowing a deep, steady red.
-               Not reflecting sunlight. Glowing.
-               She knelt down, picked it up, and said nothing to anyone.''
-        \item \textbf{Gesture 1:} As the robot says ``stayed on schedule,'' make a precise, dismissive hand wave.
-        \item \textbf{Gesture 2:} As the robot says ``she stopped,'' pause all motion for one full second.
-        \item \textbf{Gesture 3:} As the robot says ``glowing a deep, steady red,'' look slowly downward.
-        \item \textbf{Gesture 4:} As the robot says ``said nothing to anyone,'' lean slightly forward and lower the voice.
-    \end{itemize}
-
-    \item \textbf{Step 3 --- Comprehension Check (Branching)}
-    \begin{itemize}
-        \item \textbf{Speech:} ``She brought it home.
-               The mission report listed it as an anomalous geological sample.
-               NASA has been running tests on it ever since.
-               No one has published anything yet.''
-        \item \textbf{Gesture:} Stand upright, look directly at the participant, and pause for one full second.
-        \item \textbf{Question:} ``What color was the rock Dara found?''
-        \item \textbf{Branch A (Correct answer: ``Red'' or ``red''):}
-        \begin{itemize}
-            \item \textbf{Speech:} ``Red. And it was still glowing when she landed.''
-            \item \textbf{Gesture:} Robot nods once, slowly.
-        \end{itemize}
-        \item \textbf{Branch B (Any other answer):}
-        \begin{itemize}
-            \item \textbf{Speech:} ``Actually, red. Not reflecting light --- emitting it.''
-            \item \textbf{Gesture:} Robot shakes head once.
-        \end{itemize}
-    \end{itemize}
-
-    \item \textbf{Step 4 --- Conclusion}
-    \begin{itemize}
-        \item \textbf{Speech:} ``That was six years ago.
-               The rock is in a lab in Houston.
-               Dara still hasn't told anyone exactly where she found it.
-               That's the end of the story.''
-        \item \textbf{Gesture:} Stand still, lower arms to sides, and bow.
-    \end{itemize}
-\end{enumerate}
-
-\section{Post-Study Questionnaire (System Usability Scale)}
-
-\textit{Completed by wizard participants after the live trial. Circle the number that best reflects your agreement with each statement.}
-
-\vspace{0.4cm}
-\noindent
-\renewcommand{\arraystretch}{2.2}
-\begin{tabularx}{\linewidth}{X *{5}{>{{\centering\arraybackslash}}p{0.85cm}}}
-\textbf{Statement} & \textbf{1} & \textbf{2} & \textbf{3} & \textbf{4} & \textbf{5} \\
-\textit{\footnotesize (Circle one per row)}
-  & \textit{\footnotesize SD} & & & & \textit{\footnotesize SA} \\
-\hline
-1.\enspace I think that I would like to use this system frequently.
-  & $\bigcirc$ & $\bigcirc$ & $\bigcirc$ & $\bigcirc$ & $\bigcirc$ \\
-2.\enspace I found the system unnecessarily complex.
-  & $\bigcirc$ & $\bigcirc$ & $\bigcirc$ & $\bigcirc$ & $\bigcirc$ \\
-3.\enspace I thought the system was easy to use.
-  & $\bigcirc$ & $\bigcirc$ & $\bigcirc$ & $\bigcirc$ & $\bigcirc$ \\
-4.\enspace I think that I would need the support of a technical person to be able to use this system.
-  & $\bigcirc$ & $\bigcirc$ & $\bigcirc$ & $\bigcirc$ & $\bigcirc$ \\
-5.\enspace I found the various functions in this system were well integrated.
-  & $\bigcirc$ & $\bigcirc$ & $\bigcirc$ & $\bigcirc$ & $\bigcirc$ \\
-6.\enspace I thought there was too much inconsistency in this system.
-  & $\bigcirc$ & $\bigcirc$ & $\bigcirc$ & $\bigcirc$ & $\bigcirc$ \\
-7.\enspace I would imagine that most people would learn to use this system very quickly.
-  & $\bigcirc$ & $\bigcirc$ & $\bigcirc$ & $\bigcirc$ & $\bigcirc$ \\
-8.\enspace I found the system very cumbersome to use.
-  & $\bigcirc$ & $\bigcirc$ & $\bigcirc$ & $\bigcirc$ & $\bigcirc$ \\
-9.\enspace I felt very confident using the system.
-  & $\bigcirc$ & $\bigcirc$ & $\bigcirc$ & $\bigcirc$ & $\bigcirc$ \\
-10.\enspace I needed to learn a lot of things before I could get going with this system.
-  & $\bigcirc$ & $\bigcirc$ & $\bigcirc$ & $\bigcirc$ & $\bigcirc$ \\
-\hline
-\end{tabularx}
-\renewcommand{\arraystretch}{1}
-
-\vspace{0.4cm}
-\noindent\textit{\footnotesize SD = Strongly Disagree \quad SA = Strongly Agree}
-
-\section{Design Fidelity Score Rubric}
-
-\textit{To be completed by the researcher after analyzing the exported project file.}
-
-\vspace{0.3cm}
-\noindent\textbf{Participant ID:} \underline{\hspace{3cm}} \hspace{1cm} \textbf{Condition:} \underline{\hspace{3cm}}
-\vspace{0.4cm}
-
-\renewcommand{\arraystretch}{1.6}
-\begin{tabularx}{\linewidth}{X >{\centering\arraybackslash}p{1.4cm} >{\centering\arraybackslash}p{1.4cm} >{\centering\arraybackslash}p{1.4cm}}
-\hline
-\textbf{Component} & \textbf{Present} & \textbf{Correct} & \textbf{Points} \\
-\hline
-\multicolumn{4}{l}{\textbf{Speech Actions (40 points total)}} \\
-\hline
-1.\enspace Introduction speech (``Hello. I want to tell you about someone named Dara\ldots'') & Y~~/~~N & Y~~/~~N & ~~~~~/10 \\
-2.\enspace Narrative speech (``It was 2147. Dara's crew\ldots'') & Y~~/~~N & Y~~/~~N & ~~~~~/10 \\
-3.\enspace Question speech (``What color was the rock Dara found?'') & Y~~/~~N & Y~~/~~N & ~~~~~/10 \\
-4.\enspace Response speeches (correct and incorrect branches) & Y~~/~~N & Y~~/~~N & ~~~~~/10 \\
-\hline
-\multicolumn{4}{l}{\textbf{Gestures and Actions (30 points total)}} \\
-\hline
-5.\enspace Open-hand gesture during introduction & Y~~/~~N & Y~~/~~N & ~~~~~/10 \\
-6.\enspace At least two narrative gestures (pause, lean, gaze) & Y~~/~~N & Y~~/~~N & ~~~~~/10 \\
-7.\enspace Nod (correct branch) or head shake (incorrect branch) & Y~~/~~N & Y~~/~~N & ~~~~~/10 \\
-\hline
-\multicolumn{4}{l}{\textbf{Control Flow and Logic (30 points total)}} \\
-\hline
-8.\enspace Conditional branch triggers on participant's answer & Y~~/~~N & Y~~/~~N & ~~~~~/15 \\
-9.\enspace Correct sequencing of all four steps & Y~~/~~N & Y~~/~~N & ~~~~~/15 \\
-\hline
-\end{tabularx}
-\renewcommand{\arraystretch}{1}
-
-\vspace{0.4cm}
-\noindent\textbf{Scoring:} Award full points if both Present \emph{and} Correct; 50\% if Present but not Correct; 0 if not Present.
-
-\vspace{0.2cm}
-\noindent\textbf{Total:} \underline{\hspace{2cm}} / 100 \hspace{1.5cm} \textbf{Design Fidelity Score:} \underline{\hspace{2cm}}\%
-
-\vspace{0.3cm}
-\noindent\textbf{Notes:}
-
-\vspace{2.5cm}
-
-\section{Execution Reliability Score Rubric}
-
-\textit{To be completed by the researcher after reviewing the video recording of the live trial.}
-
-\vspace{0.3cm}
-\noindent\textbf{Participant ID:} \underline{\hspace{3cm}} \hspace{0.5cm} \textbf{Condition:} \underline{\hspace{3cm}}
-
-\vspace{0.2cm}
-\noindent\textbf{Video File:} \underline{\hspace{6cm}}
-\vspace{0.4cm}
-
-\renewcommand{\arraystretch}{1.6}
-\begin{tabularx}{\linewidth}{X >{\centering\arraybackslash}p{1.4cm} >{\centering\arraybackslash}p{1.6cm} >{\centering\arraybackslash}p{1.4cm}}
-\hline
-\textbf{Behavior} & \textbf{Executed?} & \textbf{Correctly?} & \textbf{Points} \\
-\hline
-\multicolumn{4}{l}{\textbf{Speech Execution (40 points total)}} \\
-\hline
-1.\enspace Introduction speech delivered without errors & Y~~/~~N & Y~~/~~N & ~~~~~/10 \\
-2.\enspace Narrative speech delivered without errors & Y~~/~~N & Y~~/~~N & ~~~~~/10 \\
-3.\enspace Comprehension question delivered correctly & Y~~/~~N & Y~~/~~N & ~~~~~/10 \\
-4.\enspace Appropriate branch response given & Y~~/~~N & Y~~/~~N & ~~~~~/10 \\
-\hline
-\multicolumn{4}{l}{\textbf{Gesture and Movement Execution (30 points total)}} \\
-\hline
-5.\enspace Introduction gesture executed completely & Y~~/~~N & Y~~/~~N & ~~~~~/10 \\
-6.\enspace At least two narrative gestures executed & Y~~/~~N & Y~~/~~N & ~~~~~/10 \\
-7.\enspace Nod or head shake executed correctly & Y~~/~~N & Y~~/~~N & ~~~~~/10 \\
-\hline
-\multicolumn{4}{l}{\textbf{Timing and Synchronization (20 points total)}} \\
-\hline
-8.\enspace Speech and gestures synchronized & Y~~/~~N & Y~~/~~N & ~~~~~/10 \\
-9.\enspace Pause held before comprehension question & Y~~/~~N & Y~~/~~N & ~~~~~/10 \\
-\hline
-\multicolumn{4}{l}{\textbf{System Reliability (10 points --- deduct if problems occur)}} \\
-\hline
-10.\enspace No disconnections, crashes, or hangs occurred & Y~~/~~N & N/A & ~~~~~/10 \\
-\hline
-\end{tabularx}
-\renewcommand{\arraystretch}{1}
-
-\vspace{0.4cm}
-\noindent\textbf{Scoring:} Award full points if both Executed \emph{and} Correct; 50\% if Executed but not Correct; 0 if not Executed. For item 10, award full points only if no errors occurred.
-
-\vspace{0.2cm}
-\noindent\textbf{Total:} \underline{\hspace{2cm}} / 100 \hspace{1.5cm} \textbf{Execution Reliability Score:} \underline{\hspace{2cm}}\%
-
-\vspace{0.3cm}
-\noindent\textbf{Notes:}
-
-\vspace{2.5cm}
+% --- W-06 -------------------------------------------------------------------
+\includepdf[pages=-,pagecommand={}]{pdfs/completed/06/ODS-06.pdf}
+\includepdf[pages=-,pagecommand={}]{pdfs/completed/06/DFS-06.pdf}
+\includepdf[pages=-,pagecommand={}]{pdfs/completed/06/ERS-06.pdf}
+\includepdf[pages=-,pagecommand={}]{pdfs/completed/06/SUS-06.pdf}
@@ -1,49 +1,258 @@
 \chapter{Technical Documentation}
 \label{app:tech_docs}

-This appendix documents the specific technologies and libraries used to build HRIStudio, organized by the three architectural layers described in Chapter~\ref{ch:design}. The goal here is reference, not justification; Chapter~\ref{ch:implementation} explains the reasoning behind the major architectural choices.
+This appendix documents the specific technologies, infrastructure, and integration mechanisms used to build HRIStudio, organized by the three architectural layers described in Chapter~\ref{ch:design}. The goal here is reference, not justification; Chapter~\ref{ch:implementation} explains the reasoning behind the major architectural choices.

 \section{Technology Stack}

+Table~\ref{tbl:tech-stack} lists the principal dependencies and their roles. The entire codebase is written in TypeScript, so type inconsistencies between layers are caught at compile time rather than appearing as runtime failures during a trial.
+
+\begin{table}[htbp]
+\centering
+\footnotesize
+\begin{tabular}{|l|l|l|}
+\hline
+\textbf{Component} & \textbf{Version} & \textbf{Role} \\
+\hline
+Next.js (App Router) & 16.2 & Full-stack React framework \\
+\hline
+React & 19.2 & User interface rendering \\
+\hline
+TypeScript & --- & Static typing across the full stack \\
+\hline
+tRPC & 11.10 & Type-safe API between client and server \\
+\hline
+Better Auth & 1.5 & Authentication and session management \\
+\hline
+Drizzle ORM & 0.41 & Type-safe database access and migrations \\
+\hline
+PostgreSQL & 15 & Primary relational database \\
+\hline
+MinIO & latest & S3-compatible object storage (video/audio) \\
+\hline
+Bun & runtime & WebSocket server for real-time trial communication \\
+\hline
+Tailwind CSS + shadcn/ui & 4.1 / 0.0.4 & Styling and UI component library \\
+\hline
+\texttt{@dnd-kit} & --- & Drag-and-drop for experiment designer \\
+\hline
+ROS~2 Humble & --- & Robot middleware (NAO6 integration stack) \\
+\hline
+Docker Compose & --- & Multi-container orchestration \\
+\hline
+\end{tabular}
+\caption{Principal dependencies in the HRIStudio technology stack.}
+\label{tbl:tech-stack}
+\end{table}
+
 \subsection{User Interface Layer}

-The frontend is built on Next.js (App Router) using React and TypeScript. TypeScript is used throughout the entire codebase, including the server and data access layers, so that type inconsistencies between layers are caught at compile time rather than at runtime. Styling is handled with Tailwind CSS and the shadcn/ui component library. The drag-and-drop canvas in the Design interface uses the \texttt{@dnd-kit} library (\texttt{@dnd-kit/core} and \texttt{@dnd-kit/sortable}) to manage nested drag operations for arranging steps and action blocks.
+The frontend is built on Next.js using React and TypeScript. Styling is handled with Tailwind CSS and the shadcn/ui component library, which provides accessible, pre-built UI primitives built on Radix UI. The drag-and-drop canvas in the Design interface uses the \texttt{@dnd-kit} library (\texttt{@dnd-kit/core} and \texttt{@dnd-kit/sortable}) to manage nested drag operations for arranging steps and action blocks.

 \subsection{Application Logic Layer}

-The server runs as a Next.js Node.js process. API routes use tRPC over HTTP for typed request/response calls; real-time communication during live trials uses a persistent WebSocket connection via the \texttt{ws} package. Authentication and session management are handled by NextAuth.js (v5 beta) with the \texttt{@auth/drizzle-adapter} and bcryptjs for password hashing. Currently, credential-based (username and password) authentication is supported.
+The server runs as a Next.js process. API routes use tRPC over HTTP for typed request/response calls; real-time communication during live trials uses a separate WebSocket server running on the Bun runtime (described in Section~\ref{sec:ws-arch}). Authentication and session management are handled by Better Auth with the Drizzle adapter for database-backed sessions. Passwords are hashed with bcrypt (cost factor~12). Currently, credential-based (username and password) authentication is supported; the architecture allows adding OAuth providers without changes to the session model.

 \subsection{Data and Robot Control Layer}

-Experiment protocols, trial records, and user data are stored in PostgreSQL. The schema and all database queries are managed through Drizzle ORM, which provides compile-time type safety for database interactions. Action configuration parameters and plugin-specific fields are stored as JSONB columns, which allows the same schema to accommodate any robot's action types.
+Experiment protocols, trial records, and user data are stored in PostgreSQL. The schema and all database queries are managed through Drizzle ORM, which provides compile-time type safety for database interactions. Action configuration parameters and plugin-specific fields are stored as JSONB columns, which allows the same schema to accommodate any robot's action types without schema migrations.

-Video and audio recordings captured during trials are stored in a self-hosted MinIO instance, an S3-compatible object storage service. Recordings are captured in the browser using the native MediaRecorder API (assisted by \texttt{react-webcam}) and uploaded to MinIO as a chunked transfer when the trial concludes.
+Video and audio recordings captured during trials are stored in a self-hosted MinIO instance, an S3-compatible object storage service. Recordings are captured in the browser using the native MediaRecorder API and uploaded to MinIO when the trial concludes. Structured data (experiment specifications, trial event logs) and media files are stored separately: the database handles queryable records, and MinIO handles large binary files that the system never queries by content.

-Robot communication is handled through a ROS Bridge (\texttt{rosbridge\_suite} or \texttt{ros2-web-bridge}) running on the robot's local network. The server connects to the bridge over a WebSocket and exchanges JSON-encoded ROS messages; it does not run as a ROS node itself. The bridge address is configured per robot in the plugin file (for example, \texttt{"rosbridgeUrl": "ws://localhost:9090"} in the NAO6 plugin).
+Robot communication is handled through a ROS~2 WebSocket bridge running on the robot's local network. The HRIStudio server connects to the bridge over a WebSocket and exchanges JSON-encoded ROS messages; it does not run as a ROS node itself. The bridge address is configured per robot in the plugin file. For actions that do not require ROS message passing, the system can also execute commands directly on the robot via SSH (see Section~\ref{sec:nao6-integration}).

-\section{Deployment}
+\section{Deployment Infrastructure}
+\label{sec:deployment}

-The full stack is orchestrated using Docker Compose. The \texttt{docker-compose.yml} file defines three services: the PostgreSQL database (\texttt{postgres:15}), the MinIO storage instance, and the Next.js application server. Starting the entire system on any machine with Docker installed is a single \texttt{docker compose up} command. This configuration is intended for on-premises deployment, which is important for studies involving participant data that cannot leave the institution's network.
+HRIStudio uses a double Docker Compose stack: one stack runs the application and its backing services, and a second stack runs the robot integration layer. This separation allows the application to run on any host while the robot stack runs on a machine with network access to the physical robot. Both stacks can run on the same machine for single-lab deployments.

-\section{Plugin Specification}
+\subsection{Application Stack}

-Robot capabilities are defined in JSON plugin files. Each file describes a robot platform and the actions it supports. The structure of a plugin file is as follows:
+The application stack is defined in \texttt{hristudio/docker-compose.yml} and provides three services:
+
+\begin{description}
+\item[db.] PostgreSQL~15 with a persistent named volume. Exposes port~5432.
+\item[minio.] MinIO object storage with a persistent named volume. Exposes port~9000 (S3 API) and port~9001 (web console).
+\item[createbuckets.] An initialization container that runs once at startup using the MinIO client to create the default storage bucket.
+\end{description}
+
+The Next.js application server and the Bun WebSocket server run outside Docker on the host, connecting to the containerized database and object store. Starting the backing services requires a single \texttt{docker compose up} command. This configuration is intended for on-premises deployment, which is important for studies involving participant data that cannot leave the institution's network.
+
+\subsection{NAO6 Integration Stack}
+\label{sec:nao6-integration}
+
+The NAO6 integration stack is defined in a separate repository and provides three ROS~2 services that collectively bridge HRIStudio to the physical robot.
+
+\begin{enumerate}
+\item The \textbf{nao\_driver} service runs the NAOqi driver ROS~2 node, which connects to the NAO's proprietary framework over the local network and publishes sensor data (joint states, camera feeds) as standard ROS~2 topics.
+\item The \textbf{ros\_bridge} service runs the \texttt{rosbridge} WebSocket server, which exposes all ROS~2 topics over a WebSocket interface on a configurable port (default~9090). This is the endpoint that the HRIStudio server connects to.
+\item The \textbf{ros\_api} service provides runtime introspection of available ROS~2 topics, services, and parameters.
+\end{enumerate}
+
+All three services are built from a single Dockerfile based on the ROS~2 Humble base image (Ubuntu~22.04). The image installs the NAOqi driver and \texttt{rosbridge} server packages along with their dependencies (NAOqi libraries, bridge message types, OpenCV bridge, and TF2) and builds them with colcon. All services use host networking so that ROS~2 discovery and the NAOqi connection operate without port forwarding.
+
+Before starting the driver, an initialization script connects to the NAO via SSH and prepares it for external control:
+
+\begin{enumerate}
+\item Disables Autonomous Life, which would otherwise cause the robot to move unpredictably.
+\item Calls \texttt{ALMotion.wakeUp} to energize the motors.
+\item Commands the robot to assume a standing posture via the ALRobotPosture service.
+\end{enumerate}
+
+Environment variables for the robot IP address, credentials, and bridge port are read from a \texttt{.env} file shared across all three services.
+
+\subsection{Communication Between Stacks}
+
+Figure~\ref{fig:deployment-arch} shows the relationship between the two Docker stacks and the components that run on the host. The HRIStudio server communicates with the robot integration stack over a single WebSocket connection to the \texttt{rosbridge\_websocket} endpoint. For actions that bypass ROS entirely (posture changes, animation playback), the server connects directly to the NAO via SSH and invokes NAOqi commands through the \texttt{qicli} command-line tool. Both communication paths are configured per-robot in the plugin file.
+
+\begin{figure}[htbp]
+\centering
+\begin{tikzpicture}[
+    box/.style={rectangle, draw=black, thick, rounded corners=2pt, align=center,
+        font=\footnotesize, inner sep=4pt, minimum height=0.9cm},
+    container/.style={rectangle, draw=black!60, thick, dashed, rounded corners=4pt,
+        inner sep=8pt},
+    arrow/.style={->, thick},
+    lbl/.style={font=\scriptsize\itshape, fill=white, inner sep=1pt}]
+
+    %% ---- Browser ----
+    \node[box, fill=gray!10, minimum width=3.5cm] (browser) at (0, 7.2)
+        {Browser Client\\[-1pt]{\scriptsize React, tRPC, WebSocket}};
+
+    %% ---- Host processes ----
+    \node[box, fill=gray!20, minimum width=2.6cm] (nextjs) at (-1.8, 5.4)
+        {Next.js Server\\[-1pt]{\scriptsize port 3000}};
+    \node[box, fill=gray!20, minimum width=2.6cm] (wsserver) at (1.8, 5.4)
+        {Bun WS Server\\[-1pt]{\scriptsize port 3001}};
+
+    \begin{scope}[on background layer]
+    \node[container, fill=blue!4,
+        fit=(nextjs)(wsserver),
+        label={[font=\scriptsize\bfseries, anchor=south]above:Host}] {};
+    \end{scope}
+
+    %% ---- Docker App Stack ----
+    \node[box, fill=gray!15, minimum width=2.2cm] (pg) at (-1.8, 3.3)
+        {PostgreSQL\\[-1pt]{\scriptsize port 5432}};
+    \node[box, fill=gray!15, minimum width=2.2cm] (minio) at (1.8, 3.3)
+        {MinIO\\[-1pt]{\scriptsize port 9000}};
+
+    \begin{scope}[on background layer]
+    \node[container, fill=green!4,
+        fit=(pg)(minio),
+        label={[font=\scriptsize\bfseries, anchor=south]above:Application Stack}] {};
+    \end{scope}
+
+    %% ---- NAO6 Docker Stack ----
+    \node[box, fill=gray!30, minimum width=1.7cm] (driver) at (-2.4, 1.2)
+        {nao\_driver};
+    \node[box, fill=gray!30, minimum width=1.7cm] (bridge) at (0, 1.2)
+        {ros\_bridge\\[-1pt]{\scriptsize port 9090}};
+    \node[box, fill=gray!30, minimum width=1.7cm] (rosapi) at (2.4, 1.2)
+        {ros\_api};
+
+    \begin{scope}[on background layer]
+    \node[container, fill=orange!6,
+        fit=(driver)(bridge)(rosapi),
+        label={[font=\scriptsize\bfseries, anchor=south]above:NAO6 Integration Stack}] {};
+    \end{scope}
+
+    %% ---- NAO Robot ----
+    \node[box, fill=gray!40, minimum width=2.8cm] (nao) at (0, -0.8)
+        {NAO6 Robot\\[-1pt]{\scriptsize NAOqi}};
+
+    %% ---- Arrows: browser to host ----
+    \draw[arrow] (browser.south west) -- node[lbl, left] {HTTP} (nextjs.north);
+    \draw[arrow] (browser.south east) -- node[lbl, right] {WS} (wsserver.north);
+
+    %% ---- Host internal ----
+    \draw[arrow, dashed] (nextjs.east) -- node[lbl, above] {broadcast} (wsserver.west);
+
+    %% ---- Host to app stack (straight down) ----
+    \draw[arrow] (nextjs.south) -- (pg.north);
+    \draw[arrow] ([xshift=4pt]nextjs.south east) -- (minio.north west);
+
+    %% ---- Next.js to ros_bridge: route down the LEFT outside ----
+    \draw[arrow] (nextjs.west) -- ++(-1.2, 0) |- node[lbl, pos=0.22, left] {WS} (bridge.west);
+
+    %% ---- Next.js to NAO via SSH: route down the RIGHT outside ----
+    \draw[arrow, dashed] ([yshift=-2pt]nextjs.west) -- ++(-1.6, 0) |- node[lbl, pos=0.18, left] {SSH} (nao.west);
+
+    %% ---- ROS containers to robot ----
+    \draw[arrow] (driver.south) -- ([xshift=-8pt]nao.north);
+    \draw[arrow] (bridge.south) -- ([xshift=8pt]nao.north);
+
+\end{tikzpicture}
+\caption{Deployment architecture: two Docker stacks and their communication paths.}
+\label{fig:deployment-arch}
+\end{figure}
+
+\section{WebSocket Architecture}
+\label{sec:ws-arch}
+
+Real-time communication during trials is handled by a dedicated WebSocket server that runs as a separate process alongside the Next.js application server. The WebSocket server is implemented in TypeScript and runs on the Bun runtime, listening on port~3001.
+
+When a wizard or observer opens the Execution interface for a trial, the browser establishes a WebSocket connection to the server, passing the trial identifier and an authentication token as query parameters. The server registers the connection in an in-memory map keyed by client identifier and also records it in the database (\texttt{hs\_ws\_connection} table) for persistence across restarts.
+
+The server handles four message types from connected clients:
+
+\begin{description}
+\item[Heartbeat.] Keeps the connection alive; the server responds with a timestamped acknowledgment.
+\item[Request trial status.] Returns the current trial state (status, current step index) by querying the database.
+\item[Request trial events.] Returns the most recent trial events from the trial event log table.
+\item[Ping.] Returns a pong response with a timestamp for latency measurement.
+\end{description}
+
+When the Next.js server needs to push an update to all clients observing a trial (for example, after a step completes), it sends an HTTP POST to the WebSocket server's internal \texttt{/internal/broadcast} endpoint. The WebSocket server then forwards the message to every client registered for that trial. This architecture separates the stateful WebSocket connections from the stateless HTTP request handling of the Next.js server.
+
+\section{Plugin System}
+
+Robot capabilities are defined in JSON plugin files hosted in a plugin repository. A plugin repository is a static file server (served by an nginx container on port~8080 in the default configuration) that exposes three resources:
+
+\begin{description}
+\item[\texttt{repository.json}.] Repository metadata including name, maintainers, trust level, supported ROS~2 distributions, and compatibility constraints.
+\item[\texttt{plugins/index.json}.] An array of plugin filenames available in the repository.
+\item[\texttt{plugins/\{name\}.json}.] Individual plugin files, one per robot platform.
+\end{description}
+
+When an administrator triggers a repository sync in the HRIStudio admin interface, the server fetches the repository metadata, retrieves the plugin index, and then fetches each plugin file. The action definitions from each plugin are stored as JSONB in the \texttt{hs\_robot\_plugin} database table, making them available to the experiment designer and the execution engine without further network requests.
+
+\subsection{Plugin File Structure}
+
+Each plugin file is a self-contained description of a robot platform. The top-level fields include robot metadata (name, manufacturer, version, capabilities, physical specifications), a ROS~2 configuration block (namespace, default topics), and an array of action definitions. The official repository currently contains three plugins: \texttt{nao6-ros2.json}, \texttt{turtlebot3-burger.json}, and \texttt{turtlebot3-waffle.json}.
+
+Each action definition specifies:

 \begin{itemize}
-  \item \textbf{Metadata}: name, version, and a human-readable description of the platform.
-  \item \textbf{ROS configuration} (\texttt{ros2Config}): the bridge URL and any global connection parameters.
-  \item \textbf{Actions}: an array of action definitions. Each action specifies:
-    \begin{itemize}
-      \item A unique action type identifier (e.g., \texttt{speak}, \texttt{raise\_arm})
-      \item A human-readable label shown in the Design interface
-      \item A parameter schema defining the input fields the researcher configures
-      \item The target ROS topic and message type
-      \item A mapping from parameter names to message fields
-    \end{itemize}
+\item A unique identifier (e.g., \texttt{say\_text}, \texttt{walk\_forward}, \texttt{play\_animation\_bow}).
+\item A human-readable name and icon for display in the Design interface.
+\item A parameter schema (JSON Schema format) defining the input fields the researcher configures.
+\item A timeout and retry policy.
+\item A ROS~2 dispatch block containing the target topic, message type, and a payload mapping.
 \end{itemize}

-When the server dispatches a robot command, it loads the active plugin, locates the matching action definition, constructs the ROS message by applying the parameter mapping, and sends it to the bridge. Adding a new robot means writing a new plugin file; no server code changes are required.
+The payload mapping supports two modes. In \emph{static} mode, the plugin defines a fixed message template with placeholder tokens (e.g., \texttt{\{\{text\}\}}) that the execution engine fills from the researcher's parameters. In \emph{SSH} mode, the action bypasses ROS entirely and executes a shell command on the robot via SSH; this is used for NAOqi-native operations such as posture changes and animation playback that are not exposed as ROS~2 topics.
+
+The NAO6 plugin defines 20 actions across three categories: speech (say text, say with emotion), movement (walk forward/backward, turn, stop, wake up, rest, stand, sit, crouch), and animation (bow, wave, nod, head shake, shrug, enthusiastic gesture, and others). Movement actions publish ROS~2 Twist messages to the velocity command topic. Animation actions publish animation path strings to the animation topic. Posture and lifecycle commands use SSH mode to call NAOqi services directly via the \texttt{qicli} command-line tool.
+
+\subsection{Adding a New Robot}
+
+Adding support for a new robot platform requires writing a single JSON plugin file and placing it in the plugin repository. No changes to the HRIStudio server code are required. The plugin author defines the robot's capabilities, maps each action to a ROS~2 topic or SSH command, and specifies the parameter schema for each action. After the repository is synced, the new robot's actions appear in the experiment designer and can be used in any study.
+
+\section{Database Schema}
+
+The database schema is managed through Drizzle ORM and uses a consistent \texttt{hs\_} prefix for all tables. The schema is organized into five groups:
+
+\begin{description}
+\item[Authentication.] User accounts, sessions, and system role assignments.
+\item[Study management.] Studies with status tracking, study membership with per-study roles, and participant records with consent tracking.
+\item[Experimental design.] Experiments, steps, and actions. Each action stores its transport type, configuration, parameter schema, and retry policy as JSONB columns.
+\item[Trial execution.] Trials with status and duration tracking, and a trial event log that records every action, step transition, and deviation with a timestamp.
+\item[Robot integration.] Robot definitions and installed plugins with cached action definitions. A block registry maps visual blocks in the experiment designer to their underlying action types, parameter schemas, and display properties.
+\end{description}
+
+All tables use a consistent \texttt{hs\_} prefix (e.g., \texttt{hs\_study}, \texttt{hs\_trial}, \texttt{hs\_action}).

 \section{Role-Based Access Control}

-HRIStudio uses a two-layer role system. System roles (\texttt{systemRoleEnum}) govern what a user can do across the platform: \emph{administrator}, \emph{researcher}, \emph{wizard}, and \emph{observer}. Study roles (\texttt{studyMemberRoleEnum}) govern what a user can see and do within a specific study: \emph{owner}, \emph{researcher}, \emph{wizard}, and \emph{observer}. A user's system role and study role are checked independently, so a user who is a wizard on one study can be an observer on another without any additional configuration.
+As described in Chapter~\ref{ch:implementation}, HRIStudio uses a two-layer role system. System roles are stored in the \texttt{systemRoleEnum} column: \emph{administrator}, \emph{researcher}, \emph{wizard}, and \emph{observer}. Study roles are stored in the \texttt{studyMemberRoleEnum} column: \emph{owner}, \emph{researcher}, \emph{wizard}, and \emph{observer}. The two layers are checked independently at the database level. On the server, tRPC middleware enforces access control: public procedures require no authentication, protected procedures require a valid session, and admin procedures additionally verify the user's system role. Study-level permissions are checked per-request by querying the \texttt{hs\_study\_member} table.
@@ -33,17 +33,17 @@
  publisher={ACM}
 }

-@article{Rietz2021,
+@inproceedings{Rietz2021,
  title={{WoZ4U: An Open-Source Wizard-of-Oz Interface for Human-Robot Interaction}},
  author={Rietz, Frank and Bennewitz, Maren},
-  journal={Proceedings of the 16th ACM/IEEE International Conference on Human-Robot Interaction},
+  booktitle={Proceedings of the 16th ACM/IEEE International Conference on Human-Robot Interaction},
  pages={95--103},
  year={2021},
  publisher={IEEE}
 }

@inproceedings{Quigley2009,
-  title={{ROS: an open-source Robot Operating System}},
+  title={{ROS: An Open-Source Robot Operating System}},
  author={Quigley, Morgan and Conley, Ken and Gerkey, Brian and Faust, Josh and Foote, Tully and Leibs, Jeremy and Wheeler, Rob and Ng, Andrew Y},
  booktitle={IEEE International Conference on Robotics and Automation},
  year={2009},
@@ -69,7 +69,7 @@ keywords = {systematic review, reporting guidelines, methodology, human-robot in
 }

@inproceedings{Pettersson2015,
-author = {{Pettersson, John S\"{o}ren and Wik, Malin}},
+author = {Pettersson, John S\"{o}ren and Wik, Malin},
 title = {{The longevity of general purpose Wizard-of-Oz tools}},
 year = {2015},
 isbn = {9781450336734},
@@ -79,7 +79,7 @@ url = {https://doi.org/10.1145/2838739.2838825},
 doi = {10.1145/2838739.2838825},
 abstract = {The Wizard-of-Oz method has been around for decades, allowing researchers and practitioners to conduct prototyping without programming. An extensive literature review conducted by the authors revealed, however, that the re-usable tools supporting the method did not seem to last more than a few years. While generic systems start to appear around the turn of the millennium, most have already fallen out of use.Our interest in undertaking this review was inspired by the ongoing re-development of our own Wizard-of-Oz tool, the Ozlab, into a system based on web technology.We found three factors that arguably explain why Ozlab is still in use after 15 years instead of the two-three years lifetime of other generic systems: the general approach used from its inception; its inclusion in introductory HCI curricula, and the flexible and situation-dependent design of the wizard's user interface.},
 booktitle = {Proceedings of the Annual Meeting of the Australian Special Interest Group for Computer Human Interaction},
-pages = {422–426},
+pages = {422--426},
 numpages = {5},
 keywords = {Wizard user interface, Wizard of Oz, Software Sustainability, Non-functional requirements, GUI articulation},
 location = {Parkville, VIC, Australia},
@@ -136,7 +136,7 @@ series = {OzCHI '15}

@INPROCEEDINGS{Pot2009,
  author={Pot, E. and Monceaux, J. and Gelin, R. and Maisonnier, B.},
-  booktitle={RO-MAN 2009 - The 18th IEEE International Symposium on Robot and Human Interactive Communication}, 
+  booktitle={Proceedings of the 18th IEEE International Symposium on Robot and Human Interactive Communication (RO-MAN 2009)},
  title={Choregraphe: a graphical tool for humanoid robot programming},
  year={2009},
  volume={},
@@ -156,7 +156,7 @@ series = {OzCHI '15}

@inproceedings{Steinfeld2009,
  author = {Steinfeld, Aaron and Jenkins, Odest Chadwicke and Scassellati, Brian},
-  title = {{The oz of wizard: simulating the human for interaction research}},
+  title = {{The Oz of Wizard: Simulating the Human for Interaction Research}},
  year = {2009},
  isbn = {9781605582934},
  publisher = {Association for Computing Machinery},
@@ -167,7 +167,7 @@ series = {OzCHI '15}

@inproceedings{Gibert2013,
  author = {Gibert, Guillaume and Petit, Morgan and Lance, Frederic and Pointeau, Gregoire and Dominey, Peter F.},
-  title = {{What makes human so different? Analysis of human-humanoid robot interaction with a super wizard of oz platform}},
+  title = {{What Makes Humans So Different? Analysis of Human-Humanoid Robot Interaction with a Super Wizard of Oz Platform}},
  year = {2013},
  booktitle = {Proceedings of the 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)},
  pages = {931--938},
@@ -176,7 +176,7 @@ series = {OzCHI '15}

@article{Strazdas2020,
  author = {Strazdas, Daniel and Hintz, Jonathan and Felßberg, Anna Maria and Al-Hamadi, Ayoub},
-  title = {{Robots and wizards: An investigation into natural human–robot interaction}},
+  title = {{Robots and Wizards: An Investigation into Natural Human–Robot Interaction}},
  journal = {IEEE Access},
  volume = {8},
  pages = {218808--218821},
@@ -186,7 +186,7 @@ series = {OzCHI '15}

@inproceedings{Helgert2024,
  author = {Helgert, Anna and Straßmann, Christopher and Eimler, Sabine C.},
-  title = {{Unlocking potentials of virtual reality as a research tool in human-robot interaction: A wizard-of-oz approach}},
+  title = {{Unlocking Potentials of Virtual Reality as a Research Tool in Human-Robot Interaction: A Wizard-of-Oz Approach}},
  year = {2024},
  booktitle = {Proceedings of the 2024 ACM/IEEE International Conference on Human-Robot Interaction},
  pages = {123--132},
@@ -208,12 +208,111 @@ abstract="TypeScript is an extension of JavaScript intended to enable easier dev
 isbn="978-3-662-44202-9"
 }

+% fix below to read: J. Brooke, “SUS: A ‘Quick and Dirty’ Usability Scale,” CRC Press, 1996, pp. 207–212. doi: 10.1201/9781498710411-35
@article{Brooke1996,
 author = {Brooke, John},
-year = {1995},
-month = {11},
-pages = {},
-title = {SUS: A quick and dirty usability scale},
-volume = {189},
-journal = {Usability Eval. Ind.}
+year = {1996},
+pages = {207--212},
+title = {{SUS: A Quick and Dirty Usability Scale}},
+publisher = {CRC Press},
+doi = {10.1201/9781498710411-35}
+}
+
+@article{HoffmanZhao2021,
+  author    = {Hoffman, Guy and Zhao, Xuan},
+  title     = {A Primer for Conducting Experiments in Human--Robot Interaction},
+  journal   = {ACM Transactions on Human-Robot Interaction},
+  volume    = {10},
+  number    = {3},
+  articleno = {14},
+  year      = {2021},
+  doi       = {10.1145/3412374}
+}
+
+@misc{HRIStudioRepo,
+  author       = {O'Connor, Sean},
+  title        = {{HRIStudio: A Web-Based Wizard-of-Oz Platform for Human-Robot Interaction Research}},
+  howpublished = {GitHub repository},
+  year         = {2026},
+  url          = {https://github.com/soconnor0919/hristudio}
+}
+
+@misc{RobotPluginsRepo,
+  author       = {O'Connor, Sean},
+  title        = {{HRIStudio Robot Plugins Repository}},
+  howpublished = {GitHub repository, maintained as a submodule of HRIStudio},
+  year         = {2026},
+  url          = {https://github.com/soconnor0919/robot-plugins}
+}
+
+@misc{NaoWorkspaceRepo,
+  author       = {O'Connor, Sean},
+  title        = {{nao-workspace: A Containerized Choregraphe Development Environment}},
+  howpublished = {GitHub repository},
+  year         = {2026},
+  url          = {https://github.com/soconnor0919/nao-workspace}
+}
+
+@misc{NaoIntegrationRepo,
+  author       = {O'Connor, Sean},
+  title        = {{nao6-hristudio-integration: ROS/NAOqi Bridge for HRIStudio}},
+  howpublished = {GitHub repository},
+  year         = {2026},
+  url          = {https://github.com/soconnor0919/nao6-hristudio-integration}
+}
+
+@misc{Anthropic2024Claude,
+  author       = {{Anthropic}},
+  title        = {{Claude 3.5 Sonnet}},
+  howpublished = {Large Language Model},
+  year         = {2024},
+  url          = {https://www.anthropic.com/claude}
+}
+
+@misc{AnthropicClaudeCode,
+  author       = {{Anthropic}},
+  title        = {{Claude Code}},
+  howpublished = {Agentic CLI Developer Tool},
+  year         = {2025},
+  url          = {https://www.anthropic.com/claude-code}
+}
+
+@misc{OpenCode,
+  author       = {{SST}},
+  title        = {{OpenCode}},
+  howpublished = {Open-source AI Coding Agent},
+  year         = {2024},
+  url          = {https://opencode.ai}
+}
+
+@misc{GeminiCLI,
+  author       = {{Google}},
+  title        = {{Gemini CLI}},
+  howpublished = {Agentic CLI Developer Tool},
+  year         = {2024},
+  url          = {https://github.com/google-gemini/gemini-cli}
+}
+
+@misc{GoogleAntigravity,
+  author       = {{Google}},
+  title        = {{Antigravity}},
+  howpublished = {Integrated Agentic Development Environment},
+  year         = {2025},
+  url          = {https://antigravity.google}
+}
+
+@misc{CursorEditor,
+  author       = {{Anysphere}},
+  title        = {{Cursor Code Editor}},
+  howpublished = {AI-Native Code Editor},
+  year         = {2023},
+  url          = {https://cursor.com}
+}
+
+@misc{ZedEditor,
+  author       = {{Zed Industries}},
+  title        = {{Zed Code Editor}},
+  howpublished = {High-performance Code Editor with AI Integration},
+  year         = {2023},
+  url          = {https://zed.dev}
 }
@@ -1,13 +1,16 @@
 %documentclass{buthesis}          %Gives author-year citation style
-\documentclass[numbib, twoadv]{buthesis} %Gives numerical citation style
+\documentclass[numbib]{buthesis} %Gives numerical citation style
 %\documentclass[twoadv}{buthesis} %Allows entry of second advisor
 %\usepackage{graphics}            %Select graphics package
 \usepackage{graphicx}             %
+\usepackage{pdfpages}             %For including PDF pages in appendices
+\usepackage{subcaption}           %For sub-figures with captions
 %\usepackage{amsthm}              %Add other packages as necessary
 \usepackage{array}                %Extended column types and \arraybackslash
+\usepackage{makecell}             %Multi-line table header cells
 \usepackage{tabularx}             %Auto-width table columns
 \usepackage{tikz}                 %For programmatic diagrams
-\usetikzlibrary{shapes,arrows,positioning,fit,backgrounds,decorations.pathreplacing}
+\usetikzlibrary{shapes,arrows,positioning,fit,backgrounds,decorations.pathreplacing,calc}
 \usepackage[
    hidelinks,
    linktoc=all,
@@ -18,14 +21,28 @@
 \butitle{A Web-Based Wizard-of-Oz Platform for Collaborative and Reproducible Human-Robot Interaction Research}
 \author{Sean O'Connor}
 \degree{Bachelor of Science}
-\department{Computer Science}
+\department{Computer Science and Engineering}
 \advisor{L. Felipe Perrone}
 \advisorb{Brian King}
+\honorscouncilrep{Abigail Kopec}
 \chair{Alan Marchiori}
-\maketitle
+% \maketitle
+\includepdf[pages=-,pagecommand={}]{pdfs/CoverPage-Signed.pdf}
+\frontmatter

 \acknowledgments{
-    (Draft Acknowledgments)
+    \begin{spacing}{1.3}
+    {\setlength{\parskip}{0.1in}
+    I owe a great deal to my advisor, Felipe Perrone. From my first year at Bucknell,
+    he gave me the resources and the space to learn on my own terms, and has supported me at every turn --- inside the classroom and far beyond it. The many credits I spent in his courses; the countless hours spent meeting, reading, and rereading drafts together; the two papers we wrote and the trips we took to present them; all of it built the skills and the judgment that this thesis required, and much more besides. I could not have done this without him.
+
+    I also thank Professor Brian King, whose encouragement of curiosity and discovery has stayed with me, and whose courses have given me the technical foundation to take on projects with confidence and lead them well. I am glad to have him on my committee.
+
+    I thank Professor Kopec for her service as the Honors Council representative on my thesis committee, and the six Bucknell faculty members who volunteered their time to participate in the pilot study that made this evaluation possible.
+
+    My parents have supported me throughout my time at Bucknell and in everything that came before it. This is for them, and for my grandfather, whose journey here made mine possible.
+    }
+    \end{spacing}
 }

 \tableofcontents
@@ -35,7 +52,13 @@
 \listoffigures

 \abstract{
-    [Abstract goes here]
+    \begin{spacing}{1.3}
+    {\setlength{\parskip}{0.1in}
+    The Wizard-of-Oz (WoZ) technique is widely used in Human-Robot Interaction (HRI) research, but two persistent problems limit its effectiveness: existing tools impose technical barriers that exclude non-engineering domain experts (the Accessibility Problem), and the fragmented landscape of robot-specific implementations makes interaction scripts difficult to port across platforms (the Reproducibility Problem --- concerning execution consistency and portability, not third-party replication). Through a literature review, I identified three design principles to address both: a hierarchical specification model, an event-driven execution model, and a plugin architecture that decouples experiment logic from robot-specific implementations. I realized these principles in HRIStudio, an open-source, web-based platform providing a visual experiment designer, a guided wizard execution interface, automated timestamped logging with deviation tracking, and role-based access control.
+
+    I evaluated HRIStudio in a pilot between-subjects study (N=6) against Choregraphe, the standard programming tool for the NAO robot. HRIStudio wizards achieved higher design fidelity, execution reliability, and perceived usability across all six sessions; the only unprompted specification deviation in the dataset occurred in the Choregraphe condition. While the pilot scale precludes inferential claims, the directional evidence across all measures supports the position that a tool built to realize the identified design principles can have significant impact on accessibility and reproducibility in WoZ-based HRI research.
+    }
+    \end{spacing}
 }

 \mainmatter
@@ -70,7 +93,9 @@

 \makeatletter\@mainmattertrue\makeatother
 \appendix
+\include{chapters/app_blank_templates}
 \include{chapters/app_materials}
 \include{chapters/app_tech_docs}
+\include{chapters/app_ai_development}

 \end{document}
Author	SHA1	Message	Date
soconnor	5d8ef0ce76	revisions of the revisions Build Proposal and Thesis / build-github (push) Has been skipped Details Build Proposal and Thesis / build-gitea (push) Successful in 1m3s Details	2026-04-30 00:19:02 -04:00
soconnor	51009cd1ce	add signed cover page Build Proposal and Thesis / build-github (push) Has been skipped Details Build Proposal and Thesis / build-gitea (push) Successful in 1m42s Details	2026-04-29 12:42:41 -04:00
soconnor	28c852a867	feat: add honors council representative and update department name in thesis Build Proposal and Thesis / build-github (push) Has been skipped Details Build Proposal and Thesis / build-gitea (push) Successful in 1m4s Details	2026-04-21 11:00:20 -04:00
soconnor	1404945756	post-defense revisions complete Build Proposal and Thesis / build-github (push) Has been skipped Details Build Proposal and Thesis / build-gitea (push) Successful in 1m3s Details	2026-04-21 00:25:54 -04:00
soconnor	5017133cfb	add embedded PDFs to git Build Proposal and Thesis / build-github (push) Has been skipped Details Build Proposal and Thesis / build-gitea (push) Successful in 1m43s Details	2026-04-20 23:19:10 -04:00
soconnor	a7508c5698	Add appendix on AI-assisted development workflow for HRIStudio This commit introduces a new appendix detailing the role of AI coding assistants in the development of HRIStudio. It covers the context of the project, tools used, division of responsibility, interaction patterns, and reflections on research integrity. The workflow is documented to provide transparency and insight into the development process, emphasizing the collaboration between human decisions and AI assistance.	2026-04-20 23:15:23 -04:00
soconnor	086b53880f	refactor: update thesis acknowledgments and abstract for clarity and detail Build Proposal and Thesis / build-github (push) Has been skipped Details Build Proposal and Thesis / build-gitea (push) Failing after 35s Details	2026-04-13 16:39:49 -04:00
soconnor	6ccf32ee4d	draft1 revisions complete Build Proposal and Thesis / build-github (push) Has been skipped Details Build Proposal and Thesis / build-gitea (push) Failing after 31s Details	2026-04-12 21:14:20 -04:00
soconnor	e1af7b1f8f	final submissions update part 1 Build Proposal and Thesis / build-github (push) Has been skipped Details Build Proposal and Thesis / build-gitea (push) Failing after 32s Details	2026-04-12 17:07:09 -04:00
soconnor	c28408bd9f	draft1 Build Proposal and Thesis / build-github (push) Has been skipped Details Build Proposal and Thesis / build-gitea (push) Failing after 32s Details	2026-04-11 16:41:52 -04:00
soconnor	659a4b0683	refactor: update thesis protocol to remove test subjects and screen recordings, add tracking documentation, and refine bibliography entries.	2026-04-08 22:43:20 -04:00
soconnor	ab48109f64	feat: draft discussion chapter and update thesis structure with preliminary results and placeholder sections. Build Proposal and Thesis / build-github (push) Has been skipped Details Build Proposal and Thesis / build-gitea (push) Successful in 1m24s Details	2026-04-01 17:22:53 -04:00