Insights

How to Build an AI Assistant That Understands Wearable Health Data

Author
Bartosz Michalak
Published
May 15, 2026
Last update
May 15, 2026

Table of Contents

EXCLUSIVE LAUNCH
AI Implementation in Healthcare Masterclass
Start the course

Key Takeaways

  • The AI health assistants buyers ask for almost always do three things: surface patterns the user cannot see, give concrete recommendations based on current data, and forecast where the trajectory is heading. All three depend on normalized longitudinal health data and a reasoning layer that can read it.
  • The data layer is the hardest part to get right. Without per-user, multi-source, normalized wearable data, the assistant either invents patterns or hedges every recommendation. Open Wearables solves the data layer; the assistant sits on top.
  • The reasoning layer is a multi-agent system, not a single LLM call. A router classifies intent, specialized tools fetch what the agent needs, guardrails moderate output, and an optional translator handles language. The architecture matters more than the model choice.
  • Cloud LLMs are the default for most healthtech products. On-prem open-source models are the right call when the regulator does not allow PHI to leave the perimeter, with a meaningful quality trade-off that should be scoped explicitly.

Is Your HealthTech Product Built for Success in Digital Health?

Download the Playbook

Introduction

The phrase "AI health assistant" covers a wide range of products, from a chatbot that paraphrases yesterday's sleep data to a clinical decision-support system that flags atrial fibrillation patterns. The buyers we talk to are usually somewhere in the middle: a product layer that turns raw biometric signals into something a member, patient, or coach can act on.

Across recent enterprise conversations, the brief that buyers articulate clearest is built on three functions. The assistant should produce meaning by connecting data points the user cannot see. It should guide by giving concrete recommendations based on current data and individual profile. It should predict by forecasting the long-term impact of daily trends. The exact wording varies, the underlying ask does not.

This post walks through what it takes to build that kind of assistant on top of a wearable data stack. It covers the three-function value framing buyers describe, the architectural layers that make it work, and the production trade-offs that distinguish a useful assistant from a demo that survives one focus group. It is written for engineering leads scoping a build and product owners deciding what is realistic to ship.

Open Wearables handles the wearable data layer. Momentum builds the reasoning layer, the surface layer, and the operational infrastructure around it, end to end.

What an AI Health Assistant Should Actually Do

Strip the marketing language away and the buyer brief almost always lands on three functions.

Producing meaning. Make patterns visible that the user cannot reasonably notice themselves. "Your resting heart rate has been three to four beats higher than your fourteen-day average all week, and it tracks with the late-evening workouts on your calendar." The assistant is not telling the user something new. It is telling them something they already had in their data but could not see from a chart.

Guiding. Translate the pattern into a concrete next action calibrated to who the user is. "Given your training load this week and your recovery score trend, consider a Zone 2 session today instead of the strength block you have planned, and aim for ten o'clock lights out." The recommendation is grounded in the user's own data and the user's own goals, not a generic playbook.

Predicting. Forecast where the current trajectory leads. "If sleep efficiency stays in this range for another two weeks, your resilience score is projected to drop into the at-risk band, with implications for performance and immune resilience." This is the function that turns an analytics product into something members keep coming back to.

The three functions are not separate features. They are the same reasoning pipeline applied to different time horizons: meaning is about the last few days, guidance is about today, prediction is about the next few weeks. The same data and the same architecture serve all three. What changes is the prompt, the tools the agent calls, and the surface the result lands on.

This is the level at which the buyer conversation usually happens. The architecture conversation comes next.

Why This Is Hard Without a Normalized Data Layer

The reasoning layer can only be as smart as the data it can read.

Most attempts at an AI health assistant that we see in the wild fail at the data layer first. A team integrates Apple Health for iOS users and Garmin for the power-user cohort, builds a chatbot prompt around the raw vendor data shapes, and ships a demo that works for one provider on one platform. The first member who switches devices, syncs late, or runs both an Oura ring and a Garmin watch breaks the reasoning. The assistant either fabricates an answer or refuses to commit to one.

The data layer problems compound. Different providers expose different sleep stage definitions and different heart-rate variability metrics. The same user can have overlapping signals from multiple sources. Apple HealthKit on iOS has a minimum sync cadence that means "real-time" is fifteen minutes to an hour. Devices go offline. Users uninstall and reinstall apps. The data the assistant needs to be smart is structurally messy.

Open Wearables addresses this layer by normalizing data from supported providers into one unified schema. The assistant reads from one query path regardless of which provider the data came from. Open Wearables also ships an MCP server that exposes user health data to LLM agents in a structured form, so the assistant does not need to know provider mechanics, only the user it is reasoning about.

That normalization is the precondition for the assistant being useful at all. Without it, every step downstream gets harder.

The Architecture in Three Layers

A production AI health assistant has three layers. Most teams under-invest in one of them, usually the middle.

Layer 1: Data

The data layer is the wearable ingestion plus normalization plus longitudinal storage that turns raw vendor payloads into something an LLM can reason about. Open Wearables covers this layer: OAuth across providers, webhook ingestion, schema normalization, longitudinal storage, and the MCP server that exposes the result to agents.

The output of this layer is a structured per-user view: their sleep, their HRV, their training load, their activity, their score outputs, in a consistent shape across time and across providers. The assistant reads from this view, not from raw vendor APIs.

Layer 2: Reasoning

The reasoning layer is where most teams reach for "let me just call GPT-4" and discover that a single LLM call is not enough. Production AI health assistants are multi-agent systems, with each component doing one thing well.

A typical reasoning pipeline looks like this:

  1. Router. A small classifier model that reads the incoming user message (or trigger event, in proactive surfaces) and decides what kind of question it is. Simple lookups go to one path. Complex multi-step reasoning goes to the main agent. Out-of-scope or unsafe content gets refused with a templated message before any reasoning runs.
  2. Main agent. A reasoning model in a ReAct loop (or equivalent agent framework) that handles the actual question. The agent has access to a curated set of tools and can call them multiple times in one query.
  3. Specialist tools. Each tool is a domain-bounded sub-component the main agent can invoke:
    • Biometric query tools that fetch the user's recent sleep, HRV, training load, recovery score, and other normalized signals from Open Wearables.
    • Retrieval tools backed by a vector database, holding the medical literature, care guidelines, product content, or domain knowledge the assistant grounds its answers in.
    • Function tools for deterministic operations, including date arithmetic, threshold comparisons, and percentile lookups, so the LLM does not have to do math in natural language.
    • Specialist sub-agents for high-stakes domains, each prompted and toolset-restricted to its specialty (sleep, training, nutrition, recovery, mental wellness).
  4. Guardrails. Output moderation that checks the final answer for safety, format, tone, and language. The assistant should refuse to give medical advice it is not qualified to give, format the output to the surface it is rendering on, and stay in the voice the brand wants.
  5. Translator (optional). A small model that handles language localization when the assistant operates in a non-English market. Reasoning happens in the agent's strongest language. The output gets translated to the user's language at the surface boundary.

A simplified tool definition for the assistant might look like this (Python, pseudo-code, agent-framework-agnostic):

```python
@agent_tool
def get_recent_sleep_summary(user_id: str, days: int = 14) -> SleepSummary:
    """
    Returns the user's sleep summary over the last N days from Open Wearables,
    including average duration, efficiency, awakenings, and HRV-during-sleep.
    Calls the OW REST API and returns a structured object the agent can reason over.
    """
    return ow_client.get_sleep_summary(user_id=user_id, days=days)

@agent_tool
def compare_to_baseline(value: float, baseline: float, metric: str) -> Comparison:
    """
    Compares a current value to a personal baseline and returns a structured
    comparison (delta as percentage, direction, magnitude category).
    Deterministic; ensures the agent does not do math in natural language.
    """
    return Comparison.from_values(value, baseline, metric)
```

The main agent reads the user message, decides which tools to call in what order, observes the results, and reasons over them to produce the answer. This is the ReAct pattern: reason, act, observe, repeat, until the agent can answer.

Layer 3: Surface

The surface layer is how the user actually encounters the assistant in the product. Most healthtech buyers we talk to are not asking for a chatbot first. They are asking for:

  • Insight cards the member sees when they open the app, surfaced from the assistant's last reasoning run.
  • Daily or weekly summaries delivered as in-app content or email, written by the assistant against the user's data.
  • Notifications that fire when the assistant detects a pattern worth surfacing (overnight HRV dropped meaningfully, recovery trending toward at-risk band).

A conversational chatbot is the fourth surface, often added later. Done well, it is a power-user feature for members who want to ask follow-up questions about the insights they have already seen. Done badly, it is the only surface, leaving 90% of members who never tap chat with nothing.

The decision of which surfaces to support shapes the reasoning layer: a proactive insight card needs the assistant to run on a schedule, against all active users, deciding which ones have something worth surfacing. A reactive chatbot only runs when a user types. The two patterns have different cost profiles and different observability needs.

A Worked Example: Producing Meaning, Guiding, Predicting

Take one member, one day, and walk the three functions through.

The data layer has the last fourteen days of this member's data: sleep duration, sleep efficiency, awakenings, RMSSD HRV during sleep, daily steps, active calories, training sessions logged, resting heart rate, and a Sleep Score plus Resilience Score computed on top by the scoring layer.

The reasoning layer runs the assistant against this user every morning at 7:30 AM local time.

Producing meaning. The agent reads the recent data, compares to the user's fourteen-day baseline, and identifies the pattern that stands out. Output: "Your overnight HRV has trended below baseline for three nights running. It tracks with the high-intensity sessions you logged on Monday and Wednesday."

Guiding. Given the pattern, the agent generates a concrete recommendation grounded in the user's profile (age, fitness level, training goal). Output: "Consider a Zone 2 session today, target heart rate one hundred fifteen to one hundred thirty beats per minute, sixty to ninety minutes. Skip the strength block. Aim to be in bed by ten."

Predicting. The agent projects the trajectory based on the current trend. Output: "If sleep efficiency stays in this range and training load does not drop, your Resilience Score is projected to fall from seventy-eight to the at-risk band (below sixty-five) by next Monday."

All three outputs render as one insight card in the app at 7:30 AM. The member opens the app, sees the card, and acts on it (or does not). The assistant logs the reasoning and the recommendation against the user's record for the audit trail and for evaluation later.

That is the full loop. The data layer (Open Wearables), the scoring layer (cohort-calibrated by the data science team), the reasoning layer (multi-agent system with router, tools, guardrails), and the surface layer (insight card, daily summary, optional notification).

Where This Gets Harder Than the Demo

Five places where AI health assistants accumulate failure modes in production.

Stale or missing data. Users miss syncs. Devices go offline. The assistant has to decide between giving a recommendation based on stale data, refusing to comment, or asking the user to sync. The right behavior depends on the use case and the freshness threshold. Hard-code this decision into the agent's tool layer, not into the prompt.

Hallucination on edge cases. When the data is unusual (a marathon runner, a frequent traveler, a member with chronic sleep apnea), generic prompt language produces generic answers that miss the edge case. Cohort-specific prompting and cohort-specific thresholds in the scoring layer reduce this, but never to zero. Guardrails should catch the most dangerous classes (medication advice the assistant is not qualified to give, diagnostic claims, anything that crosses into regulated medical practice) before the output ships.

Latency on real-time surfaces. A morning insight card can take ten seconds to generate without anyone noticing. A push notification triggered by a webhook event has a tighter budget. Multi-agent reasoning pipelines accumulate latency at every step (router, main agent, multiple tool calls, guardrails, optional translator). Profile this end to end and decide which surfaces tolerate which latencies.

Multilingual tone drift. Reasoning in one language and rendering in another is the right architecture, but the translator layer is where brand voice degrades if it is not engineered with care. Generic translation produces stilted output. Brand-tuned prompt scaffolding for the translator keeps the assistant sounding like the product, not like Google Translate.

Cohort calibration. A twenty-percent HRV drop means something different for a twenty-five-year-old athlete than for a sixty-year-old cardiac rehab patient. Generic thresholds produce alert fatigue. Cohort-calibrated thresholds produce alerts members trust. This is the data science team's work, not the agent's.

The five failure modes share a pattern. They are all places where the demo version works because the demo's data is curated and the demo's user is the team. Production data is messy and production users are unpredictable. The architecture absorbs that gap if it is designed for it.

Deployment Posture: Cloud LLM by Default

For most healthtech products, the cloud LLM providers (OpenAI, Anthropic, Google) are the right starting point. The quality of the reasoning, the speed of model iteration, and the maturity of the tooling around them are well ahead of self-hosted alternatives.

The compliance question is rarely "cloud vs on-prem" outright. It is a ladder of safety measures, escalated only when each tier proves insufficient for the customer's regulator and risk posture.

Tier 1: Cloud LLM with controlled PHI exposure. The user health data flows to the LLM provider through a tightly controlled tool layer (the agent's biometric query tools), under a BAA or its regional equivalent. The agent only ever reads the data points it needs for the reasoning step, never an open dump of the user's record. For many healthtech products, this tier is enough.

Tier 2: Anonymization and tokenization before the LLM call. When the regulator pushes back on raw PHI flowing to a cloud provider, the next move is to anonymize or tokenize the data inside the tool layer before it hits the model. The LLM reasons over pseudonymous identifiers and structured biometric values without seeing the user's name, date of birth, or anything else that maps the values back to a real person. The outputs come back into your product, get re-associated with the user, and render in the surface layer. Done well, this satisfies most regional regulators while preserving the quality of the cloud model.

Tier 3: Private-cloud LLM under BAA. When even tokenized cloud LLM calls fall outside the customer's risk appetite, the next tier is a private-cloud LLM service with a healthcare-grade contractual envelope (AWS Bedrock, Azure OpenAI, GCP Vertex AI). The reasoning model still runs in a hyperscaler environment, but in a tenant isolated for healthcare use, with logging, retention, and data-handling commitments suitable for PHI workloads.

Tier 4: On-prem open-source LLM. Only when tiers 1 through 3 have been tested and deemed insufficient. The open-source models that run on commodity GPU hardware (Llama-class, Mistral-class, Qwen-class) are good and improving fast, but they are not yet at parity with the best cloud models for healthcare-grade reasoning. The trade-off is significant: weaker reasoning, more prompt-engineering work to compensate, GPU infrastructure to procure and operate, slower iteration as the open-source ecosystem catches up.

The right answer almost always lives in tiers 1 through 3. Jumping straight to on-prem because the brief reads "no cloud" without walking the ladder usually means committing to expensive GPU infrastructure to solve a problem that tokenization could have solved at a fraction of the cost.

When on-prem is genuinely required, the architecture above stays the same. Only the model deployment changes. The router, the main agent, the tools, the guardrails, and the translator all work on either deployment shape.

What Momentum Adds On Top of Open Wearables

Open Wearables gives the assistant a normalized data layer to read from. The assistant itself is product-engineering work that sits above the platform.

When you bring Momentum in to build an AI health assistant on top of Open Wearables, the engagement covers:

The reasoning layer, end to end. Router design, main agent prompting and tool definition, specialist sub-agents for the domains your product needs, guardrails calibrated to your tone and compliance constraints, optional translator for non-English markets.

The scoring layer the assistant interprets. Anna Zych's data science team builds the score models (sleep quality, recovery readiness, resilience, condition-specific composites) the assistant reads as features. Without cohort-calibrated scores, the assistant has nothing useful to comment on.

Prompt engineering for the three functions. Producing meaning, guiding, and predicting each take a different prompt shape, a different tool budget, and different guardrails. We build all three against your buyer's actual use cases, not generic templates.

The surface layer. Insight cards, daily summaries, notifications, optional chatbot. The product engineering inside your app to render and refresh the assistant's outputs.

Production observability. Every agent run is logged with the tools it called, the data it read, the prompt context, the output, and the user's reaction. This is what makes iteration possible. Without it, the assistant is a black box and improvements are guesswork.

Compliance posture. HIPAA, GDPR, KVKK, BAA signing, audit log retention to the level your regulator expects. Standard for our healthcare client work.

Ongoing iteration. AI assistants are not ship-and-forget. The prompts, the tool definitions, the guardrails, and the scoring all evolve as the cohort data grows and the product use cases sharpen. We run that iteration as part of the managed engagement.

Momentum delivers this as a managed engagement: Open Wearables and the assistant maintained on our side, the product engineering inside your app delivered to your team, under SLA.

Talk to Momentum

If you are building a wearable-driven product where an AI health assistant is part of the value, Momentum runs this for client teams in health, wellness, longevity, and clinical software.

Two engagement shapes:

Managed Open Wearables plus AI assistant. We deploy and operate Open Wearables on your infrastructure, build the reasoning and surface layers in your product, and run the iteration loop. Fixed-cost setup plus a maintenance retainer. No per-user fees.

Custom wearable software development. When the AI assistant is part of a larger product build (clinical workflow, corporate wellness platform, longevity product), we build the whole thing. Open Wearables and the assistant infrastructure are included in scope.

Both start with a Discovery Workshop, and you leave with a scope and a timeline.

Talk to Momentum

Frequently Asked Questions

What does an AI health assistant actually do with wearable data?
A production AI health assistant does three things: it surfaces patterns the user cannot spot on their own (producing meaning), translates those patterns into concrete recommendations calibrated to that user's profile (guiding), and projects where the current trend leads over the next weeks (predicting). All three functions run on the same normalized longitudinal data and the same multi-agent reasoning pipeline. What changes is the time horizon addressed and the prompt shaping the output.
Why can't I just pass raw wearable API data directly to an LLM?
Raw wearable API data is structurally inconsistent across providers. Different vendors use different sleep stage definitions, different HRV metrics, and different sync cadences. A user running both an Oura ring and a Garmin watch has overlapping signals in incompatible formats. Without normalization, the LLM either fabricates answers to fill gaps or refuses to commit to any recommendation. Open Wearables normalizes data from all supported providers into one unified schema before the reasoning layer ever sees it.
What is the multi-agent architecture for an AI health assistant?

A production AI health assistant is not a single LLM call. It is a pipeline with distinct components: a router that classifies the incoming message and directs it to the right path, a main agent running a ReAct loop (reason, act, observe, repeat) with access to a curated toolset, specialist tools including biometric query tools, retrieval tools backed by a vector database, deterministic function tools, and optional specialist sub-agents for domains like sleep or nutrition. A guardrails layer checks every output before it reaches the user. An optional translator handles non-English markets. The architecture matters more than the model choice.

What surfaces can an AI health assistant render on?
Most healthtech products lead with insight cards (shown when the user opens the app), daily or weekly summaries (delivered in-app or by email), and proactive notifications triggered when the assistant detects a pattern worth surfacing. A conversational chatbot is typically a fourth surface added later as a power-user feature. Each surface has different latency budgets and different scheduling requirements, which shape the reasoning layer architecture.
Can an AI health assistant run on-premises to keep PHI out of the cloud?
Yes, but it should not be the first choice. The right approach is a compliance ladder: start with cloud LLMs under a BAA with controlled PHI exposure through the tool layer. If the regulator requires more, move to anonymization and tokenization before the LLM call. If that is still insufficient, use a private-cloud LLM (AWS Bedrock, Azure OpenAI, GCP Vertex AI) under a healthcare-grade contract. On-prem open-source models (Llama, Mistral, Qwen) are tier four — the right answer only when the first three tiers are genuinely ruled out. The reasoning architecture stays the same regardless of deployment shape.
What is Open Wearables and what problem does it solve?
Open Wearables is an open-source wearable data platform that normalizes health data from supported providers (Garmin, Polar, Suunto, Whoop, Strava, Apple Health, Samsung Health Connect, and others) into a single unified schema. It handles OAuth connections, webhook ingestion, schema normalization, and longitudinal storage. It also ships an MCP server that exposes user health data to LLM agents in structured form, so the reasoning layer reads from one consistent query path regardless of which device or provider the data came from.
What are the most common failure modes of AI health assistants in production?
Five failure modes recur in production. Stale or missing data forces the assistant to choose between recommending on outdated signals or refusing to answer. Hallucination on edge cases (athletes, frequent travelers, users with chronic conditions) occurs when generic prompt language misses cohort-specific thresholds. Latency accumulates across multi-agent pipelines and can miss the budget for real-time surfaces like push notifications. Multilingual tone drift degrades brand voice when the translation layer is not tuned carefully. Cohort calibration errors cause alert fatigue when thresholds are not calibrated to the specific user population. All five are addressable through architecture, not prompt tweaks alone.
How does Momentum handle compliance for AI health assistants?
Momentum covers HIPAA, GDPR, KVKK, BAA signing, and audit log retention as standard for healthcare client engagements. The deployment posture (cloud LLM tier, tokenization, private cloud, or on-prem) is determined by the client's regulator and risk appetite, starting from the least restrictive tier that satisfies the requirement. Audit logging covers every agent run: tools called, data read, prompt context, output, and user reaction. This is required for both compliance and iterative improvement.

Written by Bartosz Michalak

Director of Engineering
He drives healthcare open-source development at the company, translating strategic vision into practical solutions. With hands-on experience in EHR integrations, FHIR standards, and wearable data ecosystems, he builds bridges between healthcare systems and emerging technologies.

See related articles

Building an AI assistant on top of wearable data? Momentum delivers the full stack.

Let's Create the Future of Health Together

Open Wearables handles the wearable data layer. Momentum builds the multi-agent reasoning layer, the surface, the scoring, and the production infrastructure around it, end to end.

Looking for a partner who not only understands your challenges but anticipates your future needs? Get in touch, and let’s build something extraordinary in the world of digital health.

Newsletter

Bartosz Michalak