Key Takeaways
- Device vendors (Garmin, Oura, Whoop) publish their own HRV, sleep, and recovery scores. These are marketing features optimized for consumer engagement, not clinical precision. They are not interoperable and should not be used as health metrics in a product where accuracy matters.
- Designing a reliable health score from wearable data requires: a normalized data model across devices, statistical calibration against a reference population, validation against peer-reviewed methodology, and explicit handling of data gaps and outliers.
- HRV, sleep quality, recovery readiness, training load, and strain can all be computed reliably from wearable sensor data. The challenge is doing it correctly. Most teams underestimate the difference between a number that looks plausible and a number that is defensible.
- Adding an AI layer on top of unreliable health scores produces unreliable AI recommendations. The model is only as good as the data it reasons over.
- Health scoring is distinct from software engineering. Getting it right requires a person who understands the physiology, not just someone who can write the algorithm.
Is Your HealthTech Product Built for Success in Digital Health?
.avif)
The Problem With Vendor-Supplied Scores
Every major wearable device vendor ships their own health scores. Oura has its Readiness Score. Whoop has its Recovery Score. Garmin has its Body Battery. These scores are different numbers for the same underlying physiological signals.
A user with a Garmin and an Oura who syncs both devices to your app gets two different recovery numbers from the same night of sleep. Which one does your product display? Which one does your AI recommendation engine use?
For any product where accuracy matters, neither vendor score is a usable input.
Vendor scores are proprietary black boxes. Oura, Whoop, and Garmin do not publish their scoring algorithms. They do not disclose their reference populations. They do not expose the raw inputs their scores depend on. You cannot validate their methodology, reproduce their results, or audit their outputs.
Beyond opacity, they solve different problems. Garmin's Body Battery is designed to encourage movement throughout the day. Oura's Readiness Score is designed to encourage sleep hygiene. Whoop's Recovery Score is designed to inform training load decisions. These are product experiences, not clinical measurements. They serve the device vendor's engagement goals.
If you're building a product where health scores drive clinical decisions, insurance underwriting, coaching recommendations, or any output your users take seriously, vendor scores are not a foundation you can build on.
What a Reliable Health Score Requires
Building a health score from wearable data that holds up under scrutiny requires four things working together.
A normalized data model across providers
A health score built on Oura's HRV fields is an Oura score. For it to work when a user connects a Garmin, the score needs to be built on normalized sensor readings, not vendor-specific processed outputs.
This means starting at the raw signal level: rMSSD (the time-domain HRV metric all major devices provide), sleep stage durations, resting heart rate, activity intensity classifications, and similar device-agnostic values. These map to a unified schema regardless of which device produces them.
The normalization layer handles this, whether that's Open Wearables or another integration platform. But normalization solves the input problem, not the scoring problem. You still need to design what the score means and how it's calculated.
Statistical calibration
A recovery score of 74 means something only in relation to context: what is this person's typical recovery range, how variable is it, how does today compare to their last 30 days?
Raw HRV values are not comparable across users. A rMSSD of 45ms is excellent for a 55-year-old and unremarkable for a trained 30-year-old athlete. Without calibration to each user's baseline, a score is a number without meaning.
Proper calibration requires enough historical data per user to establish a personal baseline, statistical methods that handle outliers and data gaps without corrupting the baseline, and time-aware weighting that gives more influence to recent data than historical data.
The minimum observation window for a reliable baseline is typically 14 to 21 days of regular device sync. Scores generated before this window are estimates with wide uncertainty bounds. Products that display confidence intervals or "still calibrating" states are handling this correctly.
Validation against peer-reviewed methodology
HRV research is extensive. The relationship between rMSSD, sleep stages, cortisol patterns, and recovery readiness has been studied in athletic populations, clinical populations, and occupational health contexts. This literature exists and is accessible.
A health score that ignores it is starting from scratch on a problem that has already been solved to varying degrees. A health score that incorporates it has a defensible basis for its claims.
An engineer and a neuroscientist ask different design questions. The engineer asks "what algorithm can we write." The neuroscientist asks "what does the evidence say about the relationship between this signal and the outcome we're trying to predict." Both questions need answers; only one produces a score that holds up under scrutiny.
Momentum's health scoring work is developed with Anna Zych, neuroscientist and PhD, who brings the evidence-based methodology that shapes how each score is designed and validated. The result is a scoring model grounded in peer-reviewed research on HRV, sleep physiology, and recovery readiness, not a plausible-looking algorithm built without reference to how these signals actually behave.
Explicit handling of data quality
Wearable devices produce incomplete data. Users forget to wear their device. Garmin watches fail to sync overnight. Battery dies. Bluetooth disconnects during sleep. The resulting dataset has gaps, and any scoring model needs a defined behavior for what happens when expected data is missing.
Options include: score suppression (don't show a score when confidence is below a threshold), confidence-weighted output (show the score with a reliability indicator), partial scoring (calculate sub-scores from available data and combine them with appropriate uncertainty handling), and interpolation from adjacent periods.
There is no universally correct answer. The right choice depends on what the score is used for and what the cost of a wrong answer is. Clinical decision support has different requirements than a general wellness app.
The Scores That Can Be Built Reliably
Given a normalized multi-device data layer and proper calibration methodology, these scores can be built with defensible accuracy:
HRV-based recovery readiness. Measures how recovered the autonomic nervous system is from prior stress. Based on rMSSD from overnight sleep data. Calibrated to personal baseline, time-weighted, with outlier handling. This is the most data-rich signal wearables produce and the most directly related to readiness for physical or cognitive stress.
Sleep quality score. Composite score from sleep stage distribution (REM, deep, light, awake), continuity (number of awakenings, longest continuous sleep period), timing (alignment with the user's sleep schedule), and duration deviation from personal baseline. Each component has separate research support and can be individually weighted based on the product's use case.
Training load and strain. Cumulative cardiovascular and muscular stress from activity data. Based on heart rate zone distribution, session duration, and session count over a rolling window. Used in coaching and performance contexts to manage overtraining risk.
Recovery trend. Directional assessment of whether the user's recovery is improving, stable, or declining over a multi-week window. Less precise than a single-day score but more useful for behavioral pattern recognition.
Anomaly signals. Deviations from personal baseline that exceed statistical thresholds and may indicate physiological stress, illness onset, or training maladaptation. These are not diagnoses. They are flags that something is outside the user's normal range and worth paying attention to.
Custom scores specific to a product's user population are also feasible. A remote patient monitoring product for elderly users has different relevant signals than a performance coaching app for endurance athletes. Designing scores for a specific population means calibrating the reference methodology and validation criteria to that population, not using a generic consumer fitness framework.
The AI Layer
Once a reliable health score layer exists, it can anchor an AI recommendation engine.
The use case: a user opens the app and receives a recommendation grounded in their current health state. "Your recovery is lower than your average for Tuesday. Your last three sessions involved high training load. Consider a lighter session today or prioritize sleep tonight." The recommendation is specific to this person's data.
Building this requires a few components beyond the scoring layer:
An MCP server or equivalent API interface. The AI model needs to query health data in natural language or structured queries. Open Wearables includes an MCP server for this purpose: you can ask "what is this user's HRV trend over the last 14 days" or "how does today's sleep quality compare to their 30-day baseline" and get back a structured response the model can reason over. Other wearable infrastructure setups can achieve the same through equivalent query interfaces.
A retrieval layer. The AI model doesn't reason well over raw time-series data. It reasons well over summarized, structured facts. The retrieval layer pulls the relevant health facts (current scores, recent trends, anomalies, historical context) and formats them as context for the model's response.
Guardrails on health claims. AI models are willing to make confident clinical claims they have no basis for. A recommendation engine for a health product needs explicit constraints on what kinds of claims it can make, what language it uses when uncertainty is high, and when it should defer to a clinician rather than making a recommendation.
The AI layer built on an unreliable scoring foundation produces confident-sounding recommendations with no real grounding. This is worse than no AI layer because it adds apparent authority to uncertain information.
Starting Point for a Health Scoring Project
A health scoring engagement starts with the product context, not the algorithm. What decisions will users make based on this score? What population is the product designed for? What devices will the user base connect? What is the acceptable error rate and how is error defined?
The answers to these questions shape the scoring methodology. The methodology shapes the implementation. Building the implementation first and asking these questions later produces scores that need to be redesigned.
If you have a wearable data layer and want to add health scores or AI features, the wearables team at Momentum can scope what this involves for your specific product. For more complex scoring work or clinical validation questions, our health science advisor Anna Zych, neuroscientist and PhD, is involved in scoping from the start.
.png)


.png)

.png)
.png)
