Key Takeaways
- A 2025 systematic review published in npj Digital Medicine analyzed 82 studies covering 430,052 participants and 14 Apple Watch health metrics
- Apple Watch heart rate accuracy is strong under resting conditions (mean bias: -0.27 bpm) but variability widens during high-intensity exercise and in populations with darker skin tones
- AFib detection showed 91% specificity and 79% sensitivity, sufficient for consumer screening but not for standalone clinical diagnosis
- SpO2 measurement has a near-zero mean bias (-0.04%) but limits of agreement that span roughly 8 percentage points
- Energy expenditure produced the largest and most inconsistent errors across all 14 metrics, with mean errors exceeding 20-30% during exercise in several included studies
- The study authors call for longitudinal validation of clinical metrics before broader adoption in healthcare settings
- For health app developers, the practical implication is that accuracy thresholds differ significantly between wellness applications and clinical tools
Is Your HealthTech Product Built for Success in Digital Health?
.avif)
This article is part of The Science Behind Wearables series. To stay up to date with new posts, subscribe on Substack.
Apple Watch is on the wrists of an estimated 100 million people. Questions about its measurement accuracy matter more than most device marketing materials address: the answers affect people making decisions based on the data, and the developers building products on top of it.
A 2025 systematic review published in npj Digital Medicine provides the most comprehensive analysis to date. Researchers at University College Dublin pooled data from 82 independent studies covering 430,052 participants across 14 health metrics.
This article walks through the review's findings, metric by metric, and examines what those numbers mean for teams building wearable-connected health products.
About the Study
The review was published in March 2025 under the title "The accuracy of Apple Watch measurements: a living systematic review and meta-analysis."
The "living" designation matters. Traditional systematic reviews are published once and become progressively outdated as new research accumulates. A living review incorporates incoming studies on a rolling basis, which is especially relevant in wearable health technology, where hardware updates annually and research lags device releases by two to three years.
The 82 included studies covered participants with a mean age of 41.3 years. The analysis relied on mean bias and limits of agreement as its primary accuracy measures. These statistics are more informative than simple percentage-accuracy figures: mean bias reveals whether errors are systematic (the device consistently over or underestimates), and limits of agreement show how much individual readings can vary from the reference standard in either direction. Notably, the participant demographics across these studies skewed disproportionately male. This bias highlights a persistent blind spot in wearable validation, suggesting that these established accuracy metrics may not perfectly translate to female users due to fundamental physiological and biomechanical differences.
Apple Watch Heart Rate Accuracy
Heart rate is the most studied metric in the review, with the deepest study pool. The meta-analysis found a mean bias of -0.27 bpm. On average, Apple Watch very slightly underestimates heart rate. The limits of agreement ranged from -7.19 to 6.64 bpm, meaning most individual readings fall within roughly seven beats per minute of a clinical reference measurement.
The situation shifts during high-intensity exercise. Photoplethysmography, the optical sensor technology Apple Watch uses to detect heart rate, is susceptible to motion artifacts when the wrist moves rapidly. Several included studies documented wider error ranges during running and cycling compared to walking or seated rest. This is a consistent limitation of wrist-based PPG sensors across all manufacturers.
Skin tone introduces additional variability. The review references multiple studies showing that darker skin tones correlate with reduced PPG signal quality, producing larger errors. This is not a characteristic unique to the Apple Watch. All consumer wrist-based PPG devices share it and it has direct relevance for products targeting demographically diverse populations.
Apple Watch heart rate accuracy at rest meets the requirements for wellness applications and fitness tracking. For remote patient monitoring or clinical cardiology applications, the exercise variability and demographic moderators require additional validation layers before using Apple Watch data as the primary measurement input.
Heart rate accuracy also sets a ceiling on heart rate variability data quality. Since HRV is calculated from inter-beat intervals, any error in the heart rate signal propagates into HRV estimates. Teams building HRV features should read the review's heart rate findings alongside the SDNN accuracy data on wrist-worn wearables before deciding what level of precision they can claim.
Atrial Fibrillation Detection
The review evaluated AFib detection using sensitivity and specificity, the standard measures for any binary classification task in clinical medicine.
Apple Watch achieved 91% specificity and 79% sensitivity across the included studies. In practical terms: when the device identifies a normal rhythm as normal, it is correct 91% of the time. It detects actual AFib episodes 79% of the time. The result is a device that produces relatively few false positives and misses roughly one in five true positives.
Those figures support a screening role. High specificity reduces unnecessary clinical referrals, which matters in a consumer health context. The sensitivity figure means some AFib episodes will not be flagged, which excludes standalone diagnostic use.
For product teams, AFib detection via Apple Watch is well-suited as a screening feature that prompts clinical follow-up. Positioning it as diagnostic requires clinical evidence beyond what the consumer device literature currently provides.
Blood Oxygen (SpO2) Accuracy
SpO2 measurement on Apple Watch uses the same optical PPG principle as heart rate but at different wavelengths, calculating the ratio of oxygenated to deoxygenated hemoglobin.
The review found a mean bias of -0.04%, which is as close to zero as any consumer device study has reported. The limits of agreement were -4.00% to 3.94%, a range that spans nearly eight percentage points from bottom to top. A reading of 93% SpO2 could, within that range, represent a true value anywhere from 89% to 97%.
This is consistent with outcomes from other consumer-grade wrist PPG SpO2 devices. The FDA cleared Apple Watch's Blood Oxygen app under a general wellness classification, not as a medical device, which reflects this measurement limitation.
For health applications, SpO2 from Apple Watch contributes useful directional signal for respiratory and cardiovascular trend analysis over time. As part of a wearable data integration layer that surfaces trends across days or weeks, it provides meaningful information. For clinical oxygen monitoring, a medical-grade pulse oximeter remains the reference standard.
Apple Watch Heart Rate Accuracy in Sleep Tracking
The review characterized sleep tracking accuracy as moderate. Total sleep time estimates were generally close to polysomnography, the clinical gold standard. Sleep stage classification showed more divergence, particularly at the boundary between light NREM and REM sleep.
PPG-based actigraphy reliably identifies the difference between sleep and wakefulness. Distinguishing fine-grained stages requires methods closer to electroencephalography, which wrist sensors cannot replicate. The included studies showed that accuracy varied by sleep profile: participants with sleep disorders or irregular sleep patterns showed wider error distributions than healthy sleepers with consistent schedules.
Sleep data integration is one of the most common use cases for wearable APIs in consumer health apps. Given the accuracy profile, the most valuable data to present to users is the long-term trend in total sleep duration, rather than precision claims about a single night's sleep stage distribution.
Step Count Accuracy
Step count accuracy was moderate across the included studies, consistent with results from independent validation work over the past several years.
Apple Watch performs well on flat surfaces at typical walking speeds. Accuracy decreases in non-standard movement scenarios: pushing a stroller, using walking poles, carrying heavy loads, or moving at unusually slow or fast cadences. Short walking bouts are more susceptible to error than continuous activity periods.
For general population wellness applications, this accuracy level is sufficient. For clinical research requiring precise step counts, such as gait analysis, dedicated clinical-grade pedometry devices can provide reference data.
Energy Expenditure Accuracy
Energy expenditure was the lowest-performing metric in the review. Errors were inconsistent across studies, frequently large, and showed no stable direction of bias.
Apple Watch estimates calorie burn from a combination of heart rate data, accelerometer readings, and user-provided demographic inputs. The multi-variable algorithm compounds errors: inaccuracies in the heart rate signal propagate directly into the calorie estimate, and the physiological assumptions embedded in the algorithm do not generalize evenly across all users.
For teams building AI health coaching features or nutrition products, this is a significant constraint. Energy expenditure data from Apple Watch can serve as a relative indicator of activity intensity. Presenting absolute calorie figures as precise measurements requires clear user-facing caveats about estimation uncertainty.
What Health App Developers Should Do With This Data
The review's core finding is that accuracy is not a single number. It varies by metric, by individual physiology, and by the conditions under which measurements are taken. That has direct product implications.
Defining the required accuracy tier at the design stage prevents expensive scope changes later. Wellness and lifestyle applications can work with wider variability than clinical tools. Apple Watch heart rate accuracy at rest is sufficient for a fitness app displaying exercise zones.
For metrics with wide limits of agreement, particularly SpO2, displaying aggregate trends over time rather than point-in-time precision readings is a more defensible design pattern. A declining trend in overnight SpO2 over several weeks communicates more reliably than a single reading of 94%.
Products targeting specific populations, such as older adults, users with chronic conditions, or diverse skin tones, may need population-specific validation beyond the general consumer device accuracy literature. The review documents demographic variability clearly enough to warrant this consideration at the scoping stage.
Teams building across multiple wearable providers face an additional layer of complexity: each device has its own accuracy profile per metric. A unified data layer that normalizes for device-specific variability is more robust than passing raw values from individual providers directly to users.
The top challenges teams face with wearables in healthcare map closely to these accuracy considerations: inconsistent data quality, demographic variability, and the gap between consumer device specifications and clinical requirements.
.png)


.png)

.png)
.png)
.png)

