Accuracy of 11 Wearable, Nearable, and Airable Consumer Sleep Trackers: Prospective Multicenter Validation Study

Introduction

With the growing recognition of the importance of sleep for overall health [], there has been a significant rise in public interest in monitoring sleep patterns using consumer sleep trackers (CSTs) [-]. While the laboratory monitoring of sleep using the traditional sleep analysis tool polysomnography has several limitations associated with the need for cumbersome sensors [], CSTs facilitate individual monitoring of sleep at home using minimal equipment. Recently, many big tech companies, including Apple, Samsung, and Google, as well as health care startups like Withings and Oura, have released their own CSTs. These companies have made significant contributions to enhancing the performance of CSTs by integrating deep learning algorithms and biosignal sensing technologies [,]. As a result, CSTs have emerged as accessible solutions for home sleep monitoring [,,-]. CSTs are widely used by not only individuals interested in wellness who wish to understand and proactively manage their own sleep, but also those who want to self-check and screen for sleep disorders.

This study classified CSTs into 3 types: wearables, nearables, and airables. Wearable devices or wearables, such as smartwatches and ring-shaped devices, are generally worn by users to track sleep using sensors like photoplethysmography sensors and accelerometers [,-]. Nearable devices or nearables, placed near the body without direct contact, have radar or mattress pads to detect subtle movements during sleep [,]. Airable devices or airables use mobile phones to analyze sleep via built-in microphones or environmental sensors [,,]. This classification is based on the measurement methods and biological signals used in each category.

Given the surge of diverse CSTs, it is necessary to conduct comprehensive and objective evaluations of the performance of these CSTs available in the market [,,-]. Some studies compared CSTs and alternative tools available for sleep analysis, such as electroencephalography headbands [] or subjective sleep diaries [] (without employing the gold standard polysomnography), which failed to validate the consistency between CSTs and polysomnography. Chinoy et al [] compared the performance of 7 CSTs with polysomnography (Fatigue Science Readiband, Fitbit Alta HR, Garmin Fenix 5S, Garmin Vivosmart 3, EarlySense Live, ResMed S+, and SleepScore Max). However, this previous study showed limitations of recruitment from a single institution and exclusion of widely used CSTs available commercially.

To address these limitations, we conducted a multicenter study comparing widely used or newly released CSTs with in-lab polysomnography in a hospital setting. By simultaneously assessing multiple CSTs, we aimed to minimize bias and evaluate their performance across various metrics. Subgroup analysis was also performed to assess the impact of demographic factors on performance, including sex assigned at birth, apnea-hypopnea index (AHI), and BMI. By performing the most extensive simultaneous comparison of widely used CSTs and conducting a multicenter study with diverse demographic groups, this study offers comprehensive insights into the performance and applicability of these CSTs for sleep monitoring.

MethodsParticipants

The demographic information of the study participants is presented in . A total of 75 individuals were recruited from Seoul National University Bundang Hospital (SNUBH) and Clionic Lifecare Clinic (CLC). Of these 75 individuals, 37 (27 males and 10 females) with scheduled polysomnography for sleep disorders were recruited from SNUBH and 38 (12 males and 26 females) were recruited through an online platform from CLC. Both institutions used the same inclusion criteria, including age between 19 and 70 years and presence of subjective sleep discomfort. Individuals with uncontrolled acute respiratory conditions were excluded. Participant demographics revealed that the sample consisted of 52% (39/75) males, with a mean age of 43.59 (SD 14.10) years and a mean BMI of 23.90 (SD 4.07) kg/m2. Significant differences in sleep measures were observed between the 2 institutions, including time in bed, total sleep time, wake after sleep onset (WASO), and AHI. presents the number of measurements and data collection success rate for each CST.

Table 1. Comparative analysis of participant demographics.CharacteristicTotal (N=75)SNUBHa (n=37)CLCb (n=38)P valuecMale, n (%)39 (52)27 (73)12 (32)<.001dAge (years), mean (SD)43.59 (14.10)53.49 (11.96)33.95 (8.07)<.001dBMI (kg/m2), mean (SD)23.90 (4.07)24.64 (4.07)23.18 (3.98).12Time in bed (hours), mean (SD)7.24 (0.92)8.01 (0.31)6.49 (0.66)<.001dTotal sleep time (hours), mean (SD)5.82 (1.33)6.26 (1.2)5.40 (1.33).005dSleep latency (hours), mean (SD)0.27 (0.37)0.27 (0.38)0.26 (0.38).93Wake after sleep onset (hours), mean (SD)1.15 (1.2)1.48 (1.22)0.83 (1.23).02dSleep efficiency (%), mean (SD)81.00 (17.5)78.40 (15.54)83.54 (19.1).20Apnea-hypopnea index, mean (SD)18.18 (20.39)26.56 (24.25)10.02 (10.99)<.001d

aSNUBH: Seoul National University Bundang Hospital.

bCLC: Clionic Lifecare Clinic.

cAll P values were obtained using 2-sample independent t tests. For the male category, Fisher exact test was applied.

dStatistical significance (P<.05).

Evaluation of CSTs

We evaluated 11 different CSTs in this study. Wearables included ring-type devices (Oura Ring 3, Oura) and watch-type devices (Apple Watch 8, Apple Inc; Galaxy Watch 5, Samsung Electronics Co, Ltd; Fitbit Sense 2, Fitbit Inc; and Google Pixel Watch, Google LLC). Nearables included pad-type devices (Withings Sleep Tracking Mat, Withings) and motion sensor devices (Amazon Halo Rise, Amazon Inc; and Google Nest Hub 2, Google LLC). Airables included mobile apps (SleepRoutine, Asleep; SleepScore App, SleepScore Labs; and Pillow, Neybox Digital Ltd) with iPhone 12s and Galaxy S21s. The selection of these devices was based on their popularity and availability in the market at the time of the study. The methods of usage and application for each sleep tracker were based on user instructions provided by the respective manufacturers. To mitigate a possible learning curve for each device, the researchers educated participants on how to use each device before measurements, and in the case of wearable CSTs, they ensured that the devices were properly fitted. During the study, software updates of all devices were performed on March 1, 2023, to ensure that they were up-to-date, and automatic updates were disabled.

Study Design

This was a prospective cross-sectional study conducted to investigate the accuracy of various CSTs and polysomnography in analyzing sleep stages. It was conducted at 2 independent medical institutions in South Korea, namely SNUBH, a tertiary care hospital, and CLC, a primary care clinic.

All participants were contacted by phone at least 2 days prior to participating in the polysomnography study and were provided with instructions. On the day before and the day of the test, they were advised to abstain from alcohol and caffeine consumption and refrain from engaging in strenuous exercise, and were informed of the designated test time. These measures were taken to standardize participant behaviors and minimize the influence of potential confounding factors. On the designated test days, participants visited the hospitals and received detailed explanations about the study. They provided written informed consent and underwent polysomnography at each institution. Polysomnography recordings were conducted in a controlled sleep laboratory environment in accordance with the guidelines recommended by the American Academy of Sleep Medicine (AASM) []. Two technicians independently interpreted the results, followed by a review by sleep physicians.

To address the issue of interference due to multiple CSTs sharing the same biosignals, the participants were divided into 2 groups in both medical institutions: multi-tracker group A and multi-tracker group B, as illustrated in . The configurations of the CSTs are presented in . Specifically, at SNUBH, multi-tracker group A consisted of 18 individuals and multi-tracker group B consisted of 19 individuals. Similarly, at CLC, each group included 19 individuals. Across both institutions, the demographic statistics for participants in multi-tracker groups A and B demonstrated no significant differences across all metrics, as presented in . Each group included a combination of noninterfering CSTs. Specifically, the nearables Google Nest Hub 2 and Amazon Halo Rise, which use similar radar sensors to detect motion, were allocated to different groups. In the case of wearables, participants were allowed to simultaneously wear a maximum of 2 watch devices, which are a type of wearable, with 1 on each wrist. Consequently, Fitbit Sense 2 and Pixel Watch were assigned to multi-tracker group A, while Galaxy Watch 5 and Apple Watch 8 were assigned to multi-tracker group B. As a result, these devices were expected to yield approximately half of the intended measurements. Airables, which were available on both iOS and Android devices (SleepRoutine and SleepScore), were analyzed, with half of them on iOS and the other half on Android. We used Pillow on iOS, as it is not available on Android. The polysomnography and CST results were then compared and analyzed.

‎

Figure 1. Flowchart outlining the experimental design. Experimental procedures involving subject enrollment, CST assignment, and experimental settings for simultaneous measurement involving both CSTs and PSG. CLC: Clionic Lifecare Clinic; CST: consumer sleep tracker; PSG: polysomnography; SNUBH: Seoul National University Bundang Hospital. ‎

Figure 2. Configuration of consumer sleep trackers used in the experiment. Ethics Approval

Ethics approval was obtained from the respective Institutional Review Board (IRB) of each institution (IRB number B-2302-908-301 from SNUBH and number P01-202302-01-048 from CLC).

Statistical Methods and Evaluation Metrics

Two-sample independent t tests were employed to compare demographic information () and sleep measures, and significance was determined based on a P value of <.05. The average sleep measurements were compared, and any proportional bias was assessed using Pearson correlation and a Bland-Altman plot. The study used sensitivity, specificity, and F1 scores as evaluation metrics for sleep stage classification. Macro F1 scores, weighted F1 scores, and kappa values were used to summarize the results of the evaluation, considering the imbalance in data classes, such as sleep stages. All statistical analyses and visualizations were conducted using Python 3 (version 3.9.16) and used the scikit-learn, matplotlib, and scipy libraries.

Data Preprocessing

Three main steps were followed in the data processing stage. First, raw sleep score data were extracted from each CST device, either through direct download via the manufacturer’s app or the web portal, or by requesting raw data from the SleepRoutine device manufacturer. The sleep score codes were standardized across devices, with the wake stage assigned 0, the light stage assigned 1, the deep stage assigned 2, and the rapid eye movement (REM) stage assigned 3. Apple Watch 8 used alternative expressions, such as “core sleep” instead of light sleep.

The extracted data were synchronized in time to compare CST results of sleep tracking with polysomnography accurately. The sleep stages measured by devices earlier than polysomnography scores were discarded. Conversely, for devices that started measuring after polysomnography scoring, the sleep stages were marked as the wake stage until the measurement began. The end point of all device measurements was aligned with the end of polysomnography, resulting in consistent measurement of total time in bed across the devices.

Because the sleep stages changed at every epoch, the results may be inaccurate if the start time of the 30-second epoch differed. Therefore, the results of sleep stages involving all devices, including polysomnography, were segmented into 1-second intervals and compared every second. This approach enabled a more precise comparison of sleep stage results between polysomnography and CST measurements, and eliminated potential bias.

ResultsEpoch-by-Epoch Analysis: Overall Performance

presents the results of epoch-by-epoch agreements between polysomnography and each of the 11 CSTs under the sleep stage classification. SleepRoutine (airable) demonstrated the highest macro F1 score of 0.6863, which was closely followed by Amazon Halo Rise (nearable), with a macro F1 score of 0.6242. In terms of Cohen κ, a measure of interrater agreement, 3 wearables (Google Pixel Watch, Galaxy Watch 5, and Fitbit Sense 2), 1 nearable (Amazon Halo Rise), and 1 airable (SleepRoutine) demonstrated moderate agreement with sleep stage classification (κ=0.4-0.6). On the other hand, 2 wearables (Apple Watch 8 and Oura Ring 3), 1 nearable (Withings Sleep Tracking Mat), and 1 airable (SleepScore) showed a fair level of agreement (κ=0.2-0.4). Finally, Google Nest Hub 2 (nearable) and Pillow (airable) exhibited only a slight level of agreement across sleep stage classifications. The performance of CSTs was assessed in 2 distinct institutions, where the macro F1 scores, averaged over all devices in each institution, were 0.4973 and 0.4876 at SNUBH and CLC, respectively. There was no significant difference in performance between these 2 locations. Among the 11 CSTs evaluated, 5 (Galaxy Watch 5, Apple Watch 8, Amazon Halo Rise, Pillow, and SleepRoutine) exhibited better performance at SNUBH, while the remaining CSTs demonstrated superior performance at CLC.

Table 2. Epoch-by-epoch agreement: classification of 4 sleep stages.VariableOverallSNUBHaCLCb
AccuracyWeighted F1Cohen κMacro F1Macro F1Macro F1
Airable

SleepRoutine (n=67)c0.7106d0.7166d0.5565d0.6863d0.7188d0.6551d

SleepScore (n=38)0.43290.44720.20650.40490.44080.3094

Pillow (n=74)0.28300.29060.07410.25880.26040.2564
Nearable

Withings Sleep Tracking Mat (n=75)0.49210.50070.24550.44960.42050.4837

Google Nest Hub 2 (n=33)0.41210.40890.06440.30090.26760.3299

Amazon Halo Rise (n=28)0.66340.67060.48070.62420.62310.6031
Wearable

Google Pixel Watch (n=30)0.63550.61430.40440.56690.53810.5925

Galaxy Watch 5 (n=22)0.64940.64990.41770.57610.62610.5651

Fitbit Sense 2 (n=26)0.64640.62960.41850.58140.51300.6268

Apple Watch 8 (n=26)0.56400.57310.29760.49100.54360.4203

Oura Ring 3 (n=53)0.54270.55180.34920.51860.51870.5211

aSNUBH: Seoul National University Bundang Hospital.

bCLC: Clionic Life Center.

cThe number in parenthesis indicates the number of participants tested with each device.

dTop-performing consumer sleep tracker.

Epoch-by-Epoch Analysis: Performance According to Sleep Stages

The performance of various sleep trackers across different sleep stages is presented in . For the wake and REM stages, SleepRoutine (airable) achieved the highest macro F1 scores of 0.7065 and 0.7596, respectively. These scores substantially surpassed those of the second-best tracker, Amazon Halo Rise (nearable), by a margin of 0.1098 for the wake stage and 0.0313 for the REM stage. For the deep stage, Google Pixel Watch and Fitbit Sense 2, which are wearables, exhibited superior performance with macro F1 scores of 0.5933 and 0.5564, respectively. Google Pixel Watch achieved the highest performance with a substantial margin. It surpassed Fitbit Sense 2 by a margin of 0.0368 and outpaced SleepRoutine, which was the sleep tracker with the third highest score, with an even larger margin of 0.0567. For the light stage, an array of sleep trackers, including 3 wearables (Google Pixel Watch, Galaxy Watch 5, and Fitbit Sense 2), 1 nearable (Amazon Halo Rise), and 1 airable (SleepRoutine), demonstrated similarly high levels of performance, with a macro F1 score ranging from 0.7142 to 0.7436. Additional detailed assessments of sleep stage performance, including accuracy, weighted F1, and area under the receiver operating characteristic curve metrics, are presented in -.

presents the confusion matrices for the sleep stages of the 11 CSTs, providing a clear visual representation of prediction biases and misclassification. presents the mean and variance of predicted values across participants. Analysis of the average tendencies across all devices revealed a prediction bias toward the light sleep stage. Google Nest Hub 2 (nearable) showed the largest bias toward the light stage among all the devices. Unlike other devices, Pillow (airable) was highly biased toward the deep stage, predicting 59% of epochs as deep, whereas only 10.8% of epochs were deep based on the results of polysomnography. The confusion matrices also revealed distinct patterns of misclassification in sleep stage prediction for device types. Wearables primarily misclassified wake as light, while nearables strongly misclassified REM as light. Airables, on the other hand, demonstrated a relatively higher frequency of confusion between the light and deep stages. presents a comparison of hypnograms illustrating the epoch-by-epoch agreement at the individual level, which facilitated the evaluation of agreement between CSTs and polysomnography in a time-series format. Additional hypnograms are presented in .

Regarding , the division of groups was necessary owing to the limited number of watches worn simultaneously, as explained in the Methods section. As 9 devices were used simultaneously for each subject, the hypnograms for each device are presented, with the polysomnography result displayed at the top. As shown in , SleepRoutine, Amazon Halo Rise, and Galaxy Watch 5 exhibited more frequent transition of stages and predicted wake in the middle of sleep more frequently, resulting in better estimation of WASO, as shown in the analysis of sleep parameters.

Table 3. Epoch-by-epoch agreement: classification for detecting individual sleep stage.VariableWake stageaLight stageaDeep stageaREMb stagea
F1SensitivitySpecificityF1SensitivitySpecificityF1SensitivitySpecificityF1SensitivitySpecificityAirable

SleepRoutine (n=67)c0.7065d0.7246d0.92690.7436d0.70540.7665d0.53550.67120.89730.7596d0.73940.9609d
SleepScore (n=38)0.40570.36650.86960.51470.43550.72720.35740.52470.82640.34180.45870.7895
Pillow (n=74)0.28280.19340.95720.34090.24900.75340.26730.8594d0.44490.14400.11400.9126Nearable

Withings Sleep Tracking Mat (n=75)0.44190.41720.88540.57640.53280.63360.38000.56330.82700.40010.39640.8906
Google Nest Hub 2 (n=33)0.32960.30680.86490.56190.57720.45180.12450.13080.88830.18760.18050.8514
Amazon Halo Rise (n=28)0.59670.66120.89210.71420.66090.74840.45750.54670.90180.72830.7490d0.9401Wearable

Google Pixel Watch (n=30)0.34560.22770.9784d0.71500.76570.56200.5922d0.69370.92900.61460.65480.9029
Galaxy Watch 5 (n=22)0.47550.48140.91040.73460.72800.64120.49630.47520.9481d0.59820.62650.9058
Fitbit Sense 2 (n=26)0.38070.27140.96020.72620.7734d0.57270.55640.67100.92470.66230.68120.9297
Apple Watch 8 (n=26)0.54930.44810.96240.66800.66490.57370.30730.41300.84120.43940.42760.9070
Oura Ring 3 (n=53)0.45270.38220.92640.59530.50720.76300.42720.77840.79740.59930.71180.8716

aIndividual sleep stage classification was used to categorize each class and the remaining classes.

bREM: rapid eye movement.

cThe number in parenthesis indicates the number of participants tested with each device.

dTop-performing consumer sleep tracker.

‎

Figure 3. Normalized confusion matrices for 11 consumer sleep trackers (CSTs). Four-stage sleep classification confusion matrices comparing CSTs. Each row in the confusion matrix is the sleep stage annotated by polysomnography, while each column represents the sleep stage annotated by the CST. REM: rapid eye movement. ‎

Figure 4. Sample hypnograms of 11 consumer sleep trackers (CSTs) involving 2 subjects in different groups. Hypnogram samples for each CST were selected based on the last measured subjects in multi-tracker group A (female; age, 35 years; BMI, 30.1; apnea-hypopnea index [AHI], 2.9) and multi-tracker group B (female; age, 26 years; BMI, 20; AHI, 3.5). PSG, polysomnography. Sleep Measure Analysis

presents the Bland-Altman plots of CSTs, illustrating the performance of sleep measurements, including sleep efficiency, sleep latency, and REM latency, when compared with polysomnography. The average value of polysomnography sleep efficiency ranged from 77.57% to 86.05%, while the bias for each CST varied from −3.4909 percentage points (Amazon Halo Rise) to 12.8035 percentage points (Google Pixel Watch). Polysomnography values for sleep latency ranged from 10.80 minutes to 19.80 minutes, with CST biases ranging from −0.81 minutes (Apple Watch 8) to 39.42 minutes (Google Nest Hub 2). Polysomnography values for REM latency ranged from 87.00 minutes to 112.20 minutes, with CST biases ranging from −49.89 minutes (Amazon Halo Rise) to 65.29 minutes (Google Pixel Watch). The devices demonstrated distinct and best performances for each sleep metric. In terms of sleep efficiency, Galaxy Watch 5 (wearable) achieved a minimal bias of −0.4%. In the case of estimation of sleep latency, Apple Watch 8 (wearable) exhibited a bias of 0.81 minutes. Lastly, SleepRoutine (airable) demonstrated the best performance for REM latency with a bias of 1.85 minutes. The proportional bias, presented as “r” in , indicates how consistent the mean bias was regardless of the sleep measure. Oura Ring and SleepRoutine showed no proportional bias (ie, no significant correlation in the Bland-Altman plot) for any sleep measure. The difference in mean values between polysomnography and each CST for each sleep measure is described in . Additional information is provided in .

‎

Figure 5. Bland-Altman plots of consumer sleep trackers (CSTs) and polysomnography (PSG) for sleep efficiency, sleep latency, and rapid eye movement (REM) latency measurements. The plots present the mean bias (middle horizontal black solid line), and upper (upper horizontal black dashed line) and lower (lower horizontal black dashed line) limits of agreement. In the figure, “b” represents bias and “r” denotes the Pearson correlation coefficient between the mean of measurements and the difference between the CSTs and PSG. The correlation coefficient is displayed along with its corresponding P value. The red line indicates the estimated linear regression line. Subgroup Analysis

Subgroup analyses were conducted for all devices, considering factors, including sex assigned at birth, AHI, sleep efficiency, and BMI. The macro F1 scores for each subgroup are presented in . presents the subgroup analysis results of epoch-by-epoch agreement for the AHI. The average performance of CSTs showed a comprehensive relationship between sleep tracker performance and these parameters. In terms of BMI, the average macro F1 score was 0.5043 for individuals with a BMI of ≤25 kg/m2, whereas it dropped to 0.4790 for those with a BMI of >25 kg/m2, indicating a gap of 0.0253. Similarly, for sleep efficiency, the scores were 0.4757 for individuals with a sleep efficiency of ≤85% and 0.4902 for those with a sleep efficiency of >85%, with a difference of 0.0145. In the case of the AHI, the scores were 0.4905 for an AHI of ≤15 and 0.5024 for an AHI of >15, resulting in a difference of 0.0119. In contrast, the difference between male and female subgroups was minimal, with a macro F1 score of 0.4926 for males and 0.4932 for females, resulting in a negligible difference of 0.0006. In each subgroup, the highest variations were observed with the airable SleepScore for AHI (difference: 0.0929), the nearable Google Pixel Watch for sleep efficiency (difference: 0.1067), the wearable Galaxy Watch 5 for BMI (difference: 0.0785), and the airable SleepScore for sex assigned at birth (difference: 0.0872). and present the subgroup analysis results of epoch-by-epoch agreement in the institutions. Additionally, and provide an overview of the average macro F1 scores individually calculated for each participant.

Table 4. Epoch-by-epoch agreement: subgroup analysis of the apnea-hypopnea index and demographic characteristics.VariableApnea-hypopnea indexSleep efficiencyBMIGender
≤15>15≤85%>85%≤25>25MaleFemaleAirable

SleepRoutine (n=67)a0.6536b0.7320b0.6971b0.6490b0.6840b0.6889b0.7137b0.6568b
SleepScore (n=38)0.36360.45650.41070.38080.41180.39370.44310.3559
Pillow (n=74)0.26020.25670.24720.25670.26010.25480.26700.2446Nearable

Withings Sleep Tracking Mat (n=75)0.46440.42250.37660.46530.47600.39640.45870.4355
Google Nest Hub 2 (n=33)0.30000.30590.31150.27620.32090.25170.28890.3059
Amazon Halo Rise (n=28)0.61600.63890.62970.58570.64140.58010.60750.6491Wearable

Google Pixel Watch (n=30)0.56700.56260.50350.61020.56530.57910.52350.5956
Galaxy Watch 5 (n=22)0.57010.57900.60290.55470.55210.63060.56550.5867
Fitbit Sense 2 (n=26)0.58390.57530.53250.60900.59100.55410.53200.6129
Apple Watch 8 (n=26)0.48610.49500.43260.48040.50930.45610.52630.4414
Oura Ring 3 (n=53)0.53020.50210.48820.52450.53540.48300.49260.5405

aThe number in parenthesis indicates the number of participants tested with each device.

bTop-performing consumer sleep trackers.

DiscussionPrincipal Findings

We conducted an extensive analysis of 11 CSTs involving 75 subjects, which, to the best of our knowledge, represents the largest number of devices simultaneously evaluated in the literature [,,,,,]. The findings are illustrated in , which presents the relative performances of the 11 CSTs in estimating sleep stages and sleep measures. Our findings revealed that Google Pixel Watch, Galaxy Watch 5, and Fitbit Sense 2 demonstrated competitive performance among wearables, while Amazon Halo Rise and SleepRoutine stood out among nearables and airables, respectively.

‎

Figure 6. Relative performance rank heatmap of 11 consumer sleep trackers (CSTs). Heatmap of the relative performance of sleep stage and classification of sleep measures normalized to the highest and lowest macro F1 values in each CST. D: deep; EBE: epoch-by-epoch agreement; L: light; R: rapid eye movement; RL: rapid eye movement latency; SE: sleep efficiency; SL: sleep latency; W: wake. Wearables

Wearables, including watch and ring-type sleep trackers, represent the most prevalent CSTs in the market [,]. They employ photoplethysmography sensors and accelerometer sensors to measure cardiac activity (eg, heart rate variability) and body movements. Given their reliance on similar biosignals for sleep tracking, wearables exhibit consistent patterns in estimating sleep stages. First, most wearables generally overestimate sleep by misclassifying wake stages, leading to a substantial negative proportional bias in estimating sleep efficiency, which results in worse performance for individuals with low sleep efficiency (). This bias was specifically observed when actigraphy was used to measure sleep efficiency and WASO [,]. This can be attributed to the dependence of actigraphy and wearables on body movement to determine sleep-wake states. Insomniacs often lie still in bed while trying to sleep, even though they are actually awake []. As a result, these periods of wakefulness can be misinterpreted as sleep. Nevertheless, Oura Ring showed negligible proportional bias, potentially owing to its use of additional features, such as body temperature and circadian rhythm, for sleep staging []. Second, wearables comprising the top 3 CSTs demonstrated substantial alignment in the classification of deep stages. In particular, the results from Oura Ring 3 and Fitbit Sense 2 in this study showed improved accuracy in sleep stage detection compared to previous studies that focused on earlier versions of Oura Ring and Fitbit in assessing the accuracy of wearable sleep evaluations [,]. Thus, wearables may facilitate accurate detection of different stages of deep sleep, given their unique association with autonomic nervous system stabilization. Heart rate variability, a key indicator of autonomic nervous system activity, can be directly measured by photoplethysmography sensors []. Therefore, wearables are effective in monitoring deep sleep stages.

Nearables

Nearables, encompassing pad and motion sensor-type sleep trackers, use overall body movements and respiratory efforts (thoracic and abdominal) for sleep monitoring. Similar to wearables, nearables also exhibit aligned tendencies. First, all nearables tend to overestimate sleep onset latency, resulting in a significant mean bias (29.02 minutes for nearables, −2.71 minutes for wearables, and 4.34 minutes for airables) and a significant positive proportional bias in sleep latency measurement. This indicates that nearables may overestimate sleep latency, particularly in individuals with prolonged sleep latency. During extended periods of attempting to fall asleep in bed, users may experience increased restlessness and movement, which makes it challenging for nearables to estimate the sleep stage using radar-like sensors []. Second, unlike wearables, nearables demonstrated the least sensitivity in deep stage classification (as shown in ). Distinguishing stages of deep sleep from light sleep based on variations in respiratory patterns requires precise monitoring of respiratory activity. However, the radar-like sensors employed by nearables, while efficient at detecting larger body movements, have difficulty capturing smaller fidgeting movements, which represent a challenge in accurately identifying the stage of deep sleep.

Airables

Airables excel in terms of their accessibility by not requiring the purchase of additional hardware. However, their clinical validation is not well-established as noted in previous studies up until 2022, which highlighted the limited agreement between airables and polysomnography as a notable limitation [,]. In this study, our aim was to validate the latest airable CSTs. One distinguishing feature of airable CSTs is their use of diverse sensor types, and the variation in performance is substantially influenced by the specific sensor type and accompanying algorithm. Thus, we chose 3 types of airable CSTs considering diversity (microphone, ultrasound, and accelerometer-based applications). These methodological distinctions contribute to pronounced variations in the determination of sleep stage. Pillow requires placement on the mattress and uses the smartphone’s accelerometer sensor to detect user movements through the mattress. Notably, Pillow showed a prediction bias toward the deep stage, suggesting that movement information during sleep was insufficient for the accurate determination of sleep stage. SleepScore uses a sonar biomotion sensor and directs the smartphone’s speaker toward the chest area to emit ultrasonic signals above 18 kHz, tracking thoracic respiratory effort. Depending on the biosignal used, SleepScore shows similar tendencies with nearables, demonstrating a substantial mean bias and positive proportional bias in estimating sleep onset latency. SleepRoutine analyzes the sound recorded during sleep []. Sleep sounds provide a wealth of sleep-related information, including changes in breathing regularity linked to autonomic nervous system stabilization, changes in breathing sound characteristics (such as tone, pitch, and amplitude) due to altered respiratory muscle tension, and noise from body movements. Among all CSTs, SleepRoutine exhibited the highest accuracy in predicting the wake and REM stages.

REM Sleep Stage Estimation Performance

REM was the stage where most CSTs demonstrated relatively higher agreement with polysomnography compared with other stages. Among the top 5 CSTs with the highest macro F1 scores (SleepRoutine, Amazon Halo Rise, Fitbit Sense 2, Galaxy Watch 5, and Google Pixel Watch), the REM stage showed a substantially higher average F1 score of 0.672, compared with 0.501 for wake and 0.528 for deep sleep. This can be attributed to the unique characteristics of REM sleep, which include increased irregularity in heart rate and breathing, minimal muscle movement, and rapid variations in blood pressure and body temperature []. These features allow easy detection of different types of biosignals and accurate classification of REM sleep.

Cost-Effectiveness

We evaluated the costs of 11 sleep tracking technologies based on the costs analyzed in . Wearables, with an average price of US $386, offer a wide range of functions, including messaging and various apps, beyond sleep tracking. Oura Ring, while lacking these supplementary features, provides a broad spectrum of health tracking functions. Nearables, with an average price of US $123, include a variety of features across different models. Google Nest Hub and Amazon Halo Rise offer extra features, such as an IoT hub and wake-up light, whereas Withings Sleep Mat is exclusively designed for sleep tracking. Airables, which are app-based technologies, harness smartphone sensors for sleep tracking, requiring only a subscription fee of US $53 and no additional hardware. This economical and flexible option, which can be easily canceled, represents a cost-effective solution for sleep tracking.

Standardized Validation and Data Transparency

Standardized methods of validation and data transparency are crucial for comparing sleep trackers [], particularly due to the increasing use of deep learning algorithms whose inner workings are often opaque. In our study, we adhered to established frameworks for standardized validation [,], while also conducting multi-center evaluations based on diverse demographic factors. Regarding data transparency, we provided comprehensive details of validation; however, obtaining access to the training data of each CST was challenging. Transparency in both training and validation data is essential for building trustworthy artificial intelligence models and can also contribute to a better understanding of CSTs [].

Limitations

It is important to note the limitations of our study. First, data collection rates significantly varied between the 2 institutions as the study was independently implemented. Issues, such as battery management, account management, and human errors, resulted in data omissions. Second, demographic differences were detected between the institutions, including disparities in time spent in bed and total sleep time. Operational issues led to slightly earlier waking of participants in CLC. Third, this study focused solely on the Korean population, with limited ability to analyze performance differences among various races. Future studies should incorporate multiracial comparisons and evaluate CST performance across diverse home environments for realistic assessments.

Conclusions

Our study represents a comprehensive and comparative analysis of 11 CSTs and their accuracy in tracking sleep in a sleep lab setting. The objective of this study was to gain insights into the performance and capabilities of these CSTs. Personalized sleep health management is necessary to enable individuals to make informed choices for monitoring and improving sleep quality. Further, our findings emphasize the importance of understanding the characteristics and limitations of these devices. It lays the foundation for guiding the development of sleep trackers in the future. Accordingly, future studies should focus on developing accurate sleep stage classification systems by integrating different types of biosignals in a home environment.

SleepRoutine, which is one of the consumer sleep trackers included in this study, was developed and operated by Asleep, and the deep learning algorithm used in SleepRoutine was trained using data from Seoul National University Bundang Hospital, to which some of the authors are affiliated. The entities played no role in study design, data collection, analysis and interpretati

View original article

JMIR MHEALTH AND UHEALTH

分享书签

0 0 0 0 0 0 0

More from this channel

Accuracy of 11 Wearable, Nearable, and Airable Consumer Sleep Trackers: Prospective Multicenter Validation Study

留言 (0)