Sleep Tracking of Two Smartwatches Against Self-Reported Logs for Circadian Rhythm and Sleep Quality Assessment in Healthy Adults
Article information
Abstract
Objectives
Although many wearable devices are used to assess sleep, their accuracy remains controversial. This study aimed to investigate the accuracy of the Actiwatch, a research-grade device, and the Fitbit, a consumer-grade device, against sleep diaries to assess sleep patterns.
Methods
Twenty participants wore Fitbit and Actiwatch for two weeks and tracked their sleep patterns using sleep diaries. Total sleep time (TST), time-in-bed (TIB), sleep efficiency (SE), sleep onset latency (SOL), and wake after sleep onset (WASO) from the two devices and sleep diaries were analyzed using analysis of variance and Bland-Altman analysis.
Results
The TIB measured by the sleep log, Fitbit, and Actiwatch were 420.9 minutes, 417.3 minutes, and 567.4 minutes, respectively. Compared to the sleep log, the Fitbit underestimated TST, TIB, and SE, with significant differences observed for TST (p<0.001) and SE (p<0.001), but not for TIB. The Actiwatch overestimated TIB (p<0.001) and TST (p=0.02) and underestimated SE (p<0.001) compared to the sleep log. The difference between the Fitbit and Actiwatch was significant for TST, TIB, and SE (all p<0.001).
Conclusions
The Fitbit showed a smaller difference than the Actiwatch when compared with the sleep logs. The Fitbit could be used as a tool to assess sleep patterns in the clinic as well as in daily life.
INTRODUCTION
Many recent studies have reported that appropriate sleep duration could decrease the risk of various diseases, including cardiovascular disease [1], cancer [2], and cognitive disorders [3]. In addition, insufficient sleep, defined as a curtailed sleep pattern lasting for at least 3 months, is prevalent across most age groups and increasing globally [4]. Enough sleep is essential for maintaining good health and increasing quality of life. Therefore, efforts to assess sleep duration using various methods, such as smartwatches and sleep diaries, have increased.
Although polysomnography (PSG) is considered the gold standard for diagnosing sleep disorders, it can cause sleep disturbances and anxiety, which affect habitual sleep durations or sleep efficiency (SE) [5]. Various types of wearable and non-wearable devices have been developed and validated to assess sleep duration [6,7]. These devices are divided into research-grade devices, which are based on accelerometers with expertise in data collection and analysis; and consumer-grade devices, which provide immediate feedback and include associated free applications [8].
Previous studies validating consumer-grade smartwatches against PSG have shown that smartwatches tend to overestimate total sleep time (TST), underestimate sleep onset latency (SOL), and show relatively low specificity [9,10]. Ferguson et al. [8] reported that consumer-grade devices showed a strong correlation with the research-grade accelerometers (BodyMedia SenseWear), but these devices overestimated sleep duration. Another study comparing the smartwatch (Samsung Watch) to medical-grade actigraphy (wGT3X-BT) showed longer TST and lower wake after sleep onset (WASO) in the smartwatch compared to the actigraphy, although the differences remained within satisfactory ranges [11]. While several studies have reported that the smartwatches accurately assess sleep duration [12,13], controversy about their performance persists.
To assess sleep duration in real world settings, the performance of consumer-grade devices should be assessed and validated, rather than relying on costly research-grade devices or PSG, that require a controlled setting. In this study, we compared the performance of a research-grade device, the Actiwatch, and a consumer-grade device, the Fitbit using sleep diaries to assess sleep patterns.
METHODS
Participants
The study was conducted in South Korea between November 2022 and February 2023. The participants were adults aged 19 or over, without severe diseases that required treatment or medication. Participants provided written informed consent after receiving a detailed explanation of the study. They were instructed to withdraw from the study if they had more than four hours of noncompliance while wearing the devices for more than three days during the study period. This study was approved by the Institutional Review Board of Korea Institute of Oriental Medicine (IRB number: I–2210/010–003–01).
Actigraphy and fitbit
To measure the circadian rhythm, Actiwatch 2 (Royal Philips) and Fitbit Inspire 2 (Fitbit Inc.) were used. The Actiwatch 2 collected data, including sleep, rest, activity, and light exposure, in 60-s epochs. For Actiwatch 2, the sleep start time, sleep end time, sleep duration, SE, and wake time after sleep onset were recorded. The Fitbit Inspire 2 was used to measure sleep onset time, wake-up time, sleep hours, sleep score, awakening count, awakening duration, and hours in each sleep phase (rapid eye movement [REM], non-REM, shallow sleep, deep sleep, and awakening). The Fitbit data were recorded using the Fitbit smartphone app. Both were of the wristband type and the participants were asked to wear them on their non-dominant wrist. The participants were instructed to continue their normal activities and wear the smartwatches at all times for 14 days, except when charging or taking a shower. During this period, there were no specific restrictions regarding sleep location or light control.
Self-reported log
For comparison, all participants were asked to record their sleep onset time, wake-up time, sleep duration, and time taken to fall asleep daily. Additionally, they assessed their sleep quality daily using a 5-point Likert scale consisting of very bad, bad, intermediate, good, and very good. After 14 days of data collection, participants returned to the research center to submit their self-reported logs and smartwatches. SE was calculated based on TST and time-in-bed (TIB).
Statistical analysis
The data were analyzed using the R software (version 4.2.1, R Foundation for Statistical Computing). TST, TIB, SE, SOL, and WASO were compared between the self-reported logs and two smartwatches. Analysis of variance was used to compare sleep variables between the sleep logs and the two devices. Bland-Altman analysis was used to calculate the mean difference between the devices for each comparison. Descriptive analyses of height, weight, body mass index, age, alcohol consumption, smoking status, fatigue severity, and chronic diseases were conducted. Based on the question “Did you experience memory loss or being confused is getting severe/frequent within recent one year?” used in the Korean Community Health Survey [14], a decrease in cognitive function was assessed.
To assess sleep quality consistency, 5-point Likert scores from the self-reported logs, and sleep scores from the Fitbit, and SEs calculated from sleep logs, Fitbit, and Actiwatch data were compared. The correlation between these data was analyzed using Spearman’s Likert score and Pearson’s correlation for the others.
RESULTS
All participants, including 6 men and 14 women, were included in the analysis. Participants’ data from 14 days were analyzed except for 6 nights from the Fitbit and 4 nights from the Actiwatch, which were identified as outliers. Outliers in sleep variables were excluded according to the outlier boxplot. The average PSQI score of the participants was 5.3, and 13 participants (65.0%) were classified as good sleepers. Only one participant had hypertension, and two participants had hyperlipidemia (Table 1).
TIB in the sleep log was 420.9 minutes, and Fitbit measured 417.3 minutes, while Actiwatch overestimated TIB as 567.4 minutes. Compared TST to the sleep log (402.8 minutes) and the Fitbit (369.3 minutes), the Actiwatch overestimated TST as 420.5 minutes. SE was highest in the sleep log at 95.7%, while it was only 88.5% for the Fitbit and 75.5% for the Actiwatch. SOL was measured only in the sleep logs and Actiwatch and was significantly longer in the Actiwatch (65.2 minutes) than in the sleep logs (16.7 minutes) (p<0.001). The measured WASO did not differ significantly between the Fitbit (48.5 minutes) and Actiwatch (45.9 minutes). The duration of each sleep stage, such as REM, light sleep, and deep sleep, was measured using only the Fitbit. With less brain activity than REM sleep, non-rapid movement includes light and deep sleep. During light sleep, the body and mind slow down, but it is a stage where people can easily wake up. The body slows down even further and overall brain activity slows during deep sleep. Fitbit tracks heart rate variability, which helps to measure the duration of light and deep sleep [15]. The average duration of REM sleep was 87.7 minutes, light sleep was 206.9 minutes, and deep sleep was 66.0 minutes (Table 2).
In the comparison between the sleep log and Fitbit, only the TIB was not significantly different, with a difference of 3.2 minutes (p=0.39). Fitbit underestimated TST by 33.1 minutes (95% confidence interval [CI]: 25.9, 40.2) and SE by 7.2% (95% CI: 6.4, 7.9), and both differences were significant (both p<0.001) (Fig. 1). When compared to sleep log, Actiwatch overestimated TIB by 146.8 minutes (95% CI: 126.0, 167.5) and TST by 17.4 minutes (95% CI: 2.4, 32.3) while underestimating SE by 20.1% (95% CI: 18.0, 22.3). The differences between the sleep log and the Actiwatch were significant for all TIB (p<0.001), TST (p=0.02), and SE (p<0.001) (Fig. 2). TIB and TST measured using the Actiwatch were significantly longer than those measured using the Fitbit (both p<0.001). Actiwatch overestimated TIB by 15.0 minutes (95% CI: 129.1, 171.5) and TST by 51.0 minutes (95% CI: 36.9, 65.1) compared to the Fitbit. Finally, the SE in the Actiwatch group was 12.9% lower than in the Fitbit group (p<0.001) (Fig. 3, Table 3).

Bland‒Altman plot comparing sleep log and Fitbit. A: Time in bed comparison between sleep log and Fitbit. B: Total sleep time comparison between sleep log and Fitbit. C: Sleep efficiency comparison between sleep log and Fitbit. SD, standard deviation.

Bland‒Altman plot comparing sleep log and Actiwatch. A: Time in bed comparison between sleep log and Actiwatch. B: Total sleep time comparison between sleep log and Actiwatch. C: Sleep efficiency comparison between sleep log and Actiwatch. SD, standard deviation.

Bland‒Altman plot comparing Fitbit and Actiwatch. A: Time-in-bed comparison between Fitbit and Actiwatch. B: Total sleep time comparison between Fitbit and Actiwatch. C: Sleep efficiency comparison between Fitbit and Actiwatch. SD, standard deviation.
While the Likert score of sleep quality in the sleep log showed a weak negative correlation with the sleep score from Fitbit (r=-0.298, p<0.001), it was not significantly correlated with SE in the sleep log, Fitbit, and Actiwatch. The Fitbit score was not significantly correlated with SE from the sleep logs, but was significantly correlated with SE from Fitbit (r=0.418, p<0.001) and Actiwatch (r=0.326, p<0.001). None of the correlations between the SE from the two devices and sleep logs were significant (Table 4).
DISCUSSION
The results of this study showed that Fitbit underestimated both TST and SE compared to the sleep log. A systematic review of Fitbit performance reported that Fitbit overestimated sleep duration and SE by more than 10% compared to the PSG/accelerometer [16]. However, the results of previous studies comparing Fitbit and sleep logs are controversial. Although Brooke et al. [17] showed that the Fitbit Flex and Fitbit Charge HR overestimated TST by 8.8% and 11.5%, respectively, compared with sleep logs, many other studies reported contradictory results. One study showed that Fitbit underestimated TST by 22 minutes, and TBI by 43 minutes, while overestimating SE by 5% and WASO by 11 minutes. However, Fitbit’s data remained within the 10% equivalence zone when compared to the sleep diaries [18]. Brazendale et al. [19] reported that Fitbit underestimated the TST and TIB when considering parents’ sleep logs of their children, and the correlation between Fitbit and the sleep log was 0.71. Fitbit underestimated TST, SOL, and SE, while overestimating WASO, compared to sleep logs [20]. Similar to previous studies, the Fitbit in this study underestimated TIB, TST, and SE compared to the sleep logs. Park et al. [21] reported that the performance of Fitbit could be influenced by various factors, including participants’ characteristics, sleep hours, analysis methods, and the dominant/non-dominant wrist on which the device is worn. The accuracy of Fitbit’s performance should be further investigated using various tools and considering these factors.
In this study, Fitbit showed a higher correlation with the sleep log compared to the Actiwatch, and the difference between Fitbit and Actiwatch was significant. These results could be interpreted in two ways. First, sleep logs might not be a reliable criterion for validating the accuracy of sleep assessment devices. Second, the Fitbit may be a more accurate method for measuring sleep duration than the Actiwatch.
This study used sleep logs as the criteria for measuring sleep duration. Although PSG has been used as a criterion for sleep duration and quality assessment in many previous studies, its limitations such as sleep disturbances and the need for a controlled environment have been noted. While one systematic review on the accuracy of Fitbit devices used only PSG and accelerometers as criteria for validity [16], another systematic review included sleep logs as a reference [12]. Self-reported subjective sleep duration was longer than the objective sleep duration, which was measured using a wrist actigraphy [22,23]. However, a contrasting result was observed in another study using PSG, where the difference was not statistically significant [24]. Additionally, one study reported that subjective sleep duration was longer than the objective sleep duration when the total sleep duration was less than seven hours [22]. However, among participants with self-reported insomnia, the subjective sleep duration was shorter than the objective duration [24].
Another hypothesis regarding the results of this study is that the Fitbit is more accurate than the Actiwatch. Research-grade devices, including Actiwatch, are typically considered more accurate than consumer-level devices, including Fitbit, in assessing sleep. However, in a study by Ferguson et al. [8] that compared consumer-level activity monitors and research-level accelerometers, the Fitbit showed the strongest performance among the consumer-level devices. Another study that assessed the accuracy of both consumer-level and research-level devices compared to PSG reported no significant differences between all devices in the TST, even though the Fitbit differed from PSG in SE, whereas the Actiwatch did not [13]. Additionally, Cook et al. [25] compared the Fitbit and Actiwatch with PSG to estimate sleep and found that the Fitbit showed higher sensitivity, specificity, and accuracy than the Actiwatch. Therefore, the Fitbit could be an accurate and useful method for assessing sleep duration both in clinical setting and in everyday life.
The Actiwatch measures bed-time and sleep time based on motion detected by its accelerometer. It interprets minimal motion as bed-time and the absence of motion as sleep time. This could explain why the Actiwatch overestimates TIB, which in turn leads to errors in SE and SOL. When using the Actiwatch to assess sleep, these characteristics should be considered.
One strength of this study is that it assessed sleep patterns over a two-week period. In contrast, previous studies investigating the validity of smartwatches have typically monitored sleep for a short duration, usually overnight [13,17,26,27] or for 48 hours [8,28]. The study verifying the number of days and weeks needed for a reliable sleep diary reported that a sleep diary is a reliable method when used for at least seven consecutive days, including weekends [29], rather than assessing sleep over a short period. Therefore, the results of this study provide more reliable comparisons.
The limitations of this study include the small sample size and the predominance of female participants, which limit the generalizability of the results. Further research should involve a larger, more diverse sample with a wider range of characteristics to enhance the validity and generalizability of the results.
In conclusion, although the difference between the Fitbit and sleep logs was not significant in TIB, but was in TST or SE, it was smaller than the difference observed between the Actiwatch and sleep logs. Thus, the Fitbit can be considered a reliable method for assessing sleep.
Notes
The authors have no potential conflicts of interest to disclose.
Author Contributions
Conceptualization: Ji-Eun Park, Kyuhyun Yoon, Eunkyoung Ahn. Data curation: Ji-Eun Park. Formal analysis: Hoseok Kim. Funding acquisition: Kyuhyun Yoon. Investigation: Ji-Eun Park. Methodology: Ji-Eun Park, Kyuhyun Yoon, Eunkyoung Ahn. Project administration: Kyuhyun Yoon. Resources: Kyuhyun Yoon. Software: Hoseok Kim. Supervision: Kyuhyun Yoon. Validation: Hoseok Kim. Visualization: Hoseok Kim. Writing—original draft: Ji-Eun Park. Writing—review & editing: Kyuhyun Yoon, Jayeun Kim, Eunkyoung Ahn.
Funding Statement
This research was supported by a grant of the Korea Health Technology R&D Project through the Korea Health Industry Development Institute (KHIDI), funded by the Ministry of Health & Welfare, Republic of Korea (RS-2024-00444922) and Korea Institute of Oriental Medicine (KSN20234113).
Acknowledgements
None