Test–retest reliability of performance variables during treadmill rollerski skating

The current study evaluated the test–retest reliability of a commonly used rollerski protocol, characterized by a constant incline and escalating speed. This procedure included one familiarization session (T0) and three trials (T1–T3) within a 14-day period for skiers who were already familiar with the rollerski treadmill, though not specifically with this protocol.

The principal findings include:

1.

Speedmax revealed a mean CV from T1–T3 of 1.5% [1.1, 2.6] and an ICC of 0.96 [0.87, 0.99], with a systematic familiarization bias from T0–T3 (3.4% [1.8, 5.1]) and T1–T3 (both P < 0.05) with a change in T1 to T2 of 1.2% [0.1, 2.3] and from T2 to T3 of 2.2% [0.1, 4.3].

2.

VO2peak showed a mean CV of 2.2% [1.6, 3.8] and an ICC of 0.93 [0.78, 0.98], with a with a systematic familiarization bias from T0–T3 (4.1% [1.2, 7.1], P < 0.05), while no clear detectable systematic familiarization bias from T1 to T2 (− 0.2% [− 2.0, 1.6]) or T2 to T3 (1.8% [− 1.1, 4.7]).

3.

Submaximal O2-cost (VO2sub) showed a mean CV of 2.1% [1.5, 3.3] and an ICC of 0.94 [0.84, 0.99], with no clear systematic familiarization bias from T0–T3 with T1 to T2 of − 0.7% [− 2.4, 1.1] or T2 to T3 of − 0.1% [− 2.4, 2.3].

Graded exercise tests have faced criticism for their reliability and ecological validity (McGawley 2017; Currell and Jeukendrup 2008). In the present study, it is evident that a significant familiarization effect occurs, with Speedmax continuing to increase through the fourth test (Fig. 1). The test, which lasted approximately 6–7 min, demonstrated a mean increase of about 3% in Speedmax from T1 to T3, suggesting a systematic familiarization bias. The ~ 3% increase in Speedmax from T1 to T3 coincided with no changes in VO2peak, HRpeak, La−peak, and RPEpeak across the three main tests. Consequently, the skiers appeared to reach the same level of "exhaustion" at the end of the MTF tests as well as similar peak physiological variables. Despite the systematic familiarization bias in Speedmax, the relatively low CV (1.5% from T1 to T3) and high ICC (0.96 from T1 to T3) similar to what was found in Losnegard et al. (2013), suggest a high degree of test–retest reliability.

McGawley (2017) found that the VO2max was higher during a graded running test than during a time-to-failure test, suggesting that VO2max can vary depending on the test type. In the preliminary data collection (see Supplemental Data) we investigated differences between increased incline and increased speed and found no absolute differences in VO2peak, HRpeak or RPEpeak. Moreover, Losnegard et al. (2012a) found no difference in ski skating VO2max between a graded test and a time-to-failure test similar to the one used in the present study. In the present study, the T1–T3 CV of VO2peak was 2.2%, and HRpeak was 0.9%, comparable to earlier studies in running (McGawley 2017) and in the skating technique (Losnegard et al. 2013), but slightly higher than the diagonal style (1.5%) assessed by Bucher et al. (2023) using Douglas Bags. Taken together, based on the current knowledge on rollerski testing, different types of protocols for addressing VO2peak imply a typical CV of VO2peak on rollerskiing of 1.5–2.5% while the tested types of protocol do not seem to have a major influence on the CV in VO2peak.

Importantly, cross-country skiing is an intermittent sport involving a variety of speeds and techniques. When selecting a specific testing protocol for skiers, the purpose of the test and the targeted qualities must be considered. Previous studies have generally opted for increased speed (Sandbakk et al. 2010; McGawley and Holmberg 2014; Losnegard et al. 2017b), increased incline (Losnegard et al. 2012b; Pellegrini et al. 2011) or a combination of both (Kvamme et al. 2005; Gløersen et al. 2020; Andersson et al. 2016). In a preliminary study (see Supplementary Material) we examined the changes in the mean and CV of two different protocols over two tests, either by increasing the incline (with constant speed) or by increasing the speed (with constant incline). We concluded that both protocols showed similar CV and changes in the mean for submaximal variables, while the CV was larger for a maximal speed test to failure and for a VO2peak test with gradually increasing speed. Combined with the clear familiarization effect from the main project, this suggests that technique is a major determinant of performance in such tests. Therefore, when planning testing, it is essential to consider the main purpose of the test (e.g., physiological assessment versus technique and high-speed qualities).

VO2sub and HRsub showed a T1–T3 CV of approximately 2%, comparable to previous findings in rollerski skating (Losnegard et al. 2013). This suggests that submaximal testing at aerobic steady-state speeds is suitable for detecting relatively minor changes in work economy/efficiency and heart rate. This is crucial not only for monitoring training-induced changes (Losnegard et al. 2013) but also for detecting technical alterations (Losnegard et al. 2017a) or changes in equipment (Losnegard et al. 2017b). However, despite the learning effect appearing minor in VO2sub (e.g., % change), a reduced HRsub, and La−sub were found between T0 and T3. Therefore, we emphasize the importance of familiarization, not only with skiing on the treadmill but also within the specific test setting (e.g., using a mouthpiece and safety harness), to ensure the most reliable data.

Methodical considerations

In the current study, we selected highly trained skiers (Tier 3), similar to the study by Bucher et al. (2023). The participants’ level can potentially influence reliability, as better skiers typically have less variation in performance than slower counterparts (Spencer et al. 2014). However, the CV between tests was similar to what was found in elite skiers (Losnegard et al. 2013). Additionally, we included both female and male participants, which likely have influenced the relatively high ICC due to heterogeneity in the sample.

Our design involved four tests within a 14-day period to avoid significant training-induced changes. This design seems appropriate for detecting reliability over a short time-period (such as familiarization before a research project), but it might not necessarily provide the “correct” reliability of testing over longer periods. Of note is the maximal test, where subjects likely “remembered” how many speed changes they performed in the previous trial and aimed to improve their performance. Over a longer period, this learning effect might diminish.

Reliability depends on several factors, including biological and psychological factors, equipment, testing staff, and environmental conditions. It is important to acknowledge that the testing was conducted in three different labs with different equipment, testing staff, and facilities (e.g., treadmill and ergospirometry). Moreover, the participants were tested in different time-periods during the preparations phase, and the training prior and under the 14 days were not controlled. Although our standardized procedures were identical, these factors could have influenced the results and should be considered when interpreting the findings. However, the main purpose was to identify within-subject reliability over several tests, and we believe that the present setup is suitable for this objective.

Comments (0)

No login
gif