Assessment of the measurement properties of the Peabody Developmental Motor Scales-2 by applying the COSMIN methodology

The Peabody Developmental Motor Scales-2 (PDMS-2) has been used to assess the gross and fine motor skills of children (0–6 years); however, the measurement properties of the PDMS-2 are inconclusive. Here, we aimed to systematically review the measurement properties of PDMS-2, and synthesize the quality of evidence using the Consensus-based Standards for the Selection of Health Measurements Instruments (COSMIN) methodology. Electronic databases, including PubMed, EMBASE, Web of Science, CINAHL and MEDLINE, were searched for relevant studies through January 2023; these studies used PDMS-2. The methodological quality of each study was assessed by the COSMIN risk-of-bias checklist, and the measurement properties of PDMS-2 were evaluated by the COSMIN quality criteria. Modified GRADE was used to evaluate the quality of the evidence. We included a total of 22 articles in the assessment. Among the assessed measurement properties, the content validity of PDMS-2 was found to be sufficient with moderate-quality evidence. The structural validity, internal consistency, test-retest reliability and interrater reliability of the PDMS-2 were sufficient for high-quality evidence, while the intrarater reliability was sufficient for moderate-quality evidence. Sufficient high-quality evidence was also found for the measurement error of PDMS-2. The overall construct validity of the PDMS-2 was sufficient but showed inconsistent quality of evidence. The responsiveness of PDMS-2 appears to be sufficient with low-quality evidence. Our findings demonstrate that the PDMS-2 has sufficient content validity, structural validity, internal consistency, reliability and measurement error with moderate to high-quality evidence. Therefore, PDMS-2 is graded as ‘A’ and can be used in motor development research and clinical settings. Supplementary Information The online version contains supplementary material available at 10.1186/s13052-024-01645-6.


Introduction
Motor development refers to the ability of children to move and interact with the environment and is very important in early childhood [1].Proper motor development provides an opportunity for children to explore and participate in the world around them [2].Several studies have shown that motor development is closely associated with children's cognitive ability [3], language [4], executive functioning [5], and quality of life [6].Children with poor motor development reportedly have poor academic performance as well as depression and anxiety [7].In addition, impaired motor development in early childhood can impact learning abilities, which may persist through adolescence or even later in life [8].Motor disorders in children are associated with a lower quality of life in several domains, including physical, cognitive, emotional and social functioning [6].Children with motor dyspraxia (developmental disorder) require motor intervention to promote their motor skills and to prevent postural abnormalities [9].Therefore, early prediction of motor function is important for further intervention and education [10].Many assessment instruments or scales have been developed to accurately and efficiently screen for motor development problems in children [11,12].The Peabody Developmental Motor Scales-2 (PDMS-2) is widely used in paediatric practice and research studies to assess the gross and fine motor skills of children from birth to 6 years of age [13].The PDMS-2 has been improved and updated based on reviews of the PDMS, comments and queries from the testers and the authors' own experiences [14].The key changes in PDMS include the collection of a more representative sample, the introduction of a different test structure and more specific scoring criteria [15].
The measurement properties of an instrument were described and defined by the COnsensus-based Standards for the selection of health Measurements INstruments (COSMIN).According to the COSMIN methodology, reliability, validity and responsiveness are the main domains.The reliability was categorized into test-retest, interrater and intrarater reliability, and validity was categorized into content, construct (structural, cross-cultural, hypothesis testing) and criterion validity [16].Since the publication of PDMS-2, many studies have examined the measurement properties of this scale.The measurement properties of the original version have been assessed by English-speaking countries [17][18][19], while the measurement properties of the translated versions have been assessed by non-English-speaking countries [20,21].Although several studies have confirmed the reliability and validity of the PDMS-2 device to be sufficient, there are some contradictory reports on its reliability and validity.For example, the concurrent validity of the PDMS-2 and the Bayley Scales of Infant Development II Motor Scale (BSID-II) was simultaneously reported to be "high correlation" [22] and "low correlation" [19].Despite the heterogeneity of studies on the measurement properties of PDMS-2, no systematic review has addressed this issue.Since PDMS-2 is widely used by clinicians, therapists, psychologists and diagnosticians [14], establishing consistent evidence on its measurement properties is highly warranted.
The COSMIN methodology is typically employed to evaluate the measurement properties of various tools/ scales of a certain field [23,24].Hulteen et al. employed the COSMIN methodology in their systematic review of the measurement properties of several motor assessment scales in children and adolescents [25].The COSMN methodology can also be used to review the measurement properties of a single measurement instrument, such as the Body Image Scale [26].As reported results are inclusive of the measurement properties (reliability, validity, and responsiveness) of PDMS-2, the COSMIN could be an alternative methodology to delineate this inconsistency.Therefore, we searched for studies that determined the measurement properties of PDMS-2 and employed the COSMIN methodology to conduct a systematic review of the measurement properties of PDMS-2.In this review, we summarize the state of research on the measurement properties of PDMS-2 and synthesize the quality of evidence via the COSMIN methodology.

Literature search strategy
The PubMed, EMBASE, Web of Science, CINAHL and MEDLINE databases were searched for relevant studies that assessed the different measurement properties of PDMS-2 through January 2023.The search terms or keywords used to identify the name of the scale/instrument (PDMS-2) were "Peabody developmental motor scales-2" OR "PDMS-2" OR "Peabody developmental motor scales-second edition" OR "Peabody developmental motor scales-2nd ".The search term utilized to determine the scale measurement properties was a filter developed by the Patient Reported Outcome Measures (PROMs) Group at the University of Oxford (a highsensitivity search filter that has been validated by Terwee et al. [27].For the article search, we followed the latest version of the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA, 2020) guidelines [28].The full texts of the selected articles were downloaded from the journal's homepage.In addition, we contacted our university library or external collaborators for the full-text articles upon necessary.The study protocol was registered in PROSPERO (https://www.crd.york.ac.uk/prospero/; CRD42022376335).

Inclusion and exclusion criteria
The included literature met the following criteria: (1) the study was conducted on children aged 0-6 years; (2) the study addressed the evaluation of the PDMS-2 measurement properties; and (3) at least one of the scale's measurement properties was evaluated in the study.The measurement properties of the PDMS-2 include content validity, structural validity, internal consistency, crosscultural validity/measurement invariance, reliability, measurement error, criterion validity, hypothesis testing for construct validity, and responsiveness.The collected literature was excluded if it met any of the following criteria: (1) used PDMS-2 to investigate children's motor development; (2) used PDMS-2 to assess the effectiveness of an intervention; (3) was a review and systematic review; or (4) had only an abstract without a full-text article or nonpeer review.

Literature selection and data extraction
The literature search, article selection and data extraction were independently performed by two researchers (YZ and JH), and the results were compared with the help of another author (YQ).Any disagreements were resolved by discussion with other review authors (WY and MK).The literature was imported into EndNote, and duplicates were first excluded.Subsequently, the titles and abstracts of the collected articles were read, and irrelevant articles were excluded.The full texts of the remaining articles were subsequently read and screened according to our study criteria.
The following information was extracted from the literature: first author name, year of publication, studied population and source, region, sample size, age and sex of the children, use of the PDMS-2 language, measurement properties of the PDMS-2 (content validity, structural validity, internal consistency, cross-cultural validity/ measurement invariance, reliability, measurement error, criterion validity, hypothesis testing for construct validity, and responsiveness), and data on the measurement properties.

Evaluation of the risk of bias and quality of evidence of the included studies
We used the COSMIN risk of bias checklist [29] to assess the methodological quality of the studies.The checklist consists of ten sections, including "PROM development, content validity, structural validity, internal consistency, cross-cultural validity/measurement invariance, reliability, measurement error, criterion validity, hypothesis testing for construct validity, and responsiveness".Appropriate boxes were selected according to the measurement properties of the study.The methodological quality of the studies was assessed as "very good", "adequate", "doubtful" or "inadequate" on an item-by-item basis according to the standard score given in the boxes.The overall methodological quality rating of the studies was based on the "worst score principle".The worst score of the criteria in the box was regarded as the overall methodological quality rating of the study.
The quality of evidence was synthesized according to the modified version of the Grading of Recommendations Assessment, Development and Evaluation (GRADE) method [24].This method is an improvement on the original version to accommodate the COSMIN method.The evidence levels could be categorized as "high", "moderate", "low" or "very low" according to the standard.The starting level of evidence for the included studies was "high", and the data were subsequently downgraded according to the characteristics of the included studies.Unlike the original GRADE method, the modified version removes the "publication bias" factor.The quality of evidence was downgraded according to the risk of bias, inconsistency, indirectness, and imprecision.

Overall rating of the measurement properties
The overall rating of each measurement property of the PDMS-2 was assessed by the COSMIN methodology for systematic reviews of the PROM user manual (COSMIN manual) [30] and the COSMIN methodology for assessing the content validity of the PROM user manual [31].The items included "content validity, structural validity, internal consistency, cross-cultural validity/measurement invariance, reliability, measurement error, criterion validity, hypothesis testing for construct validity, and responsiveness" (Table S1).The reported items for each measurement property were rated as "sufficient (+), "insufficient (-), or "indeterminate (?)" (Table S2).The overall rating of each measurement property was given as "sufficient (+)", "insufficient (-)", "inconsistent (±)", or "indeterminate (?)".Inconsistent results were analysed in groups to explore the reasons for this difference.
For reliability, studies were considered sufficient if the Pearson correlation coefficient [32] or Spearman's rho correlation coefficient [33] was ≥ 0.80.Hypothesis testing for construct validity requires the reviewer team to set hypotheses in advance.The hypothesis for this study was as follows: for construct convergent or concurrent validity, the correlation coefficient was expected to be ≥ 0.50 for the correlations with the comparator instrument if a similar construct was measured with respect to the PDMS-2.Construct validity was rated as sufficient (+) if at least 75% of the results were in accordance with the hypotheses, insufficient (−) if at least 75% of the results were not, or indeterminate (?) if no hypotheses were defined.

Literature search results
From our database search, we identified a total of 529 articles, including 95 articles from PubMed, 103 from EMBASE, 156 from Web of Science, 48 from CINAHL, and 127 from MEDLINE.The search was performed until January 31, 2023, without restriction of early publication time.
All identified articles were imported to EndNote, and 424 duplicates were removed.The titles and abstracts of the remaining 105 articles were screened, and 68 irrelevant articles were excluded, resulting in 37 additional articles.Then, two articles were excluded due to unavailability of the full text (conference abstracts), and 35 were assessed for eligibility.We further excluded 13 articles for the following reasons: three articles were reviews [15,34,35], one was a dissertation [36], one study did not investigate the measurement properties of PDMS-2 [37], and eight studies used PDMS-2 to assess other scales [2,[38][39][40][41][42][43][44].Finally, 22 articles were included in our assessment.The detailed selection process and number of articles in each step are shown in Fig. 1.

Synthesis of evidence for the measurement properties of PDMS-2
The overall assessment of the PDMS-2 measurement properties and the corresponding quality of evidence for each measurement property are shown in Table 2.The detailed quality of evidence data are provided in the supplementary material (Table S3).

Content validity
Of the 22 included articles, only one study methodologically assessed the content validity of the PDMS-2 standard recommended by the COSMIN [59].The study systematically assessed the content validity of the PDMS-2 by interviewing experts in the field and judged the relevance and comprehensiveness of the scale.The overall rating of the results for content validity was found to be sufficient, and the quality of evidence was moderate.Since this study did not report comprehensibility, it was not possible to judge the overall rating of comprehensibility (Table 2).

Structural validity
Four of the 22 included articles assessed the bifactor structural validity of the PDMS-2 by classical test theory (CTT) [14,21,45,59].The overall rating of the results for structural validity was found to be sufficient.The quality of evidence was high, and all studies were judged as very good (Table 2).

Cross-cultural validity/measurement invariance
Of the 22 included articles, only one study assessed the cross-cultural validity of the PDMS-2 [46].However, the methodology used in this study did not meet the COS-MIN methodological requirements.
One study assessed the intrarater reliability of the PDMS-2 [59].The ICC for the intrarater reliability of the PDMS-2 was more than 0.70.However, due to the imprecision of the included studies (total sample size 80, i.e., < 100), the quality of evidence was graded as moderate.Therefore, there was sufficient moderate-quality evidence for the intrarater reliability of the PDMS-2 (Table 2).Taken together, the high-quality evidence from our assessment demonstrated that the reliability of the PDMS-2 was sufficient.

Measurement error
One study evaluated the measurement error of PDMS-2 [54].The smallest detectable change (SDC) was 7.76, and the minimal important change (MIC) was 8.39, which met the criterion of sufficient survival (+, SDC < MIC).The quality of evidence was high.Therefore, there was sufficient high-quality evidence for the measurement error of PDMS-2 (Table 2).

Hypothesis testing for construct validity
There is no 'gold standard' in the field of children's motor development assessment.Therefore, concurrent validity as a part of criterion validity is classified as evidence of construct validity recommended by the COSMIN [30].
One study assessed the concurrent validity of the PDMS-2 Gross Motor scale (PDMS-GM-2) with the EIDP [17].The overall rating results showed that the concurrent validity was sufficient.Because the sample size (30 children) was less than 50, the quality of evidence was low.Overall, there was sufficient low-quality evidence for the concurrent validity of the PDMS-GM-2 with the EIDP.One study assessed the concurrent validity of the PDMS-GM-2 with the M-FUN [18].The overall rating results showed that the concurrent validity was sufficient, but the quality of evidence was low due to the small sample size (22 children, i.e., < 50).Overall, our results showed that there was sufficient low-quality evidence for the concurrent validity of the PDMS-GM-2 with M-FUN (Table 2).
Three studies assessed the concurrent validity of the PDMS-2 with the Bayley-III [49,51,56].The overall rating of the results for the concurrent validity of the PDMS-2 with the Bayley-III was found to be sufficient, and the quality of the evidence was high.Three studies assessed the concurrent [19,57] and convergent [22] validity of the PDMS-2 with the BSID-II.Of these three studies, two involved the recruitment of exceptional children [22,57]; the overall rating was judged as sufficient (+), and the quality of evidence was high.One study recruited normally developing children [19]; the overall rating was judged as insufficient (-), and the quality of evidence was low.Our assessment revealed that the results of the PDMS-2 device with BSID-II appeared to be sufficient, and the quality of evidence was high (Table 2).
Two studies assessed the concurrent validity of the PDMS-2 with the BOT-2 [20,54].The overall rating of the results for the concurrent validity of the PDMS-2 with the BOT-2 was found to be sufficient, and the quality of evidence was high.Furthermore, two studies [47,48] examined the convergent validity of PDMS-2 with M-ABC.These two studies met the requirement of correlation in PDMS-GM-2 but not in PDMS-FM-2.Therefore, the convergent validity of PDMS-2 with M-ABC was inconsistent.The quality of evidence was very low due to the small sample size (67 children, < 100) and inconsistent results.Thus, there is inconsistent very lowquality evidence for the convergent validity of PDMS-2 with M-ABC (Table 2).

Responsiveness
Two studies assessed the responsiveness of PDMS-2 [53,54].The overall rating of the results was sufficient.However, the quality of evidence was low because the study was severely biased according to the COSMIN risk of bias assessment checklist [29].These results indicate that even low-quality evidence showed sufficient responsiveness of PDMS-2 (Table 2).

Discussion
To the best of our knowledge, this is the first systematic review in which the COSMIN methodology was used to assess the measurement properties of PDMS-2.In this study, we evaluated the different properties of PDMS-2, which were reported in 22 articles.According to the COSMIN manual, any measurement instrument or scale with sufficient evidence for content validity (any level quality) or internal consistency (at least low quality) can be categorized as "A" [30].Our results showed that the content validity of the PDMS-2 had sufficient

Overall rating Quality of evidence Responsiveness
Before and after intervention: [53,54] Qualitative summary: Sufficient (+) Low: severe bias ES = 0.47-0.74;SRM = 0.35-0.70[54] Sufficient (+) Sample size: 141 GRI-R range: 1.7-2.3[53] Sufficient (+) Sample size: 32  moderate-quality evidence, and the internal consistency of the PDMS-2 had sufficient high-quality evidence.These findings revealed that PDMS-2 can be graded as ' A' , which can be used in motor development research and in clinical settings.The COSMIN manual further states that the results obtained from any "A" grade scale can be trusted [30].
According to the COSMIN manual, content validity is the most important property of a measurement instrument or scale [30].Bums and Grove stated that content validity is obtained from three sources: literature, patient judgement (judgement of representatives of the relevant populations), and expert judgement [60].The most commonly used source of content validity is expert judgement [61], and the COSMIN method combines patient judgement with expert judgement to assess three parts of content validity: relevance, comprehensiveness, and comprehensibility [30].In our assessment, only one study reported the content validity of the PDMS-2 [59].However, in this study we examined the content validity of the PDMS-2 by asking experts in related fields but not patients/participants [59].When using the PDMS-2, patients (children) must complete their movements only following the instructions of the evaluator and do not need to understand the meaning of the PDMS-2 items [14].Therefore, no studies assessing the comprehensibility of PDMS-2 were found, but we still consider the content validity of PDMS-2 to be sufficient.
For the assessment of structural validity, the COSMIN quality criterion includes two criteria, namely, CTT and item response theory (IRT) [30,62].All the studies addressing structural validity in our analyses used the CTT method.Although the CTT easily assesses structural validity, the results from the IRT are said to be more reliable in educational and psychometric fields [63].Due to its high accuracy, IRT is a highly validated method for assessing the structural validity of PDMS-2 [63].However, at present, no study has used the IRT to evaluate the structural validity of the PDMS-2, and further studies are necessary to address the importance of IRT.
According to the COSMIN manual, cross-cultural validity/measurement invariance has been defined as "the degree to which the performance of the items on a translated or culturally adapted measurement instruments are an adequate reflection of the performance of the items of the original version of the measurement instruments" [30].In our analyses, we determined that no studies have assessed the cross-cultural validity/measurement invariance of the PDMS-2 by the COSMIN recommended method.We suggest further research on the cross-cultural validity/measurement invariance of the PDMS-2.
The results of the construct validity test demonstrated that the PDMS-2 is well correlated with most of the same-domain measurement instruments.However, the results of the three studies of the PDMS-2 device with BSID-II differed, which might be due to differences in sample type.Of these three studies, one study recruited normally developing children [19], and two studies recruited exceptional children [22,57].The concurrent validity of the PDMS-2 with the BSID-II among normal children was insufficient because of the small sample size (n = 15, i.e., < 50) [19].However, the concurrent or convergent validity among exceptional children was found to be sufficient for obtaining high-quality evidence (sample size 198, > 100) [22,57].The COSMIN stated that high-quality studies provide stronger evidence than low-quality studies and can be considered decisive in determining the overall rating when ratings are inconsistent [30].Overall, our findings revealed that the results of the assessment of PDMS-2 with BSID-II were sufficient.Next, we addressed the convergent validity of the PDMS-2 and M-ABC devices in two studies [47,48]; the results were sufficient for the gross motor quotient (GMQ) and inconsistent for the fine motor quotient (FMQ).As the sample size was small and the assessment ratings were inconsistent, the quality of PDMS-2 and M-ABC was considered very low evidence.
The risk of bias of reliability and measurement error was not judged according to the retest interval recommended by the COSMIN risk of bias checklist (approximately two weeks) due to the rapid growth rate of children aged 0 to 6 years.However, we judged the risk of bias in the studies (approximately one week) using another method described by Lee et al. [32].A suitable measurement error requires that the smallest detectable change (SDC) in the measurement instrument is less than the MIC [64].Only one study was conducted on the SDC and MIC [54].The MIC is the best result that can be calculated from multiple studies and using multiple anchors [65].Therefore, it is clear that one study alone is not convincing and involves multiple anchors, and we suggest further studies to verify the MIC results.
Responsiveness measures the ability of a scale to change over time in the construct to be measured [30].The results of the two included studies [53,54] showed sufficient responsiveness of PDMS-2, but the quality of evidence of these two studies was low.There are two reasons for these results.First, these two studies did not describe the intervention details.The second reason is that Wang et al. [53] used a statistical method (Guyatt's responsiveness ratio), which is not recommended by COSMIN [30].According to the COSMIN manual, Guyatt's responsiveness ratio takes the minimal important change into account [30].A marginally important change concerns the interpretation of the change score, not the validity of the change score [30].Low-quality evidence does not mean validating the sufficient or insufficient responsiveness of the PDMS-2 before and after the intervention.
In addition to the abovementioned outcome measures in COSMIN, interpretability and feasibility are also important variables for evaluating the measurement properties of PDMS-2 [30].In our assessment, one study [54] reported no ceiling or floor effects when using the PDMS-2 to assess the motor development of children.Reporting such no ceiling or floor effects indicates good interpretability of the PDMS-2.According to the results of previous studies of PDMS-2 [14], we assumed that the use of PDMS-2 is highly feasible and that a specific environment and/or equipment are not necessary to assess motor development in children.
The synthesized evidence of the measurement properties of PDMS-2 is comparable to that of other wellknown similar domain measurement instruments, such as M-ABC, BOT-2, Bayley-III, and BSID-II.For instance, a previous study reported that the interrater reliability, test-retest reliability and content validity of the M-ABC were good, but mixed results were reported for internal consistency and cross-cultural validity [66].The BOT-2 scale was reported to have excellent interrater reliability, test-retest reliability, and internal consistency [66].Another study reported that the internal consistency and test-retest reliability of the Bayley-III were good [35].In addition, the interrater reliability, internal consistency, and test-retest reliability of the BSID-II were reported to be sufficient [67].Our findings demonstrate that the PDMS-2 has sufficient content validity, structural validity, internal consistency, reliability and measurement error with moderate to high-quality evidence.

Limitations and future perspectives
Our results could not establish the quality of evidence for the cross-cultural validity of PDMS-2 because few or no studies have assessed the cross-cultural validity of PDMS-2 via the COSMIN-recommended methodology.For the article search, the Cochrane reviews used various additional sources, including dissertations, editorials, and conference proceedings.However, the probability of finding additional relevant articles for systematic reviews from these sources appears to be low [24].As we excluded the nonpeer reviewed articles in our study, our conclusions may not be influenced by these articles; however, we cannot completely exclude them.
To date, no study has addressed the cross-cultural validity of PDMS-2 by the COSMIN recommended method.In addition, only one study assessed the measurement error of PDMS-2.Therefore, further studies are necessary to assess the cross-cultural validity and measurement error of PDMS-2.These measurement properties can be used in the assessment to determine the overall rating and quality of evidence by the COSMIN methodology.We further suggest that future studies on the responsiveness of PDMS-2 that can be used in the COSMIN methodology.

Conclusions
Assessment results from the COSMIN methodology showed that the PDMS-2 has sufficient high-quality evidence for structural validity and internal consistency.The reliability and measurement error of the PDMS-2 also demonstrated sufficient high-quality evidence.However, no adequate or low-quality evidence was found for the cross-cultural validity/measurement invariance and responsiveness of the PDMS-2.On the other hand, very low-quality evidence for convergent validity suggested that the PDMS-FM-2 was inconsistently correlated with the M-ABC, which needs to be further investigated.Overall, our findings revealed that the PDMS-2 was graded as "A", and this scale can be used in the field of child motor development research as well as in clinical settings.

Table 1
Basic characteristics of the included articles Note: M/F = male/female, y = years, m = month, NR = Not Report