Feasibility of nonparametric item response theory in an objective structured clinical exam


G.A.M. Bouwmans
E. Denessen
A.M. Hettinga
C.T. Postma


OSCE and Standard Setting




Radboud University Medical Centre


 MSA can result in a more accurate estimation of reliability than Cronbach’s alpha.The assessment of homogeneity is paramount in the interpretation of OSCE checklists and MSA can elucidate several aspects of homogeneity that could facilitate the interpretation of OSCE’s.


Given that reliability, homogeneity and validity of OSCE’s can be improved at the expense of each other, the improvement process of OSCE's is complex, laborious and in need of extensive information. MSA provides a broad selection of parameters and analyses that could contribute to this process.


The use and interpretation of classical item analyses are not unchallenged in determining homogeneity and reliability in Likert scale checklists1,2,3.





It may be argued that item analysis based on nonparametric item response theory (NIRT) adds value to interpreting homogeneity and reliability in Objective Structural Clinical Exams (OSCE’s).



Summary of Work

A sample of three OSCE checklists, comprising physical examination, history taking and communication was analyzed with the NIRT method of Mokken Scale Analysis (MSA)4.


For every checklist fit with the Monotone Homogeneity Model (MHM) and the Double Monotone Model (DMM) was determined. Reliability was estimated with MSA coefficient Rho and coefficient Cronbach’s alpha.


Within each checklist, the MSA search procedure was used to identify possible subscales that fitted the MHM and DMM.

Take-home Messages

MSA is a promising method to supplement or substitute classical item analysis in investigating Likert type checklists of OSCE’s.

Summary of Results

    None of the three checklists fitted the MHM and the DMM. Reliability was sufficient for the physical examination checklist and the communication checklist but insufficient for the history taking checklist.


MSA´s search procedure revealed two subscales in the physical examination checklist that fitted the MHM. The history taking checklist comprised two subscales that fitted the MHM, one of which fitted the DMM as well. The communication checklist comprised three subscales that fitted the MHM, two of which fitted the DMM as well.


Reliability was sufficient for subscales comprising six items or more. Smaller subscales had insufficient reliability most of the time.


















1Schuwirth, L. W. T., & van der Vleuten, C. P. M. (2006). A plea for new psychometric models in educational assessment. Medical Education, 40(4), 296-300. doi: DOI 10.1111/j.1365-2929.2006.02405.x

2Sijtsma, K. (2009a). On the Use, the Misuse, and the Very Limited Usefulness of Cronbach's Alpha. Psychometrika, 74(1), 107-120. doi: 10.1007/s11336-008-9101-0

3Sijtsma, K., & Emons, W. H. M. (2011). Advice on total-score reliability issues in psychosomatic measurement. Journal of Psychosomatic Research, 70(6), 565-572.

4Molenaar, I. W., & Sijtsma, K. (2000a). USER's Manual MSP5 FOR WINDOWS (Version 5.0 ed.). Groningen: iec ProGAMMA.



With respect to reliability MSA produces a higher and more accurate estimation (Rho) than classical item analyses (Cronbach’s alpha). However, coefficient Rho is based on the DMM and though there is evidence that Rho is robust against minor violations of the DMM, criteria and consequences of these violations are unknown. In our dataset the three original scales did not fit the DMM, yet in these scales Rho was higher than Cronbach’s alpha and the differences between Rho and Cronbach alpha were small, suggesting that Rho might be robust and well within the violation range. All other scales met the DMM, therefore Rho is more accurate and preferred to Cronbach’s alpha.  



Fit with MHM and DMM

The assessment of reliability without the assessment of dimensionality is of limited value because it is difficult to interpret sumscores if several underlying factors have been measured. Fit with the MHM indicates that an item set is measuring 1 underlying skill and  the higher the value of H, the more certain it is that test scores represent the relative level of this skill. Fit with the DMM, meaning that  the ordering of item difficulty is the same for all respondents, facilitates interpretation even more since this could rule out a biased test.



Search procedure

MSA provides a manual and a automated procedure to construct scales in agreement with the MHM and DMM. In the manual procedure the entire scale is analyzed and by manually deleting bad fitting items, scalability and reliability can be improved. Since deleting 1 item could drastically change the other item coefficients it is important to delete items one by one, until scalability, reliability, fit with MHM or fit with DMM reach satisfactory levels. This confirmatory approach is recommendable if it is already assumed that the items somehow constitute a scale.


In the automated procedure the MSA program algorithm selects item sets within the entire scale, that are in agreement with the MHM. This exploratory strategy is recommendable to gather information about a scale about whose dimensionality little is known. Finally, the automated procedure offers the possibility to hand pick a starting item set. This feature can be used when content specialist are convicted that an certain collection of items is imperative for scale construction. After all, this empirical- or theoretical starting set does not necessarily have to coincide with a strong statistical item set.



Though unidimensionality and reliability are desirable psychometric properties, this does neither mean that every bad fitting item should be removed, nor that a test with relatively low reliability or scalability coefficients is a bad test because reliability and dimensionality do not provide information about “what” is measured. It is conceivable that some skills are multidimensional in nature and deleting items merely to improve reliability or scalability coefficients, may lead to undesirable narrowing of these constructs. Content specialists like scale constructors or teaching doctors should determine to what degree the psychometrics are explicable and eligible to adjustment.


Taking content as well as psychometrics into account, content specialists may decide to extend just the strongest MHM or DMM dimension(s), to reformulate items, to delete items, to split the skill in separate sub-skills or to accept the multidimensionality of the skill. Given that reliability, dimensionality and validity might be enhanced at the expense of each other, this process is complex, laborious and in need of extensive information. MSA provides an extensive selection of analyses that could contribute to this process.



To date most research concerning item analysis in OSCE’s has focused on classical test theory. However, classical item analysis is not undisputed. Reliability coefficient Cronbach’s alpha is a lower bound to reliability, dependent on item difficulty whereby broad variations in item difficulty suppresses Cronbach’s alpha. Cronbach’s alpha decreases with increasing dimensionality and an increasing number of items inflates Cronbach’s alpha.


Erroneously, Cronbach’s alpha is often considered an index of dimensionality, partly fuelled by different interpretations and definitions of the terms dimensionality, homogeneity and internal consistency. Ascertaining idimensionality is important in OSCE-checklists since summed item scores are usually expected to represent the level of the underlying medical skill and the interpretation of these sum scores is facilitated when all items measure just one single skill. Therefore additional statistics such as factor analyses and models of the item response theory are often used to supplement Cronbach’s alpha.


The nonparametric item response method of Mokken scale analyses (MSA) is a probabilistic version of the deterministic Guttman model comprising the assessment of dimensionality as well as reliability. MSA is frequently used in the field of psychology, medicine, marketing and social-medicine but for all we know MSA has never been used in an OSCE context.


Two models are paramount in MSA: the Monotone Homogeneity Model (MHM) and the Double Monotone Model (DMM). If the MHM holds, the checklist sum scores can be used to order participants on an unidimensional, latent scale. The MHM is based on three assumptions. The first assumption is called unidimensionality, meaning that all items measure the same single latent construct. Unidimensionalty is a desirable quality, for it simplifies the interpretation of the item sum score: the higher the sum score the better the latent construct is mastered. Conversely, if the sum score comprises items that measure various latent constructs (multidimensionality), it is unclear what a single parameter such as the item sum score, represents.

The second assumption, known as “monotonicity”, states that for every item the probability of correctly answering the item increases as the ability level of the latent construct (measured as the sum score of the item set) increases. For instance the item characteristic curve’s (ICC) in figure 1 show that item-a and item-b are in agreement with the monotonicity assumption and item-c violates the monotonicity assumption, as can be seen in the continuous increasing (item a and b) and partial decreasing (item c) progression of the item characteristic curves.


Monotonicity is a desirable characteristic for it is reasonable to expect that higher ability levels increase the probability to endorse every item in the item set. Note that the ICC of item-a is steeper than the ICC of item-b, indicating that item-a can discriminates more clearly between different ability levels than item-b. This discrimination power and steepness of the ICC is  reflected by coefficient H(i).

    The third assumption called local independence is of a more technical nature and is not necessary to understand the Mokken concept. Local independence means that that endorsing or failing an item does not affect the probability of endorsing subsequent items    .   


  The DMM is obtained by adding a forth assumption to the three assumptions of the MHM. This forth assumption, named  “non-intersection” states that the ICC’s of all items in a scale do not intersect. For instance figure 2 show that item-a and item-b are in agreement with the non intersection assumption (both graphs do not intersect). Item-c violates the non intersection assumption with item-a and item-b alike  .


    TThe rationale behind this feature is perhaps easier understood by comparing two ability levels marked X1 and X2 in figure 2. A participant with ability level X1 considers item b to be more difficult than item c (the probability of a correct answer is lower for item b than item c). Controversially, with respect to a more proficient participant with ability level X2 the situation is just the other way round: for participant with ability level X2 item c is more difficult than item b. So, even though the probability of correctly answering item b and item c increases when the ability level increases (meaning that the monotonicity criterion of both items is met), the ordering of item difficulties is different for both participants   .



If the ordering of item difficulties are different for certain subgroups in the population (e.g. men versus women) this probably indicates that the item set is measuring different latent constructs in different subgroups (differential item functioning). Conversely, if the non-intersection assumption holds the interpretation of the sum scores is further facilitated, for it underpins a unbiased item set.













Summary of Work

-Fit with monotone homogeneity model (MHM) indicates that the scale allows the item sum score to be used to order respondents ability on a single underlying skill. The MHM criteria are met if scalability of all item pairs H(ij) are positive and H(i) ≥ 0.30 for all items resulting in a scale with H ≥ 0.3. The usually accepted criteria are: 0.30 ≤ H ≤ 0.40 constitutes a weak scale, 0.40 ≤ H ≤ 0.50 constitutes a moderate scale and H ≥ 0.50 constitutes a strong scale. Values smaller than 0.3 indicate that the scale is unusable for practical purposes.

-Fit with double monotone model (DMM) indicates that in addition to the MHM, the ordering of item difficulties is approximately the same for all respondents and is checked the Htrans statistic for dichotomous items (Ht ≥ 0.30 and Ht(a) ≤ 10). Note that the DMM is an addition to the MHM, so violating the MHM automatically means that the DMM is violated, irrespective of the outcome of the Htrans statistic.

-Reliability implies that test results are repeatable under similar conditions. Reliability was measured with Cronbach’s alpha as well as Mokken’s Rho. For both coefficients a value of 0.70 will be considered sufficient.

-MSA search procedure. MSA’s automated search procedure was used to ascertain possible monotone homogenous- and double monotone item sets within each scale. For every found  subscale, fit with the MHM, DMM and reliability was assessed as described above.


Take-home Messages
Summary of Results

On the front page of this poster the psychometric properties of the physical examination checklist were presented. The two remaining checklists regarding communication skills and histoty taking skills are presented here. For explanation of the coefficients please see the additional information under the "more detail" -button of the BACKGROUND- and SUMMARY OF WORK section of the main page. 







Send ePoster Link