Article Text


Statistical methods used in the development of a health measurement scale
  1. Pamela Warner
  1. Reader in Medical Statistics, Centre for Population Health Sciences, University of Edinburgh, Edinburgh, UK
  1. Correspondence to Dr Pamela Warner, Centre for Population Health Sciences, University of Edinburgh Medical School, Teviot Place, Edinburgh EH8 9AG, UK; p.warner{at}

Statistics from


Various psychometric statistical methods have been used in a paper by Simon et al1 in this issue of the Journal. These notes are intended to provide some additional explanation of the methods employed in developing a health measurement scale. [See Box 1 for a glossary of terms used in this article.]

Box 1

Glossary of statistical terms used in this article

What statistical methods are used in scale development?

A health measurement scale is a tool designed for a particular purpose: to quantify some attitude (say ‘acceptability’ of sterilisation), or to screen for those with a high-risk status (say, practising unsafe sex) in order to offer them additional counselling, or to assess cancer knowledge (as in the Simon et al.1 article).Any such tool needs to be fit for purpose or, preferably, best for purpose.2 The main statistical methods available for application in psychometric work have subtly differing objectives, and they are couched in terms of concepts such as reliability, validity and ‘responsiveness to change’. In general the statistical methods (and designs) that are used in psychometric work are adaptations of the well-known standard statistical approaches, but with ‘bespoke’ labels reflecting their psychometric focus:

  • (1) Cronbach's alpha (α)

  • (2) Test-retest reliability (rt-r)

  • (3) Item-total correlation (ri-t).

When/why are psychometric statistical methods useful?

Psychometric statistical methods are useful first and foremost in optimising a new health scale that is being developed. However, once the iterative development process has resulted in an ultimate product, these methods should also be used to validate the scale that has been developed, to demonstrate formally to others that the scale is of good quality for research or clinical use. Such bench-marking, as fulfilling well the function intended (together with publication of a report of this validation exercise in an academic journal), will promote more widespread use of the scale by other researchers in that field. [Advantages will then accrue, for those wishing to use the health scale in their practice (both researchers and clinicians), if all research in a field tends to use the same good-quality tool. This is because there will be scope for amalgamation and interpretation of findings across separate studies, to provide more robust and dependable evidence.] Finally, these statistical methods are invaluable when an existing health scale, presumably already validated for the originally intended research/clinical context, needs adaptation/validation for use in new contexts (e.g. community use of a health scale developed for a hospital setting, or after translation of an existing questionnaire such as SF-36 into other languages).3

What precautions are needed?

It is common that a scale is developed initially by means of a respondent sample, who complete a ‘development’ scale comprising a pool of possible items (typically more items than are wished in the final scale). Iterative analysis is then undertaken to check the ‘informativeness’ and utility of each item, by applying something like item-total correlation. The least successful items would tend to be dropped at this stage.

As a general rule, once a (next) prototype scale has been decided, it needs to be validated (i.e. pertinent aspects of its validity and reliability need to be evaluated). At this stage there are likely to be further changes to the scale, and if these are more than minor then a further round of validation might be needed. It is not unusual for the various evaluations to lead to some conflicting suggestions, as to optimum action in respect of a particular item (e.g. if an item is relatively internally inconsistent with the rest of the scale, but is needed for the sake of content validity). Therefore subtle judgement is often required, and this will need to be based on an understanding of the psychometric analyses, and an awareness of the research consequences of decisions that might be taken.

In the validation stage, it is preferable to use a new sample of respondents (not the development sample if there was one), and this sample should be representative of the population in whom the scale will be used. For example, if developing a diagnostic scale, validation should be on patients presenting as possible cases, who would in future be given the scale to complete, in order to examine whether the scale can differentiate true cases from non-cases. Regrettably it is all too often the case that a validation sample comprises a group of true cases and a group of healthy ‘controls’. In such a circumstance discrimination of cases would be trivially easy, and hence validation findings would be over-optimistic regarding the performance of the scale that could be expected in real clinical use. Not all scales are intended to be discriminatory, so in other scales different aspects of reliability and validity will be the focus of the validation research.


In the Simon et al. study1 what was to be assessed was ‘public awareness of ovarian and cervical cancer’. An initial stage (and sample) to refine a pool of items was not required, since the health aspect being measured is knowledge about ovarian and cervical cancer, and the scale being developed is an adaptation of an existing generic Cancer Awareness Measure (CAM). The cancer-specific knowledge this scale should assess was decided from the literature, and reviewed by expert opinion (content validity). Item-total correlations were calculated and internal reliability was assessed by Cronbach's α. Test-retest reliability was assessed for those items for which this was technically possible (i.e. excluding open-ended items). Responsiveness to change was assessed by having a randomly selected subgroup of respondents read an information leaflet prior to completing the scale, and then comparing their scores with those for respondents not being given the leaflet. Construct validity was ascertained by a formal comparison of scores for ‘standard’ respondents against those for a group of ‘expert’ respondents selected on the basis that they were likely to be knowledgeable.


Validation of a health measurement scale is essential to ensure a useful and effective tool for execution of health research. The validation process involves fine-grained and detailed technical research and analysis.


View Abstract


  • Competing interests None.

  • Provenance and peer review Commissioned; internally peer reviewed.

Request permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.