Objetivo: Comparar dos sistemas de puntuación para un test de fluidez verbal con el Modelo de Escalas de Calificación. Método: Se analizaron datos de 289 participantes, de los cuales 92 habían sido diagnosticados con Parkinson. Las puntuaciones se calcularon con dos sistemas de categorización: un procedimiento convencional y otro basado en percentiles. Resultados: Las puntuaciones Rasch procedentes de percentiles dan lugar a categorías adecuadas y medidas fiables; la correlación con las puntuaciones del test Minimental es evidencia de validez concurrente. Tras controlar estadísticamente el efecto de la edad, las medidas Rasch procedentes de percentiles discriminan entre ambos grupos, lo que evidencia validez predictiva. Conclusiones: El análisis de los dos procedimientos permite recomendar el uso de las categorías basadas en percentiles.


Verbal fluency (VF) ability is usually measured as the number of words generated under stimulus constraints such as category or first letter (Lezak, Howieson, Bigler, & Tranel, 2012). It implies multiple cognitive processes related to the activation of different brain areas (Troyer & Moscovitch, 1997), including lexical selection, phonemic coding, working memory, and executive control (Paulesu et al., 1997). VF tasks are used to assess verbal production speed, ability to initiate behaviors in response to a novel task (Bryan & Luszcz, 2000), denomination ability, response speed, mental organization, search strategy, and some aspects of short- and long-term memory, Light, Parker, & Levin, 1997). Spreen and Strauss (1998) consider VF tasks to be estimators of initiation capability, sustained attention, processing speed, and the ability to suppress inadequate responses. Deficits in VF are frequently found in diseases such as Parkinson’s (Azuma, Cruz, Bayles, Tomoeda, & Montgomery, 2003; Dubois, et al., 2007; Henry, & Crawford, 2004; Jankovic, 2008) as well as in mild cognitive impairment (Rinehardt et al., 2014).

The commonest VF tasks are semantic VF (in which the participant is asked to evoke words of a certain category, e.g., animal, fruit, clothes) and phonemic VF (in which the participant is asked to evoke words starting with a letter, e.g., P, S, F) (Bryan, & Luszcz, 2000). Action VF is the ability to evoke words for action. It is also considered to be an executive functioning measure in clinical populations (Burgess, Alderman, Evans, Emslie, & Wilson, 1998; Piatt, Fields, Paolo, Koller, & Tröster, 1999). In the clinical field, VF tasks are used to detect cognitive decline (Holtzer, Goldin, & Donovick, 2009; Radanovic et al, 2009), and to tell apart normal aging from mild cognitive impairment (Bertola et al., 2014). An exhaustive review of VF tasks and their assessment utility in diverse populations can be found in Lezak, Howieson, Bigler and Tranel (2012).

Not requiring any materials, VF tasks are easy to apply in any cultural context, and so it is usual to find them as part of many neuropsychological assessment protocols such as those for language or executive functions. For instance, the Frontal Assessment Battery (FAB) includes a VF task to measure mental flexibility (Dubois, Slachevsky, Litvan, & Pillon, 2000). However, the scoring of VF tests has not received the attention that it deserves. Even though the psychometrical properties of VF scores have been hardly studied, parametric statistical methods are typically used on these scores, taking interval status for granted.Counts are sometimes arbitrarily categorized, as is the case of the FAB VF item (0-2 words= 0; 3-5 words = 1; 6-9 words = 2; > 9 words = 3).

The Rasch approach to measurement can be used to contrast the quality of scoring systems (Delgado, 2007; Prieto & Delgado, 2003; Prieto, Delgado, Perea, & Ladera, 2010). From a methodological perspective, the advantages of applying the Rasch family of models are already well known (Freitas, Prieto, Simões, & Santana, 2014). Of special interest is the fact that the measured attribute can be represented on a single dimension, an interval-scaled variable where people and items are jointly located. However, these models are still underused in the neuropsychological assessment field. Thus our objective was the empirical contrast of the functionality of two quantitative scoring systems for a VF test composed of three “items” (semantic, phonemic and action) by means of the Rating Scale Model, an extension of the Rasch Model for polytomous items (RRSM; Andrich, 1978), whose formulation is:

ln (Pnik / Pni(k-1)) = Bn - Di - Fk

Pnik: probability that person’s n answer to item i is category k;

Pni(k-1): probability that the answer to item i or response is k-1;

Bn: ability or attribute of person n;

Di: location of item i;

Fk: transition point (step) between k and k-1.



A secondary analysis of the VF scores of 289 participants (142 female; age range: 45-95; education: 2-20 years) was carried out. Of these, 92 had been diagnosed with Parkinson’s disease (P), while the remaining 197 subjects came from a community sample and served as comparison group (C). Informed consent was required. All procedures were performed in accordance with the 1964 Helsinki declaration and its later amendments or comparable ethical standards.


Semantic, phonemic and action fluency tasks were regarded as “items” composing a VF test. In the semantic task participants were asked to evoke as many animal names as they could in one minute. In the phonemic task, participants were asked to evoke as many words starting with the letter P as they could in one minute. In the action VF task, participants were asked to evoke as many verbs as they could in one minute. Combining the tasks is justified given both their common content and large score inter-correlations (r semantic-phonemic = .49; r semantic-action = .60; r phonemic-action = .71).


Scores were calculated with two different category systems: the arbitrary one used by the FAB VF item (0-2 words = 0; 3-5 words = 1; 6-9 words = 2; > 9 words = 3), and a percentile-based procedure. A percentile rank is the percentage of the data that is below a concrete score. By using percentile rank ranges we have calculated the number of words corresponding to each category, as can be seen in Table 1.

Table 1: Percentile Range, Word Number Range and Percentile-based Category Percentile Range Semantic Phonemic Action Category 0-9 0 -10 0-5 0-5 0 10-24 11-12 6-8 6-7 1 25-49 13-15 9-11 8-10 2 50-74 16-18 12-13 11-13 3 75-89 19-20 14-17 14-16 4 ≥90 ≥ 21 ≥ 18 ≥ 17 5

Data Analysis

Both sets of scores were then separately calibrated by means of the RRSM. As to person measures, maximum and minimum scores were imputed given that RRSM does not allow estimating extreme scores. Data analysis was performed with Winsteps 3.92.0 (Linacre, 2016), and the adequacy of the response categories analyzed with the following criteria: (a) sufficient frequency and regular distribution of the categories; (b) the average measures according to category increase monotonically in the rating scale; (c) no category misfit; (d) the transition points go up monotonically (Linacre, 2002).

The model fit was evaluated with Outfit, based on the chi-square statistic, and Infit, based on the same statistic but with each observation weighted by its statistical information. Infit/Outfit values over 2 indicate severe misfit (Linacre, 2016). Principal component analysis of residuals was used to assess unidimensionality. According to Reckase (1979), the percent of variance explained should be over 20% and there should not be a second dominant factor.

After selecting the more adequate scoring system, Differential Item Functioning (DIF) for gender and for group (P and C) was tested so as to refute the hypothesis that VF scores show differential validity in these groups (Wolfe & Smith, 2007). Correlation coefficients between Rasch-modeled scores, demographic variables, and the MMSE were calculated. The difference in means between P and C groups was statistically contrasted controlling for the effect of the associated demographic variables.


It can be seen from Table 2 that the arbitrary response categories (FAB) were not functional according to Linacre criteria (2002). The second column shows that the observed frequency for the category 0 is not enough (it should be at least 10) to properly estimate the thresholds. The sum of the observed frequencies for the categories is the number of items by the number of subjects. The category score distribution is very asymmetrical: most of the observed frequencies (79%) are clustered in the category 3, artificially reducing the variability and thus the reliability of the person scores (Model Person Separation Reliability = .33; Cronbach’s alpha= .56).

Table 2: Arbitrary (FAB) Category System Statistics. Category Observeda Averageb Infit Outfit Thresholdc 0 2 -2.15 0.83 0.87 - 1 41 -0.38 0.87 0.89 -4.13 2 137 2.84 1.01 1.00 -0.05 3 687 7.23 1.02 1.04 4.18 aObserved category frequency= count of observations in category. bAverage measure = sum (Bn - Di ) / count of observations in category

Conversely, the percentile-based response categories are clearly functional, as can be seen from Table 3.

Table 3: Percentile-based Category System Statistics Category Observeda Averageb Infit Outfit Threshold 0 60 -2.38 .94 .94 -4.03 1 115 -1.47 1.11 1.12 -2.29 2 219 -.52 .90 .88 -.80 3 233 .53 1.00 1.00 .77 4 134 1.77 .88 .89 2.30 5 106 2.41 1.11 1.10 4.06 aObserved category frequency= count of observations in category. bAverage measure = sum (Bn - Di ) / count of observations in category

Score reliability was much better than with the previous system (Model Person Separation Reliability = .82; Cronbach’s alpha =.79). Thus, the remaining analyses were carried out on the scores calculated with this percentile-based response category system that, once modeled with the RRSM, will be called measures.

The unidimensionality assumption was fulfilled: the variance explained by the main dimension was very large (64.5%); the eigenvalue of the residual variance first component was 1.74. It can be seen from Table 4 that the remaining fit statistics were also good.

Table 4: Item Statistics Item D SE Infit Outfit Semantic -.11 .08 1.25 1.25 Phonemic .11 .08 .99 .99 Action .00 .08 .72 .72

Differential Item Functioning (DIF) occurs when an item has a different probability of being passed by persons of a certain group after controlling for the measured attribute. To test for DIF in the Rasch approach, the standardized difference between group parameter locations is calculated after adjusting for group differences and a Bonferroni correction of the significance level is then carried out (Linacre, 2016). Neither gender-related nor group item DIF was found.

As to person measures, 15 out of 289 were imputed given that RRSM does not allow to estimate the perfect (12 maximum and 3 minimum) scores. The frequency of severe misfit (Infit and/or Outfit > 2) was 38 (13.1%). VF measures had a mean value of 0.30 (SD= 1.97), i.e., slightly over the mean item difficulty, which is conventionally located at the scale zero. The unit of the interval variable constructed by means of the RRSM is the logit.

VF measures significantly correlated with age (r = -.21, p < .001) and education years (r = .52, p < .001), but not with gender (r = .02, p =.69). For P and C groups, the mean (SD) was .27 (2.17) and .32 (1.87), respectively, which is a non-significant difference, t (287) = .18, p =.87. Statistically controlling for education by means of ANCOVA, the difference between P and C remains non-significant, F (1, 286)= 1.62, p = .20. However, when the effect of age is controlled, the difference between P and C becomes significant, F (1 , 286)= 6.35, p = .01. This is evidence for predictive validity.

Finally, the correlation of VF measures with the MMSE scores is r = .57, p < .001, evidencing concurrent validity.


Two scoring systems have been evaluated with the RRSM corroborating that the arbitrary category system was not functioning adequately. Percentile-based response categories were clearly functional, and the resulting scores showed good fit and generalized validity for both genders as well as for P and C groups. As usual, VF measures significantly correlated with age and education years, but not with gender. Predictive validity was also supported given the mean differences between P and C scores (after controlling for age), which evidences diagnostic utility. Concurrent validity was also supported, given the large positive correlation of VF Rasch-modeled measures with the MMSE scores.

Even though the correct performance in the various VF tasks requires shared cognitive processes (Troyer, & Moscovitch, 1997) including sustained attention, searching strategy maintenance, lexical selection, inhibition ability, working memory and articulation, there are also some differences.

Semantic VF is related to verbal memory and storing (specially linked to the temporal lobe: Birn et al., 2010; Hodges, & Patterson, 2007) while phonemic VF is less dependent on memory and more related to initiation and shifting abilities (linked to the frontal lobes: Troster et al., 1998; Troyer, Moscovitch, Winocur, Alexander, & Stuss, 1998; Troyer, Moscovitch, Winocur, Leach, & Freedman, 1998). Action VF requires working memory, frontal executive processing, initiation ability, sustained attention and searching strategy maintenance (Perea, Ladera, & Rodríguez, 2005).

In this study, percentiles are given for each of the VF scores, apart from considering the whole test score. In practice this is very useful for clinicians, given the above exposed differences in cognitive processing. VF patterns are used to tell apart deficits associated to the frontal lobe from those associated to the temporal lobe. Frontal lobe injuries lead to low phonemic (Baldo, Shimamura, Delis, Kramer, & Kaplan, 2001; Hodges et al., 1999) and action VF (Damasio, & Tranel, 1993) while temporal lobe injuries give place to deficits in semantic VF (Baldo, Schwartz, Wilkins, & Dronkers, 2006; Hodges et al. 1999) with relatively well preserved verb-evoking ability (Damasio, & Tranel, 1993). Our data allow the location of an individual VF task performance helping to tell apart anterior injuries (frontal) from the posterior (temporal) ones.

Finally, it is relevant to note that, even though most neuropsychological test scores are ordinal-level at best, parametric statistical methods are usually found in the reporting of data analysis. The RRSM logistic transformation has served to construct an interval-level variable, which is desirable from both a scientific perspective and a diagnostic one (e.g., measuring change in patient status is allowed). Comparison of a patient with the remaining participants is implicit in the percentile-based category system, which facilitates personalized interpretation. Finally, as usual in the Rasch approach, unexpected response patterns can give place to new clinical and/or scientific hypotheses (Prieto, Delgado, Perea, & Ladera, 2010).