Case Study #8: ALCOVE
1. The goal and architecture of ALCOVE
1.1. The cognitive task of the model
Category learning abounds in everyday life, and is a fundamental aspect of cognition. How, for example, do children learn that certain squiggles on paper belong to the category ``R,'' but other squiggles belong to the category ``P''? How do mushroom hunters learn that particular fungi are edible but others are poisonous? How do physicians learn which symptoms indicate which disease?
In a standard laboratory paradigm for studying category learning, a participant is presented with a stimulus, such as a simple geometric figure or a list of symptoms, and the participant must guess the correct category label from a pre-specified set of possible labels, such as F, G, H or J. After his or her guess, the participant is told the correct label, and then is shown another stimulus to try to categorize. After many such trials, the participant learns which stimuli belong to which categories, and also makes systematic generalizations to novel stimuli. Cognitive scientists try to understand the principles of category learning in this circumscribed laboratory paradigm with the hope that these principles will also apply in more complicated applications, just as the physicist can profitably understand how ball bearings fall in a vacuum as a step to understanding how leaves fall in a thunderstorm.
1.2. The data and input/output task
Despite the restrictions imposed by this laboratory paradigm, the observed effects are rich, varied, robust, and perplexing. Some types of categorical distinctions are easier to learn than others, and some instances of categories are easier to learn than other instances. Generalizations to novel stimuli seem to be based sometimes on graded similarities to training instances but other times seem to be based on application of discrete rules. Generalizations seem sometimes to take category base rates (frequencies) into account, but other times seem to contradict the base rates.
The ALCOVE model addresses trial-by-trial category learning. On each training trial, ALCOVE is presented with a stimulus, makes a prediction of the distribution of category choices, is presented with the correct classification, and then adjusts its associative weights and dimensional attention strengths (defined below). In most applications, ALCOVE is trained on the same set of training sequences experienced by the human participants. A concrete example is provided below.
1.3. The network architecture
ALCOVE is a feedforward network with three layers of nodes. The input nodes encode the stimulus, with one node per dimension or feature. Each node is gated by a multiplicative attention strength, which grows to reflect the relevance of the dimension for the categorical distinction being learned.
The internal (hidden) layer consists of nodes that represent training exemplars. The activation of a hidden node reflects the similarity of the stimulus to the exemplar represented by the node. These "exemplar" nodes compute their activation in two steps: First, they compute the distance between the stimulus and the exemplar they represent, then they compute their activation as a monotonically decreasing function of distance. (A special case of this type of node is a "radial basis function," but exemplar nodes in ALCOVE use somewhat different functions to better reflect psychological data.)
The output layer consists of one node per category, with each category node's activation computed as a sum of weighted activations from the exemplar nodes. Category activations are converted to choice probabilities using the Luce choice rule (a.k.a. the softmax rule in computer science).
1.4. A concrete example
As an illustration of how ALCOVE is applied to empirical data, consider the filtration-condensation experiment described by Kruschke (1993a, pp.12-21). Each stimulus was a rectangle with an internal vertical segment. On different trials the rectangle could be one of four heights, and the internal segment could be in one of four lateral positions. Of the sixteen possible combinations, eight stimuli were used, as diagrammed schematically in Figure 1.
+------+ +------+ | | | | | | | | | | | | | | | | | | | | | | +------+ +------+ +------+ +------+ | | | | | | | | | | | | | | | | | | +------+ +------+ +------+ +------+ | | | | | | | | | | | | | | +------+ +------+ +------+ +------+ | | | | | | | | | | +------+ +------+
Figure 1. Schematic diagram of stimuli in an experiment by Kruschke (1993a). The rectangle had four possible heights, arrayed vertically in this Figure from shortest to tallest. The internal segment had four possible lateral positions, arrayed horizontally in this Figure from left-most to right-most.
In one of the "filtration" conditions, stimuli were classified such that if the internal segment was left of center then the correct category label was "B," otherwise the correct response was "N" (with responses made by pressing the corresponding key on the computer keyboard). In this situation there were two stimulus dimensions, eight training exemplars, and two categories, so ALCOVE has two input nodes, eight exemplar nodes, and two category nodes.
The input values for the stimuli are determined by a separate multi-dimensional scaling study. (See the Appendix of Kruschke, 1993a, for full details.) After asking observers to rate the similarities of pairs of stimuli, one can derive the coordinates in psychological space that best satisfy the similarity ratings. In particular, whereas the physical increments in lateral position of the internal segment were equal, the psychological increments were not equal: The psychological difference between just-left-of-center and just-right-of-center is greater than the psychological difference between far-left and just-left-of-center.
Suppose that on a given trial the left stimulus in the top row of Figure 1 is presented. Its scale values for segment position and height are -0.58 and 1.60, respectively. Hence, the input node for segment position is given an activation of -0.58, and the input node for height is given an activation of 1.60. Each of the eight exemplar nodes then computes how similar the input is to the exemplar it represents. The exemplar node activations are then propagated to the category nodes, and the network's category choice is determined by the relative activations of the two category nodes. The network predicts choice probabilities, and does not make discrete choices. The correct category is then provided; in this case the "B" category node is given a teacher value of 1.0 and the "N" category node is given a teacher value of -1.0. The error is computed and propagated down the network, and then the association weights and dimensional attention strengths are adjusted to reduce the error. On this trial, the attention strength on the lateral-position node is increased, and the attention strength on the height node is decreased. The associative weight from the presented exemplar node to the "B" category node is increased, and the associative weight from the presented exemplar node to the "N" category node is decreased, and so on. The process repeats on the next trial. The network can be trained on the same sequence of stimuli observed by human learners.
The mean predictions of ALCOVE are compared with choice proportions of human learners, and the four free parameters of the model are adjusted to best fit the human data. The four free parameters consist of (1) a learning rate for the association weights, (2) a learning rate for the attention strengths, (3) a "specificity" for the exemplar nodes that determines an overall scale for computing similarity, and (4) a scaling constant for mapping category node activations to choice proportions.
2. Structure, memory, time and change in ALCOVE
2.1. Structure. There are two important structural assumptions embodied in ALCOVE, both of which stem directly from the model's predecessor, the Generalized Context Model (Nosofsky, 1986), and are based on well-established psychological evidence. First, the input representation for a stimulus is its scale values on independently derived psychological dimensions. The psychological dimensions and their scale values are determined by multidimensional scaling techniques conducted independently of the category learning experiments. Thus, the structure of the input is not a cheat or trick (Lachter & Bever, 1988) that trivializes the modeling, but it is instead a constraint that demands a particular input representation. The goal of the model is to learn associations between the input representation and category labels, in a manner comparable to human learning. The form of the input representation is necessary but not sufficient to achieve this goal. One reason it is important to have the psychological dimensions represented explicitly is that people can learn to attend selectively to psychological dimensions that are relevant to the category distinction, and the model reflects this ability directly and explicitly.
The second major structural assumption is that all features of every stimulus are stored in memory; hence ALCOVE is called "exemplar-based," as opposed to prototype-, exemplar-fragment-, or rule-based. Each hidden node in the model represents a training exemplar, and a hidden node's activation represents the psychological similarity of the current stimulus to the exemplar represented by the node. The model learns associative strengths between memory exemplars and category labels.
2.2. Memory. The model is supposed to learn two things: the relevance of each stimulus dimension to the category distinction, and the associative strength of each exemplar to the category labels. The method for learning is gradient descent on error, via the mechanism of backward propagation.
2.3. Time. ALCOVE measures time in terms of single-trial increments in laboratory experiments. It does not model real-time dynamics in the process of categorization or learning. This approach to temporal dynamics is a pre-theoretical simplification, wherein it is assumed that the largest effects in this paradigm of category learning appear at the scale of trial-to-trial exposures. In future research, as finer temporal details of learning are addressed, models will have to incorporate real-time dynamics.
2.4. Change. ALCOVE assumes that the learning takes place over a time scale short enough that the underlying psychological structures do not change. Thus, ALCOVE assumes that the psychological dimensions of the input representation are unchanging during the course of learning. On a larger time scale, of course, various psychological dimensions can develop, be learned, or be refined, but ALCOVE does not address such changes.
3. Identifying functional roles of components in ALCOVE
"The goal of this workshop is ... to identify the functional roles of ANN [artificial neural network] components, the link from the component to the overall behaviour and how they combine to explain cognitive phenomena" (from the Rationale section of the Call for Papers). Presumably, there are two problems that make this goal non-trivial. First, the principles embodied by components of ANN's might not have direct correspondence with the explanatory principles for particular cognitive phenomena. Second, the behavior of the ANN itself might be only partially understood (McCloskey, 1991). ALCOVE is less prone to these problems than vanilla ANN's because it directly formalizes domain-specific explanatory principles with parameterized functions.
ALCOVE embodies three principles of category learning (Kruschke, 1993b). First, people learn to selectively attend to (psychological) dimensions that are relevant to the category distinction. For example, stimuli might vary in size and color, with the correct category dependent only on size. People quickly learn in this situation that color is irrelevant. A second principle in ALCOVE is that people encode the complete exemplar in memory (perhaps unconsciously) on every trial. This principle is useful for explaining people's sensitivity to the frequencies of individual training exemplars, people's ability to learn exceptions to rules, and people's ability to remember prior learning when confronted with new stimuli to learn. A third principle is that learning is error driven, so that associative weights and attention strengths are adjusted so to reduce the discrepancy between the predicted category and the correct category.
In ALCOVE, the formalization of each principle is explicit, in that each principle is separately formalized and parameterized. None of the principles emerges exclusively from lower level mechanisms. Consequently, the embodiments of structure, memory, time and change are explicit and transparent, and the functional roles of each principle are clearly defined and demonstrable.
3.1. Role of attention. The major emphasis of ALCOVE is the role of learning which dimensions to attend to. Two empirical phenomena graphically illustrate the importance of learning dimensional relevance. Both phenomena show that when fewer dimensions are relevant to the category distinction, people learn the distinction faster (i.e., in fewer trials). First, Kruschke (1993a) showed that ALCOVE exhibits an appropriate advantage for learning one-dimensional distinctions ("filtration") relative to two-dimensional distinctions ("condensation"). Second, Kruschke (1992) showed that ALCOVE exhibits the proper ordering of the six category distinctions studied by Shepard, Hovland & Jenkins (1961), and Nosofsky, Gluck, Palmeri, McKinley & Glauthier showed that ALCOVE gives good quantitative fits to learning the six categories.
The functional role of attention is easily isolated in these cases because attention is separately parameterized in the model. When the learning rate for dimensional attention is fixed at zero, ALCOVE cannot show the proper effects. Moreover, Kruschke (1993a) showed that vanilla backpropagation, which does not have dimensional attention, cannot exhibit the proper effects, but if attentional gates analogous to ALCOVE are added, then the extended backpropagation model can also display learning behavior comparable to humans in these situations.
3.2. Role of exemplar representation. A number of empirical phenomena provide evidence for exemplar representation in human category learning. Two such phenomena are emphasized here. First, exemplar-based representation is useful for explaining the fact that people do not "catastrophically" forget previously learned classifications when learning new classifications. Kruschke (1993a) showed that ALCOVE can fit human performance well in a situation where (attentionally extended) backpropagation exhibits catastrophic forgetting (cf. McCloskey & Cohen, 1989). The situation was designed perniciously to take advantage of standard backpropagation's lack of exemplar representation.
There is no implication that humans use exemplar-based representation exclusively, nor is there an implication that humans remember complete exemplars on every occasion. In particular, people can also use rules for classifying stimuli. Even with rules, however, there can be exceptions, and the ability to learn exceptions can be accounted for by positing exemplar-based representation of the exceptions. Kruschke and Erickson (1994) combined ALCOVE with a rule-based module to account for the relative speeds of learning rules and exceptions. Without the ALCOVE module, specific exceptions could not be learned, but without the rule-based module, ALCOVE could not extrapolate the rule far or fast enough.
3.3. Role of error driven learning. The third main principle in ALCOVE is the use of error to drive learning of both the attention strengths on dimensions and the associative weights from from exemplars to categories. Kruschke (1992, 1993b; Nosofsky, Kruschke & McKinley, 1992) argued that error-driven learning of associative weights was crucial for ALCOVE's account of "apparent base-rate neglect." Whereas error-driven learning of associative weights was crucial for the particular learning sequences modeled, a general account of apparent base-rate neglect and the "inverse base-rate effect" depends on error-driven learning of featural attention strengths (Kruschke, 1996).
4. Summary and Conclusion
A variety of phenomena in category learning are addressed by the psychological principles formalized in ALCOVE and its extensions (Choi, McDaniel & Busemeyer, 1993; Kruschke, 1992, 1993a, in press; Kruschke & Erickson, 1994; Nosofsky, Gluck, Palmeri, McKinley & Glauthier, 1994; Nosofsky & Kruschke, 1992; Nosofsky, Kruschke & McKinley, 1992). ALCOVE emphasizes three psychological principles - error-driven shifts of attention, exemplar-based representation, and error-driven learning of associative weights - formalized in a particular way. There are many other possible formalizations of the principles. The power of a formalization is that from it can be derived precise predictions, independent of the vagaries of the verbal principles and the intuitions of the theorist. In this framework, the explanatory power, i.e., the quality that makes human thinkers say "ah, now I understand," derives from the principles, and the formalization plays the role of buttressing the principles and giving them specific predictive applications. If the predictions are confirmed, both the principles and their formalization are strengthened. If the predictions are disconfirmed, then either the formalization or the principles or both are wrong in some respect.
The three principles embodied in ALCOVE are fairly specific to the domain of category learning. They are not necessarily generic principles of "brain style" computation, or parallel distributed processing (PDP), as defined by McClelland (1992; see also Cohen, Servan-Schreiber & McClelland, 1992) and promoted by Seidenberg (1993). The principles in ALCOVE could be implemented in connectionist networks that adhere to the generic computational style, but not every generic network can implement the specific principles in ALCOVE (Kruschke, 1993a).
The research referred to here was supported in part by NIMH FIRST Award 1-R29-MH51572-01, by a Summer Faculty Fellowship from Indiana University, and by NIH Biomedical Research Support Grant RR 7031-25.