This is a plain-ASCII version of a paper published in Presence. Copyright remains with the MIT Press. The full bibliographic citation for this work is: Newby, Gregory B. "Gesture Recognition Based upon Statistical Similarity." 1994. Presence 3(3): 236-243. ----- Gesture Recognition Based Upon Statistical Similarity Gregory B. Newby Graduate School of Library and Information Science University of Illinois at Urbana-Champaign Urbana, IL, 61801. Email: gbnewby@uiuc.edu Abstract One of the improvements virtual reality offers traditional human-computer interfaces is that it enables the user to interact with virtual objects using gestures. The use of natural hand gestures for computer input provides opportunities for direct manipulation in computing environments, but not without some challenges. The mapping of a human gesture onto a particular system function is not nearly so easy as mapping with a keyboard or mouse. Reasons for this difficulty include individual variations in the exact gesture movement, the problem of knowing when a gesture starts and ends, and variation in the relative positions of other body parts which might help to identify a gesture but are not measured. A further difficulty stems from limitations on the number of gestures that a person can reliably remember and reproduce. This paper describes work on the statistical recognition of gestures based on the sum of squares. A DataGlove(tm) was employed to measure finger position and "train" software to recognize the letters and numbers of the American Sign Language (ASL) manual alphabet. This technique for gesture recognition is more effective than methods commonly employed in VR applications in that it can distinguish dozens of gestures and is not bound by the input sequences of a particular user. The work described here is limited in that it only examines gestures which do not occur across time. Applications for speakers of ASL and for VR are discussed, and future directions for gesture recognition research are introduced. These include adding a motion tracker and potential for recognizing gestures that do occur across time. 1. Introduction Several aspects of virtual reality (VR) are different from those of other forms of human-computer interaction. One of the most important is the use of gesture-oriented input devices. Traditional input devices (e.g., keyboards) do not use natural gestures. Mice and lightpens allow for two dimensional input (plus a button to depress), and enable a limited set of gestures for input. VR input devices have gone beyond the limits of previous generations of input devices by giving computers the capability to recognize human gestures. This is in contrast to other forms of human- computer interaction which require the user to translate what he or she wants to do into a set of input sequences the computer can understand [1]. With VR input devices, the computer can "learn" appropriate responses to gestures, instead of requiring the user to learn the correct sequences to control the computer. The use of gestures, especially gestures involving the hands, is the focus of this work. Glove devices which measure finger positions and trackers to assess the position and orientation of the hands, head, and body are basic interface components of gesture input devices. This investigation is limited to the finger position on one hand only, although the same methods may be applied for two hands and for various body position and orientation data. Speakers of American Sign Language (ASL) employ gestures to communicate with each other to a degree that far exceeds that of spoken English (hereafter, referred to as "English" [2]). ASL is primarily used for communication among deaf people or between the deaf and non-deaf, and does not make use of sound, which is the primary component of spoken communication among English speakers. People who communicate with ASL use hand movements as a primary component of communication, augmented with body movement and facial expression, to a larger extent than do speakers of English. The purpose of this work is to investigate the ability of a computer to recognize gestures such as those produced by speakers of ASL. Various methods for recognizing gestures are discussed, and the role of gesture recognition for human- computer interaction and for ASL speech recognition is presented. A small-scale study to recognize letters of the manual alphabet of ASL is described. Future directions for gesture recognition research compose the concluding sections. Although this work is presented in the context of ASL, the use of gestures for interaction in virtual environments is often fundamental. These techniques may be used in any virtual environment where the recognition of gestures is required. 2. Goals for Gesture Recognition Two general purposes of gesture recognition for computer applications are (1) as input to control events in computer applications and (2) for translation into spoken or written English. The second purpose assumes the first, and is more difficult. The first case is the more general, and may be applied for VR or non-VR applications as well. In VR, the most frequently discussed applications include a very small number of available input choices -- grabbing and pointing are the only gestures used in many "fly through" virtual worlds (cf., Newby, 1993). Additionally, graphical scenes may be based on the position of the user's head. In contrast to a typical computer application, the number of discrete types of input that a VR application allows is very limited. On a word processor, for example, about 100 keys may be pressed, often in combination, and a mouse may be used to modify the input. Pull-down menus provide further options. Even a simple program that seems appropriate for implementation in a virtual world -- a paint or draw application, for example -- requires a number of gestures to grab and release tools, apply/use them, select areas, and so forth. If we are to develop VR applications to make use of the fuller range of input that the gloves and motion trackers of VR offers, we must develop methods for distinguishing those input events from one another with greater precision than is currently found. "Fly through" applications for VR are commonly demonstrated. These use the simple gestures of grabbing and pointing for navigation through a visual environment. Grabbing and pointing are typically recognized using the "fixed-parameter" method described below. More complicated applications require the capability to distinguish among a larger number of gestures. ASL provides a good model for VR interaction as it includes thousands of gestures well suited to measurement with VR input devices. The remainder of this section discusses ASL as a method for communication, starting with the automatic generation of English translations from ASL input. 2.1. ASL Grammar The grammar of ASL is generally simpler than that of English (American Standard or any other dialect). For example, modifiers for tense (past, present, future) occur before a phrase is signed, and stay in effect until another modifier is given. The single sign for "to go" could mean "will go," "have/has gone," and "am going," depending on whether a modifier for future, present, or past tense is present [3]. ASL phrases leave out articles and other parts of speech, e.g., "a," "an," "the," "of." The placement of modifiers, lack of articles, and use of multi-part signs contribute to the lack of isomorphism between ASL phrases and their English counterpart. For example, a gesture-by-gesture translation of an ASL phrase into English might yield: Future I go again eat here. The equivalent English phrase would be: "I will eat here again." More complex ideas expressed in ASL may be further removed from their grammatically accurate expression in English. Although direct translations are not very difficult for people knowledgeable in both languages, the creation of computer algorithms to translate between the languages, so that a gesture recognition program could be used to produce synthesized English speech, would involve more than simple one-to-one matching of ASL signs and their English counterparts. The remainder of this work is concerned with the identification of ASL signs, not their translation into another language. 2.2. ASL Signs There are three components to an ASL sign: the location relative to the body, the finger position, and the movement. Many signs are made with both hands and arms, and a large number of common English words or ideas have single ASL signs. When a word does not have a sign or the sign is not known, it is spelled using the manual alphabet. This happens frequently with proper names and technical terms. Many signs include components from other signs, such as the sign for "green" which starts with the hand position for the letter "G" but has added movement. Signs in ASL are made relative to body reference points: parts of the head and face, the arms, the shoulders, and torso. Some two-handed signs are symmetric, others involve different movements with each hand. Signing also involves great involvement of facial expression to support the communication or add emphasis. Signers may speak words silently in English with their mouths as they speak ASL with their hands and body. 3. Methods for the Automatic Recognition of Gestures For ASL speakers, the dominant communication components are the gestures made while communicating. In virtual environments, gestures are used to navigate through a representation of a physical or physical-like domain or to interact with applications. Both ASL speakers and VR users can benefit from effective methods for a computer to recognize gestures and learn to distinguish among gestures. This section discusses two computer methods and their associated issues: fixed-parameter recognition and sum of squares statistical recognition. "Fixed-parameter" is used here to refer to gesture recognition in virtual environments that set up a list of parameters for the input measures and compare the current input list to those parameters in order to recognize the gesture. A typical VR application might involve using these gestures: a fist, an open hand, and pointing with one or two fingers. The DataGlove produces two numeric values for each finger [4]. Fixed-parameter recognition sets up a list of minimum or maximum values for each of the fingers that distinguishes these gestures. So, a fist would involve high values on every finger; an open hand would have low values on every finger; pointing would have high values on some and low on others. The problems with the fixed-parameter approach are first, that it is not well suited for large numbers of gestures and second, the parameters may be different for different users (or even the same user at different times). The main advantages of this method are conceptual simplicity and computational efficiency. This method is used in most VR applications, especially those that involve "flying through" a virtual environment. These use finger gestures to indicate direction and velocity and grabbing gestures to select items in the virtual world. A somewhat more sophisticated approach to gesture recognition is the measurement of similarity by the "sum of squares" method. The sum of squares is one of the simplest statistics used for assessing the similarity of two sets of scores (e.g., Tukey, 1977). The sum of squares statistic is calculated by squaring the differences between current values and each matched prototype value in the set of known gestures and then summing these squared difference scores. With the DataGlove, this may be accomplished by obtaining "prototype" values for each gesture to be matched. Then, a gesture to be recognized is compared to each known prototype. The known gesture with the smallest sum of squares score is the one "closest" to the current input gesture. The DataGlove used for the work described here produces values from the 2 fiber-optic strands mounted on each finger for a total of 10 values. Raw data from each sensor range from 0 (completely straight) to 255 (highly non- straight). Calibration software is used to produce floating- point values from 0 to 2 for each sensor with the upper value corresponding to the maximum bend which a particular user makes when wearing the glove (e.g., while making a fist). Specific values therefore correspond with fair precision to a finger bend position at the knuckle and center finger joint for a particular user. By calibrating at the start of a session and re-calibrating periodically during a session, effects of the drift in absolute (0-255) DataGlove values due to the equipment warming up, the hand swelling during a session, and small variations in hand size across sessions may be minimized. The sum of squares for each known gesture is calculated by taking the difference between the current gesture and the prototype gesture on every DataGlove value, producing 10 difference scores. Each difference is squared, then the squared values are added. The one number resulting is the sum of squares for that known gesture and the current gesture. In the case where the known gesture is identical for all 10 finger values, the sum of squares will be 0. As the two gestures differ, the sum of squares increases. The set of sum of squares scores can be ranked to see which gesture is closest -- and a value for tolerance for error may be used to cull sum of squares scores which are too large to be a "good" match. Algorithm: ! For each gesture we have trained to recognize: For i = 1 to Number_of_Known_Gestures ! Sum the squared difference scores score(i) = Sum for j = 1 to 10 ! 10 sensors (current_finger(j) - prototype_finger(j))2 ! WhatÕs the lowest difference score? Best_Match = [ minimum (score(1), score(2), ... score(10)) ] ! Is the lowest difference score small enough to ! be interesting? If [Best_Match < Tolerance ] then Match = Best_Match ; else Match = none ! Now, "Match" is either the known gesture which ! most closely matches the current gesture, or "none" The benefit of the sum of squares approach is that it can efficiently distinguish a far larger number of gestures than the fixed-parameter method. As long as the DataGlove values of new gestures are different from known gestures, new gestures may be added. Different users can provide their own prototype gestures, so that individual differences will not detract from the accuracy (this could be done for fixed- parameter methods as well, but in practice seldom is). Prototype gestures may be shared by different users, but the best results are obtained when each user provides his or her own prototypes. The main problem with the sum of squares method is its computational complexity. Even though this is a simpler statistic than most, it does involve several dozens of floating point operations for each known gesture, followed by ranking and comparison to the tolerance for error. To be effective for gesture recognition in virtual environments, this must be accomplished for each incoming set of DataGlove values as they become available, in "real time." A desirable minimum threshold for VR is about 10 updated "frames" per second -- so the calculations should take no longer than 1/10 second, including any time needed to get current DataGlove finger position values. The software developed for the experiment described below was able to operate within these limits. However, it will be desirable to add a movement tracker, as well as finger position measurement, to make the transition beyond the manual alphabet to ASL proper. As a standard Polhemus tracking device produces only about six new measurements per second, this addition may produce a performance limitation. The use of tracking devices with faster update rates can help to eliminate this problem. The sum of squares approach described above is well suited for comparing "slices" of time -- comparing one set of DataGlove values to another, where the sets are the same size. However, ASL signs take place across time, which means that a program capable of recognizing ASL during actual use would need to match a set of DataGlove scores representing a sign (say, sampled 60 times per second) with an ongoing stream of incoming data. The sum of squares method is suitable for this: the known gestures would be stored in an array of DataGlove values, and compared to an array of the same size taken from the most recent input values. Implementation is not so straightforward as for the static measures, especially for guarding against false matches as would likely occur when the start of one gesture matches all of another. Other forms of gesture recognition could be used. One of the most promising may be a neural network approach (cf. Rumelhart & McClelland, 1986). Benefits of this approach may be anticipated based on success in other applications, and may include automatic identification of the critical features for discriminating among similar gestures (e.g., the position of the index finger for a pointing gesture). However, neural networks do not decrease the difficulties with dealing with gestures across time, and they would almost certainly not be as fast as the sum of squares approach. More importantly, the addition of a new gesture or deleting a gesture might require retraining for the entire network (something that takes at least several seconds in most neural network implementations).  4. Experiments on the Automatic Recognition of Gestures Results are presented for Experiment 1. Experiment 2 is pending. The goal of these experiments is to assess the effectiveness of the statistical approach to gesture recognition described above for signs that do not occur across time. The method is limited to evaluating the ability of software employing the sum of squares method for recognition of the 26 letters of the manual alphabet of ASL, and does not enter the domain of the moving gestures employed for regular ASL speech. 4.1. Setting The Virtual Reality laboratory of the National Center for Supercomputing Applications at the University of Illinois at Urbana-Champaign. 4.2. Equipment The input device used was a VPL DataGlove model II. The glove had two sensors to measure finger bend on each finger, one over the knuckle and the other over middle joint of the finger (no sensors measured the palm, wrist, or last finger joint positions). 4.3. Software A Silicon Graphics SkyWriter(tm) graphics supercomputer received input from the DataGlove and performed all calculations. The SkyWriter uses a flavor of the UNIX(tm) operating system. Gesture recognition software was written in the C programming language (an earlier version was written in FORTRAN) [5]. Additional software for interaction with the DataGlove and VR devices was used by the gesture recognition code [6]. 4.4. Subject A native sign language speaker volunteered to assess the performance of the recognition software. She also taught ASL, and was fluent in spoken English [7]. 4.5. Method for Experiment 1 The subject generated template gestures for the 26 letters of the ASL manual alphabet and the numbers 1 through 10. The software was used to match gestures made with those it knew about. Different values of tolerance for error could be selected so that the software would "guess" a gesture even though a perfect match was not obtained. "Tolerance" is a number chosen at run time or a pre-set value that is simply the number over which matches are not considered to be interesting. For this research, a sum of squares score over .75 was frequently spurious -- e.g., the software would identify a "match" when the user was between gestures. Excessively low tolerance values, below .20, would result in the user needing to form every gesture very carefully. Tolerance values around .50 yielded good results. 4.6. Method for Experiment 2 This experiment has not yet been carried out. Instead of gestures, words will be spelled with the manual alphabet. Unlike the gestures investigated in Experiment 1, these words occur across time. Rather than comparing a given gesture to all known gestures, an array of current and past gestures will be compared to known arrays. 5. Results Experiment 1 demonstrated the promise of the statistical approach to gesture recognition for gestures that do not occur across time. Of the 36 gestures "taught" to the software, only a few were not recognized reliably and unambiguously. "I" and "J" were confused, as they are identical hand positions but "J" moves. "Z" and "1" are also identical, except that "Z" moves and has a different orientation. This is no surprise since the method described here was only applied to time slices -- single measurements -- not across time. So, "Z" and "1" would be expected to be confused, because both have the same finger position. The addition of the capability to measure hand orientation and take measurements across time will provide data to disambiguate these problematic pairs, but introduce other problems discussed below. Performance of the software was adequate for real-time recognition, suitable for use in VR applications. However, it was only barely fast enough for use by native ASL speakers, who use the manual alphabet at a rate of about ten letters per second. The software could be optimized somewhat, and a dedicated processor used, to remove some of the lag. There will necessarily be some delay for obtaining data from the DataGlove, though, and the methods for fine- tuning the algorithm (below) will result in extra time being taken to disambiguate similar gestures. It seems likely that this approach will require the native ASL speaker to sign somewhat more slowly than normal for optimum recognition using these methods, given current technology. 6. Enhanced Methods for Gesture Recognition The sum of squares method used here works very well for static gestures which have distinguishing features in finger position only. It may be guessed that the addition of a positional indicator will increase the performance for gestures in which finger positions are similar but the orientation is different, even without requiring across-time measurement. A complication caused by current VR equipment is that the scales on the Polhemus tracker and VPL DataGlove are not the same, so the numbers from one would need to be scaled to match the range of the other before being used in sum of squares calculations such as those demonstrated here (e.g., the Polhemus tracker can produce numbers ranging from 0-180 degrees, while the DataGlove values as used here range from 0-2 per finger). The library of computer functions developed for this research may be incorporated into current VR applications with little difficulty, or the simple sum of squares algorithm may be written from scratch. Enhancements to the sum of squares approach will take two directions. First is the identification of relations among gestures, so that similar gestures may be distinguished. Second is the implementation of the method for gestures that occur across time. Knowledge about relations among gestures can help augment the monolithic sum of squares statistic generated by the method described above. The basic approach is to assess a similarity score for each known gesture to the current gesture. In practice, though, some gestures are similar. Additionally, the input gesture may be ill-formed and produce measurements that diverge from the prototype "known" measures. In this case, a second pass at data analysis may focus on the factors that distinguish pairs of known gestures from each other. A simple method would be to create a set of between-gestures difference scores for the two candidate gestures for each matched finger value. Then, each difference score between the candidate gestures and the current gesture would be multiplied by the between-gestures score for that finger value, thus inflating scores that helped to distinguish between that pair of candidate gestures. This would need to be completed for every pair of candidate gestures that fell under the threshold for error -- 3 candidates would require 3 passes, 4 candidates would require 6 passes, etc. A more sophisticated analysis, but one that may be more computationally efficient, would be to investigate the most common factors in the set of finger scores. Figure 1 shows the outcome of one such analysis, in which similarities among letters of the manual alphabet were assessed through a statistical procedure called principal components analysis. The multidimensional space that emerges from such an analysis rates the most important dimensions in the data (which are akin to factors from another statistical procedure, factor analysis). Then, only the relative placement of known gestures on the most important dimensions need to be taken into account, rather than all possible combinations as with the method described in the previous paragraph. For a set of only a few dozen gestures the effort involved in completing the principal components or other analysis (which does not occur in real time) would probably not result in greatly increased performance. However, significant computational time may be saved by eliminating, say, 1/2 of all comparisons when hundreds of gestures are involved. 7. Conclusion Future work on gesture recognition will necessarily go into far more complicated environments than those described in this work. For example, the use of two hands and the tracking of head and body movements are necessary components for effective human communication with ASL, and have direct applications for virtual environments. The simple sum of squares statistic demonstrated here may be easily used in VR applications that require more than a few gestures. Such applications include teleoperation and telerobotics (e.g., for keying simulated buttons or keys), navigation through complex data structures (e.g., a visual environment where the user must control acceleration, direction, selection of objects in the environment, signals for different views on the environment, and levels for items such as color intensity and sound intensity), and other environments in which numerous input sequences are needed. Currently, the author and his colleagues are successfully using the methods described here for gesture recognition for all applications that involve discriminating more than two or three gestures, and incorporating the methods into previously existing applications NOTES 1. Norman and Draper (1987), among others, have argued against the need for humans to translate their intentions to conform to limitations of the computer interface. The best interfaces, though, can only go part way: people may be able to use a mouse to move a pointer and select (click on) an icon, but this is only slightly similar to reaching out and physically grabbing something. 2. "English" for the current purposes refers to "American Standard English." However, the comparison with ASL applies generally to other dialects. 3. Many instructional and descriptive texts for ASL are available. These include Riekehof (1961), Hoemann (1976), and Cokely (1980). 4. This is the most commonly used model of DataGlove. Other models include a third finger sensor and sensors for the web of the hand. 5. The gesture recognition code is available over the Internet via anonymous FTP to gpx.lis.uiuc.edu (128.174.4.40) as pub/VR/recog.tar.Z. The code assumes that the NASA and UNC driver programs are present (see below), but could be adjusted to take input in other forms. Consult the README file for implementation details and distribution limitations. This archive will be maintained at least through 1994. 6. The basic libraries for interaction with the DataGlove and other devices such as Polhemus trackers and head-mounted displays were obtained from NASA and the University of North Carolina (UNC) at Chapel Hill. 7. The volunteer subject was not deaf, but both her parents were. 8. The values displayed on the scatterplot were obtained through a principal components analysis of the set of 10 finger scores for each letter (analysis with SAS(tm) version 6.18). Note that this is only the first two dimensions (PRIN 1 and PRIN 2) of a 10 dimensional space, so relations among particular items may not be faithfully represented on these two axes. The first 2 dimensions accounted for about 18% of the variance in the set of finger data overall (all 10 dimensions account for 100% of the variance, by definition). REFERENCES Cokely, Dennis. (1980). American Sign Language: A Student Text. Silver Spring, Maryland: T.J. Publishers. Hoemann, Harry W. (1976). American Sign Language: Lexical and Grammatical Notes with Translation Exercises. Silver Spring, Maryland: National Association for the Deaf. Newby, Gregory B. (1993). "Virtual Reality." In: Williams, Martha E. (Ed.). Annual Review of Information Science and Technology 28. Medford, New Jersey: Learned Information. 187-217. Norman, Donald A.; Draper, Stephen W. (1986). User- Centered System Design: New Perspectives on Human-Computer Interaction. Hillsdale, New Jersey: L. Erlbaum Associates. Riekehof, Lottie L. (1961). The American Sign Language: A Manual of the Signs Most Commonly Used by the Deaf of North America. Indianapolis: Shaneyfelt. Rumelhart, David E.; McClelland, James L. (1986). Parallel Distributed Processing: Explorations in the Microstructure of Cognition. Cambridge, Massachusetts: MIT Press. Tukey, John W. (1977). Exploratory Data Analysis. Reading, Massachusetts: Addison-Wesley. Figure 1: Scatterplot of the statistical relations among finger values for the 26 letters of the manual alphabet [8]. PRIN 1 3.00 + | l | | | p 2.00 + q | g | | k z t n | 1.00 + y | r s | h u m x | | v i 0.00 + a | | | | d -1.00 + | o | | | -2.00 + w | e | | | c -3.00 + | | | | -4.00 + | | | | b -5.00 + | f | | | -6.00 + | ---+--------------+--------------+--------------+--------- -2.50 -1.00 0.50 2.0 PRIN2