Office Activity
HMMs where successfully used in activity recognition in an office environment in [1]Reference
Oliver, N., E. Horvitz, and A. Garg,
"Layered Representations for Human Activity Recognition",
In Fourth IEEE Int. Conf. on Multimodal Interfaces, pp. 3–8, 2002.
and [2]Reference
Oliver, N., A. Garg, and E. Horvitz,
"Layered representations for learning and inferring office activity from multiple sensory channels",
Comput. Vis. Image Underst., vol. 96, no. 2, New York, NY, USA, Elsevier Science Inc., pp. 163–180, 2004.
with use of Layered Hidden Markov Models (see LHMMs for more information). At the lowest level of the hierarchy three sensors were used to capture the observations from environment. These sensors include:
1.Binaural microphones (for sound classification)
2. USB camera (for video signals)
3. Keyboard and mouse (for capturing activity with computers)
The architecture of their software employs a two-level cascade of HMMs with three processing layers. The lowest layer captures video, audio, and keyboard and mouse activity, and computes the feature vectors associated to each of these signals. The middle layer includes two banks of HMMs for classifying the audio and video feature vectors . On the audio side classes include human speech, music, silence, ambient noise, phone ringing, and the sounds of keyboard typing.The video signals are classified using another set of HMMs that implement a person detector. At this level the system detects whether nobody, one person, one active person, or multiple people are present in the office.
The inferential results from this layer (i.e. the outputs of the audio and video classifiers), and the history of keyboard and mouse activities constitute a feature vector that is passed to the next (third) and highest layer of analysis. This layer handles concepts with longer temporal extent. The result of classification of this layer is activities such as Phone Conversation, Face Conversation, Presentation, Other Activity, Nobody Around, and Distant Conversation.
The results show that for the same amound of training data the accuracy of LHMM is significantly higher than that of HMM (about 99%). Moreover LHMMs are more robust to changes in the environment than HMMs. HMMs need to be tuned to a particular testing environment while at least the highest level of LHMM did not require retraining despite the changes in office conditions. Finally the discriminative power of LHMM is notably higher than that of HMM that is the distace between the log-likelihoods of the two most likely models is more than the simple HMM classfier making it prone to instability and errors in the classification.

Figure 1. Architecture of LHMM activity recognition software (SEER) (source: [3]Reference
Oliver, N., A. Garg, and E. Horvitz,
"Layered representations for learning and inferring office activity from multiple sensory channels",
Comput. Vis. Image Underst., vol. 96, no. 2, New York, NY, USA, Elsevier Science Inc., pp. 163–180, 2004.
)
In [4]Reference
Perdikis, S., D. Tzovaras, and M G. Strintzis,
Recognition of Human Activities Using Layered Hidden Markov Models,
: CIP 2008 Workshop, 2008.
authors used LHMM to recognize actions in an office environment. The contribution of this paper lies mainly on the demonstration of the applicability of LHMMs for the Activity Recognition problem, when a person’s actions (e.g. putting a stamp),
rather than his ”state” or ”situation” (e.g. phone conversation) has to be detected. Recognition of abstract actions proves to be a challenging problem, since the order of the series of events is of great importance. In order to achieve this goal, a decomposition of the structure of human action is necessary. Eventually, the application of LHMMs becomes feasible thanks to the idea of exploiting two typical characteristics of the human activity:
1. Hierarchical and chronological structure of activity
2. Distribution of activity to multiple cooperative agents
Figure 2. a) Hierarchical structure of sctivity. b) Distribution of activity to multiple cooperative agents.
In fact the authors are combining Hierarchical HMM and Coupled HMM to estimate actions in the office environment and solve the resulting model with an LHMM. That is instead of using standard algorithms for these models they are using a general approach. The idea seems to work fine unless as we will explain shortly for cases like their specific problem, the resulting hierarchical model has overlapping branches.
Their model consists of three layers. At the first layer, they recognize movements of agents, i.e., left hand and right hand, from one place to another, for example from phone to pencil holder. The input to this level is raw data of position and velocity of hands obtained from video observation. The result of this layer is Premitive Motions(PMs) of the agens.These PMs are the inputs to the second layer which will recognize the Abstract Motions(AMs). These AMs include Pick up phone, Put stamp, Take pen, and Approach screen for left and right hand. The result of this layer is sent to Agent Inference Integeration layer which is the third layer. In this layer the model will distinguish between two abstract motions: Switching the screen on/off and Adjusting the monitor (See Figure 3).
Figure 3. 3 layered model.
Note that in the third layer there is an overlap between observation sequences detecting the two actions. The HMM at this level is not able to distinguish between Switching the screen on/off and Adjusting the monitor since in both cases it will observe the same abstract motion namely Approaching the monitor by the right agent (it is assumed that the monitor can be switched on/off only by the right hand). Unfortunately the result of this motion is deliberately eliminated from the paper which implies the correctness of this analysis. Although this approach may have some difficulties regarding HHMMs it might be a good solutions for generalization of non-overlapping HHMMs and CHMMs.
References
- Oliver, N., E. Horvitz, and A. Garg, "Layered Representations for Human Activity Recognition", In Fourth IEEE Int. Conf. on Multimodal Interfaces, pp. 3–8, 2002.
- Oliver, N., A. Garg, and E. Horvitz, "Layered representations for learning and inferring office activity from multiple sensory channels", Comput. Vis. Image Underst., vol. 96, no. 2, New York, NY, USA, Elsevier Science Inc., pp. 163–180, 2004.
- Oliver, N., A. Garg, and E. Horvitz, "Layered representations for learning and inferring office activity from multiple sensory channels", Comput. Vis. Image Underst., vol. 96, no. 2, New York, NY, USA, Elsevier Science Inc., pp. 163–180, 2004.
- Perdikis, S., D. Tzovaras, and M G. Strintzis, Recognition of Human Activities Using Layered Hidden Markov Models, : CIP 2008 Workshop, 2008.