Contents
In order to provide a tool for audio understanding, the first step is to classify an audio clip or segment an audio stream into several distinguishable classes. In our project, the following six audio classes are considered: pure speech, speech over music, speech over noise, pure music, background sound and pause/silence. Some new features are proposed and utilized; several methods, including the approaches based heuristic rules and SVM/GMM/KNN are employed and compared. To understand more subtle structure, we further analyze the speech stream in the following two directions. The first one is to segment a speech stream into semantic level, such as sentence segmentation without speech recognition. The other is segmentation based on speaker identities. We proposed a new approach to real-time speaker change detection and speaker tracking. In such a case, both speaker identities and number of speakers are assumed unknown. It is different from most previous published works on speaker segmentation, where were designed for applications suitable for either off-line processing without prior knowledge or real-time processing with prior knowledge and supervised training. Our algorithm satisfies both requirements at the same time. Similarity measure is a fundamental step in content-based audio analysis, such as audio classification audio retrieval. In most of current audio analysis systems, similarity measure is based on statistical characteristics of the temporal and spectral features of each frame; and the statistics, such as mean, standard deviation or covariance, are used to describe the property of an audio clip. These statistical features have proved their effectivity in many previous works. However, they only utilized the averaged feature variations over time, but ignore the detail status in each time point and the variation trend of each feature. In this project, feature structure pattern is proposed to improve the similarity measurement. Feature structure pattern means the representative pattern which describes the structure characteristics of both temporal and spectral feature. Three structure patterns are proposed in our current work, including the energy envelope pattern, pitch contour pattern, and harmonic pattern. In music understanding, we classify the music/songs into several pre-defined classes based on genre and mood. Five genres are considered: classical (Baroque, Romantic), Pop, Rock and Jazz; and six moods are employed: Sober, Gloomy, Lyrical, Joyous, Restless, Majestic. Timber, intensity and rhythm features are extracted and utilized. New timber features, octave-based spectral contrast, are used, which is proved better than general MFCC. In our music project, beat tracking is also implemented for popular songs. Several beat types, such as bass and snare, are also identified. Similar to image and video thumbnail, music snippet (music thumbnail) can be used efficiently for fast browsing large number of music files. Music snippet is usually a part of the repeated melody, main theme or chorus. In this project, the most salient segment of the music is firstly detected based on its occurrence frequency and energy information. Meanwhile, the boundaries of musical phrases are also detected based on the estimated phrase length and phrase boundary confidence of each frame. These boundaries are used to ensure that an extracted snippet does not break musical phrases. Finally, the musical phrases including the most salient segment are extracted as music snippet. In this project, we allows the user query a music database by singing or whistling a desired tune. This work seek to some pitch extraction algorithms to produce a note-like representation of pitch contour, which can be used to query a database. But to extract score-like attributes from even the simplest pieces (waveform) is very difficult. We finesse this problem by using archives of MIDI files, which are score-like representations of music. In order to adapt to peoples humming habit and tolerate inevitable humming errors, a triplet melody representation and new hierarchical matching method are employed. Future work will focus on acoustic musics. Audio Texture is a new audio media. It provides an efficient means of synthesizing continuous, perceptually meaningful audio stream from an example audio clip. It is perceptually meaningful in the sense that the synthesized audio stream is perceptually similar to the input example clip. An audio texture is not just a simple repetition of the audio patterns contained in the input; it can be composed of variations of the original patterns to give more a vivid stream. The audio stream can be of arbitrary length according to the needs. Audio textures can be used in many applications such as lullabies, game music, background music and other effects. We also extend this idea to audio texture restoration, or constrained audio texture synthesis for restoring the missing part in an audio clip. It is also useful in many applications such as error concealment for audio/music delivery with packets loss on the Internet. Compared with most traditional error concealment methods, which only dealt with errors with a short length (typically around 20ms or several packets) , our method can restore the audio with loss of a much longer length, such as one second and more. In this project, an attention model is established to measure the relatively importance or attended level of each part in in an audio stream. The model is established based on some low level features such as energy, and some perceptually meaningful highlight sound effects, such as applause, laughing and cheers. It is very helpful for many application including video summarization Last Update: June 1, 2003
|