A central topic in spoken-language-systems research is what’s called speaker diarization, or computationally determining how many speakers feature in a recording and which of them speaks when. Speaker diarization would be an essential function of any program that automatically annotated audio or video recordings.
To date, the best diarization systems have used what’s called supervised machine learning: They’re trained on sample recordings that a human has indexed, indicating which speaker enters when. In the October issue of IEEE Transactions on Audio, Speech, and Language Processing, however, MIT researchers describe a new speaker-diarization system that achieves comparable results without supervision: No prior indexing is necessary. Stephen Shum, a graduate student in MIT’s Department of Electrical Engineering and Computer Science, is lead author on the new paper.
Continue reading the article on MIT News.