MISP-Meeting Corpus
Introduction
The MISP-Meeting corpus focuses on the meeting scenario, where 4–8 meeting attendees sit around an 8-microphone array and a panoramic camera, both placed adjacent to each other on the table in a standard meeting room, engaging in a natural conversation. Additionally, each participant wore a headset microphone synchronized with a Zoom F8N recorder to share a common clock. This novel recording setup yields a wealth of audio-visual data, including near-field mono speech for each speaker (*-F8N.zip), far-field 8-channel speech (*-CSOBx3.zip), and 360-degree panoramic video (*-PSCx3.zip). Significantly, the far-field 8-channel speech not only records each participant's spoken contributions but also captures the rich tapestry of background sounds, such as clicking, keyboard typing, door opening and closing, and fan sounds. In contrast, the near-field mono speech effectively reduces interference from unwanted sources while maintaining a remarkable signal-to-noise ratio (SNR) greater than 15 dB. The panoramic camera captures the entire meeting room, including each participant's facial expressions, body movements, and the visual focus of attention, providing a rich source of multimodal cues for analysis.
SetTrainDevEvalTotal
Duration118.803.243.11125.15
Room154423
Participant2332528286
-Male1151314142
-Female1181214144
Avg. Duration2832.371943.511863.232763.98
Avg. Length13085.137557.837624.0012680.65
Avg. Turns463.33118.83321.67445.43
Avg. Speakers5.575.005.505.55

If you find this corpus useful in your research, please cite the following papers:

MISP-Meeting: A Real-World Dataset with Multimodal Cues for Long-form Meeting Transcription and Summarization

@inproceedings{chen2025misp,

title = "{MISP - Meeting}: A Real-World Dataset with Multimodal Cues for Long-form Meeting Transcription and Summarization",

author = "Chen, Hang and Yang, Chao-Han Huck and Gu, Jia-Chen and Siniscalchi, Sabato Marco and Du, Jun",

booktitle = "Proceedings of the 63st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",

year = "2025",

publisher = "Association for Computational Linguistics",

pages = "1--14"}

Downloads
This dataset is available under the license. By using the corpus, you agree to the terms of this License. If you do not agree to this License, then you do not have any rights to use the corpus, and you must immediately cease using the corpus
FileSizeMD5 Checksum
training-CSOBx3.zip151 G9cdb120a91c6e0d7dbc90e0fa1e3bd9a
training-F8N.zip46.2 G222e288a6f8c28351787a135adfded78
training-PSCx3.1.zip325 Gac234e7de8350cb6435edaa02ccd87e2
training-PSCx3.2.zip318 G11d3e0369a05341f330c100f98c308e2
training-PSCx3.3.zip289 G6b5c4de9c7b461ae14896e15fca04b42
training-PSCx3.4.zip313 Gac234e7de8350cb6435edaa02ccd87e2
training-PSCx3.5.zip307 G59c47d06ba7f1947d27e6921c8e557df
training-PSCx3.6.zip283 G83d7001e5061ffea8c321cf00bf52ef7
training-PSCx3.7.zip301 G97c68815cfc234c1255abe800e03edda
training-PSCx3.8.zip302 G6414de042e70db249d7da7a7b6468700
dev-CSOBx3.zip4.27 Gc7f51ae471fffec39aefa22ff4fe4e33
dev-F8N.zip1.26 G05fb159399f723e3803731c1478d70dc
dev-PSCx3.1.zip31.9 Gd2e9741a65d3950ecf62f3235389e48b
dev-PSCx3.2.zip31.9 G2bbfd588eb5a2c58b440b201d01bf401
eval-CSOBx3.zip4.11 G580e5a71411dfbe6f634d7630a20300c
eval-PSCx3.zip71.56 G148ea330c027ed590cb8ec4cbf17a05b
eval-F8N.zip1.27G3a3291c63e1b3c16a1862e83f5244008
training-Transcription.zip12.23M68e7be287ed38eedc7e4201bad2b3a30
dev-Transcription.zip448.93K9f6e50958c8b3b59f469dca066e13b69
meeting_summary.json25.83K917b37316a36cd61bc02b9468ab28fc6
Updated AVSR corpus of MISP2021 challenge
Introduction
The updated Audio-Visual Speech Recognition (AVSR) corpus of MISP2021 challenge is a large-scale audio-visual Chinese conversational corpus consisting of 141h audio and video data collected by far/middle/near microphones and far/middle cameras in 34 real-home TV rooms. The corpus is the first distant multi-microphone conversational Chinese audio-visual corpus recorded in the home TV scenario, where several people are chatting in Chinese while watching TV and interacting with a smart speaker/TV in a living room. Based on the corpus presented by the MISP2021 challenge, we make a dataset update including correcting the asynchronous sample in the training/development set and adding more data to increase the data diversity of the evaluation set. If you find this corpus useful in your research, please cite the following papers:
If you already downloaded the corpus during the MISP2021 challenge and want to update the corpus, please re-download the zip with [Update] and [New] tags.
License

This dataset is available under license. By using the corpus, you agree to the terms of this License. If you do not agree to this License, then you do not have any rights to use the corpus, and you must immediately cease using the corpus.
Downloads
Updated AVWWS database of MISP2021 challenge
Introduction
The updated audio-visual wake word spotting (AVWWS) database of the MISP2021 challenge covers a range of scenarios of audio and video data collected by near-, mid-, and far-field microphone arrays, and cameras, to create a shared and publicly available database for WWS. The wake word is "Xiao T Xiao T". A sample will be taken as a positive sample if the wake word is included, otherwise, it will be regarded as a negative sample. For each sample, at most one wake word is included. The data was divided into three subsets: Train, Development, and Evaluation. Dataset split follows speaker and room independence. Some noise data are also provided. In order to facilitate data transmission, we have packed and compressed the audio and video data, and named them respectively according to the content. You can prepare data directories by extracting the downloaded zip compressed file. For more information about the directory structure, please refer to https://mispchallenge.github.io/mispchallenge2021/task1_data.html. Papers submitted that make use of this database further need to cite [1] and [2]:
License

This dataset is available under license. By using the corpus, you agree to the terms of this License. If you do not agree to this License, then you do not have any rights to use the corpus, and you must immediately cease using the corpus. If you already downloaded the corpus during the MISP2021 challenge and want to update the corpus, please check the training set and development set according to the latest file list [Train_dev_file_list.zip], and re-download the evaluation set again.
Downloads
AVSD&AVDR corpus of MISP2022 challenge
Introduction
In the MISP 2021 challenge, we released a large multi-microphone conversational audio-visual corpus. In the follow-up work, we have resolved authorization and storage issues to fully release the updated AVWWS and AVSR corpus of MISP2021 Challenge to all researchers. For the MISP 2022 challenge, our training set is based on the updated MISP2021 AVSR corpus in the first section and supplies the RTTM/timestamp directories. The new development set is selected from the development and evaluation sets in the updated MISP2021 AVSR corpus. In addition, we will also release a new evaluation set, which has no duplicate speakers compared with other sets.
License

This dataset is available under license. By using the corpus, you agree to the terms of this License. If you do not agree to this License, then you do not have any rights to use the corpus, and you must immediately cease using the corpus.
Downloads
AVTSE corpus of MISP 2023 challenge
Introduction
In the MISP 2021 challenge, we released a large multi-microphone conversational audio-visual corpus. In the follow-up work, we have resolved authorization and storage issues to fully release the updated AVWWS and AVSR corpus of MISP 2021 challenge to all researchers. In the MISP 2022 challenge, we released a new development set and a new evaluation set. For the MISP 2023 challenge, we focus on the Audio-Visual target speaker extraction (AVTSE) task. Our training set is based on the updated MISP2021 AVSR corpus in the first section and the development set is based on the MISP 2022 AVSD&AVDR corpus’ development set. In addition, we will add the middle-field videos to the development set. For evaluation set, we will add some new sessions, which focus on female dialogue scenarios.
License

This dataset is available under license. By using the corpus, you agree to the terms of this License. If you do not agree to this License, then you do not have any rights to use the corpus, and you must immediately cease using the corpus.
Downloads
—Training set
Please refer to the "Updated AVSR Corpus of MISP2021 Challenge" above for the download link of the training set. If you have already downloaded the training set, there is no need to repeat the download. We suggest that participants use near-field audio to simulate far-field audio to ensure complete alignment. We will provide a simulation solution in the baseline, and participants can also use different simulation methods or propose more innovative methods to solve this problem.
—Development set
For the far-field aduio and the transcription of development set, please refer to the " AVSD&AVDR corpus of MISP2022 challenge " above for the download link. In addition, we provide the middle-field video and the detection results of the lip. The download links for these two are as follows.
Wake word lipreading corpus of ChatCLR challenge
Introduction
For the wake word lipreading task of ChatCLR challenge, the training and development set are based on the far-field videos of the updated MISP2021 AVWWS dataset. For evaluation set, we will release new sessions, which include words that share similar lip shapes with the wake-up words, "Xiao T Xiao T", to amplify the difficulty.
License

This dataset is available under license. By using the corpus, you agree to the terms of this License. If you do not agree to this License, then you do not have any rights to use the corpus, and you must immediately cease using the corpus.
Downloads
—Training set & development set
Please refer to the far-field videos of training and development set in the "Updated AVWWS database of MISP2021 challenge" above for the download link. If you have already downloaded the training and development set, there is no need to repeat the download.
Target speaker lipreading corpus of ChatCLR challenge
Introduction
For the target speaker lipreading task of ChatCLR challenge, we utilize the far-field videos from the training and development sets of MISP2021 AVSR dataset. Other official open-source video datasets can also be utilized. The development and evaluation sets contain 12 participants, whose videos are also included in the training set. Each speaker possesses approximately 30 minutes of data. Two-thirds of each person's data make up the development set, while the remaining data make up the evaluation set.
License

This dataset is available under license. By using the corpus, you agree to the terms of this License. If you do not agree to this License, then you do not have any rights to use the corpus, and you must immediately cease using the corpus.
Downloads
—Training set
Please refer to the far-field video of training and development set in the "Updated AVSR corpus of MISP2021 challenge" above for the download link. If you have already downloaded the training set, there is no need to repeat the download.