2025 iFLYTEK AI开发者大赛

MISP-Meeting Corpus

Updated AVSR corpus of MISP2021 challenge

Updated AVWWS database of MISP2021 challenge

AVSD&AVDR corpus of MISP2022 challenge

AVTSE corpus of MISP 2023 challenge

Wake word lipreading corpus of ChatCLR challenge

Target speaker lipreading corpus of ChatCLR challenge

MISP-Meeting Corpus

Introduction

The MISP-Meeting corpus focuses on the meeting scenario, where 4–8 meeting attendees sit around an 8-microphone array and a panoramic camera, both placed adjacent to each other on the table in a standard meeting room, engaging in a natural conversation. Additionally, each participant wore a headset microphone synchronized with a Zoom F8N recorder to share a common clock. This novel recording setup yields a wealth of audio-visual data, including near-field mono speech for each speaker (*-F8N.zip), far-field 8-channel speech (*-CSOBx3.zip), and 360-degree panoramic video (*-PSCx3.zip). Significantly, the far-field 8-channel speech not only records each participant's spoken contributions but also captures the rich tapestry of background sounds, such as clicking, keyboard typing, door opening and closing, and fan sounds. In contrast, the near-field mono speech effectively reduces interference from unwanted sources while maintaining a remarkable signal-to-noise ratio (SNR) greater than 15 dB. The panoramic camera captures the entire meeting room, including each participant's facial expressions, body movements, and the visual focus of attention, providing a rich source of multimodal cues for analysis.

Set	Train	Dev	Eval	Total
Duration	118.80	3.24	3.11	125.15
Room	15	4	4	23
Participant	233	25	28	286
-Male	115	13	14	142
-Female	118	12	14	144
Avg. Duration	2832.37	1943.51	1863.23	2763.98
Avg. Length	13085.13	7557.83	7624.00	12680.65
Avg. Turns	463.33	118.83	321.67	445.43
Avg. Speakers	5.57	5.00	5.50	5.55

If you find this corpus useful in your research, please cite the following papers:

MISP-Meeting: A Real-World Dataset with Multimodal Cues for Long-form Meeting Transcription and Summarization

@inproceedings{chen2025misp,

title = "{MISP - Meeting}: A Real-World Dataset with Multimodal Cues for Long-form Meeting Transcription and Summarization",

author = "Chen, Hang and Yang, Chao-Han Huck and Gu, Jia-Chen and Siniscalchi, Sabato Marco and Du, Jun",

booktitle = "Proceedings of the 63st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",

year = "2025",

publisher = "Association for Computational Linguistics",

pages = "1--14"}

Downloads

This dataset is available under the license. By using the corpus, you agree to the terms of this License. If you do not agree to this License, then you do not have any rights to use the corpus, and you must immediately cease using the corpus

File	Size	MD5 Checksum
training-CSOBx3.zip	151 G	9cdb120a91c6e0d7dbc90e0fa1e3bd9a
training-F8N.zip	46.2 G	222e288a6f8c28351787a135adfded78
training-PSCx3.1.zip	325 G	ac234e7de8350cb6435edaa02ccd87e2
training-PSCx3.2.zip	318 G	11d3e0369a05341f330c100f98c308e2
training-PSCx3.3.zip	289 G	6b5c4de9c7b461ae14896e15fca04b42
training-PSCx3.4.zip	313 G	ac234e7de8350cb6435edaa02ccd87e2
training-PSCx3.5.zip	307 G	59c47d06ba7f1947d27e6921c8e557df
training-PSCx3.6.zip	283 G	83d7001e5061ffea8c321cf00bf52ef7
training-PSCx3.7.zip	301 G	97c68815cfc234c1255abe800e03edda
training-PSCx3.8.zip	302 G	6414de042e70db249d7da7a7b6468700
dev-CSOBx3.zip	4.27 G	c7f51ae471fffec39aefa22ff4fe4e33
dev-F8N.zip	1.26 G	05fb159399f723e3803731c1478d70dc
dev-PSCx3.1.zip	31.9 G	d2e9741a65d3950ecf62f3235389e48b
dev-PSCx3.2.zip	31.9 G	2bbfd588eb5a2c58b440b201d01bf401
eval-CSOBx3.zip	4.11 G	580e5a71411dfbe6f634d7630a20300c
eval-PSCx3.zip	71.56 G	148ea330c027ed590cb8ec4cbf17a05b
eval-F8N.zip	1.27G	3a3291c63e1b3c16a1862e83f5244008
training-Transcription.zip	12.23M	68e7be287ed38eedc7e4201bad2b3a30
dev-Transcription.zip	448.93K	9f6e50958c8b3b59f469dca066e13b69
meeting_summary.json	25.83K	917b37316a36cd61bc02b9468ab28fc6

Updated AVSR corpus of MISP2021 challenge

Introduction

The updated Audio-Visual Speech Recognition (AVSR) corpus of MISP2021 challenge is a large-scale audio-visual Chinese conversational corpus consisting of 141h audio and video data collected by far/middle/near microphones and far/middle cameras in 34 real-home TV rooms. The corpus is the first distant multi-microphone conversational Chinese audio-visual corpus recorded in the home TV scenario, where several people are chatting in Chinese while watching TV and interacting with a smart speaker/TV in a living room. Based on the corpus presented by the MISP2021 challenge, we make a dataset update including correcting the asynchronous sample in the training/development set and adding more data to increase the data diversity of the evaluation set. If you find this corpus useful in your research, please cite the following papers:

If you already downloaded the corpus during the MISP2021 challenge and want to update the corpus, please re-download the zip with [Update] and [New] tags.

License

Downloads

Updated AVWWS database of MISP2021 challenge

Introduction

The updated audio-visual wake word spotting (AVWWS) database of the MISP2021 challenge covers a range of scenarios of audio and video data collected by near-, mid-, and far-field microphone arrays, and cameras, to create a shared and publicly available database for WWS. The wake word is "Xiao T Xiao T". A sample will be taken as a positive sample if the wake word is included, otherwise, it will be regarded as a negative sample. For each sample, at most one wake word is included. The data was divided into three subsets: Train, Development, and Evaluation. Dataset split follows speaker and room independence. Some noise data are also provided. In order to facilitate data transmission, we have packed and compressed the audio and video data, and named them respectively according to the content. You can prepare data directories by extracting the downloaded zip compressed file. For more information about the directory structure, please refer to https://mispchallenge.github.io/mispchallenge2021/task1_data.html. Papers submitted that make use of this database further need to cite [1] and [2]:

License

This dataset is available under license. By using the corpus, you agree to the terms of this License. If you do not agree to this License, then you do not have any rights to use the corpus, and you must immediately cease using the corpus. If you already downloaded the corpus during the MISP2021 challenge and want to update the corpus, please check the training set and development set according to the latest file list [Train_dev_file_list.zip], and re-download the evaluation set again.

Downloads

AVSD&AVDR corpus of MISP2022 challenge

Introduction

In the MISP 2021 challenge, we released a large multi-microphone conversational audio-visual corpus. In the follow-up work, we have resolved authorization and storage issues to fully release the updated AVWWS and AVSR corpus of MISP2021 Challenge to all researchers. For the MISP 2022 challenge, our training set is based on the updated MISP2021 AVSR corpus in the first section and supplies the RTTM/timestamp directories. The new development set is selected from the development and evaluation sets in the updated MISP2021 AVSR corpus. In addition, we will also release a new evaluation set, which has no duplicate speakers compared with other sets.

License

Downloads

AVTSE corpus of MISP 2023 challenge

Introduction

In the MISP 2021 challenge, we released a large multi-microphone conversational audio-visual corpus. In the follow-up work, we have resolved authorization and storage issues to fully release the updated AVWWS and AVSR corpus of MISP 2021 challenge to all researchers. In the MISP 2022 challenge, we released a new development set and a new evaluation set. For the MISP 2023 challenge, we focus on the Audio-Visual target speaker extraction (AVTSE) task. Our training set is based on the updated MISP2021 AVSR corpus in the first section and the development set is based on the MISP 2022 AVSD&AVDR corpus’ development set. In addition, we will add the middle-field videos to the development set. For evaluation set, we will add some new sessions, which focus on female dialogue scenarios.

License

Downloads

—Training set

Please refer to the "Updated AVSR Corpus of MISP2021 Challenge" above for the download link of the training set. If you have already downloaded the training set, there is no need to repeat the download. We suggest that participants use near-field audio to simulate far-field audio to ensure complete alignment. We will provide a simulation solution in the baseline, and participants can also use different simulation methods or propose more innovative methods to solve this problem.

—Development set

For the far-field aduio and the transcription of development set, please refer to the " AVSD&AVDR corpus of MISP2022 challenge " above for the download link. In addition, we provide the middle-field video and the detection results of the lip. The download links for these two are as follows.

Wake word lipreading corpus of ChatCLR challenge

Introduction

For the wake word lipreading task of ChatCLR challenge, the training and development set are based on the far-field videos of the updated MISP2021 AVWWS dataset. For evaluation set, we will release new sessions, which include words that share similar lip shapes with the wake-up words, "Xiao T Xiao T", to amplify the difficulty.

License

Downloads

—Training set & development set

Please refer to the far-field videos of training and development set in the "Updated AVWWS database of MISP2021 challenge" above for the download link. If you have already downloaded the training and development set, there is no need to repeat the download.

Target speaker lipreading corpus of ChatCLR challenge

Introduction

For the target speaker lipreading task of ChatCLR challenge, we utilize the far-field videos from the training and development sets of MISP2021 AVSR dataset. Other official open-source video datasets can also be utilized. The development and evaluation sets contain 12 participants, whose videos are also included in the training set. Each speaker possesses approximately 30 minutes of data. Two-thirds of each person's data make up the development set, while the remaining data make up the evaluation set.

License

Downloads

—Training set

Please refer to the far-field video of training and development set in the "Updated AVSR corpus of MISP2021 challenge" above for the download link. If you have already downloaded the training set, there is no need to repeat the download.

关于iFLYTEK AI开发者大赛

“iFLYTEK AI 开发者大赛”是由科大讯飞发起，中国信息协会联合主办的人工智能竞赛平台，汇聚产学研各界力量，面向全球开发者发起数据算法及创新应用类挑战，推动人工智能前沿科学研究和创新成果转化，培育人工智能产业人才，助力人工智能生态建设。
关注公众号
获取大赛最新动态
联系我们
AICompetition@iflytek.com