The Multilingual-BC3 corpus is a dataset comprises multilingual conversations that have taken place via email. The Multilingual-BC3 is not a real multilingual dataset and it is constructed by manually translating some parts of BC3 dataset. The dataset is produced by three people at the Intelligent Information Systems Lab, University of Tehran.
The original BC3 annotated with sentence labels such as Speech Acts, and Subjectivity. In addition, ConThread-BC3 is a version of BC3 that is annotated with conversation threads and body-text segmentation information. This extra information is language independent so they can easily be extended to Multilingual-BC3.
The conversations in Multilingual-BC3 are in two languages, Persian and English. To simulate real multilingual conversations, The translation is done by a particular policy. We assume each person only send email in one language. Thus, we select 25 Person-ID from 160 Person-ID to translate their emails. These people are selected such that after translating all their emails, each conversation thread in the dataset is a multilingual conversation. In other words, each conversation includes emails in both languages. By doing so, 107 emails from 261 email have been translated. It is noteworthy that if a translated email has been quoted in other emails, The quoted parts in those emails are also exchanged by its equivalent translation. In this way, some emails' body text became mixed-language. (e.g. The main content may be in different language from quoted parts). Details of this procedure are described in a README file.
It should also be said that there are two versions of the Multilingual-BC3. The first one is multilingual version of the original BC3 dataset and the second one is the multilingual version of ConThread-BC3.
Citing the Multilingual-BC3 Corpus:
When citing or discussing the Multilingual-BC3 corpus, please reference these papers:
- Mostafa Dehghani, A. Shakery , M. Asadpour, and A. Koushkestani, "A Learning Approach for Email Conversation Thread Reconstruction", Journal of Information Science (JIS), Volume 39 Issue 6, 2013, pp. 846-863. [ACM-DL Link]
- Mostafa Dehghani, M. Asadpour, and A. Shakery, "An Evolutionary-Based Method for Reconstructing Conversation Threads in Email Corpora", In proceedings of The 2012IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM'12), 2012.[ACM-DL Link]
The Multilingual-BC3 Corpus by Mostafa Dehghani is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License. Based on a work at http://www.cs.ubc.ca/labs/lci/bc3.html. Here you can download Multilingual-BC3.
If you have any questions, ideas or suggestions, please do not hesitate to contact me!