CoST dataset
This is the main dataset designed during the CoST project:
Publication: Dosso C, Moreno JG, Chevalier A, and Tamine L. 2021. CoST: An annotated Data Collection for Complex Search. CIKM2021.
Access: https://doi.org/10.6084/m9.figshare.15286353
Other datasets
Approaches developed during the project might be evaluated using the following benchmarks/collections:
- The TREC Session Track (2010 to 2014), based on the Clueweb09 and Clueweb12 collections
- The TREC Task track (2015 to 2017), based on the ClueWeb12 collection
- AOL search query logs
- Based on AOL:
- The AOL User Task [1]
- The AOL Task extraction [2]
- The Webs-SMC-12 corpus, containing 8840 search engine interactions of 127 users.
- A lifelogging dataset from the NTCIR Lifelogging track , comprising 60 days of logs, with over 1,600 activities annotated. The dataset was used during the Task Intelligence Workshop @ WSDM 2019
- The CLEF 2018 Dynamic Search for Complex Tasks
- The dataset used in the Yandex Personnalized Web Search Challenge on Kaggle
- The Webis Query-Task-Mapping Corpus 2019 (Webis-QTM-19) [3]
[1] Lucchese, C., Orlando, S., Perego, R., Silvestri, F., and Tolomei, G. Discovering User Tasks from Search Engine Query Logs. In ACM Transactions on Information Systems (ACM TOIS), vol. 31, issue 3 – July 2013, pp. 14:1–14:43.
[2] Sen, Procheta, Ganguly, Debasis and Jones, Gareth J.F. (2018) Tempo-lexical context driven word embedding for cross-session search task extraction. In: 16th Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 1-6 June 2018, New Orleans, LA, USA
[3] Michael Völske, Ehsan Fatehifar, Benno Stein, and Matthias Hagen. 2019. Query-Task Mapping. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval(SIGIR’19). Association for Computing Machinery, New York, NY, USA, 969–972.