Datasets - COST

CoST dataset

This is the main dataset designed during the CoST project:
Publication: Dosso C, Moreno JG, Chevalier A, and Tamine L. 2021. CoST: An annotated Data Collection for Complex Search. CIKM2021.
Access: https://doi.org/10.6084/m9.figshare.15286353

Other datasets

Approaches developed during the project might be evaluated using the following benchmarks/collections:

The TREC Session Track (2010 to 2014), based on the Clueweb09 and Clueweb12 collections
The TREC Task track (2015 to 2017), based on the ClueWeb12 collection
AOL search query logs
Based on AOL:
- The AOL User Task [1]
- The AOL Task extraction [2]
The Webs-SMC-12 corpus, containing 8840 search engine interactions of 127 users.
A lifelogging dataset from the NTCIR Lifelogging track , comprising 60 days of logs, with over 1,600 activities annotated. The dataset was used during the Task Intelligence Workshop @ WSDM 2019
The CLEF 2018 Dynamic Search for Complex Tasks
The dataset used in the Yandex Personnalized Web Search Challenge on Kaggle
The Webis Query-Task-Mapping Corpus 2019 (Webis-QTM-19) [3]

[1] Lucchese, C., Orlando, S., Perego, R., Silvestri, F., and Tolomei, G. Discovering User Tasks from Search Engine Query Logs. In ACM Transactions on Information Systems (ACM TOIS), vol. 31, issue 3 – July 2013, pp. 14:1–14:43.

[2] Sen, Procheta, Ganguly, Debasis and Jones, Gareth J.F. (2018) Tempo-lexical context driven word embedding for cross-session search task extraction. In: 16th Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 1-6 June 2018, New Orleans, LA, USA

[3] Michael Völske, Ehsan Fatehifar, Benno Stein, and Matthias Hagen. 2019. Query-Task Mapping. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval(SIGIR’19). Association for Computing Machinery, New York, NY, USA, 969–972.