Published August 16, 2021 | Version 0.0
Dataset Open

Spider Realistic Dataset In Structure-Grounded Pretraining for Text-to-SQL

  • 1. The Ohio State University
  • 2. Microsoft Research

Description

This folder contains the Spider-Realistic dataset used for evaluation in the paper "Structure-Grounded Pretraining for Text-to-SQL". The dataset is created based on the dev split of the Spider dataset (2020-06-07 version from https://yale-lily.github.io/spider). We manually modified the original questions to remove the explicit mention of column names while keeping the SQL queries unchanged to better evaluate the model's capability in aligning the NL utterance and the DB schema. For more details, please check our paper at https://arxiv.org/abs/2010.12773. 

It contains the following files:

- spider-realistic.json
  # The spider-realistic evaluation set
  # Examples: 508
    # Databases:         19
- dev.json
  # The original dev split of Spider
    # Examples: 1034
    # Databases:         20
- tables.json
  # The original DB schemas from Spider
    # Databases:          166
- README.txt
- license

The Spider-Realistic dataset is created based on the dev split of the Spider dataset realsed by Yu, Tao, et al. "Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task." It is a subset of the original dataset with explicit mention of the column names removed. The sql queries and databases are kept unchanged.
For the format of each json file, please refer to the github page of Spider https://github.com/taoyds/spider.
For the database files please refer to the official Spider release https://yale-lily.github.io/spider.

This dataset is distributed under the CC BY-SA 4.0 license.

If you use the dataset, please cite the following papers including the original Spider datasets, Finegan-Dollak et al., 2018 and the original datasets for Restaurants, GeoQuery, Scholar, Academic, IMDB, and Yelp.

@article{deng2020structure,
  title={Structure-Grounded Pretraining for Text-to-SQL},
  author={Deng, Xiang and Awadallah, Ahmed Hassan and Meek, Christopher and Polozov, Oleksandr and Sun, Huan and Richardson, Matthew},
  journal={arXiv preprint arXiv:2010.12773},
  year={2020}
}

@inproceedings{Yu&al.18c,
  year =         2018,
  title =        {Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task},
  booktitle =    {EMNLP},
  author =       {Tao Yu and Rui Zhang and Kai Yang and Michihiro Yasunaga and Dongxu Wang and Zifan Li and James Ma and Irene Li and Qingning Yao and Shanelle Roman and Zilin Zhang and Dragomir Radev }
}

@InProceedings{P18-1033,
  author =     "Finegan-Dollak, Catherine
        and Kummerfeld, Jonathan K.
        and Zhang, Li
        and Ramanathan, Karthik
        and Sadasivam, Sesh
        and Zhang, Rui
        and Radev, Dragomir",
  title =     "Improving Text-to-SQL Evaluation Methodology",
  booktitle =     "Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
  year =     "2018",
  publisher =     "Association for Computational Linguistics",
  pages =     "351--360",
  location =     "Melbourne, Australia",
  url =     "http://aclweb.org/anthology/P18-1033"
}

@InProceedings{data-sql-imdb-yelp,
  dataset   = {IMDB and Yelp},
  author    = {Navid Yaghmazadeh, Yuepeng Wang, Isil Dillig, and Thomas Dillig},
  title     = {SQLizer: Query Synthesis from Natural Language},
  booktitle = {International Conference on Object-Oriented Programming, Systems, Languages, and Applications, ACM},
  month     = {October},
  year      = {2017},
  pages     = {63:1--63:26},
  url       = {http://doi.org/10.1145/3133887},
}

@article{data-academic,
  dataset   = {Academic},
  author    = {Fei Li and H. V. Jagadish},
  title     = {Constructing an Interactive Natural Language Interface for Relational Databases},
  journal   = {Proceedings of the VLDB Endowment},
  volume    = {8},
  number    = {1},
  month     = {September},
  year      = {2014},
  pages     = {73--84},
  url       = {http://dx.doi.org/10.14778/2735461.2735468},

@InProceedings{data-atis-geography-scholar,
  dataset   = {Scholar, and Updated ATIS and Geography},
  author    = {Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, Jayant Krishnamurthy, and Luke Zettlemoyer},
  title     = {Learning a Neural Semantic Parser from User Feedback},
  booktitle = {Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
  year      = {2017},
  pages     = {963--973},
  location  = {Vancouver, Canada},
  url       = {http://www.aclweb.org/anthology/P17-1089},
}

@inproceedings{data-geography-original
  dataset   = {Geography, original},
  author    = {John M. Zelle and Raymond J. Mooney},
  title     = {Learning to Parse Database Queries Using Inductive Logic Programming},
  booktitle = {Proceedings of the Thirteenth National Conference on Artificial Intelligence - Volume 2},
  year      = {1996},
  pages     = {1050--1055},
  location  = {Portland, Oregon},
  url       = {http://dl.acm.org/citation.cfm?id=1864519.1864543},
}

@inproceedings{data-restaurants-logic,
  author    = {Lappoon R. Tang and Raymond J. Mooney},
  title     = {Automated Construction of Database Interfaces: Intergrating Statistical and Relational Learning for Semantic Parsing},
  booktitle = {2000 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora},
  year      = {2000},
  pages     = {133--141},
  location  = {Hong Kong, China},
  url       = {http://www.aclweb.org/anthology/W00-1317},
}

@inproceedings{data-restaurants-original,
 author    = {Ana-Maria Popescu, Oren Etzioni, and Henry Kautz},
 title     = {Towards a Theory of Natural Language Interfaces to Databases},
 booktitle = {Proceedings of the 8th International Conference on Intelligent User Interfaces},
 year      = {2003},
 location  = {Miami, Florida, USA},
 pages     = {149--157},
 url       = {http://doi.acm.org/10.1145/604045.604070},
}

@inproceedings{data-restaurants,
  author    = {Alessandra Giordani and Alessandro Moschitti},
  title     = {Automatic Generation and Reranking of SQL-derived Answers to NL Questions},
  booktitle = {Proceedings of the Second International Conference on Trustworthy Eternal Systems via Evolving Software, Data and Knowledge},
  year      = {2012},
  location  = {Montpellier, France},
  pages     = {59--76},
  url       = {https://doi.org/10.1007/978-3-642-45260-4_5},
}

Files

dev.json

Files (6.4 MB)

Name Size Download all
md5:ff99bb627ccb6491fc93609933f0c83b
3.6 MB Preview Download
md5:f1385c9c86eafb2029bbf801bc36f831
20.1 kB Download
md5:25d7aa4fd51a1f0f863c5cdbc8bcf5e9
6.3 kB Preview Download
md5:40252e199d88e33ebf13825fa6ac4ca5
1.9 MB Preview Download
md5:a0fd4de9c7703334f0bd3fcd84463f5e
811.0 kB Preview Download