Building a Deep Learning Dataset

en May 22, 2020

My final project for OpenAI is to generate GraphQL queries from natural english prompts. The first project deliverable is to create a GraphQL dataset. This post will detail the process I went through to create that dataset and my reflection of each step.

Research & Plan

I started doing some preliminary research as I prepared my project proposal. This included researching different tools and resources that I could use to create a dataset. The tools I used are detailed in my project proposal. It took a bit of balancing the available tools with the tools I'm comfortable with. One of the most important resources I used was the Spider Dataset, a Text-to-SQL dataset. My idea was to build off that dataset and convert it to GraphQL. It took some research to even realize if it was even possible. I spent days looking at SQL and GraphQL grammar and I've reviewed the dataset throughout. I took notes about possible approaches, went through different exercises until I felt comfortable with an approach. It took around 2 weeks of research and planning before I felt I had a solid approach and the right tools.

I felt that those 2 weeks were enough to have a solid foundation to build on, maybe a little more would have been helpful. Too much more time and I would have lost time running experiments. Any less and I might not have enough underlying understanding to build the dataset.

Coding & Building

Building the dataset took a decent amount of engineering. All in all it, I wrote 7 scripts and around 2,000 lines of code, plus some more generated code. Most of the process went pretty smoothly.  I don't want to go into too much technical detail but there were some choices that ended up being instrumental in building the dataset.

One strategy I took was to use the language and IDE to offload cognitive load as much as possible, since I expected the engineering to get intense. I used Swift for its type system since I was dealing with structured data. It's also a language I'm comfortable with, and it's speed and debugging tools were also beneficial.

I also found myself making a lot of assumptions about the dataset, so I would validate these assumptions by littering my scripts with assert statements and fatal errors. This had the secondary benefit that I gained a decent understanding of the examples in the dataset.

Surprisingly I only ran into a couple issues in the process, though these are decently technical and specific to my dataset. One of the big features of the Spider dataset was that it includes many complex queries, the issues I ran into were in these queries. For example, one issue was that Hasura (the tool that builds a GraphQL endpoint on top of a DB)  has the limitation that its default endpoint can't represent the GROUP BY clause. With issues like these I had to decide between writing manual queries or throwing out those examples.

I did expect to run into more issues while engineering but I was pretty happy with how smooth it went and how the dataset turned out. This ended up taking around 3 weeks before the dataset was at a point where I could start validating it and running some deep learning experiments with it.

Review & Validation

This step had me going back to the dataset and fixing issues that I found after the fact. This process included manually looking at examples and verifying the queries syntax and ran on Hasura. There were some issues I only ran into while running the dataset through a baseline network. Luckily I found most issues by writing a script to validate the queries with an eslint plugin.

Although this step has added more work than I expected, I learned the obvious lesson that review and validation is necessary to build a high quality dataset.

Next Steps

I'm actually not finished with reviewing my dataset. I'm still catching and fixing minor issues. But I'll be constantly improving and adding examples until the end of the project.

At the end of the project I'll be releasing the dataset along with all of the source code. For now I'll be working on building those baseline models and improving the dataset.