My final project involves taking natural language English prompts and running them through a model to generate corresponding GraphQL queries. This post details the preliminary model experiments and validation strategies I've used the past couple of weeks.
As you can see in my project proposal document, I planned to experiment with a few models including the BART, the vanilla transformer, and the Reformer. I've stayed close to that original plan though I replaced the vanilla transformer with T5. The idea was to run some preliminary experiments with each of those models and see collect results that later on, I would be able to improve these metrics.
I think it's helpful to reflect on some of the issues I found as I started training the models. There were some minor issues including errors with my dataset, hyperparameter issues, and computation constraints. But those roadblocks were easily fixed with a little debugging. The biggest issue I ran into that I didn't plan for was my evaluation metrics.
Initially, I just assumed I would be able to use the same evaluation metrics as Spider (the dataset my dataset was based on). Then while running through my models I realized I needed to do a little more research. Eventually I landed on the idea of taking the results from my model and my target queries and parsing them into an Abstract Syntax Tree. I could then use set comparison between the ASTs and get a boolean accuracy for each example. Luckily I found a library to parse GraphQL queries into an AST. I had to make a few minor tweaks but eventually I had an accuracy validation metric I could run my models against.
Once I validated my models on my validation accuracy, I was pretty excited with my results. I got 48% accuracy on my T5 model! It seemed like a big deal to me since this is a new dataset and a new task and I'm already getting results without much hyperparameter turning.
The validation metric that I came up with seems to work for my purposes but there could be improvements. Since queries are just trees based on the schema graph, there are multiple ways to represent equivalent queries. So ideally I could come up with a way to evaluate those queries equivalently, though I'm not sure my time constraints will allow it. Other options include adding execution accuracy, to verify how queries run against my GraphQL endpoints.
In the coming weeks I'll continue to experiment with my models, collect further results and compile everything I've gathered from the program!