Towards GameDevGPT - Adapting LLMs to Specialized Programming Tasks

Oct 9, 2023


  • There is a lot of demand from game developers for a specialized code generation model, as it could save significant time across a variety of tasks

  • But LLMs struggle when it comes to specialized programming tasks

  • Code generations are often cut short, mixed up… or just plain wrong

  • Open-source models can be adapted to improve results and compete with the best-performing closed models

  • Our results support this hypothesis - showing that these adapted models can perform comparably with the best closed models in specialized technical tasks

  • We believe there is significant headroom for specialized models to continue to improve, drastically outperforming more generalized alternatives and becoming commercially viable

The future of specialized programming

The last few years have given us a glimpse into the future of programming. Copilot (code completion), ChatGPT (code generation) & GPT Engineer (autonomous software engineering) inspire us all to think of a world where we:

  • 🛩️ Develop at warp speed

  • 🦾 Create without vast amounts of technical know-how

  • 🤖 Build autonomously, without human input at each step

Yet for most tasks, and more specifically, for specialized programming tasks like game development - these tools remain ineffective & as such, underutilized.

Even disregarding typical issues in generative AI (i.e. data provenance, security and cost), performance limitations mean that these models are unable to cross the chasm to mainstream adoption.

In conclusion, the demand is there but the value is not yet readily accessible:

“I use ChatGPT to help generate code in some instances. I prefer it to finding code on Stack Overflow, but even so I still often get generated code that looks great at first, but is actually gibberish & uses elements from multiple frameworks.”

Steve, Indie Game Developer

The need for specialized models

This is where Unakin comes in. We take foundation models and transform them into useful AI agents - capable of autonomously completing highly valuable, highly specific technical tasks.

To do this, a core component of our technology is the development of a best-in-class CodeGen model, which ultimately enables our users to rapidly build and iterate.

A few factors mean that such a model is probably the only way forward to commercialization:

  • Improved performance: custom models can incorporate multiple state of the art (SotA) techniques to provide better results across specific metrics

  • Improved outcomes with domain-adaptation: specialized models may better understand relevant syntax, libraries & frameworks, providing code that is more aligned with user needs

  • Transparent data provenance: especially in industries like gaming - where IP & copyright is hugely important - clear data pipelines are preferable over black box models

“My company’s lawyer addressed us and banned us from using any OpenAI product. They won’t let us use anything without prior confirmation.”

Tim, Professional Game Developer at mid-sized studio

  • Custom features: code generation is just one component of a useful product. Custom models can enable product extensibility, which ultimately can underpin a vastly superior user experience

  • Visibility on core technology roadmap: no-one likes having the rug pulled from underneath them. Specialized models remove the dependency upon a generalized LLM provider where you might have to deal with model degradation or a shift in focus over time

  • Cost savings: whilst not universally true, building custom models can open up cheaper avenues for fine-tuning and inference spend

Our methodology - overview

Like so many others, we had previously tested the GameDev CodeGen capabilities of ChatGPT and GPT-4. The results were… frustrating.

Using ChatGPT with the prompt: "catch on fire"

Despite this frustration, it’s clear that ChatGPT/GPT-4 are the best performing models across most software engineering metrics. You can see their comparative scoring across popular benchmarks HumanEval & MBPP below.

We set out with the objective to build a competitive (performance-wise) model benchmarked across one sub-discipline - UI programming code within game dev - whilst maintaining the benefits of custom models.

As a model base, we chose the Llama family as they are performant open-source models - albeit they still lag behind ChatGPT + GPT-4 in most coding metrics. Specifically, we worked with both Llama 2 and latterly Code Llama.

Our methodology - specific techniques

Here is an overview of some of the more specific techniques we used during this adaptation. It's a bit more technical, so feel free to skip!

Giant batch size

TL;DR: With complete control over model and training implementations we can train, fully fine-tune or use PeFT fine-tuning with very large batch sizes.

Training throughput is one of the most crucial elements when performing experiments, especially on large data samples. Using our own re-implementation of the models that employ a custom training loop we gain access to a large amount of run-time or static optimizations. As such, we’re able to do full fine-tuning at sequence lengths of up to 64K with our main model.

Since we have full explicit control over the attention kernels that we’re using, we can add several mini-batches in the same sequence by employing a custom attention mask without breaking the auto-regressive conditional dependence. On 8xA100 80GB, this allows us to train at a maximum of 512K tokens per step, which takes just a few seconds.

Careful fine-tuning

TL;DR: We use a custom training schedule  and Meta-Learning like techniques to avoid most pitfalls of training LLMs or learning from out of distribution data.

Our earliest experiments on several models have been revealed to be affected by some common pitfalls in general LLM training, namely loss spiking. Usually this is handled either with manual restarting of the training run, or by performing a costly Exponential Moving Average aggregation of the weights frequently during the optimization process. These methods hurt the overall cost efficiency of model training and we decided to carefully analyse when and why this happens.

To this end, we devised a careful training curriculum which takes inspiration both from curriculum learning and domain adaptation. With this novel optimization schedule we were able to fully mitigate the loss spiking instability in a purely online fashion without requiring additional computational costs or manual intervention from our engineering team.

Another important aspect about our training routine is that it's heavily inspired from Meta Learning, namely Model Agnostic Meta-Learning. At learning time, we randomly sample some of the parameters for which we want to compute an update step and then obtain the pseudo-gradient using low-rank approximations. Then, at inference time, we compute the average pseudo-gradient and apply to the base model's weights. What this approach does, as opposed to more well-established training methodologies, is minimising the overall model variance that might be generated by the sampling procedure and providing at the same time implicit regularisation through the gradient averaging mechanism.

Curated datasets

TL;DR: We use a special pre-processing data collection pipeline to ensure better model performance.

Several works have shown the importance of high-quality data and as such we carefully curate the samples that we use for training. Our efforts are two-fold, using ethically gathered data (only permissively licensed code, or other sources that explicitly provide permissive licences for derivative work) and ensuring a minimum amount of data quality.

For GitHub collected data we use several repository level heuristics to decide whether or not we should include that sample in our training dataset. After that we use a mix of automatic parsing and manual inspection to make sure we’re not using any faulty code.

Finally we format code both at train and test time to ensure that none of the statistical reasoning that the model gains is style dependent.

Extending context length

TL;DR: We reimplement and port readily available models, which lets us use significantly more efficient implementations. We use synthetic instructions and context to make the model better at long sequences.

When working with codegen for large repositories, there are several scenarios when the model might have to account for a large amount of related code fragments or code fragments that might be affected by the feature that is being generated. CodeLlama and subsequent works showcase simple yet effective strategies for scaling the RoPE . 

We adopt the same approach as in CodeLlama and introduce training on much longer sequence lengths that are sourced from very large code files. To emulate the live behaviour of our CodeGen agent, we build out a synthetic instruction generation pipeline, where we mask some code fragments from the original file and ask the LLM to reconstruct it. Unlike code infilling, this method is special token agnostic, and it also comes with the added benefit of directly training the model with a 1:1 copy of the behaviour that will be used when collaborating with the agent.

Additional note on copyright:

To address Tim's lawyer concerns, we specifically lay out a benchmark were we fine-tune a commercially licensed model strictly in an unsupervised manner from readily available data on the internet that has a permissive license. Our aim is to provide value to even some of the most stringent clients for whom the preferred solution would be having on-premise deployment of a model that is tailored to their specific needs. We omit using any instructions or synthetic data to showcase how cost-effective our baseline solution can be, as collecting instruction datasets can be prohibitively costly in most cases.

Results: evaluation methodology

It should be noted that evaluating CodeGen models is already contentious. Static measurements often lack the nuance of real-world usage & can therefore fail to capture the actual value driven by these models.

This issue is even more stark in specialized programming tasks. For example, in game dev, even code that is seemingly ok can ultimately fail to deliver the required experience for the end user. 

Moving forward, the industry will need better, more in-depth & more aligned evaluation metrics to truly drive towards customer success. 

Having said that, we decided to use two evaluation metrics to measure performance.

Firstly, we have an automated benchmark called ‘perplexity’. Perplexity is essentially a metric that looks at how well a language model can predict a sample of text. The lower the perplexity, the better  the model is at making predictions. In the absence of alternatives, this can be used as a proxy for model improvement. Unfortunately, we cannot test this against models like GPT-4 since its perplexity is not publicly released information.

Secondly, we have a much more involved manual evaluation, where game developers are provided with randomized samples, generated by different models. The test is conducted blind (to remove any potential bias), and the developers are tasked with running and reviewing the code within the game engine. The samples are then scored against each other based upon the developer’s review on the level of alignment with the prompt. A pairwise comparison is then calculated to directly compare models.

The above method is itself more aligned with the customers’ desires, yet is restrictive due to its resource-intensive nature and resultant smaller sample sizes.



Across the board, our adapted models showed significant improvements in perplexity. This would likely continue to improve with additional data and technical implementations. 

Manual evaluation

We are really excited to announce that our adapted model performed comparably with GPT-4, whilst showing demonstrable progress from Code Llama 🥳.

  • The results show significant improvement from the base Code Llama model

  • There is a very comparable performance between GPT-4 & Unakin's adapted Code Llama model; with GPT-4 just edging out our model

  • GPT-4 had slightly more consistent results

This is extremely promising; it's a powerful argument for domain-specific models. We look forward to some of the results of experiments we have in progress representing further evidence to support this belief.

Side-by-side sample

This below example was not part of the benchmarking but serves to show an example of our codegen capabilities.

Prompt: 'create a particle system that follows your mouse around.'

Model: ChatGPT

Prompt: 'create a particle system that follows your mouse around.'

Model: Unakin-CodeLlama

Cost-related metrics

An additional benefit of our custom recipe is drastically cheaper fine-tuning potential. 

Compared to the quoted OpenAI fine-tuning costs, our current models would be ~50x cheaper to fine-tune. This means that companies who wish to continually fine-tune models on their latest data are able to do this on a near constant basis.

What's next?

It’s clear - from 1000s of conversations - that programming is about to change forever. But particularly in more specialized fields, it’s also clear that there is still a lot of work to be done. First and foremost, performance has to be drastically improved before we start seeing daily usage. 

We’re especially pumped about some of our latest experiments in these models, particularly around generating synthetic data (view Unnatural Llama’s results in the table to see why we’re so positive!).

Outside of pure CodeGen, there are entire products and processes to be built around these models still - amplifiers of the core value that LLMs provide. 

We’ve been working on some extraordinarily exciting things for builders around the world; with several products on the horizon. Stay tuned!