It’s the Data Stupid: AI Models and Your Business

I have not been able to stop thinking about this since reading it a few days ago:

This mirrors the conclusions of a paper published by Google DeepMind earlier this month that transformer-based architectures (i.e., all the AI systems that have been getting so much hype lately) just don’t generalize well beyond their training data.

If model architecture doesn’t matter, there are all sorts of implications here if you develop AI models or ever plan to use one in your business anytime soon.

First off, if you want to offer a differentiated AI product with any sort of moat, then you’d better have access to differentiated data. This seems obvious, but I’ve seen so many companies launch lately that are just thin wrappers on top of ChatGPT or train their own models using some public data set. Those are all doomed to fail or become commodities.

Secondly, since the most critical part of creating the AI models is selecting the training data, then your organization better have a good data strategy:

  • Save EVERYTHING. Bits are cheap. When in doubt just record it because it might be useful later.
  • Record it in a sane, consistent format. If the critical path toward developing AI models is selecting the appropriate subset of data, you shouldn’t make data engineers’ lives more painful by having them repeatedly fight to make sense of it. Make it easy for them! This is a bit tangential, but that’s why data lakes have never made any sense to me… they seem more like a data cesspools. If someone could give me a good use-case for one of those, I’d appreciate it.
  • Version the data. Data is constantly streaming into your organization. If you find a set that produces a great model, you’ll probably want to be able to recreate that model consistently. This comes naturally when you can access specific, older versions of the dataset.
  • Be thoughtful about what data you release in the wild. If data is the key to your competitive advantage, then you probably don’t want to make that data available to your competitors.

Third, don’t waste too much time mucking around with the model parameters. This can be a huge cost-saver, since AI engineers’ time is SO freaking expensive. The cost of computer is nothing to laugh at either. I have no doubt that jbekter used at least 8-figures worth of cloud-compute to learn the lesson in the quote at the top of this page. Thankfully, now we don’t have to.

These implications might all change if researchers discover AI architectures that do generalize beyond their training data, but at that point we’re probably all going to be robot sex slaves unless someone figures out the alignment problem, so no need to worry about developing new models.

Until then, happy data wrangling!

2 responses to “It’s the Data Stupid: AI Models and Your Business”

  1. dangdana Avatar
    dangdana

    Some supporting evidence for this article is the growing marketplace for datasets, e.g. https://magic.dev/blog/data-bounty

    All that matters is the data AND the representation of the data. Bill Dally (Nvidia) recently spent half of a HW AI trends talk mulling over numerics: https://www.youtube.com/watch?v=kLiwvnr4L80&list=RDCMUCfSiYryINctnCaKe-jilVeA&start_radio=1

    1. Nate Avatar

      Thats’s super interesting. Thanks for the link, Dana!

Leave a comment

Create a website or blog at WordPress.com