Technology: We Ain’t There Yet. The Waity List is a series of articles on technologies that we shouldn’t get too excited about, at least until they can meet certain criteria.
This is a topic I’ve been grumbling to myself about as I listened to the This Week in Tech podcasts talk about machine learning over the last year or two. In fact, this topic coming back up in This Week in Google‘s year in review for 2022 is what inspired me to sit my currently-unemployed ass down in the chair and start writing these articles, even if this wasn’t the one that I started with. (Note: I started this essay in January, but I had other things come up and set it aside for a while)
The thing that inspired this grouchiness in specific was the conversation around Blake Lemoine (an AI researcher who said, after a brief discussion with a machine learning language model, that he was convinced it was sentient), and in particular, Leo’s argument that perhaps LaMDA wasn’t doing things too differently from how human beings create language. Although I know he was saying something very narrow and it’s easy to misrepresent what he was saying–he repeatedly said he wasn’t even suggesting that LaMDA was sentient or anything similar–it still got me grumpy, in the way that all the WAiTY topics get me grumpy.
Machine learning is deceptively close, but We Ain’t There Yet, and I suspect that we won’t be until we learn how to give machine learning an exhaustive database of internal state, a task that we’re not yet really qualified to attempt.
To explain what I mean, I’ll take you on a short journey and then get use that to illustrate my point. I’ll start as a relevant example with a blog post by AI Weirdness entitled “Halloween Candy”, in which AI Humorist Janelle Shane generates (topical, at the time) images of halloween candy from basic descriptions using Dall-E 2. Many of the images are very good, with some of the most obvious faults being in the writing; some of the images, you could convince yourself, really were pictures of knock-off-brand candies like “Skite,” “Tvizzles,” or “Thix” but an unfortunate number of them have things that barely look like letters, let alone the correct letters in the correct order to spell familiar words.
Now, Dall-E is a relatively new technology still, and it’s fantastic for its age and for our relatively poor understanding of what we’re trying to accomplish, but it’s also clearly trying to achieve very poorly defined goals. As a GAN–a generative adversarial network–Dall-E is constantly trying to decide whether the image fits the description given. As long as it ends up describing whatever was written, that’s good enough–right? Well, yes and no, but importantly, that not how human beings operate, and there’s a reason why humans don’t operate that way. And, yes, there’s a connection from this back to why Leo’s responses in TWiG made me grouchy.
The second step I’d like to take on this short little journey is AI Dungeon, a text-only “adventure” that lets you input any response to a prompt, and a text transformer will… it would be generous to say it responds appropriately to your input, in the style of classic text-based adventure games. It will try to keep some momentum and ignore some inappropriate actions, but broadly speaking, it has neither imagination nor proper state, and is simply trying to say something that looks correct, given the last few prompts and responses. But if, in your adventure with orcs and lizard people, you say that you punch the sun… then you punch the sun, no matter how impossible that may be under the circumstances. And perhaps, for example, it explodes, leaving you in nothing but painful light. But then, that may also not be true, if you decide it isn’t and lead the game in that direction.
What is important here is the fact that the (being generous) AI-powered “Game” has no sense of state. If you ask it to keep track of your character’s inventory–a staple in more programmatic text adventure games–you’ll find it can’t, not over time or in response to events. If you expect it to notice when major things happen, like the death of a dragon or the exploding of a sun, you’ll get… decent odds, perhaps, that massive, world-changing events have made any impact on its parser, but better than even odds that it will just have generated some new inventory list that it things is entirely appropriate to the circumstances, and may or may not look like the old one at all.
This is ultimately similar to the candy image generator’s inability to generate appropriate labels. Turning text into graphics is ultimately a task that is best suited for a machine with specific internal state – that is to say, a machine that has already decided what is being written and where, and with algorithms specifically written to turn that specific internal state into the requested graphic. That doesn’t mean, necessarily, that it can’t be done with modern AI; however, when that isn’t a design consideration in the training model, you can’t reliably generate the text you want–the text you choose–from a description.
Ultimately, any ML-based AI is only trying to do something–anything–appropriate to the prompt. However, machine learning prompts as they exist today are deliberately informal, translating natural language into appropriate results. This conceals the general truth that any sufficiently advanced discipline uses technical language, and anyone trained in the use of that technical language will be able to express more, and more precisely, than an outsider using natural language.
As a trivial example of technical language, let’s talk about a retail pack of M&M’s candy. It only takes one word in that description to provide a number of technical requirements: retail packaging implies specific labels that we as consumers have grown used to, such as a net weight label, ingredients list, nutrition facts, barcode, best-by date, copyright or trademark information, and the address of the packaging or distribution center, plus various things they add to entice the consumers, such as recipes, flavor descriptions, coupons, website urls, etc. If you wanted to create highly accurate fake retail packaging, you would need to know how many of these details are necessary, and how to generate each one correctly. Generating a fake ingredients list or nutritional facts chart is complicated enough that it would be entirely unreasonable to expect an image-specific generation algorithm to succeed at the task.
Fine, Let’s Get Complicated Then
To get to the point more directly, now: although I’m not an insider to the AI industry, I believe (as much as anything, from first principles) that machine learning as we have it today is foundationally wrong, starting from the concept that natural language describes images and requests fully. I believe that no single machine learning algorithm will accomplish the goal successfully, and I do not expect that we will continue looking at single-algorithm modules in the long term.
Instead, as I said before, ultimately we are going to be concerned with AI that works on an exhaustive database of internal state. What form that will look like is a highly technical question and we may not be able to fully answer it for decades, or certainly years, but I will posit the following as a start: Machine learning must do what the human mind does, and split itself into parts and pieces, each specific to a task, or a subsection of a task.
As part of that, I would argue, advanced AI requires at least three forms of state: knowledge, decisions, and analyses. Knowledge in this context is perhaps the most familiar form of state; it describes facts, and that alone is such an exhaustive list to generate that it takes human children decades of constant effort. But even knowing what the knowledge graph should look like for an AI is difficult and confusing; given (as I’ll describe a bit below) that machine learning focuses so heavily on analysis, ideally knowledge should be a combination of summary and detail, such that you can analyze in summary and get approximately the same results as if you analyzed in detail, while taking less time and resources in the moment. Updating the details may at times require updating the summary, and vice versa, but not always, and…
Well, suffice it to say, you could write a PH.D thesis on exactly what an actual operational knowledge graph should look like, let alone how to implement it, but let’s move on.
Analytical State is something that we are just coming to grips with now in machine learning algorithms. A generative-adversarial model (GAN) determines whether an algorithm is achieving the task set for it, and this analysis is used to with a quick turn-around to fine-tune the actions of the machine, sometimes on a microsecond-by-microsecond basis. It compares the declared intent, current progress, and hypothetical progress if various versions of an algorithm were used, or various noise was added to the data–whatever. That hypothetical progress is analyzed, and the result which is deemed closest to the declared intent is made into the current progress, onto which the algorithms will iterate once again, in the next processing cycle.
Which brings us to the need for decision state, something we are currently missing. If today I asked an image generator to show me a picture of retail candy packaging, it would create something through differential analysis of raw chaos until it arrived at something that looks, roughly, like retail candy packaging. If I gave it a highly specific prompt about said candy packaging, including the name of the candy, its net weight, the copyright mark, what numbers the barcode should decode to, the ingredients, and the nutritional facts… well, I would not expect the image generator to properly create that image, but crucially, you could imagine someone training an AI to perform that highly specific task, if they were extremely dedicated to forging fake candy wrappers.
Ultimately, if what you want is to ask an AI to, in so many words, “generate a fake label for a packet of M&M’s” and receive an image good enough to survive more than a passing glance, it first must parse that request into pieces and generate a much better request, using technical language to ensure that the natural-language prompt is successfully completed while knowing everything that is required. This task–translating one request into another–is still not necessarily decision state; it is analysis of the prompt matched against the knowledge state of the machine.
Instead, once that more technical request is created, in order to create a consistent result, yet another machine language pass must be performed, creating a decision state to which all subsequent generated content must conform. This decision state is not knowledge, and although it is the result of analysis (and, where necessary, random generation), it is a qualitatively different layer of data than the iteration-by-iteration analytical states of the machine.
Only once the decision state is created can we begin generating a final result that conforms to three levels of analyses: the original prompt, the technical-language version of the original prompt based on known requirements, and the decision state of the machine. In my example, the “fake label for a packet of M&M’s” request must generate an image that shows, for example, an ingredients list and nutritional information that conforms to standards. That requires the image to show specific text with specific formatting, and there can be no glitchy letter-ish things mixed in, even if–from the image generator’s analysis–they look correct. Ideally, if you sit down and read the nutritional facts, they will be plausible given the ingredients and stated net weight of the packet, even if they are ultimately randomly generated.
Taking Things in a Different Direction
We can similarly look at the earlier example of a ML-powered text adventure game. In this case, the knowledge and analysis graphs are very different from those used to create images; we are specifically looking at a text-only rendition of places, things, and actions, with some reasonable connection between actions and consequences. However, while today’s AI Dungeon game (and others like it in the future) may only be capable of spewing out pseudo-random responses to your request, if we were going to attempt to create a full ML-powered tabletop game master replacement, it would need to have exhaustive knowledge and also keep a long and detailed list of decision state.
In this case, in fact, decision state becomes a much more complicated and arguably much more interesting part of the process. The AI Game Master must keep track of a lot more state, and specifically, it doesn’t have the luxury of disposing of decision state when a single process is completed. If you leave your house and return to it after a year of adventuring, barring some traumatic plot hook in which your home village is wiped out by orcs, you should reasonably expect that house to look the same when you return as it did when you left, and if the mayor of your hometown talks with a distinct Boston accent, that should not switch to French just because you walked away and came back. But what is allowed to change? If the mayor is replaced, there’s no reason the replacement must also talk with a Bostonian accent, unless everyone in the village does.
In the case of an evolving world such as this, decision state doesn’t just encompass objects and their properties, but relationships and rules of existence. Some of this is arguably knowledge and not decisions, but there are distinctions: in a small town, the replaced mayor might have the same accent, but in a large city, especially a melting pot, it wouldn’t be surprising that the new mayor is nothing like the old one.
In this kind of task, and also more broadly, decisions determine which specific rules are in play, whether those rules are objective and external or generated and unique to the situation. A real game master will draw not only on their extensive knowledge of humanity, physics, and role-playing game mechanics, but also on their understanding of the specific circumstances both stated and unstated, in order to know what the next words out of their own mouth are. In order for an AI to step into the same shoes, it must know what kind of rules can be in play, and then decide which rules are in play, and then make sure that its response to any given prompt is consistent with the current rules in play. And once an action is taken, it updates its own internal state database, which may change the rules at play–going from peaceful negotiation to combat, for instance.
Too Late to Make This Short
Consistent results from a machine learning algorithm are going to take vastly, vastly more work than we have yet done, and we will have to adapt what we’ve created in ways we’re not yet ready to do. The foundational logic loops that unite knowledge, decisions, and analyses are not yet set in stone, and I don’t suspect that, like the Hitchhiker’s Guide‘s Infinite Improbability engine, we’ll be able to just plug the problem into an ML algorithm and have the perfect result pop out, or at the very least, not yet.
All of which is to say, that machine learning isn’t yet anything like what human beings do when we talk, or write. Our process requires a careful balance of multiple factors, some of which we have barely begun to analyze or replicate. And while I forgive Leo for being a talk show host for making a poor summary of a complex topic in order to get people thinking, I fully believe that he was wrong, and that we’re far from really achieving the real, desirable results from machine learning.
To summarize the summary of the summary, We Ain’t There Yet, so ML remains on the WAiTY list for a while yet.