Technology: We Ain’t There Yet. The Waity List is a series of articles on technologies that we shouldn’t get too excited about, at least until they can meet certain criteria.
I have written about this on my old blog, now defunct, but I was inspired listening to This Week in Google‘s year in review for 2022 to create a series of articles, and since I already know what I have to say about this one, it seemed like an adequate place to start.
I am of the opinion that augmented reality won’t come of age without certain technologies–technologies that, if you were simply to already have them, would be a utility in and of themselves. Consider the following example: a pair of augmented reality glasses, which by means of calibrated eye-facing and front-facing cameras, knows what you are looking at, outlines it in the display, and displays next to it what the object is, how far it is from you (including vertical distance, and angle above or below the horizon), its speed relative to you, and (if applicable) how long before it reaches you. Or, to summarize the summary, the most basic of the things that Iron Man’s JARVIS computer HUD is shown doing in the movies.
If you had a pair of goggles that did absolutely nothing other than this, it could be considered a utility. Notably, the ability to tell exactly how far something is away from you is a professional skill for, say, surveyors, and it would be handy when driving to be able to put numbers on exactly how far you are behind a car, and whether that distance is opening or closing. If the identification database was good enough, it could be a utility for everything from birdwatchers and botanists to, yes, thieves and stalkers, who might be interested in finding out exactly what or who just happens to be in their field of view.
But those utilities are based on fundamental tools, many of which are already in development and some of which are relatively mature. Eye tracking, for instance, has been studied for a long time, and while there aren’t a lot of commercially available utilities, in my admittedly limited understanding there is a lot of academic and closed-door research being done on what can be learned from your eyes. However, that is just a piece, and that piece needs to be integrated; once you have an angle from each eye and can pinpoint where (in 3D space ahead of the glasses) you are looking, the next and seemingly trivial thing to do is put those 3D rays into context with cameras (and/or other sensors).
At this point, we run into another, seemingly unrelated technical problem, which is nonetheless critical for augmented reality: whether with one camera, two cameras, lidar, ultrasound, or any other method, you have to parse the view in front of the glasses, with an eye towards depth, as it were. Knowing how far things are from you isn’t just part of the utility I set us the task of creating; it’s also critical to parsing what you see in any meaningful way. While we can use techniques in computer vision to parse a single 2D image and get a lot of information about what objects are, the task of identifying things semantically, that is to say, understanding what you are looking at and not merely that you are looking at it, necessarily requires a very large database and a large amount of processing that should, by all rights, not be done on device even were we to advance technology a good 50-100 years from where we are; if you can narrow down that semantic search by separating foreground and background, that searching can at least be done piece by piece, or you can simply ignore everything but the object of interest. In either case, this parsing of depth information can be done either on-device or on the nearest convenient processor (read: cellphone or edge device) for rapid results.
Along the way, these depth maps can (but not necessarily do) help with another task I set the glasses to: outlining the object you are looking at. While machine learning algorithms nowadays can do things like find and outline people or things, unless you have an exceptionally well-trained and highly specific algorithm, I suspect you’ll find it doesn’t have the performance necessary to refresh that outline as quickly as the glasses need you to, whenever your head moves slightly or the thing you’re looking at moves around. Assuming the computer vision algorithm can isolate things at similar depth, we can once again only send the relevant data off to processing and (hopefully) produce that highlight in enough time so that a 120-hz (or similar) display is updated on time, every frame.
However, that assumes that the object you’re highlighting is all at a uniform or near-uniform depth. What if you are standing in front of the Taj Mahal, and your glasses (correctly) separate the nearest and furthest towers as being in separate depth planes, but the image parsing (correctly) labels all four of them, plus the central building, as being all part of the same iconic landmark? Correctly outlining the entire building in 3D space, especially if (for example) you were on a helicopter ride making a circular path around it, is a complex task that requires identifying a continuous structure across depth planes, or else using a more expensive machine-learning algorithm to brute-force the solution from the full video feed every frame.
Suddenly this simple-sounding task of highlighting the object you’re looking at, which sounds so simple in concept, is one of many areas where We Ain’t There Yet–and this is only a part of the proof of concept, utility glasses.
Fine, then; let’s set aside the task of highlighting and get back to the glasses. Let’s even assume that we are close enough to that goal that we can identify what you’re looking at, and have enough object permanence that if we walk in a circle around the Statue of David, we can recognize it even when looking at it from the angles where it’s not typically photographed–we can trust, even that because we are constantly judging the depth, we will know that we’re still looking at the Statue of David even if one were to walk really, really close to the statue, such that the only thing in your glasses’ field of vision is… the statue’s statuesque curly hair, sculpted abs, or its intricately detailed marble toes. We can just assume for the moment that we have that technology.
For now, imagine that you are staring at a mountain. Suppose it’s not a particularly interesting mountain; it’s not Everest, Fuji, Kilimanjaro, or even Eyjafjallajökull, but it is your own friendly local mountain, and it has a name, darn it. This is still within the range of feasibility; machine learning can pick out something that looks like a mountain, and with a gps, bearing and distance, and a friendly mapping application, you can make somewhere between an educated guess and a definitive statement about exactly which mountain you’re looking at. But, hold on–are you looking at the mountain, the forest on the mountain, a tree in the forest, a bird in the tree, or the worm in the bird’s mouth? What if by accident, while looking at the mountain, your gaze just so happens to linger on a worm on a bird in a tree in a forest that you weren’t even particularly interested in? What should your utility glasses tell you you’re looking at? And if they tell you that you’re looking at the mountain, what distance do they report–the distance to the worm, to the mountain directly under the worm, to the mountain behind the worm, to the mountain peak, or to the closest edge of what can be considered the mountain, geographically?
Okay, okay–we can be less picky. The glasses don’t have to be perfect, or even really intelligent; they’ll just report the distance to whatever you’re looking at, even if it’s a bird. If you want to know the distance to the peak, look at the peak; if you want to know the distance to the bird, look at the bird. Let’s call that part good when it can just name the mountain that you’re looking at, or the forest if your vision seems to only and always looking at the forest, or a specific tree if you end up staring at that one tree for a long time. What else?
Oh, right–relative speed, closure rates, and estimated time of arrival. That’s an interesting question, and one that seems easy enough in principle–until you realize that those visual cues are either complicated or subtle. For example, you are standing in a scorching desert, walking backwards with your thumb out in classic hitch-hiker pose, looking down a long straight stretch of asphalt at the ruby-red sportscar that you’re sure must be really booking it. And yet, even if it were going eighty miles an hour, if it’s still ten miles out, it will continue to look very small for a very long time. It won’t really start giving you a visual indication of the distance, and therefore not much about the closure rate, until it gets fairly close–and then it may pass you in a flash, the self-centered driver completely uninterested in picking up a sweaty, sandy hobo along a hot desert road.
Granted, there are ways to help this. If you can see the car from ten miles out, presumably you can most or all of the ten miles of road in between, perhaps because of a gentle slope in the road–and again, with GPS, distance, and bearing, you might be able to drop an approximate map pin on the spot of road you’re looking at, even as that spot (and the car it represents) moves. This, in theory, might give you a good idea of whether the person you’re looking at is doing 50, 80, or 120 miles per hour down the road, and whether or not they seem to be slowing down as they approach you–and knowing whether or not they seem to be slowing might help you make a decision between getting off the road or stepping further out into it with a plaintive, helpless look on your face, in the hopes that they will take pity and give you a ride, or at least throw a bottle of water at your face as they pass by. Perhaps, even if the car doesn’t slow, if you turn your head at just the right moment, the glasses will catch the license plate number of the selfish prick, enabling you to do actions which break the law and which I do not endorse and have definitely never performed.
The other side of this fantastical and definitely not auto-biographical narrative* is the idea of getting a good relative velocity reading on a more complicated object in flight–say, a football tumbling end over end as it spins directly towards your head and the expensive hardware resting thereupon, or perhaps a bone-shaped dog toy being tossed your way by your stepsister, or perhaps your step-sister’s hundred-and-fifty pound tibetan mastiff, eagerly chasing after said dog toy. Each of these will have a different and shifting profile as they approach, and distance estimates based entirely on the depth-sensing might be tricked temporarily as the ball, toy, or canine appears, visually, to grow, shrink, or stay in place due to some trick of geometry or hair fluff. Accounting for all the various factors will require advanced knowledge of optical illusions, both of the classical varieties, plus new ones that will be created based purely on the specifics of how the computer vision algorithms are run.
In any case, while we may have or be close to having many of these techniques, we still have to place the images onto the augmented reality display with a high enough refresh rate that they trick the eyes into seeing them as a part of your surroundings–but also, with enough precision that they look natural. This part specifically is about outlining what you’re looking at, and it’s one of the reasons why it’s worth adding that to the list–imagine, for example, that due to sweat, your augmented smart glasses slip just an eighth of an inch down your nose. If the smart displays are static, and assume that they are positioned just so on your face, the sudden difference in how the glasses are positioned will throw off the outlining by an equivalent amount, which can seem quite large indeed. Better, I assume (without having done millions of dollars of research and development) would be to align the images on the smart displays using data generated from the eye-facing cameras–yet another task that must be performed once every frame, just in case the subtle vibrations of the helicopter you’re riding for your tour around the Taj Mahal end up shaking your glasses. You wouldn’t want your technology to be wrong even for a fraction of a second, after all!
But in truth, we end up back at the beginning–those tools, built into a pair of glasses, and this is all just the building blocks on which to build other augmented reality tools. Because imagine these same glasses letting you place a virtual Taj Mahal on the table next to your mound of clay, or perhaps mashed potatoes, letting you visualize the artwork you are about to create. As you study the august Indian architecture, you decide it needs flavor, and so you add a Tibetan mastiff chasing the early bird that caught the worm, as his owner sits watching from the open gull-wing door of their ruby red sports car. You get to work on your mashed potatoes, sculpting one intricate detail after another with your fork, only briefly glancing up from your actual artistry to see the virtual scene that you’ve laid out a paltry foot away.
This trivial–trivial–bit of augmented reality requires much of the same technology that the aforementioned glasses have, unless you are really okay with just half-assing the result. You need be constantly understanding what you are looking at in order to provide a consistent illusion that your “jerk sister’s dumb dog chasing a bird in the Taj Mahal” is a physically present object, and if you were–for example–to provide an extra layer of augmented reality, and provide comparisons between your Close Encounters copy and the real thing for the purposes of self-improvement or perhaps art cricitism, you would need to have enough fidelity to turn what you’re seeing into a full 3D model, or at least point cloud, in order to give relevant feedback such as the dog being out of position, or the Taj Mahal having an incorrect slope to its roof.
The more tools you can build, the better the platform will be, but perhaps more importantly, each of the tools that you build in order to make a platform should and can be a selling point in and of itself. In the early days of computing, basic tools like calculators, notepads, spreadsheets, calendars, and even clocks were common applications despite being extremely basic tools, and all because those basic tools were, in fact, valuable. We’re in an era now where better versions of all the basic tools are available on computers and phones, and so I think we may have lost that basic vision of what makes a platform, into a platform, but I think it’s perhaps one of the most important aspects.
If you don’t know what good you can do with a tool, why on earth are you asking people to give you money for it?