Monday, April 5, 2010
I talk a lot about the integral relationship between compression and intelligence. Here are some simple methods. We will talk of images but images are not special in any way (just easier to visualize). Recognizing pattern in an image is easier if you can't see very well.
Blur your eyes and you vastly reduce the information that has to be processed. Garbage in, brilliance out!
Do this with every image you want to compare. Make copies and blur them heavily. Now compress their size down to a very small bitmap (say 10 by 10 pixels) using a pixel averaging algorithm. Now convert each to grey scale. Now increase the contrast (about, 150 percent). Store them thus compressed. Now compare each image to all of the rest: subtract the target image from the compared image. The result will be the delta between the two. Reduce this combined image to one pixel. It will have a value somewhere between pure white (0) and pure black (256), representing the gross difference between the two images. Perform this comparison between your target image and all of the images in your data base. Rank and group them from most similar to least.
Now perform image averages of the top 10 percent matches. Build a graph that has all of the source images at the bottom, the next layer is the image averages you just made. Now perform the same comparison to the 10 percent that make up this new layer of averages, that will be your next layer. Repeat until your top layer contains two images.
Once you have a graph like this, you can quickly find matching images by moving down the graph and making simple binary choices for the next best match. Very fast. If you also take the trouble to optimize your whole salience graph each time you add a new image, your filter should get smarter and smarter.
To increase the fidelity of your intelligence, simply compare individual regions of your image that were most salient in the hierarchical filtering that cascaded down to cause the match. This process can back-propagate up the match hierarchy to help refine salience in the filter graph. Same process works for text or sound or video or topology of any kind. If you have information, this process will find pattern in it. Lots of parameters to tweak. Work the parameters into your fitness or salience breading algorithm and you have a living breathing learning intelligence. Do it right and you shouldn't have to know which category your information originated from (video, sound, text, numbers, binary, etc.). Your system should find those categories automatically.
Remember that intelligence is a lossy compression problem. What to pay attention to, what to ignore. What to save, what to throw away. And finally, how to store your compressed patterns such that the graph that results says something real about the meta-paterns that exist natively in your source set.
This whole approach has a history of course. Over the history of human scientific and practical thought many people have settled in on the idea that fast filtering is most efficient when it is initiated on a highly compressed pattern range. It is more efficient for instance to go right to the "J's" than to compare the word "joy" to every word in a dictionary or database. This efficiency is only available if your match set is highly structured (in this example, alphabetically ordered). One can do way way way better than alphabetically ordered lists of 3 million words. Lets say there are a million words in a dictionary. If one sets up a graph, an inverted pyramid, where each level where the level one has 2 "folders" and each folder is named for the last word in the subset of all words at that level divided into two groups. The first folder would reference all words from "A" to something like "Monolith" (and is named "Monolith") The second folder at that level contains all words alphabetically larger than "Monolith" (maybe starting with "Monolithic") and is named "Zyzer" (or what ever the last word is in the dictionary). Now, put two folders in each of these folders to make up the second tier of your sorting graph. At the second level you will have 4 folders. Do this again at the third level and you will have 8 folders each named for the last word in the graph referenced in the tiers of the graph above them. It will only take 20 levels to reference a million words, 24 levels for 15 million words. That represents a 6 order of magnitude savings over an unstructured sort.
A cleaver administrative assistant working for Edward Hubble (or was it Wilson, I can't find the reference?) made punch cards of star positions from observational photo plates of the heavens and was able to perform fast searches for quickly moving stars by running knitting needles into the punch holes in a stack of cards.
Pens A and B found their way through all cards. Pen C hits the second card.
What matters, what is salient, is always that which is proximal in the correct context. What matters is what is near the object of focus at some specific point in time.
Lets go back to the image search I introduced earlier. As in the alphabetical word search just mentioned, what should matter isn't the search method (that is just a perk), but rather the association graph that is produced over the course of many searches. This structured graph represents a meta-pattern inherent in the source data set. If the source data is structurally non-random, its structure will encode part of its semantic content. If this is the case, the data can be assumed to have been encoded according to a set of structural rules themselves encoding a grammar.
For each of these grammatical rule sets (chunking/combinatorial schemes) one should be able to represent content as a meta-pattern graph. One of the graphs representing a set of words might be pointers to the full lexicon graph. A second graph of the same source text might represent the ordered proximity of each word to its neighbors (remember the alphabetical meta-pattern graph simply represents the neighbors at the character chunk level).
What gets interesting of course are the meta-graphs that can be produced when these structured graphs are cross compressed. In human cognition these meta-graphs are called associative memory (experience) and are why we can quickly reference a memory when we see a color or our nose picks up a scent.
At base, all of these storage and processing tricks depend on two things, storing data structures that allow fast matching, and getting rid of details that don't matter. In concert these two goals result in a self optimization towards maximal compression.
The map MUST be smaller than the territory or it isn't of any value.
It MUST hold ONLY those aspects of the territory that matter to the entity referencing them. The difference between photos and text: A photo-sensor in a digital camera doesn't know for human salience. It sees all points of the visual plane as equal. The memory chips upon which these color points are stored see all pixels as equal. So far, no compression, and no salience. Salience only appears at the level of where digital photos originate (who took them, where, and when). On the other hand, text is usually highly compressed from the very beginning. What a person writes about and how they write it always represents a very very very small subset of
I am convinced that the secret to developing intelligence (in any substrate, including your brain) lies in the percentage of the data coming in that you are willing (or forced) to toss. Lossy compression is the key to intelligence. Of course there is a caveat… you can't just trash anything and everything.
The first line of the book I am writing about evolution: "What matters is what matters, knowing what matters and how to know it matters the most."
I am convinced that evolving systems can only work towards mechanisms that process salience if they are forced to maximize the amount of stuff they can trash.
If you are forced to get rid of 99.999 percent of everything that comes in, well you will have to get good at knowing the difference between needles and hay and you will have to get good at knowing the difference in a hurry. The "needles and hay" metaphor doesn't map well to what I am talking towards. If the system you are dealing with is so unstructured as to fit the haystack metaphor, you really aren't doing anything I would classify as intelligence. If there is nothing of structure in the haystack you are storing than your compression system should already have tossed the whole thing out.
Many techniques for the filtering of essence, for finding pattern, for storing pattern and for storing pattern of pattern have been developed. The most impressive reduce raw input streams and store pattern from the most general to the most specific as hierarchically stratified graphs.
Being forced to reduce data to storage formats that maximize lossy-ness minimizes necessary storage. But that is just a perk. What really gates intelligence is the amount of a complex system (or map thereof) that can be made proximal to immediate processing. Our brains might be big and mighty, but what really matters is how much of the right parts of what is stored can be brought together in one small space for semi-real-time simulations processing. Information, when organized optimally for maximal storage density, will also be information that is ideally organized for localized serialization and simultaneity of processing.
To think, a system has to be able to grab highly compressed pattern hierarchies and move them into superposition on top of each other for near instantaneous comparison. You can't do this with a whole brain's worth of data, no matter how well organized it is.
Lets say you have to store everything you know about every sport you have ever heard of, and you have to do it in a very limited space. You will be forced to build a hierarchy of grammars in which general concepts shared in every sport (opponents, the goal to win, a set of rules and consequences, physical playing geometries, equipment, etc.), with layers of groupings that allow for the similarities between some sports and so on up to the specifics that are are only present in each individual sport. Keep compressing this set. Always compress. Try all day (or all night) for even more compression. Compress until you can't even get to lots of the specifics any more. Keep compressing. Dump the sports you don't care about. Keep on throwing stuff out.
Now lets say I have some sort of morbid sense of humor and I tell you that you are going to have to store everything you encounter and everything you think about, your entire life, in that same database that you have optimized for sports.
You will have to learn to look for the meta-patterns that will allow you to store your first romance in a structure that also allows you to store everything you know about kitchen utensils and geo-politics and the way the Beatles White Album makes you feel when it is windy outside.
The necessity to toss, enforced by limited storage and an obsession to compress will result in domain-blending salience hierarchies. It is why we can find deep similarities between music and geological topologies. It is why we can "think".
For years people have tried to come up with the algorithms of thought. What we need instead is to build into our artificial systems, a very mean and ornery compression task master that forces over time, all of our disparate sensation streams into the same shared graph.
Once you have all of your memories stored within the same graph, by necessity sharing the same meta-pattern, the job of evolving processing algorithms is made that much easier.
An intelligent system will spend most if not all of its time compressing data. We have a tendency to bifurcate the behavior of a mind into storage on the one hand, and processing on the other. I am beginning to think that the thing we call "thinking" and "thought" is exclusively and only a side-effect of constant attempts at compression – that there really isn't anything separate that happens outside of compression. Is this possible?