Monday, April 5, 2010
I talk a lot about the integral relationship between compression and intelligence. Here are some simple methods. We will talk of images but images are not special in any way (just easier to visualize). Recognizing pattern in an image is easier if you can't see very well.
Blur your eyes and you vastly reduce the information that has to be processed. Garbage in, brilliance out!
Do this with every image you want to compare. Make copies and blur them heavily. Now compress their size down to a very small bitmap (say 10 by 10 pixels) using a pixel averaging algorithm. Now convert each to grey scale. Now increase the contrast (about, 150 percent). Store them thus compressed. Now compare each image to all of the rest: subtract the target image from the compared image. The result will be the delta between the two. Reduce this combined image to one pixel. It will have a value somewhere between pure white (0) and pure black (256), representing the gross difference between the two images. Perform this comparison between your target image and all of the images in your data base. Rank and group them from most similar to least.
Now perform image averages of the top 10 percent matches. Build a graph that has all of the source images at the bottom, the next layer is the image averages you just made. Now perform the same comparison to the 10 percent that make up this new layer of averages, that will be your next layer. Repeat until your top layer contains two images.
Once you have a graph like this, you can quickly find matching images by moving down the graph and making simple binary choices for the next best match. Very fast. If you also take the trouble to optimize your whole salience graph each time you add a new image, your filter should get smarter and smarter.
To increase the fidelity of your intelligence, simply compare individual regions of your image that were most salient in the hierarchical filtering that cascaded down to cause the match. This process can back-propagate up the match hierarchy to help refine salience in the filter graph. Same process works for text or sound or video or topology of any kind. If you have information, this process will find pattern in it. Lots of parameters to tweak. Work the parameters into your fitness or salience breading algorithm and you have a living breathing learning intelligence. Do it right and you shouldn't have to know which category your information originated from (video, sound, text, numbers, binary, etc.). Your system should find those categories automatically.
Remember that intelligence is a lossy compression problem. What to pay attention to, what to ignore. What to save, what to throw away. And finally, how to store your compressed patterns such that the graph that results says something real about the meta-paterns that exist natively in your source set.
This whole approach has a history of course. Over the history of human scientific and practical thought many people have settled in on the idea that fast filtering is most efficient when it is initiated on a highly compressed pattern range. It is more efficient for instance to go right to the "J's" than to compare the word "joy" to every word in a dictionary or database. This efficiency is only available if your match set is highly structured (in this example, alphabetically ordered). One can do way way way better than alphabetically ordered lists of 3 million words. Lets say there are a million words in a dictionary. If one sets up a graph, an inverted pyramid, where each level where the level one has 2 "folders" and each folder is named for the last word in the subset of all words at that level divided into two groups. The first folder would reference all words from "A" to something like "Monolith" (and is named "Monolith") The second folder at that level contains all words alphabetically larger than "Monolith" (maybe starting with "Monolithic") and is named "Zyzer" (or what ever the last word is in the dictionary). Now, put two folders in each of these folders to make up the second tier of your sorting graph. At the second level you will have 4 folders. Do this again at the third level and you will have 8 folders each named for the last word in the graph referenced in the tiers of the graph above them. It will only take 20 levels to reference a million words, 24 levels for 15 million words. That represents a 6 order of magnitude savings over an unstructured sort.
A cleaver administrative assistant working for Edward Hubble (or was it Wilson, I can't find the reference?) made punch cards of star positions from observational photo plates of the heavens and was able to perform fast searches for quickly moving stars by running knitting needles into the punch holes in a stack of cards.
Pens A and B found their way through all cards. Pen C hits the second card.
What matters, what is salient, is always that which is proximal in the correct context. What matters is what is near the object of focus at some specific point in time.
Lets go back to the image search I introduced earlier. As in the alphabetical word search just mentioned, what should matter isn't the search method (that is just a perk), but rather the association graph that is produced over the course of many searches. This structured graph represents a meta-pattern inherent in the source data set. If the source data is structurally non-random, its structure will encode part of its semantic content. If this is the case, the data can be assumed to have been encoded according to a set of structural rules themselves encoding a grammar.
For each of these grammatical rule sets (chunking/combinatorial schemes) one should be able to represent content as a meta-pattern graph. One of the graphs representing a set of words might be pointers to the full lexicon graph. A second graph of the same source text might represent the ordered proximity of each word to its neighbors (remember the alphabetical meta-pattern graph simply represents the neighbors at the character chunk level).
What gets interesting of course are the meta-graphs that can be produced when these structured graphs are cross compressed. In human cognition these meta-graphs are called associative memory (experience) and are why we can quickly reference a memory when we see a color or our nose picks up a scent.
At base, all of these storage and processing tricks depend on two things, storing data structures that allow fast matching, and getting rid of details that don't matter. In concert these two goals result in a self optimization towards maximal compression.
The map MUST be smaller than the territory or it isn't of any value.
It MUST hold ONLY those aspects of the territory that matter to the entity referencing them. The difference between photos and text: A photo-sensor in a digital camera doesn't know for human salience. It sees all points of the visual plane as equal. The memory chips upon which these color points are stored see all pixels as equal. So far, no compression, and no salience. Salience only appears at the level of where digital photos originate (who took them, where, and when). On the other hand, text is usually highly compressed from the very beginning. What a person writes about and how they write it always represents a very very very small subset of