(you may click
the number of the subfile to be viewed, or
scroll down)
This
file contains the following subfiles:
10 - common uncertainties
11 - classes and templates
11.1 - Why the definitions are unstable
11.25 - words to templates and vice versa:
why?
11.27 - analyzing for essence
11.5 - automatically associate subsets
11.55 - musical motives and managing
subgroupings of data
(subfile 10: dealing with uncertainties)
When dealing with language it is essential to have an organized way of
dealing with such problems as approximation, unknown elements of a
definition, elements which comprise a range of values, and definitions
which specify classes of other definitions. In this project's model
these four ideas are represented in similar ways. Imagine a printed set
of star-light spectra, and that we want to find a subset of those
spectra with certain characteristics: e.g., values must be present for
some frequencies, and those values must fall in a certain range. It
would be possible to construct a piece of cardboard – a template – with
spans cut in certain places defined by these characteristics. By
holding the template over the individual spectra, we could see which
ones correspond to the set of characteristics we defined. The size of
the hole cut for a given frequency would be one parameter associated
with that frequency. Also imagine that the templates’ spans come with
conditions, so that one can specify things like “if at 750 angstroms
the value is between x and y, then ignore the value you see through
that span at 800 angstroms.” It is the expansion of information at any
given axis number from a single value to a range and a probability
function that expands the associated meaning of the object to represent
the uncertainties mentioned above.
(subfile 11: classes)
The removal of one axis from a definition often leaves behind the
definition of a class-word. An apple is a plant-product that's red,
sugary and edible. Take away “red”, and what you're left with looks a
lot like “fruit”. Fuzzy-storage methods provide the mechanism for
finding definitions close together in MS, so that these class-word
definitions, made by dismembering other definitions, can always be
found (see "Fuzzy data,
fuzzy logic" main file, page 22).
(Subfile 11.1: Why the definitions are
unstable)
The various reasons for the instability of definitions will be
described in detail later, after some relevant and necessary
constructions are presented. In brief:
1)
In the training of an entity such as this program, at any given time
there are unknown amounts of as-yet-untaught information. Additionally,
there are extensive means by which the program processes and changes
its own database - each of these also takes time, and there is no way
to know how much relevant processing may have been completed when some
current interaction requires accessing a particular node. Any
calculation made today may be obviated by additional information or
analysis made available tomorrow.
2)
All information-storage in this paradigm is "fuzzy" (see p. 21) - that
is, while definitions appear to specify a point, in fact all the
routines in this program treat such specifications as the central cores
of objects that are extensive in MS. Accessing an object might
therefore return slightly different coordinates either at random or
according to other influences effecting the probe. Not only are objects
(and probes) fuzzy, but different aspects of objects are activated by
context; therefore two calculations that may appear to involve the same
object can lead to different results, depending on this activation (see
"context-controlled axis activation", p.5).
3)
Definitions of objects, definition complements, summed objects, etc.,
all consist of collections of coordinates and relations between them.
The axes that create the coordinate space are sometimes only marginally
connected to any dependable reality, and there are limited means for
optimizing them. Primarily, the axes, and the metrics that allow values
to be assigned along them, merely provide one of an undoubted infinity
of systems for labeling things that are perceived by the computer (see
"MS definitions: they're not real!" below, p. 34).
(subfile 11.25: WHY words to templates and
vice versa?)
Frames and scripts are venerable AI concepts having to do with the
organization of everyday knowledge (see below, subfile 39.6). A
learning program should be able to construct and progressively refine
both structures. Script learning is comparatively straightforward,
since it involves series of behaviors; every branch of the
short-term-memory in this program consists of statements and actions,
and the series of actions are automatically stored and evaluated both
by statistical usage and by reinforcement (such storage and evaluation
are parts of the simplest Purr-Puss formulations, as described later).
Frames, however, consist
of associations required for the sensible consideration of a
particular 'subject' – these and their default values need to be built
up from both
experience and intrinsic relationships among definitions. The
experiential learning is similar to script learning, and depends on the
appearance and use of objects in conversations with Teacher. Some
relations among definitions, and some questions that can elucidate
these relations, however, are discoverable by the program on its own
(see also 'deduction' in subfile #39.7).
An essential part of "apple-ness" is "crunchy". Let's imagine that the
texture-axis has received a value, but that little else has as yet been
discovered about apples beyond that they are liked – that is, good to
eat. (For example, one might see an object on the table, and one's
horse might come along and eat it, making a sharp, sudden sound
incompatible with "soft & mushy".) A routine concerned with newish
words might well loop through axes that possess values, so we can
easily imagine a situation in which a comparatively un-defined word
would be in one register while the texture axis would be in another.
The simplest of question-generators would merely have to proceed from
'word' to 'template', along the texture axis, to know that it should
ask "Do
you also like liquid apples?" An affirmative answer always indicates
reinforcement, and so the motion from apple to apple-juice would be a
candidate for membership in the frame of an apple. Since "apple" is a
small point in the region of MS known as "fruit" the same motion would
(more weakly) be a candidate for membership in the frame for fruit.
Later this frame element could independently postulate the existence of
any fruit having an associated juice.
The series of events just described requires that a question be asked
by Teacher. More
important is
the establishment of frame elements without requiring the participation
of any entity outside the program.
Frame terminals are those sub-contexts that are known to be associated
with the central idea of the frame; for example, the function-frame for
“chair” includes a back ( if you’re talking about sitting in a
chair, you must be able to lean back against something, or else it’s a
bench ). It is essential, for a program that purports "to learn", to be
able to construct frame terminals on its own, and the opening and
closing of spans on axes can be a central part of this.
Suppose
“running” has been discussed, but the frame for “walking” is as yet
incomplete. The most basic PUSS would learn to recall words that came
after “running” in Teacher's sentences - and in this example “running"
is taken to be adjectival. (This means it is actually located in a
different region of MS than is the present participle of the verb.)
There would be “running water” and “running shoes” and not too much
more. “Running water” would not match any aspects of the current
context if pedal locomotion were the subject, so “running shoes” would
appear near the top of the heap of relevant associations.
Expanding the “speed” axis in “running” would point to “walking”, among
other things ( perhaps “jogging” and “sprinting” ). This is clearly a
short enough list that each element could be considered - this means
that “walking” would at some point come to the top of the association
stack. Then, having a strong association to
the words ”walking” and “shoes”, one is in a position to add
“shoes” to the “walking” frame. An addition like this is more like a
postulate until repetition confirms that it is an appropriate part of
the frame.
Three mechanisms exist for removing unconfirmed frame elements. First,
all additions to data structures like frames trigger the dark-cycle
crawler that constructs questions for Teacher <see main file,
p.22>. This is no complex matter: in the case being considered, the
crawler simply adds to a stack of questions needing answers, that here
might say “are shoes relevant to walking?” and subsequently “what
default value(s) should there be?”
Second, reinforcement drives many processes, and if, in conversation,
Teacher refers to shoes and walking in close proximity, then the
"shoes" terminal in the "walking" frame will be reinforced. This is no
complicated matter, either; all frames associated with active words are
present in the buffer <see main file, p.25> as soon as they
appear, and Teacher's references are so important that they immediately
become addresses for content-probes. This would lead to the "walking"
frame without any search whatever. In this case, a frame element
that is "not walking" would not immediately be removed because of this
reinforcement of "walking", but would sink lower into the bin of
relevant objects; each such sinking reduces the lifetime of the object
in the environment (see "Big Buffer", p. 25)
Third, all the terminals in a frame are stored (like everything else
around here) as PUSS-records, which already include a feature that
simply counts the number of times a particular association is made or
accessed. After a large amount of teaching has occurred, the counters
for each postulated terminal could be examined, and those whose
appearances are too few could be suppressed.
(subfile 11.27: analyzing
for essence)
Suppose a number of defined words are seen to be isolated in a region
of MS, with regions of relative emptiness all around. If all of the
words consist of fully expanded definitions (that is, ones in which
there appear only axes and pointers, no sub-words) then it may be
useful to find out which word best defines the entire group, which
words divide the group into classes, what the characteristics of those
classes are, and what makes each word a distinct individual. All this
is accomplished with a dark-cycle tabulation of subset appearances.
1) best group definition
It is perfectly reasonable to construct all the subsets of axes present
in a set of words. By "reasonable" I mean practical from a
computational point of view: it doesn't take too long. (It is also
reasonable to imagine the neurological analog: many (all?) subsets of
sense-inputs must cause different cortical states to arise.)
Different subsets occur in different numbers of words, allowing the
simple summing up of a score for each subset. Call the largest subset
that occurs in all the words of the group Smax. Then, of the words
containing Smax, that word with the fewest additional axis values is
likely to be the best class-word for the group.
2) division of the group into classes
Examination of the ranked list of scored subsets would not in itself
necessarily provide a list of words, since a subset might not
constitute a word and vice versa. Subsets by definition divide MS up
into regions that overlap according to the amount of common axis
information in the subsets. There is no obvious and simple way, on the
basis of axis content alone, to decide which subsets constitute more
meaningful or useful partitions of the space.
Fortunately, the combination of subsets with extant (English) words can
serve that purpose. We assert that human language must have evolved a
sensible subdivision of meanings, and use the presence or absence of a
word to score a subset that overlaps the word's location. This is of no
help in non-linguistic realms, of course.
In those cases one must make some assumption about scoring. For
example, one could assign higher scores to groups of subsets on the
basis of isolation (the less overlap, the higher the score).
Assumptions such as these would be among the program parameters to be
influenced by reinforcement.
3) characteristics of the classes and individuation
Once a region has been divided, one has by definition differentiable
sets of axis-content, and that axis content IS "the characteristics of
the classes". Likewise, each actual word's definition, minus the axis
content of the class in which it is found, constitutes the properties
that makes that word an individual among classmates.
Such a dark-cycle analysis of subsets can be running continuously,
setting up scored boundaries within MS. In this case, a boundary would
receive a higher score the greater were the distances to word-clusters
on the sides of the boundary. This is conceptually trivial: we wish to
know what regions of the space are incompatible with other regions. The
extent to which a possible region is separated from its neighbors is a
purely numerical matter. Once a region is found, or once it is decided
to consider an area to be a candidate for "region-hood", then its
compatibility with non-contiguous regions can be analyzed, just as was
its region-hood, by a tabulation of bombs and other associations
originating in the region.
(subfile 11.5: learning word
fragments, and what is a "syllable" in a melody, an action by an arm,
or a path?)
The syllable function will be performed by a dark-cycle crawler (which
see: main file, p. 22). The goal of that procedure is to examine the
whole vocabulary of defined words, in order to determine what syllables
can be extracted that have related meanings. Initially we identify
word-fragments that have similar spelling or sound. Any words
containing the fragment are treated as a 'cluster' (see main file
p.27):
1 - create a group of words with some suspected
internal commonalities: such a commonality is
suspected - initially - if there
is a similarly spelt fragment or if there is a sound with
a similar subset of I.P.A. designators
2 - the coordinate sets comprising the words'
definitions provide a list of axes
3 - each axis receives a score
a - proportional to the frequency
of its appearance in the whole group, and
b - inversely proportional to the
MS distance between the coordinates in which it appears.
If some of the words in the group contain an identifiable
fragment with a common meaning, then the axes associated with the
common fragment will have created peaks in the distribution.
(This is a somewhat simplified description; because of the difficulty
of determining appropriate thresholds for the peaks, the quantity
actually used is the first derivative of the change in score as the end
of the examination of the group is approached.) At that point the
fragments can be stored away as if they were words; they are treated
like suffixes, prefixes, and conjugated verbs.
An example of such a fragment would be "quant" from the word
"quantity". All the words like "quantify", "quantize", etc., would
include axis subsets that would show up as peaks when this function is
used. The vocabulary might include "Quantz" (the 18th century composer)
or "Quantum" (the model of car), but these words would share no
significant axes with the others, and therefore would not contribute to
the statistics that would define the fragment (which is, in this realm,
a syllable).
Likewise, syllables that can have two separate meanings will form two
peaks, and will be separated and dealt with just as homonyms. An
example would be "duct", whose usages in "ductile" and "conduct" would
have no overlap in axes.
Finally, it is necessary to avoid inappropriate assignment of meaning
to subgroups that have so many meanings that they should never be
considered alone. Consider "con". A large number of words contain this
spelling and this sound, and so this collection of letters would cause
the function to be initiated. The function would create a number of
peaks, some of which would be inefficient as members of a working
vocabulary. "Con" has rather different meanings when it is followed by
"t", "sent", "v", etc. "Co" would be even worse. The difference between
these "problem syllables" and ones that are useful is extremely simple:
multiple peaks in the axis histogram created by a proposed syllable
always implies that the selection made is too short. Additional letters
need to be added and the procedure repeated. Eventually so many letters
will be added that all peaks will disappear (the entire collection
"conversion", for example, would create no peaks: it would appear only
in one word, and would not be considered as a candidate. Note the
similarity to window resolution issues in Purr-puss.)
Successfully established fragments have a form exactly like all other
objects, and therefore they can be extracted into templates like any
other word. This provides a sort of class-word for morphemes. Combining
templates of word fragments will be a principal means of elucidating
(guessing at?) the meaning of unknown words, and of postulating the
existence of words not yet entered into the dictionary.
Different bodies of information about words is expected to be available
depending on the initial clustering function. Using spelling of
syllables will not always coincide with fragments identified by sound.
Relationships such as those between "clean" and "kleenex", "focused"
and "focussed", or "Pittsburgh" and "Allenberg" would be missed if
syllables are only defined by spelling.
We are quite ready to accept the possibility that words fragments thus
established might be "real", useful and consistent parts of the
language. These fragments can be found bya brute force comparison
of vocabulary elements as well. That is, instead of seeking an initial
group of words that is suspected to contain meaningful subunits, we
simply set the comparator going and accept any group that is extracted
as a candidate for containing useful sub-items for isolation. Thus, for
example, similarities related to the structural elements of words would
come out (see 'deftrans', main file p. 14).
Since linguistic words look the same as the objects in the musical,
navigational and manipulator realms, the program will attempt to find
parallel relationships among non-linguistic objects as well. For
example, suppose that the robot arm has learned to do lots of things in
the blocks world. Each task would have presented series of actions that
could be used (interpreted) the same way that series of letters are.
Then certain subsets of actions would emerge as being consistently
used, just like syllables. For example, grasping a block would
undoubtedly be followed by such a small set of actions that each such
pair would be found by the function described above. For example we
could expect "grasp-lift", "grasp-rotate.left", and "grasp-compress" to
emerge as useful "words"; in fact, "grasp" might never be used alone
after considerable learning has taken place (except perhaps in
sequences of actions used to debug ‘grasping’).
(subfile 11.55: musical motives and
managing subgroupings of data)
The idea of a motive in composition covers almost any organizational
element created for a particular piece. Music theorists do not usually
use the word for elements that are common to all music in the style
(such as 'major triad' in common-practice-period Western music), but
this program does not make that distinction. Any perceptible process or
relation will be learned (if it is repeated). Some separate
mechanism would have to be active if the learned elements are to be
sorted into subgroups, such as "motives appropriate for a morning raga"
or "common harmonic progressions in Bach". Without such subgrouping,
the program, when asked to "create", would combine styles without
prejudice.
These subgroupings in music are easy to understand; linguistic analogs
include "the current subject" but also "the people present". It is
another delightful convenience of Purr-Puss that these subgroupings can
be separated from one another in a trivially simple way - and in a way
that is closely related to the core function of Purr-Puss.
Without involving any other code, one separates such regions from one
another merely by adding one window element to the feature vectors used
in the Puss storage method. (At this point readers not familiar with
the workings of learning algorithms using content-addressable memory
should read the section on OTHELLO, p.42, main file.) For
example, suppose the program is learning to move blocks. A large number
of actions might work perfectly well when all the objects present are
dry. These actions would be governed by feature vectors such as
goal:"no red blocks"
action: remove (block with property:
red)
end condition:(find(block with property red) fails).
The action would be a composite daemon, whose subDaemons would include
"grasp" and "lift".
This would be found to fail under the circumstance "oily". Grasping
wouldn't work, and instead, maybe scooping the block up would be an
available option. If this succeeded, then the feature vector for the
task could have added to it a channel for dryness, and if it had the
value "oily", then a different series of behaviors would have to be
learned. This technique has a lot of problems not addressed here.
When this program uses separation of data regions, it requires an
additional storage cycle for each item. Likewise, the separated data is
available both as part of the whole and as a subgroup. For example, if
a game is being played at an unrestricted level, and a move has been
recovered from the unseparated part of MS, then with one additional
probe of memory, the program can discover that some move would be the
best one available at a specified restricted level.
The following examples are equivalent with respect to
establishing boundaries between subgroupings of data:
1 - conversation participants
2 - current subject
3 - whose opinion is currently in play
4 - which branch of an argument is being considered
5 - what musical style is involved, or what mode, or
which raga, etc.
6 - what level of game play is being used
7 - which team is "on offense"
8 - what "tense" is current (past, future,
hypothetical, wishful, present-sarcastic, etc.)
One way to clarify this equivalence is to characterize to sorts of data
regions that are being separated.
1 - Clearly, in normal conversation, one knows to
whom one is speaking. This knowledge is critical: I have an
acquaintance who, as an infant, spoke English with Dad, French with the
Nanny, and German with Mom. She was unaware she was tri-lingual: the
different languages were just "how you talk to <this person>".
Another aspect of "keeping track" of conversations
involves "who has been informed of which facts".
As with all of the examples that follow, both of
these types of regions can be separated by adding single identifiers to
all the feature vectors. If one is talking to a person with whom no
previous conversation has taken place, all sorts of conversational
tacks are available that become irrelevant with partners who are very
familiar. If "the person present" can be part of the feature vector
(the Puss window) that is involved in making a decision about where a
conversation should go, then the decisions made (which are based on the
points in MS accessed) can be sensibly limited, for example, to things
not yet discussed with this person.
2 - Suppose we're discussing skateboards. Different
facts are useful depending on whether one is considering 'safety' or
'fun'. If the current conversation involves one or the other of these,
then a single coordinate can be added to the windows to prevent
irrelevant matters from being brought to the fore. Whole different
regions of MS - that is, whole different sets of associations - must be
enabled or inhibited by whatever the current context includes.
On the other hand, even a conversation concerning
the 'fun' available with a skateboard might occasionally turn to issues
of safety; re-enabling those associations is accomplished by removing
the limiting window element and probing the memory anew.
3,4,5 - In normal conversation we commonly
evaluate different opinions in turn. Reasoning appropriate for one side
of a debate, or logic of a sort used by one particular person, might be
out of place for others. In order to follow a single "train of
thought" without contamination, for example, by facts unknown to
that side of the debate, it is once again necessary to limit the
availability of portions of MS. The limiting could be accomplished by
inserting an identifier into feature vectors as long as the limiting is
required.
This procedure can be taken to extremes: if every
feature vector ever used includes some arbitrary marker, then an
entirely separate 'learning entity' can be running using the same code
and memory, without interference with the original program. (An
interesting side effect of doing this would be the ability of the
entities to enjoy any degree of common access to knowledge desired.)
This is the process for using a single program image and memory block
to learn to compose melodies in different styles. One batch of
learning, accomplished with a training set from one style, can be kept
entirely separate from other styles learned, or, the output routines
could be allowed access to both batches of learning, producing a
mixture of styles.
This implies a functional analogy between musical
style and personal opinion; not surprisingly, since the choices of any
given composer must of necessity be based on that composer's 'personal
opinion' about "what will sound best".
6 - Many computer programs for executing game
strategies are endowed with an option to limit the level of advancement
the program can employ (sometimes this limitation is desirable to
reduce the amount of time required by the computer to make its next
move). Sometimes these limitations are a matter of limiting the depth
of exhaustive search, but if the adjustment is more heuristic, then the
limitation of ability amounts to a limitation on the regions of MS
available to the decision-making routines.
7 - See SOCCER, main file, p. 47, in which the
effective bit-width of the window is doubled by the use of a single bit
whose value reflects 'possession'. In this case, the value of the
possession bit not only determines which half of that MS is available,
but it also allows the meaning of input channels to be toggled whenever
possession changes. This would be rather the same as separating the two
halves of memory needed for chess openings. One could allow the
function of a program to phase from opening-moves into middle-game
moves by allowing successively more access to the other half of memory
as the two sides (black and white) become more equivalent.
8 - We all have modes of discourse that are
determined either by social considerations (how polite one should
be at some moment) or by what might be called literary or theatrical
"stance". A sarcastic reply, called for when some disagreement is
sensed, would call for a different set of linguistic paths than one
required when comforting a child. Of course, merely having the
equipment to express different stances doesn't tell one when to use
them - this is an entirely more difficult question.