Recently I wrote this post about the basics of what computer science field means. I thought I'd follow up with a regular bit on some of the more interesting problems in the field. Again, for the edification of my readers. Note that this should not in any way be meant as clues as to who I am. I will probably end up writing about a wide variety of stuff.
So today's topic is Natural Language Processing. This falls under the major subfield of Artificial Intelligence and is highly related to Linguistics.
So, anyway here are two major parts to NLP, in my opinion. There is language acquisition, and language translation. I'll start briefly with acquisition.
One of the basic problems with most AI subfields is that we don't know exactly how a brain works, or at least we cannot build an electronic brain that functions like a human brain. If we could, we could just teach it language the way we teach children language, and it would just pick it up, like a human. But obviously, we can't do that right now. So we have to eliminate this approach and go with something a bit more brute force. There is a spectrum of the chosen approach -- starting from total brute force (i.e. many, many, many rules for every instance) to very general (i.e. just general rules). The more general, the more potential for the machine to learn more than you tell it exactly.
For example, here is an example of a brute force style rule:
"She sat." is a sentence.
Is that a good rule? You could probably reasonably code up a REALLY simple bot that knows some language by brute forcing a bunch of sentences into the general vocabulary of the bot, but that wouldn't be very helpful would it? To get the bot to the point of having any conversational ability would take many, many, many rules, and it would probably have to think for 20sec before saying anything because of the number of rules it would have to go through. And you could never pick anything new up that wasn't an explicit rule.
Here is a "middle" rule:
Sentences are of the form "subject verb".
This might be slightly more useful, in that it's more general (though still very specific). You could probably have a lot fewer of these to get to a similar level of language acquisition as using the previous style of rules, which could allow a bot to learn more sentences than just the ones you tell it exactly. This also brings in some semblance of Linguistics, which as I mentioned is very related to the field. This way, the "chunks" a bot learns will be known also by their parts of speech, which would be useful for novel sentence comprehension.
Finally, the state of the art involves lots more crazy rules, involving lots more linguistical knowledge, and have a lot more generalities. There is also a lot of statistical analysis involved a la Markov chains and such. A Markov chain basically is a chain of states where the present state is dependent on past states. Here's how you can utilize this for language.
If I begin a sentence with "Insofar" - what do you think the probability my next word will be "as"? What about "banana"? A bot could totally learn and generate new probabilities for its chain as it encounters more and more sentences. This is the other end of the spectrum - a highly general structure by which a bot can learn a lot more than exactly what you tell it, and is pretty likely to be ok.
In my mind, until we know how to either generalize language, or mimic a brain, this is a tough, tough problem to completely "finish."
A highly related but different subfield of NLP is Automatic Machine Translation.
Step back and imagine that you are trying to build a universal translator a la Hitchhiker's Guide or Star Trek. If you forget for the moment that you have to actually *build* the thing, think about how you are going to run the translation engine. Can language be *generalized* to the extent that you can code some rules into a piece of computational machinery and it will always be able to translater, based on those rules between any two languages?
My guess is no. Though all human languages share a lot of traits, to the point that automatic translators can do an OK job, it is pretty tough to totally generalize. At the same time, there were those episodes in Star Trek where the translators didn't work for a particular alien race because the language was so dependent on local metaphor that the translator couldn't handle it. Those Star Trek writers knew their geek. Local metaphor is TOO SPECIFIC to be generalized, just as local slang can be. If I asked Babelfish to translate "off the heezy" for me, you can be sure it would be very, very, very confused unless I happened to be punching into Urban Babelfish with special rules for American Slang or something.
Another difficulty lies in language structure, not just complexities like slang. Example, in English:
"He went." vs. "He goes."
The difference in tense is presented by the alteration of the verb form.
In Chinese, however (I'll do romanized spellings):
"Ta qu le." vs. "Ta qu."
The difference in tense is presented by an extra word modifier! This can be very confusing for an automatic translator. I think the technology these days requires a specific translator with specific rules for two specific languages because of strangeness like this. There are other languages where important tense modifiers come many words later in a sentence - which can be very confusing to generalize in terms of translation.
So there is creeping progress in this area of research, but I still think until we can build a humanoid electronic brain, this will involve a lot of complexity and detail without getting even close to "all the way there." However, it is still very fascinating to consider how to come up with general rules to characterize language and be able to TEST those rules via translation tests. Very cool.
Anyway, that's my bit on NLP in CS.
Enjoy, dear readers :).
1 year ago