Friday, 15 January 2010

.net - NLP/Quest. Answering - Retrieving information from DB -



.net - NLP/Quest. Answering - Retrieving information from DB -

i've been doing bit of reading on nlp recently, , far i've got (very) basic thought of how works, ranging sentence splitting pos-tagging, , knowledge representation.

i understand there's wide diversity of nlp libraries out there (mostly in java or python) , have found .net implementation (sharpnlp). it's been first-class actually. no need write custom processing logic; utilize functions , voila! user input well-separated , pos-tagged.

what don't understand go here, if main motivation build question answering scheme (something chatterbot). libraries (preferably .net) available me use? if wish build own kb, how should represent knowledge? need parse pos-tagged input else db can understand? , if i'm using ms sql, there library helps map pos-tagged input database queries? or need write own database querying logic, according procedural semantics (i've read)?

the next step, of course, formulate well-constructed reply, think can leave later. right bugging me lack of resources in area (knowledge representation, nlp kb/db-retrieval), , i'd appreciate if of there offer me expertise :)

this broad question , such barely fits format stackoverflow, never less i'd give stab.

first, word on nlp broad availability of mature tools in area of nlp in misleading. all/most nlp functions, from, say, pos-tagging or chunking to, say, automatic summarization or named entity recognition covered , served logic , supporting info of various libraries. building real world solutions these building blocks hardly trivial task. 1 needs to:

architect solution along sort of pipeline or chain whereby results of particular transformation feed input of subsequent processes. configure individual processes: computational framework of these established extremely sensitive underlying info such training/reference corpus, optional tuning parameters etc. select , validate proper functions/processes.

the above particularly hard part of solution associated extraction , handling of semantic elements text (information extraction @ large, co-reference disambiguation, relationship extraction or sentiment analysis, name few). these nlp functions , corresponding implementations in various libraries tend harder configure, more sensitive domain-dependent patterns or variations in level of speech or in "format" of supporting corpora.

in nutshell, nlp libraries provide essential building blocks applications such "question answering systems" mentioned in question, much "glue" , much discretion how , apply glue required (along dose of non-nlp technologies such issue of knowledge representation, discussed below).

on knowledge representation hinted above, pos-tagging lone isn't sufficient element of nlp pipeline. pos-tagging add together info each word in text, indicating [likely] grammatical role of word (as in noun vs. adjective vs verb vs. pronoun etc.) pos info quite useful allows, example, subsequent chunking of text logically related groups of words and/or more precise lookup of individual words in dictionaries, taxonomies or ontologies.

to illustrate kind of info extraction , underlying knowledge representation may required "question answering system", i'll discuss mutual format used in various semantic search engines. beware format maybe more conceptual prescriptive semantic search , other applications such expert systems or translation machines require yet other forms of knowledge representation.

the thought utilize nlp techniques along supporting info (from plain "lookup tables" simple lexicons, tree-like structures taxonomies, ontologies expressed in specialized languages) extract triplets of entities text, next structure:

an agent: or "doing" something a verb : beingness done an object : person or item upon "doing" done (or more generically, complement of info "doing")

examples:   cat/agent eat/verb mouse/object.   john-grisham/agent write/verb the-pelican-brief/object   cows/agent produce/verb milk/object

furthermore kind of triplets, called "facts", can categorized various types corresponding specific patterns of semantic, typically organized around semantics of verb. illustration "cause-effect" facts have verb express causality, "contains" facts have verb imply container-to-containee relationship, "definition" facts patterns agent/subject defined [if partially] object (e.g. "cats mammals"), etc.

one can imagine how such databases of facts can queried supply answers questions, , provide various smarts , services such synonym substitution or improving relevance of answers questions (compared plain keyword matching).

the real difficulty in extracting facts text. many nlp functions set play purpose. example, 1 of steps in nlp pipeline replace pronouns noum reference (anaphora resolution or more co-reference resolution in nlp lingo). step identify named entities: names of people, of geographic places, of books etc.(ner in nlp lingo). step may rewrite clauses joined "and" create facts repeating grammatical elements implied. example, maybe john grisham illustration above came text excerpt like author j. grisham born in arkansas. wrote "a time kill" in 1989 , "the pelican brief" in 1992"

getting john-grisham/agent wrote/verb the-pelican-brief/object implies (among other things):

identifying "j. grisham" , "the pelican brief" specific entities. replacing "he" "john-grisham" in 2nd sentence. rewriting 2nd sentence 2 facts: "john-grisham wrote a-time-to-kill in 1989" , "john-grisham wrote the-pelican-brief in 1992" dropping "in 1992" part (or improve yet, creating fact, "time fact": "the-pelican-brief/agent is-related-in-time/verb year-1992/object") (btw imply having identified 1992 beingness time entity of type "year".)

in nutshell: info extraction complicated task when applied relatively limited domains , when leveraging existing nlp functions available in library. much "messier" activity simply identifying nouns adjectives , verbs ;-)

.net sql-server nlp artificial-intelligence question-answering

No comments:

Post a Comment