Automatic taxonomy construction

Automatic taxonomy construction (ATC) is the use of software programs to generate taxonomical classifications from a body of texts called a corpus. ATC is a branch of natural language processing, which in turn is a branch of artificial intelligence.

A taxonomy is a model used to organize and index knowledge (stored as documents, articles, videos, etc.) so that users can find the information they are searching for. Taxonomies are typically tree structured and divide a domain into categories based on the value of properties called taxa. Taxonomies were first used by biologists to group organisms into categories such as kingdom, phylum, genus, and species. Taxonomies are often represented as is-a hierarchies where each level is more specific (in mathematical language "a subset of") the level above it. For example, a basic biology taxonomy would have concepts such as mammal, which is a subset of animal, and dogs and cats, which are subsets of mammal. This kind of taxonomy is called an is-a model because the specific objects are considered instances of a concept. For example, Fido is-a instance of the concept dog and Fluffy is-a cat.^[1]

In linguistics, is-a relations are called hyponymy. Words that describe categories are called hypernyms and words that are examples of categories are hyponyms. In the simple biology example dog is a hypernym and Fido is one of its hyponyms. A word can be both a hyponym and a hypernym. For example, dog is a hyponym of mammal and also a hypernym of Fido. In ATC programs one of the most important tasks is the discovery of hypernym and hyponym relations among words.

The development and maintenance of a taxonomy is a knowledge-intensive task requiring significant time and resources. Also, domain modelers have their own point of view which inevitably, even if unintentionally, work their way into the model. ATC uses artificial intelligence techniques to automatically generate a taxonomy for a domain in order to avoid these problems. There are several approaches to ATC. One approach is to use rules to detect patterns in the corpus and use those patterns to infer relations such as hyponymy. Other approaches use machine learning techniques such as Bayesian inferencing and Artificial Neural Networks.^[2]

ATC systems have been used to automatically generate large ontologies for domains such as insurance and finance. They have also been used to enhance existing large networks such as Wordnet to make them more complete and consistent.^[3]^[4]^[5]

References

↑ Brachman, Ronald (October 1983). "What IS-A is and isn't. An Analysis of Taxonomic Links in Semantic Networks". IEEE Computer. 16 (10).
↑ Neshati, Mahmood; Alijamaat, Ali; Abolhassani, Hassan; Rahimi, Afshin; Hoseini, Mehdi (2–5 November 2007). "Taxonomy Learning Using Compound Similarity Measure". IEEE/WIC/ACM International Conference on Web Intelligence. IEEE. doi:10.1109/WI.2007.135. Retrieved 8 March 2017.
↑ Velardi, Paola; Faralli, Stefano; Navigli, Roberto (10 October 2012). "OntoLearn Reloaded: A Graph-based Algorithm for Taxonomy Induction". Computational Linguistics. Association for Computational Linguistics. Retrieved 8 March 2017.
↑ Liu, Xueqing; Song, Yangqiu; Liu, Shixia; Wang, Haixun (12–16 August 2012). "Automatic Taxonomy Construction from Keywords" (PDF). KDD ’12. ACM. Retrieved 7 March 2017.
↑ Snow, Rion; Jurafsky, Daniel; Ng, Andrew. "Semantic Taxonomy Induction from Heterogenous Evidence" (PDF). Stanford University. Retrieved 8 March 2017.

Automatic taxonomy construction

See also

References

Further reading