libraries/libexttextcat: Added (Text Categorization library).

Signed-off-by: Willy Sudiarto Raharjo <willysr@slackbuilds.org>
author: Hunter Sezen 2015-12-08 19:25:33 +0100
committer: Willy Sudiarto Raharjo 2015-12-08 19:59:50 +0100
commit: 004c61c7bdfbd12d9a40803752dc5b85f55903fb (patch)
tree: c31d472203e77e190d856eb001ff3f29354a88a4 /libraries/libexttextcat/README
parent: 4c0a36c790bf8b45635331844324792b872c4a20 (diff)
download: slackbuilds-004c61c7bdfbd12d9a40803752dc5b85f55903fb.tar.gz
1 files changed, 20 insertions, 0 deletions
diff --git a/libraries/libexttextcat/README b/libraries/libexttextcat/README
new file mode 100644
index 0000000000..3b9743c04a
--- /dev/null
+++ b/libraries/libexttextcat/README
@@ -0,0 +1,20 @@
+Libtextcat is a library with functions that implement the
+classification technique described in Cavnar & Trenkle, "N-Gram-Based
+Text Categorization". It was primarily developed for language
+guessing, a task on which it is known to perform with near-perfect
+accuracy.
+     
+The central idea of the Cavnar & Trenkle technique is to calculate a
+"fingerprint" of a document with an unknown category, and compare this
+with the fingerprints of a number of documents of which the categories
+are known. The categories of the closest matches are output as the
+classification. A fingerprint is a list of the most frequent n-grams
+occurring in a document, ordered by frequency. Fingerprints are
+compared with a simple out-of-place metric. See the article for more
+details.
+     
+Considerable effort went into making this implementation fast and
+efficient. The language guesser processes over 100 documents/second on
+a simple PC, which makes it practical for many uses. It was developed
+for use in our webcrawler and search engine software, in which it it
+handles millions of documents a day.
author	Hunter Sezen	2015-12-08 19:25:33 +0100
committer	Willy Sudiarto Raharjo	2015-12-08 19:59:50 +0100
commit	004c61c7bdfbd12d9a40803752dc5b85f55903fb (patch)
tree	c31d472203e77e190d856eb001ff3f29354a88a4 /libraries/libexttextcat/README
parent	4c0a36c790bf8b45635331844324792b872c4a20 (diff)
download	slackbuilds-004c61c7bdfbd12d9a40803752dc5b85f55903fb.tar.gz