What is LingPipe's reputation

Text mining with LingPipe

Transcript

1 Text Mining with LingPipe Advanced Seminar Information Retrieval PD Dr. Karin Haenelt University of Heidelberg Lecture by Alexanderkap in the winter semester 2008/2009

2 Overview Text Mining Definition & Demarcation Tools Processes & Examples Applications LingPipe The Tool Functions Installation Architecture & Components Demonstration: Example Applications Summary Alexander Cap - Text Mining with LingPipe 2

3 Text Mining Definition of alternative terms in literature: Text Data Mining, Textual Data Mining or Text Knowledge Engineering freely translated: Text mining refers to the automated discovery of relevant information from text data - Wikipedia is the process of compiling, organizing and analyzing large document collections to extract information and the Discovery of hidden relationships between texts and text fragments in short: Gaining knowledge from texts Alexanderkap - Text Mining with LingPipe 3

4 Text mining delimitation Delimitation from information retrieval (information retrieval): In response to a search query, IR delivers those documents from a document collection that are relevant to answering the question.In contrast to text mining, individual information or facts are not accessed, but entire documents delimitation Regarding information extraction methods: IE aims to extract individual facts from texts and to present them in a schema, in contrast to text mining, but at least the categories for which information is sought are known (the user knows what he does not know ) Comparison to data mining: share many processes, but not the subject data mining operates on highly structured data (e.g. in relational databases) text mining operates on texts, i.e. unstructured or weakly structured data Alexander cap - text mining with LingPipe 4

5 Text Mining Tools Text mining tools extract information from texts, which should enable users to expand their knowledge, in the best case scenario, provide information or relationships that users did not previously know existed in interaction with their users Text mining tools are also able to generate hypotheses, check them and refine them step by step Alexanderkap - Text Mining with LingPipe 5

6 Text mining processes (1) the actual text mining processes can only ever build on the basis of at least partially analyzed data. Essential tasks of the process: explicitly making information that is implicit in texts visible, making relationships between information that is represented in different texts visible Alexander cap - text mining with LingPipe 6

7 Text mining methods (2) work with the aid of statistical and linguistic means Methods of logical reasoning Methods of explorative data analysis Examination and assessment of data, of which only a little knowledge about their interrelationships is available, the machine plays a major role in the development of such methods Learn Alexander Cap - Text Mining with LingPipe 7

8 Text Mining Examples (1) Search for (lexical) associations in texts and evaluation according to their strength, e.g. strong association between the name of a drug and negative expressions the drug has a bad reputation (according to the texts) the recognition of associations assumes that problems of synonymy (equality of meaning) and polysemy (ambiguity) in natural language texts are largely resolved Alexander cap - text Mining with LingPipe 8

9 Text Mining Examples (2) Finding documents that relate to the same subject, although their wording is different. A set of documents is converted into a high-dimensional vector space (dimensions and more) of terms and term frequencies. The resulting matrix is ​​broken down into a low-dimensional matrix by decomposing singular values ​​(several hundred dimensions). The evaluation of relationships between terms in this matrix makes it possible to establish associative relationships between terms that often correspond to semantic relationships and can be represented in an ontology. Text mining with LingPipe 9

10 Text mining applications Most applications try to identify patterns and trends from the text data Planned and in some cases already implemented applications: a service that tracks the development of the reputation of companies and products on the basis of relevant discussion contributions in newsgroups, weblogs, etc. which determines whether certain pharmaceutical product developments have already taken place and whether these attempts were successful or why they failed Monitoring of internal company networks (extrusion prevention) to ensure that no secret data leaves the company Secret service monitoring of the media, states, minorities and others Groups of people Spamchecker (software that can distinguish advertisements from meaningful or desired mailings) Search engines for research of all kinds Alexanderkap - Text Mining with LingPipe 10

11 Overview Text Mining Definition & Demarcation Tools Processes & Examples Applications LingPipe The Tool Functions Installation Architecture & Components Demonstration: Example Application Summary Alexander Cap - Text Mining with LingPipe 11

12 LingPipe The LingPipe tool is a collection of Java libraries for the linguistic analysis of human language free license for research purposes (requires free provision of any software and processed data) commercial license from $ 9500 / year (for startups) and $ / server for companies ( also includes support and upgrades) robust enough for commercial use (e.g. used by the US administration and the military) flexible enough to be of interest to researchers (various universities) Alexander cap - text mining with LingPipe 12

13 Lingpipe functions (1) Find occurrences of entities People Places Companies Biomedical terms (e.g. organisms and genes) Realized through: Training on statistical models Comparison with dictionaries Regular expressions Alexander cap - Text mining with LingPipe 13

14 Lingpipe functions (2) Classification of documents or text passages according to: Language Character coding Genre Topic Mood (Sentiment Analysis) Separation of subjective opinions and objective facts Differentiation between positive and negative contributions Alexander Cap - Text Mining with LingPipe 14

15 Lingpipe functions (3) Uncovering relations between entities and actions Correcting the spelling according to the specifications of a text collection Clustering of documents according to topic and uncovering trends in the course of time Linking the results with database entries Alexanderkap - Text Mining with LingPipe 15

16 Lingpipe functions (4) Sentence recognition Breakdown into parts of sentences (tokenization) Acquisition of parts of speech (Part-of-Speech Tagging) Find important terms / sentences e.g. based on frequency in different documents Alexander cap - text mining with LingPipe 16

17 LingPipe Installation LingPipe Core as precompiled.jar-archiv version (status:) runs on every platform with Java Virtual Machine (version 1.4 or higher) Control via the command line (GUI implementations possible) Live web demos available Alexander cap - text mining with LingPipe 17

18 LingPipe data formats LingPipe accepts the following formats as input: Unicode HTML XML plain text the output is always in XML Alexander cap - text mining with LingPipe 18

19 LingPipe architecture Information from the developer: the architecture is designed in such a way that LingPipe is efficient, scalable, reusable and robust. Users confirm: the speed of LingPipe on normal PCs is sufficient for an analysis in real time. Up to words per second can be processed. Architecture facts: Java API with source code and unit tests Models for many languages, many areas of application and many genres Training for new languages ​​and new tasks is possible Alexanderkap - Text Mining with LingPipe 19

20 LingPipe components (1) the basic components of LingPipe are the models represent and specify the elements of the computational linguistic analysis are loaded at runtime and form the core of the language processing are always specified according to: Task language genre training corpus Alexander cap - text mining with LingPipe 20

21 LingPipe components (2) Models included in the basic package: for part-of-speech tagging (English) General: pos-en-general-brown.hiddenmarkovmodel Biomedicine: pos-en-bio-genia.hiddenmarkovmodel Biomedicine: pos-en -bio-medpost.hiddenmarkovmodel for the recognition of entities (English) News: ne-en-news-muc6.abstractcharlmrescoringchunker Genes: ne-en-bio-genetag.hmmchunker Genomics: ne-en-bio-genia.tokenshapechunker for the word breakdown of the Chinese language Academia Sinica Version SIGHAN 2005: words-zh-as.compiledspellchecker Alexander cap - Text Mining with LingPipe 21

22 LingPipe Packages Packages for the individual functionalities: (term) classification (term) clustering sentence recognition tokenization etc. supporting packages: general Java utilities methods of statistics matrices and vectors input / output processing XML processing Alexander cap - text mining with LingPipe 22

23 Overview Text Mining Definition & Demarcation Tools Processes & Examples Applications LingPipe The Tool Functions Installation Architecture & Components Demonstration: Example Applications Summary Alexander Cap - Text Mining with LingPipe 23

24 Demo sentence recognition Alexander cap - text mining with LingPipe 24

Find 25 Demo Entities Alexander Cap - Text Mining with LingPipe 25

26 Demo Part-of-Speech Tagging Alexander Cap - Text Mining with LingPipe 26

27 Overview Text Mining Definition & Demarcation Tools Processes & Examples Applications LingPipe The Tool Functions Installation Architecture & Components Demonstration: Example Applications Summary Alexander Cap - Text Mining with LingPipe 27

28 Summary Text mining as a means of obtaining connections and knowledge from the flood of information (of the Internet) LingPipe as a text mining tool considerable range of functions + expandability scientifically founded methods robust, efficient and flexible well documented and therefore easy to use Alexanderkap - Text Mining with LingPipe 28

29 Thank you for your attention! Alexander cap - text mining with LingPipe 29

30 Sources Gerhard Heyer, Uwe Quasthof, Thomas Wittig (2006). Text mining: text as a raw material for knowledge - concepts, algorithms, results. W3L Verlag, Herdecke, Bochum. Jürgen Franke, Gholamreza Nakhaeizadeh and Ingrid Renz (2003). Text Mining - Theoretical Aspects and Applications. Physica-Verlag, Berlin. Bastian Buch (2008). Text mining for automatic knowledge extraction from unstructured text documents. VDM publishing house. Text Mining - Overview: (Visited:) LingPipe v3.7.0: (Visited:) LingPipe API: Alexander Cap - Text Mining with LingPipe 30