Domain-specific classifiers

This Work Package deals with an array of NLP tasks: Language Identification, Use of Ontologies, Entity Recognition, Categorization and Use of Rule Engines. These components are all together included under the general term of “information extraction”: and they are interactive: the language identification triggers the NLP engine for entity extraction, links the lexemes in a text to ontological instances, and is a feature in categorization. The entity extraction relies on ontological instances linked to the lexemes in a text in order to arrive at higher accuracy and focused identification of entities. The features of the NLP together are part of the features of the categorization engine for building the statistical vectors. These tasks are critical for the project as they are the first building blocks in the process of identification of radical and terrorist messages in the Internet.
For this work package, an extensive terrorism and radicalism ontology has been enhanced, building on the core IP of INT. The language identification models remain those used by the INT technology. Entity recognition has been re-trained during the project in order to improve accuracy. Finally, the task of generating classification models of textual terrorist content has been completed (i.e. the models do not include audio, image or video sources). The models enable the platform to perform two levels of categorisation in order to enable effective triage of large quantities of data that will be input into the system. These levels are:
A language-dependent “domain detector” based on the features of lexical occurrences in texts that are reduced to their stems and lemmas which determines if the text conforms to any of the general domains in which the platform has an interest and enables the platform to refrain from further categorisation of texts which are not of the domain of interest (terrorism);
A language-independent “categorization engine” which uses ontological features generated by the NLP engine to perform deep categorization of the text according to a number of categorization clusters (example: violence level, radicalism level, political-ideological affiliation, religious affiliation, document type, theatre etc.).
Hence any new document input into the system will be first triaged by the domain detector to determine if it warrants further processing, and then the relevant documents will be sent to the categorization engine for deeper classification.
The models have been generated on the basis of ±100,000 texts of all types: short texts, social media, formal articles, texts in different domains (political, religious, financial, technical etc.) in a number of languages that are currently supported by the NLP engine. Testing for accuracy reveals a high level of precision and recall (F-Score) on large texts. The output is currently .ps files that will be converted into JSON format.

D2 1 – Domain Specific Classifiers FINAL