Linguistic feature extraction tool

The present document, Deliverable 2.4 “Linguistic Features Report” (henceforth referred to as D2.4), is a deliverable related to Task 2.3 “Implement linguistic feature identification and extraction” (henceforth referred to as T2.3). The implementation of T2.3 assumes to integrate a set of methods for extracting features or attributes from social media text contents. With this purpose, this task will include selected Natural Language Processing methods, from low-level methods (if the message includes hashtags, exclamations marks, etc.) to higher-level ones (entity extraction, topic classification, etc.).
Mainly, the objective of the Project is to understand and detect how terrorism-related content is spread across social media platforms and one of the main features is to analyze text content. The figure below represents the diagram of the whole process (see Deliverable 1.7-System architecture for a more detailed description of the modules).

The result of the Task 2.3 is a set of new features extracted from the text in order to improve the Deep Learning methodology that classifies the text into its level of suspiciousness (included in the NLP module, Task 2.2). These new features will enrich the text analysis by identifying text properties that will help detection of suspicious messages.
For example, in the messages for ISIS support, they usually use the pronouns to confront “we” (as the real Muslims) against “they” (the other). So, using these pronouns like that can be an “indicator” of pertaining to this group.
Similarly, other properties like text length, readability, use of articles, etc. allows us to understand and identify radical messages. In terms of technical development, this task will be integrated into the NLP module.

The objective of the T2.3 is to improve the accuracy of the Machine Learning model, by adding new features that give more information about the text content for the message that will be analyzed.
In the context of the project, T2.3 is closely related to the Task 2.2 (Deep Learning methodologies). Both will be part of the NLP module to analyze text, extracting the threat score, a score related to the suspiciousness of the text.

D2 4 Linguistic Features Extraction Tool