Recognising and classifying unique identifiers in unstructured text as named entities using NLP
integrating LSTM and CNN layers for entities with limited structure and contextRecognising and classifying unique identifiers in unstructured text as named entities using NLP
integrating LSTM and CNN layers for entities with limited structure and contextSamenvatting
Information can be exchanged in many different ways, the standard format for textual information is a document, which comes in an impossible number of different structures. The Ministry of Internal Affairs and Kingdom Relations concerns an overwhelming number of documents on a daily basis and uses a rulebased approach to find and capture the phone numbers, email addresses, and other contact details in a given document. This brings structure to the text and as a result, quickly find the necessary information, especially because The Ministry receives an expansive number of different documents. The named entities in these documents, however, are recorded in many different formats, which made them increasingly difficult to capture as exceptions arose. I. e. the format of a named entity that does not conform, or deviate too much, from its internationally established format can not be found using traditional rulebased methods without requiring too much manual work. The hypothesis was that a machine learning approach could automatically learn and capture new formats, specifically due to its ability to adapt and generalise on new data. The first challenge to overcome was the required training data for the Machine Learning (ML) model, no internal data was made available for privacy reasons. Using scraping, textual data related to the individual named entities were collected and diversified using the data augmentation techniques: synonym replacement, random insertion, random swap, random deletion. The named entities were annotated using lookup tables and regular expressions as absolute verifiable truths. This guaranteed that the vast majority of named entities in the dataset were annotated. To recognise and classify the named entities in texts, both contextual and syntactical features were considered. The concerned named entities have a strict pattern that can mostly be identified by syntactical features. If a named entity does not conform to its respective pattern then contextual features will help in recognising, and primarily classifying, the named entity by understanding its context. A bi-LSTM-CNN-CRF model was developed as a solution. The bi-LSTM core architecture considers the context from both directions, as a result, it understands whether the context relates to a named entity. The convolution layer takes a matrix representation of individual token and applied several filters and a maxpool, resulting in different feature maps with syntactically captures the patterns, or the format, of a potential named entity. The CRF layer adds extra selflearned constraints to better classify the structure of a separated multispanning token named entity. The model achieved an average F1 score of 0.95 on the testing subset, whereas the rulebased approach achieved an average 0.88 F1 score. The model is better in recognising and classifying the edgecases, as was the primary goal, but not for the base cases. Therefore, the newly developed solution should be combined with the rulebased approach rather than replace it. To create a uniform encompassing model, the emphasis should be on considering the syntactical features to a higher degree and expanding the number of syntactical features.
Organisatie | Fontys |
Opleiding | HBO-ICT |
Afdeling | Fontys ICT |
Partner | Ministerie van Binnenlandse Zaken en Koninkrijksrelaties |
Jaar | 2020 |
Type | Bachelor |
Taal | Engels |