From free text to clusters of content in health records: an unsupervised graph partitioning approach

Healthcare records contain rich unstructured data in different forms. Free text constitutes a large portion of such data, yet this source of highly detailed information often remains under-used (i.e., either partially read manually or ignored) because of a lack of suitable methodologies to extract interpretable content.
Here we present the application of network-theoretical tools to the analysis of free text in Patient Incident records from the National Health System (NHS), as reported by hospitals across England and Wales since 2004 via the National Reporting and Learning System (NRLS). Our aim is to find clusters of incident reports in an unsupervised manner based directly on the free text contained within them, in order to provide alternative, intrinsic classifications for the records emanating from their content. To do so, we introduce a network-based framework for the unsupervised clustering of text documents, by combining recently developed deep neural-network high-dimensional text- embedding methodologies with multi-scale Markov Stability community detection applied to a similarity graph of documents obtained from sparsified text vector similarities.
We showcase the approach with the analysis of a dataset of patient incident reports from the NHS. First, we use the 13 million records collected by the NRLS since 2004 to train our text embedding. Then we analyse the subset of 3229 records collected in St Mary’s Hospital in London over three months in 2014 to extract of clusters of incidents at different levels of resolution in terms of content. Our method reveals the multiple levels of intrinsic structure in the topics of the dataset, as shown by the extraction of relevant word descriptors from the grouped records. We also carried out an a posteriori comparison against hand-coded categories assigned by healthcare personnel. Several of our clusters of content exhibit good correspondence with well-defined hand- coded categories, yet our results also provide a distinct level of resolution in certain areas as well as revealing complementary categories of incidents not defined in the external classification. Finally, we discuss how our method can be used to monitor health incident reports over time and to compare across healthcare providers with different patterns of incidents.

M. Tarik Altuncu, Erik Mayer, Sophia N. Yaliraki and Mauricio Barahona
Tuesday, September 25, 2018 - 17:30 to 17:45


The official Hotel of the Conference is
Makedonia Palace.

Conference Organiser: NBEvents

The official travel agency of the Conference is: Air Maritime

Photo of Thessaloniki seafront courtesy of Juli Bellou
fb flickr flickr