8th Biennial ACSPRI Social Science Methodology Conference

Gorkem Sezgin

Gorkem Sezgin is a healthcare researcher with extensive experience in applying innovative and advanced techniques to extract knowledge out of big data in areas of healthcare systems and delivery. Past research experience includes linking components of large healthcare datasets from hospitals and general practices, with a proven record of publications of the outcomes from linked datasets. Further skills involve developing statistical modelling techniques to describe outcomes.


Session

Thursday 24th November 2022
14:00
15min
Linking Hospital Emergency and Inpatient Admissions for secondary data analysis: a case study using Natural Language Processing
Gorkem Sezgin

Background: Hospital admission records contain a rich resource of data for healthcare research, providing a direct insight into processes and procedures whilst also being resilient to bias and limitations afflicting other sampling methods. In Australian hospitals, most data records are standardised or otherwise classified using internationally established conventions (e.g., International Classification of Disease by the World Health Organisation), thereby providing a robust data source for research. Hospital admission records are not centrally stored, with emergency and inpatient datasets located separately with different structures and frameworks. Therefore, before utilising hospital records data to report outcomes, pre-processing steps need to be taken. Here, we homogenise and link emergency and inpatient admission datasets and apply natural language processing on the linked datasets to create a predictive model for patient length of stay and readmission.

Methods: The dataset contains emergency and inpatient hospital admission records from two local health districts (South-Eastern-Sydney and Illawarra-Shoalhaven Local Health Districts) between 2020 and 2021. Both datasets were configured to a patient-admission level by reshaping the datasets to have the diagnostic records expand across a row rather than a column. A custom algorithm was created to link the reshaped datasets by using de-identified patient IDs as key and matching overlapping admission and departure/discharge date-times. Two outcome variables were generated for natural language processing: one indicating if the patient was readmitted within 28 days, and another indicating if the patient was admitted for more than one day. Diagnostic records from the emergency dataset, inpatient dataset, as well as age and gender of the patient were used in the models to predict the outcomes based on natural language processing (random forest classification with TF-IDF and word2vec vectors). Stata MP 15.1 was used to pre-process the datasets, and Python was used to link the datasets at a patient-admission level and run the natural language processing algorithm. The study was conducted under ethical approval from South-Eastern Sydney Local Health District Human Research Ethics committee (HREC/16/POWH/412) and Macquarie University, and funded under a National Health and Medical Research Council Partnership Project (1111925).

Results: Without the emergency dataset linked, the TF-IDF model produced a predictive model for readmission with 96% precision and 76.2% recall. The linkage increased the precision to 96.4% and the recall to 76.3%. The unlinked word2vec model had a precision of 96.7% and a recall of 74.8%, which increased to 97.1% precision and slightly reduced to 74.6% recall after linkage. For predicting if the patient would be admitted for more than one day, the unlinked TF-IDF model had 86.2% precision and 89.7% recall, which increased to 87.4% precision and 90.5% recall after linkage. The unlinked word2vec model had 81.9% precision and 89% recall, and the linked model had 83.4% precision and 88.6% recall.

Conclusion: Hospital admission records provide a rich source of data for secondary data analysis, with pre-processing and linking different components of a patient’s stay improving predictive modelling. Here we show an improvement in predictive modelling by linking inpatient and emergency dataset diagnostic records. Linkage with pathology tests, radiology tests, and medications would further improve predictive models and reporting outcomes.

Recording link: https://acspri-org-au.zoom.us/rec/share/sFS4E0Eva6L3CETeEbh9cr8bLW7jHf3BqadVrD3ZBgTXkIpMYjtO_h9UWipxxHi-.ZbxNTlh6Oe1hAEZ5?startTime=1669255431000

Data linkage and modelling
Zoom Breakout Room 3