Data Cleansing

  • Category Methods
  • Date Published Dec 04, 2018
  • Written by LAITEK
  • Share This Article

Data cleansing

Data migration presents a one-time opportunity to “clean up” legacy image data as it is moved to the new archive. This so-called “Data Cleansing” is offered as an extra-cost option in data migration projects. Exception conditions that are candidates for remediation include:

Misidentified Data

  • Patient demographic data (Name, Patient ID, Birth date, Gender)
  • Exam/Order identification (Accession Number, Requested Procedure ID)

Junk Data

  • Test images
  • Discarded images

Bad Data

  • Unsupported old SOP classes (image types)
  • Corrupted/invalid DICOM objects

Hard Errors

  • Media read errors
  • Missing primary or backup volumes

Some of these exception conditions were errors at the time the data sets were created, and others may have been introduced later. Patient Identification, for example, may be updated as a person’s name changes or when identification errors are discovered, sometimes years after the fact. In some systems, the name-change process leaves the same study in the archive under both the old and new patient names. In addition, data migration may consolidate sets of image data stored with different patient ID systems, and it may be desirable to map patient IDs to a new master ID system when they are migrated.

Methods for matching are commonly rule-based or use heuristic or probability-based methods, or a combination thereof.

The latter methods are usually proprietary algorithms that often offer better name matching than rule-based methods. However, this incremental improvement comes at the expense of not knowing what the “black box” algorithms are doing. Rule-based methods sequentially apply a set of agreed-upon rules to the image and examination inventories. The rules are refined iteratively in consultation with the customer. The advantage of rule-based methods is that they are deterministic, resulting in a documented and agreed-upon set of operations that will be performed on the data as it is migrated.

Dirty Data

The most common data cleansing operations are the correction of images misidentified at the Patient or Study level.

Patient Level cleansing fills in or corrects missing or erroneous patient attributes based on patterns in the other patient attributes, usually matching to an authoritative patient list from the Hospital or Radiology Information System (HIS/RIS).

Study Level cleansing matches the imaging studies to a list of examinations from the HIS/RIS, populating Accession Number and Requested Procedure attributes with values from the corresponding matched HIS/RIS examination. Patient attributes are also used in exam-level matching, which offers better results than matching exams or patients alone.

See our glossary terms