Reindexing OMIM fails

Description

Reindexing fails for ONLINE MENDELIAN INHERITANCE IN MAN.

Environment

None

Activity

Show:
Veronika Koltunova
March 1, 2017, 3:57 AM

sometimes there are things line "MOVED TO 116200 AND 107250".

Sergiu Dumitriu
March 1, 2017, 4:07 AM

Entries starting with `^` shouldn't be ignored, but indexed as-is. They won't normally be included in searches, since the searches explicitly ignore them. Further than that, it gets complicated. If they're "REMOVED FROM DATABASE", there's nothing more we can do, but we should still keep them so that if a patient record uses one of the terms that later gets removed, we will at least have an indication that the OMIM identifier used to be valid, instead of simply getting a 'term not found'. For things that have been moved, ideally we would add their identifier as an 'alt_id' value for the term where they have been moved, but a deprecated term can be moved to 1, 2, or even 3 other terms, so it would be wrong to choose only one of them as the new term. In summary, simply store as-is, without any further processing.

For columns 4 and 5, in an ideal world we should split the label and the gene, but it's not possible in OMIM, since some entries have the format "label; label; gene", or "label; label", or "label; gene; gene', so it's hard to know for certain if something is a label or a gene name. Since this isn't the main label or the main gene name, and searching will match the correct word regardless of whether alternative/included genes are in their own column or together with the labels, it's simpler to just store them together.

Veronika Koltunova
March 1, 2017, 4:14 AM

Regarding your comments:

  • "name", in our current index, has the format "#100300 ADAMS-OLIVER SYNDROME 1; AOS1", which includes the symbol, the identifier, the name, and the gene or disorder abbreviation; it should instead contain just the name: "ADAMS-OLIVER SYNDROME 1"

  • add a new multi-valued "short_name" field for storing the abbreviation after the name: "AOS1"; note that there can be more than one short name

Are name and short name have to come both from the same 3rd column? If so, I can't seem to find a row example where more than one short name could be presented. Are there any?

However, 4th and 5th column are there in various forms - having just abbreviations, or many names and abbreviations. Examples:
"RDC7"
"A2AR;; ADORA2;; RDC8"
"SOMATOTROPINOMA, FAMILIAL ISOLATED; FIS;; ISOLATED FAMILIAL SOMATOTROPINOMA; IFS;; SOMATOTROPHINOMA, FAMILIAL;; ACROMEGALY DUE TO PITUITARY ADENOMA 1"
"PITUITARY ADENOMA PREDISPOSITION, INCLUDED; PAP, INCLUDED;; PITUITARY ADENOMA, FAMILIAL ISOLATED, INCLUDED; FIPA, INCLUDED"

Sergiu Dumitriu
March 1, 2017, 4:16 AM

These 3 have two short names:

Felicia Collura
March 1, 2017, 2:50 PM

yes, actually I couldn't have explained that better myself. that's very logical way of looking at something that is overcomplicated by commas.

Fixed

Assignee

Veronika Koltunova

Reporter

Sasha Andjic

Labels

External issue ID

None

Epic Link

Components

Fix versions

Affects versions

Priority

Medium