Reindexing fails for ONLINE MENDELIAN INHERITANCE IN MAN.
sometimes there are things line "MOVED TO 116200 AND 107250".
Entries starting with `^` shouldn't be ignored, but indexed as-is. They won't normally be included in searches, since the searches explicitly ignore them. Further than that, it gets complicated. If they're "REMOVED FROM DATABASE", there's nothing more we can do, but we should still keep them so that if a patient record uses one of the terms that later gets removed, we will at least have an indication that the OMIM identifier used to be valid, instead of simply getting a 'term not found'. For things that have been moved, ideally we would add their identifier as an 'alt_id' value for the term where they have been moved, but a deprecated term can be moved to 1, 2, or even 3 other terms, so it would be wrong to choose only one of them as the new term. In summary, simply store as-is, without any further processing.
For columns 4 and 5, in an ideal world we should split the label and the gene, but it's not possible in OMIM, since some entries have the format "label; label; gene", or "label; label", or "label; gene; gene', so it's hard to know for certain if something is a label or a gene name. Since this isn't the main label or the main gene name, and searching will match the correct word regardless of whether alternative/included genes are in their own column or together with the labels, it's simpler to just store them together.
Regarding your comments:
"name", in our current index, has the format "#100300 ADAMS-OLIVER SYNDROME 1; AOS1", which includes the symbol, the identifier, the name, and the gene or disorder abbreviation; it should instead contain just the name: "ADAMS-OLIVER SYNDROME 1"
add a new multi-valued "short_name" field for storing the abbreviation after the name: "AOS1"; note that there can be more than one short name
Are name and short name have to come both from the same 3rd column? If so, I can't seem to find a row example where more than one short name could be presented. Are there any?
However, 4th and 5th column are there in various forms - having just abbreviations, or many names and abbreviations. Examples:
"RDC7"
"A2AR;; ADORA2;; RDC8"
"SOMATOTROPINOMA, FAMILIAL ISOLATED; FIS;; ISOLATED FAMILIAL SOMATOTROPINOMA; IFS;; SOMATOTROPHINOMA, FAMILIAL;; ACROMEGALY DUE TO PITUITARY ADENOMA 1"
"PITUITARY ADENOMA PREDISPOSITION, INCLUDED; PAP, INCLUDED;; PITUITARY ADENOMA, FAMILIAL ISOLATED, INCLUDED; FIPA, INCLUDED"
These 3 have two short names:
yes, actually I couldn't have explained that better myself. that's very logical way of looking at something that is overcomplicated by commas.