Add a custom parser for the NCIT flat file

Description

 

  • Add support for a new “source” special case of the indexing servlet (LFS-101), when “source=ncit

  • In this case, the version parameter is mandatory

  • Download the flat zip file from "https://evs.nci.nih.gov/ftp1/NCI_Thesaurus/Thesaurus_" + version + ".FLAT.zip" into a temporary file

  • Open it with a ZipInputStream, access getNextEntry(), convert it to a BufferedReader then to a CSVParser, then read it row by row

  • create a new map String → String[] for recording parents

  • for each row, key = column 0, values = column 2 split by |, store this in the map

  • create a new map String → String[] to record ancestors

  • for each entry in the parents map, ancestors.put(entry.getKey(), computeAncestors(parents, entry.getKey()), where computeAncestors recursively gathers the parents

  • Open the temporary zip file again as a CSVParser

  • For each entry, create a new Node with jcr:primaryType=lfs:VocabularyTerm, identifier=row[0], label=StringUtils.defaultIfBlank(row[5], row[3].split("\\|")[0], description=row[4], synonyms=row[3].split("\\|"), parents=parentsMap.get(row[0]), ancestors=ancestorsMap.get(row[0])

  • Delete the temporary file

  • Save the session

 

Environment

None

Status

Assignee

Bruce He

Reporter

Sergiu Dumitriu

Labels

None

External issue ID

None

External issue ID

None

Sprint

None

Priority

Medium