Search code examples
terminologymedical

NLM UMLS MRREL is broken / incomplete


I have been working with the Unified Medical Language System (UMLS) for decades. But I have been aware for some years now (since 2017) that the MRREL table is woefully defective. And I wonder how can that possibly be?

I have tons of examples, but I am just making it very simple. The ATC code is a simple tree. Among many others, there is a top-level category 'G' (CUI: C3653431) and another 'C' (CUI: C3540036).

To be absolutely sure that I am not losing anything due to my importing process into a relational database, I am checking the raw files from the UMLS distribution:

(  unzip -p 2021AA-full/2021aa-2-meta.nlm 2021AA/META/MRREL.RRF.aa.gz |zcat ;
   unzip -p 2021AA-full/2021aa-2-meta.nlm 2021AA/META/MRREL.RRF.ab.gz |zcat ;
   unzip -p 2021AA-full/2021aa-2-meta.nlm 2021AA/META/MRREL.RRF.ac.gz |zcat ;
   unzip -p 2021AA-full/2021aa-2-meta.nlm 2021AA/META/MRREL.RRF.ad.gz |zcat ; 
) |egrep 'C3540036|C3653431'

and here is what I get:

|||PAR|C3540036|A22726695||inverse_isa|R162880348||||||N||
|||PAR|C3540036|A22726695||inverse_isa|R162896206||||||N||
|||PAR|C3540036|A22726695||inverse_isa|R162888235||||||N||
|||PAR|C3540036|A22726695||inverse_isa|R162884662||||||N||
|||PAR|C3540036|A22726695||inverse_isa|R162904098||||||N||
|||PAR|C3540036|A22726695||inverse_isa|R162892260||||||N||
|||PAR|C3540036|A22726695||inverse_isa|R162895918||||||N||
|||PAR|C3540036|A22726695||inverse_isa|R162895969||||||N||
|||PAR|C3540036|A22726695||inverse_isa|R162884408||||||N||
|||CHD|C3540036|A22726695||isa|R162905548||||||N||
|||CHD|C3653431|A22724193||isa|R145149031||||||N||
C3540036|A22726695|AUI|CHD|C0001645|A22729715|AUI|isa|R162894118||ATC||||N||
C3653431|A22724193|AUI|CHD|C3653561|A22721518|AUI|isa|R145152424||ATC||||N||
|||PAR|C3653431|A22724193||inverse_isa|R145147348||||||N||
|||PAR|C3653431|A22724193||inverse_isa|R145150236||||||N||
|||PAR|C3653431|A22724193||inverse_isa|R145153001||||||N||
|||PAR|C3653431|A22724193||inverse_isa|R162904046||||||N||

Why would there only be one link for each of these top level ATC categories?

  • CUI: C0001645 is ATC C07 - BETA BLOCKING AGENTS
  • CUI: C3653561 is ATC G03 - SEX HORMONES AND MODULATORS OF THE GENITAL SYSTEM

but where is C06, C05 (CUI: C0304533), G02 (CUI: C3653939), etc?

Let's search the other way around:

(  unzip -p 2021AA-full/2021aa-2-meta.nlm 2021AA/META/MRREL.RRF.aa.gz |zcat ;
   unzip -p 2021AA-full/2021aa-2-meta.nlm 2021AA/META/MRREL.RRF.ab.gz |zcat ;
   unzip -p 2021AA-full/2021aa-2-meta.nlm 2021AA/META/MRREL.RRF.ac.gz |zcat ;
   unzip -p 2021AA-full/2021aa-2-meta.nlm 2021AA/META/MRREL.RRF.ad.gz |zcat ; 
) |egrep 'C0001645|C0304533|C3653561|C3653939' \
|fgrep '|ATC|' 

this time I filter out everything but the MRRELs from the source ATC. First is C07AA child of C07

C0001645|A22726519|AUI|CHD|C0304515|A22728404|AUI|isa|R145146143||ATC||||N||
C0001645|A22729715|AUI|CHD|C0001645|A22726519|AUI|isa|R162909942||ATC||||N||

look above there is even a cycle! And where are all the other children of C07. Nowhere. The only other row with C07 is the link to C that we already had.

C3540036|A22726695|AUI|CHD|C0001645|A22729715|AUI|isa|R162894118||ATC||||N||

And the C05? Only one child C05B, but no parent link to C nor any other child!

C0304533|A22730499|AUI|CHD|C0360720|A22722089|AUI|isa|R162902080||ATC||||N||

Now here is G02 with 3 of its (certainly more) children:

C3653939|A22723315|AUI|CHD|C3653712|A22724891|AUI|isa|R162905420||ATC||||N||
C3653939|A22731353|AUI|CHD|C3653306|A22721882|AUI|isa|R162890442||ATC||||N||
C3653939|A22722139|AUI|CHD|C0164398|A22725073|AUI|member_of|R162897807||ATC||||N||

and then we have inverse links, which are not actually from ATC, those concepts are from SNOMED and other sources:

C0164398|A22725073|AUI|PAR|C3653939|A22722139|AUI|has_member|R162896052||ATC||||N||
C0754280|A26456152|AUI|PAR|C3653939|A22722139|AUI|has_member|R171341743||ATC||||N||
C1721339|A32510681|AUI|PAR|C3653939|A22722139|AUI|has_member|R202594180||ATC||||N||
C3652943|A22728555|AUI|PAR|C3653939|A22722139|AUI|has_member|R162895991||ATC||||N||
C3652944|A22730286|AUI|PAR|C3653939|A22722139|AUI|has_member|R162884649||ATC||||N||

And here is G to G03

C3653431|A22724193|AUI|CHD|C3653561|A22721518|AUI|isa|R145152424||ATC||||N||

and this here also is not a ATC link, the target is in SNOMED and other sources, but not in ATC:

C3653561|A22721518|AUI|CHD|C0002844|A22722789|AUI|isa|R145149338||ATC||||N||

So this is completely random.

I remember from decades ago that the MRREL was pretty redundant having both directions for all relationships. But not any more. What is going on here?


Solution

  • I have sent a problem report to NLM and they replied that the file in the UMLS-Full.zip, the ones that end in .nlm, that also contain the UMLS data tables, are somehow incomplete and one needs their MetamorphoSys program to assemble the right files.

    It seems like they do some data compression (for whatever reason) in rows by which they can reduce the size of the MRREL file by about 20%.

    • MRREL.RRF from the metathesaurus distribution 5,137,657,601 bytes

    • MRREL.RRF from the UMLS-Full .nlm file 3,662,797,614 bytes

      $ head MRREL.RRF.met C0000005|A13433185|SCUI|RB|C0036775|A7466261|SCUI||R86000559||MSHFRE|MSHFRE|||N|| C0000005|A26634265|SCUI|RB|C0036775|A0115649|SCUI||R31979041||MSH|MSH|||N|| C0000039|A0016515|AUI|SY|C0000039|A11754881|AUI|translation_of|R101808683||MSHSWE|MSHSWE|||N|| C0000039|A0016515|AUI|SY|C0000039|A12080359|AUI|sort_version_of|R64565540||MSH|MSH|||N|| C0000039|A0016515|AUI|SY|C0000039|A12091182|AUI|entry_version_of|R64592881||MSH|MSH|||N|| C0000039|A0016515|AUI|SY|C0000039|A13042554|AUI|translation_of|R193408122||MSHCZE|MSHCZE|||N|| C0000039|A0016515|AUI|SY|C0000039|A13096036|AUI|translation_of|R73331672||MSHPOR|MSHPOR|||N|| C0000039|A0016515|AUI|SY|C0000039|A1317708|AUI|permuted_term_of|R28482432||MSH|MSH|||N|| C0000039|A0016515|AUI|SY|C0000039|A18972171|AUI|translation_of|R124061564||MSHPOL|MSHPOL|||N|| C0000039|A0016515|AUI|SY|C0000039|A28315139|AUI||R173174221||RXNORM|RXNORM|||N||

      $ head MRREL.RRF.nlm C0000005|A13433185|SCUI|RB|C0036775|A7466261|SCUI||R86000559||MSHFRE||||N|| C0000005|A26634265|SCUI|RB|C0036775|A0115649|SCUI||R31979041||MSH||||N|| C0000039|A0016515|AUI|SY|C0000039|A11754881|AUI|translation_of|R101808683||MSHSWE||||N|| C0000039|A0016515|AUI|SY|C0000039|A12080359|AUI|sort_version_of|R64565540||MSH||||N|| |||SY|C0000039|A12091182||entry_version_of|R64592881||||||N|| C0000039|A0016515|AUI|SY|C0000039|A13042554|AUI|translation_of|R193408122||MSHCZE||||N|| C0000039|A0016515|AUI|SY|C0000039|A13096036|AUI|translation_of|R73331672||MSHPOR||||N|| C0000039|A0016515|AUI|SY|C0000039|A1317708|AUI|permuted_term_of|R28482432||MSH||||N|| C0000039|A0016515|AUI|SY|C0000039|A18972171|AUI|translation_of|R124061564||MSHPOL||||N|| C0000039|A0016515|AUI|SY|C0000039|A28315139|AUI||R173174221||RXNORM||||N||

    You can see how the 5th row is produced from the 4th row by copying over the previous columns into empty columns.

    That seems to be the issue.