Search code examples
javascriptstringloopsdata-structuressubstring

Searching into an array of long strings specific things. Then save it on a JSON Object


I'm new. I'm working on a program that reads a bunch of docx documents. I get the document content from his XML with XPATH and xmldom. It gives me an array with every line of the document. The thing is, I have something like this:

[
  '-1911312-14668500FECHA:  15-12-25',
  'NOMBRE Y APELLIDO:  Jhon dee',
  'C.I.: 20020202                                  EDAD: 45                       ',
  'DIRECCION:  LA CASA',
  'TLF:  55555555',
  'CORREO: thiisatest@gmail',
  '                                            HISTORIA CLINICA GINECO-OBSTETRICA',
  'HO',
  'NULIG',
  'FUR',
  '3-8-23',
  'EG',
  '',
  'FPP',
  '',
  'GS',
  '',
  'GSP',
  '',
  '',
  'MC:  CONTROL GINECOLOGICO',
  'HEA',
  '',
  'APP:  NIEGA PAT, NIEGA ALER, QX NIEGA.',
  'APF: MADRE HTA, ABUELA DM.',
  '',
  'AGO: MENARQUIA:  10                FUR:                         CICLO:      4/28              ',
  '    TIPO: EUM',
  ' MET ANTICONCEP:  GENODERM DESDE HACE 3 AÑOS.',
  'PRS:                                      NPS:                                                   ITS: VPH LIE BAJO GRADO 2017 , BIOPSIA.',
  'FUC:  NOV 2022, NEGATIVA. COLPO NEGATIVA.',
  '',
  '',
  'EMBARAZO',
  '#/AÑO',
  'TIPO DE PARTO',
  'INDICACION',
  'RN',
  'SEXO',
  'RN',
  'PESO',
  'OBSERVACIONES',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  'EXAMEN FISICO:',
  'PESO:  80,1                  TALLA:                    TA: MMHG                    FC:                    FR: ',  
  '',
  'PIEL Y MUCOSA:  DLN',
  'CARDIOPULMONAR: DLN',
  '',
  'MAMAS: ',
  '',
  'ABDOMEN: ',
  'GENITALES:  CUELLO SIN SECRECION , COLPO SE EVDIENCIA DOS LEISONES HPRA 1 Y HORA 5',
  '',
  'EXTREMIDADES: DLN',
  'NEUROLOGICO: DLN',
  '',
  ' IDX:  LESION EN CUELLO UTERINO',
  '',
  'PLAN: DEFEROL OMEGA, CAUTERIZACION Y TIPIFICACION VIRAL',
  '22-8-23',
  'SE TOMA MUESTRA DE TIPIFICACION VIRAL.',
  '',
  '',
  '',
  'LABORATORIOS:',
  'FECHA',
  'HB/HTO',
  'LEU/PLAQ',
  'GLICEMIA',
  'UREA',
  'CREAT',
  'HIV/VDRL',
  'UROANALISIS',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  ... 44 more items
]

So, I want to put this content on a js object like:

const customObj = {
 fecha: "fecha on the doc",
....
}

But well I think this will works:

const fillObject = (inputArray, keywords) => {
    const customObj = {};
    keywords.forEach((keyword, index) => {
        customObj[keyword] = inputArray.map(line => {
            const keywordIndex = line.indexOf(keyword);
            if (keywordIndex !== -1) {
                const nextKeywordIndex = keywords.slice(index + 1).reduce((acc, nextKeyword) => {
                    const nextKeywordIndex = line.indexOf(nextKeyword);
                    return nextKeywordIndex !== -1 && nextKeywordIndex < acc ? nextKeywordIndex : acc;
                }, line.length);
                return line.slice(keywordIndex, nextKeywordIndex).trim();
            }
            return null;
        }).filter(Boolean);
    });
    console.log(customObj);
    return customObj;
};

From the function I get this: the keyword with the content before the next keyword, and i want to get only the important data. The format of the documents is always the same, but sometimes i get spaces between a keyword and its content and sometimes I don't. The words are always capitalized.

I try the function mentioned before, but i want to be more precise on my searching and in how the data looks in the object. The final result has to be a little more accurate because the output actually looks like this:

'FECHA:': [ 'FECHA: 19-10-23' ],
    'NOMBRE Y APELLIDO:': [ 'NOMBRE Y APELLIDO: John Dee' ],
    'C.I.:': [ 'C.I.: 3232323' ],
    'EDAD:': [ 'EDAD: 56' ],
    'DIRECCION:': [ 'DIRECCION:   Marylan ],
    'TLF:': [ 'TLF:  55555555' ],
    'CORREO:': [ 'CORREO:  5555555AS@GMAIL.COM' ],
    'CONTACTO:': [
      'CONTACTO:  IG                                            HISTORIA CLINICA GINECO-OBSTETRICA'
    ],

As you can see some properties are weird like "contacto" does not fit well.


Solution

  • Instead of providing a set of keys, I would just parse the input data to recognise the "key: value" pattern.

    Note that you could get ambiguity. For instance, if an input line were:

    TEST: A B C: OK
    

    Then, this could be interpreted as:

    {
        "TEST": "A",
        "B C": "OK"
    }
    

    or as:

    {
        "TEST": "A B",
        "C": "OK"
    }
    

    To break such ties, we could make the capture of the value greedy, so that in the above example the second output would be generated. If however we find that there is a separation of at least three spaces, then we could interpret what follows as a new key/value pair, so that this input:

    TEST: A   B C: OK
    

    ...would be interpreted as:

    {
        "TEST": "A",
        "B C": "OK"
    }
    

    Secondly, if a value has commas, you could turn that value into array (except if the comma is part of a numeric value).

    We can use the power of regular expressions to do this kind of parsing.

    Here is a function makeObject and how that could work for your sample input:

    const multiple = arr => arr.length > 1 ? arr : arr[0];
    const regex = /((?:[A-Z.]+ )*[A-Z.]+):((?: {0,2}(?!\S*:)\S+)*)/g;
    const makeObject = data => Object.fromEntries(
        Array.from(data.join("\n").matchAll(regex), ([, key, value]) => [
            key, 
            multiple(value.split(/,(?!\d)/).map(val => val.trim()))
        ])
    );
    
    // Your sample data:
    const data = ['-1911312-14668500FECHA:  15-12-25','NOMBRE Y APELLIDO:  Jhon dee','C.I.: 20020202                                  EDAD: 45                       ','DIRECCION:  LA CASA','TLF:  55555555','CORREO: thiisatest@gmail','                                            HISTORIA CLINICA GINECO-OBSTETRICA','HO','NULIG','FUR','3-8-23','EG','','FPP','','GS','','GSP','','','MC:  CONTROL GINECOLOGICO','HEA','','APP:  NIEGA PAT, NIEGA ALER, QX NIEGA.','APF: MADRE HTA, ABUELA DM.','','AGO: MENARQUIA:  10                FUR:                         CICLO:      4/28              ','    TIPO: EUM',' MET ANTICONCEP:  GENODERM DESDE HACE 3 AÑOS.','PRS:                                      NPS:                                                   ITS: VPH LIE BAJO GRADO 2017 , BIOPSIA.','FUC:  NOV 2022, NEGATIVA. COLPO NEGATIVA.','','','EMBARAZO','#/AÑO','TIPO DE PARTO','INDICACION','RN','SEXO','RN','PESO','OBSERVACIONES','','','','','','','','','','','','','','','','','','','','EXAMEN FISICO:','PESO:  80,1                  TALLA:                    TA: MMHG                    FC:                    FR: ','','PIEL Y MUCOSA:  DLN','CARDIOPULMONAR: DLN','','MAMAS: ','','ABDOMEN: ','GENITALES:  CUELLO SIN SECRECION , COLPO SE EVDIENCIA DOS LEISONES HPRA 1 Y HORA 5','','EXTREMIDADES: DLN','NEUROLOGICO: DLN','',' IDX:  LESION EN CUELLO UTERINO','','PLAN: DEFEROL OMEGA, CAUTERIZACION Y TIPIFICACION VIRAL','22-8-23','SE TOMA MUESTRA DE TIPIFICACION VIRAL.','','','','LABORATORIOS:','FECHA','HB/HTO','LEU/PLAQ','GLICEMIA','UREA','CREAT','HIV/VDRL','UROANALISIS','','','','','','','','',];
    console.log(makeObject(data));

    You'll see in the output all keys it could find, even those that have an empty value (like AGO). Just extract from this object what you need.