Search code examples
regexpython-3.xnlpdata-extractiongoogle-natural-language

fetching name and age from a text file


I have a .txt file from which I have to fetch name and age. The .txt file has data in the format like:

Age: 71 . John is 47 years old. Sam; Born: 05/04/1989(29).
Kenner is a patient Age: 36 yrs    Height: 5 feet 1 inch; weight is 56 kgs. 
This medical record is 10 years old. 

Output 1: John, Sam, Kenner
Output_2: 47, 29, 36  

I am using the regular expression to extract data. For example, for age, I am using the below regular expressions:

re.compile(r'age:\s*\d{1,3}',re.I)

re.compile(r'(age:|is|age|a|) \s*\d{1,3}(\s|y)',re.I)

re.compile(r'.* Age\s*:*\s*[0-9]+.*',re.I)

re.compile(r'.* [0-9]+ (?:year|years|yrs|yr) \s*',re.I)

I will apply another regular expression to the output of these regular expressions to extract the numbers. The problem is with these regular expressions, I am also getting the data which I do not want. For example

This medical record is 10 years old.

I am getting '10' from the above sentence which I do not want. I only want to extract the names of people and their age. I want to know what should be the approach? I would appreciate any kind of help.


Solution

  • Please take a look at the Cloud Data Loss Prevention API. Here is a GitHub repo with examples. This is what you'll likely want.

    def inspect_string(project, content_string, info_types,
                       min_likelihood=None, max_findings=None, include_quote=True):
        """Uses the Data Loss Prevention API to analyze strings for protected data.
        Args:
            project: The Google Cloud project id to use as a parent resource.
            content_string: The string to inspect.
            info_types: A list of strings representing info types to look for.
                A full list of info type categories can be fetched from the API.
            min_likelihood: A string representing the minimum likelihood threshold
                that constitutes a match. One of: 'LIKELIHOOD_UNSPECIFIED',
                'VERY_UNLIKELY', 'UNLIKELY', 'POSSIBLE', 'LIKELY', 'VERY_LIKELY'.
            max_findings: The maximum number of findings to report; 0 = no maximum.
            include_quote: Boolean for whether to display a quote of the detected
                information in the results.
        Returns:
            None; the response from the API is printed to the terminal.
        """
    
        # Import the client library.
        import google.cloud.dlp
    
        # Instantiate a client.
        dlp = google.cloud.dlp.DlpServiceClient()
    
        # Prepare info_types by converting the list of strings into a list of
        # dictionaries (protos are also accepted).
        info_types = [{'name': info_type} for info_type in info_types]
    
        # Construct the configuration dictionary. Keys which are None may
        # optionally be omitted entirely.
        inspect_config = {
            'info_types': info_types,
            'min_likelihood': min_likelihood,
            'include_quote': include_quote,
            'limits': {'max_findings_per_request': max_findings},
          }
    
        # Construct the `item`.
        item = {'value': content_string}
    
        # Convert the project id into a full resource id.
        parent = dlp.project_path(project)
    
        # Call the API.
        response = dlp.inspect_content(parent, inspect_config, item)
    
        # Print out the results.
        if response.result.findings:
            for finding in response.result.findings:
                try:
                    if finding.quote:
                        print('Quote: {}'.format(finding.quote))
                except AttributeError:
                    pass
                print('Info type: {}'.format(finding.info_type.name))
                print('Likelihood: {}'.format(finding.likelihood))
        else:
            print('No findings.')