Search code examples
pythonregextokenize

Extract certain items from text file for tokenisation


Below is the structure of the text file "info.txt". From this file, I need to extract the IDs and Descriptions (any method that accurately extracts ID and Description information). There are about 500 instances of IDs and Descriptions in the file. One ID represents one title and one description as seen in the text file.

First part I am unsure about is whether to store ID and description information in 2 lists. If I use lists then would I be able to use the "Description" list to tokenise each Description (keeping in mind there would be 500 descriptions in this list).

ID: #22579462
Title: Quality Engineer
Description: Our client are a leading supplier of precision machined, high integrity components, integrated kits of parts and complete mechanical assemblies. Due to an large increase in workload they are recruiting a Quality Engineer Reporting to the Quality Manager, the successful individual will be responsible for providing documentation to fulfil our customers quality assurance requirements on specific contracts, whilst maintaining a system of storage and retrieval for documentation. The role will also support the internal audit schedule, performing audits as required. Responsibilities include: Documentation Checking all vendor supplied documentation to ensure it complies with the requirements or Express s customer specifications. Produce accurate, legible documentation packs, in accordance with customer requirements. Quality Systems Maintain system of storage and retrieval of all associated QA documentation in accordance with ISO9001:**** Certification Ensure certificates of conformance are checked, in accordance with the C of C matrix and any applicable concessions are referenced Material Certification Verify and approve certification on receipt for conformance to customer requirements and resolve discrepancies with suppliers Non conformance Raise and submit supplier reject reports and concessions. Store all responses received in relevant databases. Internal Auditing Carry out internal audits as and when required in line with the internal audit schedule. Identify and report all nonconformances within Quality Management System, and assist in corrective actions to close them out Supplier Rejects Ensure corrective action is received for supplier rejects submitted to key suppliers The Individual: Has experience within the quality department of a related company in a similar role Ideally from a mechanical or manufacturing engineering background. Ideally be familiar with the range of processes involved in the markets of Oil Must have good communication and organisational skills Has the ability to work as part of a team or as an individual. Has the ability to be customer facing and discuss technical / quality issues with vendors and customers

ID: #22933091
Title: Chef de Partie  Award Winning Dining  Live In  Share of Tips
Description: A popular hotel located in Norfolk which is a very busy operation has a position available for a Chef de Partie Role: A Chef de Partie capable of coping well under pressure is required to join the kitchen team at a hotel that has an excellent reputation for offering high quality dining to its guests and has gained accreditations in the main restaurant.The busy Brasserie style restaurant regularly serves **** covers for lunch and dinner so this Chef de Partie role will require you to be organised on your section ensuring all prep is complete to the standards expected by the Head Chef before each service. Requirements: All Chef de Parties applying for this role must have a strong background with highlights previous AA Rosette experience in a high volume operation.A candidate who is self motivated and capable of working well in a busy team of chefs would be ideal for this role. Benefits Include: Uniform Provided Meals on Duty Accommodation Available Share of Tips – IRO **** Per Month Excellent Opportunities To Progress If you are interested in this position or would like information on the other positions we are recruiting for or any temporary assignments please send your CV by clicking on the 'apply now' button below and our consultant Sean Bosley will do his utmost to assist you in your search for employment. In line with the requirements of the Asylum Immigration Act **** all applicants must be eligible to live and work in the UK. Documented evidence of the eligibility will be required from candidates as part of the recruitment process. This job was originally posted as  

ID: #23528672
Title: Senior Fatigue and Damage Tolerance Engineer
Description: Senior Fatigue Static stress (metallic or composite) Finite element analysis. Senior Fatigue Aerospace  ****K****K (dep on exp)  benefits package Bristol, Avon

ID: #23529949
Title: C I Design Engineer
Description: We are currently recruiting on behalf of our client who have an exciting opportunity available for a CE Produce CE Control Panel designs  Genera Arrangements, Detail drawings, Schematics Diagrams, Interlock Diagrams for typically PLC Specification of hardware and production of parts list. Manufacturing specification. Ensure Company policies and procedures are being applied across the projects. Manage the interface between CE Communicate at all levels with both internal and external customers to meet their expectations while meeting the project budget and programme constraints. Support the Lead Engineer in the delivery of scope to budget and programme. Provide technical expertise to tenders as and when required. Provide input to the development of the CE&l function and resource

There are a few things that I am trying to achieve here, One) create a unigram vocabulary of all descriptions with format word_string:integer_index. Two) Create a text file where each line corresponds to one description. The line would start with the ID (keeping the #). The rest of each line is the sparse representation of the corresponding description in the form of word_index:word_freq separated by comma.

I guess this is why I thought storing ID and Description information in a list would be ideal. This way index 0 in ID list would be #22579462 and index 0 in description list would be the corresponding description text.

Thanks in advance


Solution

  • You can read in the file at once, then parse it with regex findall. The 'rslt' list contains (ID,Description) tuples:

    with open("info.txt") as ff:
        rslt= re.findall(r"(?sm)^\s*ID:\s*#(\d+)\s*$.*?^Description:(.*?)(?:\s*(?=^ID: #)|\Z)",ff.read())
    

    (?sm) --> m: multiline mode, s: the dot(.) matches the new line, too;

    ^\s*ID:\s*#(\d+) --> matches the start of a line, following spaces and the "ID: #" pattern, and then the digits, which are grouped (see the parenthesis);

    \s*$ --> after the digits, the line can contain whitespaces only;

    .*?^Description: --> skips the Title, and matches "Descripttion:" pattern;

    (.?)(?:\s(?=^ID: #)|\Z) --> (.*?) gets the description text (grouped) to the next block beginning with "ID: # " or the end of string \Z.