Search code examples
pythonweb-scrapingfixed-width

How to scrape fixed-width files in Python?


In Python 3 I have a series of links with "fixed-width files". They are websites with public information about companies. Each line has information about companies

Example links:

http://idg.receita.fazenda.gov.br/orientacao/tributaria/cadastros/cadastro-nacional-de-pessoas-juridicas-cnpj/consultas/download/F.K03200UF.D71214AC

and

http://idg.receita.fazenda.gov.br/orientacao/tributaria/cadastros/cadastro-nacional-de-pessoas-juridicas-cnpj/consultas/download/F.K03200UF.D71214RO

I have these links in a dictionary. The key is the name of the region of the country in which the companies are and the value is the link

for chave, valor in dict_val.items():
    print (f'Region of country: {chave} - and link with information: {valor}')

Region of country: Acre - and link with information: http://idg.receita.fazenda.gov.br/orientacao/tributaria/cadastros/cadastro-nacional-de-pessoas-juridicas-cnpj/consultas/download/F.K03200UF.D71214AC
Region of country: Espírito Santo - and link with information: http://idg.receita.fazenda.gov.br/orientacao/tributaria/cadastros/cadastro-nacional-de-pessoas-juridicas-cnpj/consultas/download/F.K03200UF.D71214ES
...

I want to read these links (fixed-width files) and save the content to a CSV file. Example content:

0107397388000155ASSOCIACAO CULTURAL                                                                                                                                                          
02073973880001552              16MARIA DO SOCORRO RODRIGUES ALVES BRAGA                                                                                                                      
0101904573000102ABREU E SILVA COMERCIO DE MEDICAMENTOS LTDA-ME  - ME                                                                                                                         
02019045730001022              49JETEBERSON OLIVEIRA DE ABREU                                                                                                                                
02019045730001022              49LUZINETE SANTOS DA SILVA ABREU                                                                                                                              
0101668652000161CONSELHO ESCOLAR DA ESCOLA ULISSES GUIMARAES                                                                                                                                 
02016686520001612              10REGINA CLAUDIA RAMOS DA SILVA PESSOA                                                                                                                        
0101631137000107FORTERM * REPRESENTACOES E COMERCIO LTDA                                                                                                                                     
02016311370001072              49ANTONIO MARCOS GONCALVES                                                                                                                                    
02016311370001072              22IVANEIDE BERNARDO DE MENEZES 

But to fill the rows of the CSV columns I need to separate and test on each line of the links with "fixed-width files"

I must follow rules like these:

1. If the line begins with "01" is a line with the company's registration number and its name. Example: "0107397388000155ASSOCIACAO CULTURAL"

1.1 - The "01" indicates this /

1.2 - The next 14 positions on the line are the company code - starts at position 3 and ends at 16 - (07397388000155) /

1.3 - The following 150 positions are the company name - starts at position 17 and ends at 166 - (ASSOCIACAO CULTURAL)

and

2. If the line starts with "02" it will have information about the partners of the company. Example: "02073973880001552 16MARIA DO SOCORRO RODRIGUES ALVES BRAGA" /

2.1 - The "02" indicates this /

2.2 - The next fourteen positions are the company registration code - starts at position 3 and ends at 16 (07397388000155) /

2.3 - The next number is a member identifier code, which can be 1, 2 or 3 - starts and ends at position 17 - (2) /

2.4 - The next fourteen positions are another code identifying the member - starts at position 18 and ends at 31 -("" - in this case is empty) /

2.5 - The next two positions are another code identifying the member - starts at position 32 and ends at 33 (16) /

2.6 - And the 150 final positions are the name of the partner - starts at position 34 and ends at 183 (MARIA DO SOCORRO RODRIGUES ALVES BRAGA)

Please in this case one possible strategy would be to save each link as TXT? And then try to separate the positions? Or is there a better way to wipe a fixed-width files?


Solution

  • You can take a look at any URL parsing modules. I recommend Requests, although you can use urllib which comes bundled with python.

    With that in mind, you can the text from the page, and seeing as it doesn't require a login of any from, with requests it would simply be a matter of:

    import requests
    r = requests.get('Your link from receita.fazenda.gov.br')
    page_text = r.text
    

    Read more in the Quickstart section of requests. I'll leave the 'position-separating' to you.

    Hint: Use regex.