In Python 3 I have a series of links with "fixed-width files". They are websites with public information about companies. Each line has information about companies
Example links:
and
I have these links in a dictionary. The key is the name of the region of the country in which the companies are and the value is the link
for chave, valor in dict_val.items():
print (f'Region of country: {chave} - and link with information: {valor}')
Region of country: Acre - and link with information: http://idg.receita.fazenda.gov.br/orientacao/tributaria/cadastros/cadastro-nacional-de-pessoas-juridicas-cnpj/consultas/download/F.K03200UF.D71214AC
Region of country: Espírito Santo - and link with information: http://idg.receita.fazenda.gov.br/orientacao/tributaria/cadastros/cadastro-nacional-de-pessoas-juridicas-cnpj/consultas/download/F.K03200UF.D71214ES
...
I want to read these links (fixed-width files) and save the content to a CSV file. Example content:
0107397388000155ASSOCIACAO CULTURAL
02073973880001552 16MARIA DO SOCORRO RODRIGUES ALVES BRAGA
0101904573000102ABREU E SILVA COMERCIO DE MEDICAMENTOS LTDA-ME - ME
02019045730001022 49JETEBERSON OLIVEIRA DE ABREU
02019045730001022 49LUZINETE SANTOS DA SILVA ABREU
0101668652000161CONSELHO ESCOLAR DA ESCOLA ULISSES GUIMARAES
02016686520001612 10REGINA CLAUDIA RAMOS DA SILVA PESSOA
0101631137000107FORTERM * REPRESENTACOES E COMERCIO LTDA
02016311370001072 49ANTONIO MARCOS GONCALVES
02016311370001072 22IVANEIDE BERNARDO DE MENEZES
But to fill the rows of the CSV columns I need to separate and test on each line of the links with "fixed-width files"
I must follow rules like these:
1. If the line begins with "01" is a line with the company's registration number and its name. Example: "0107397388000155ASSOCIACAO CULTURAL"
1.1 - The "01" indicates this /
1.2 - The next 14 positions on the line are the company code - starts at position 3 and ends at 16 - (07397388000155) /
1.3 - The following 150 positions are the company name - starts at position 17 and ends at 166 - (ASSOCIACAO CULTURAL)
and
2. If the line starts with "02" it will have information about the partners of the company. Example: "02073973880001552 16MARIA DO SOCORRO RODRIGUES ALVES BRAGA"
/
2.1 - The "02" indicates this /
2.2 - The next fourteen positions are the company registration code - starts at position 3 and ends at 16 (07397388000155) /
2.3 - The next number is a member identifier code, which can be 1, 2 or 3 - starts and ends at position 17 - (2) /
2.4 - The next fourteen positions are another code identifying the member - starts at position 18 and ends at 31 -("" - in this case is empty) /
2.5 - The next two positions are another code identifying the member - starts at position 32 and ends at 33 (16) /
2.6 - And the 150 final positions are the name of the partner - starts at position 34 and ends at 183 (MARIA DO SOCORRO RODRIGUES ALVES BRAGA)
Please in this case one possible strategy would be to save each link as TXT? And then try to separate the positions? Or is there a better way to wipe a fixed-width files?
You can take a look at any URL parsing modules. I recommend Requests, although you can use urllib which comes bundled with python.
With that in mind, you can the text from the page, and seeing as it doesn't require a login of any from, with requests it would simply be a matter of:
import requests
r = requests.get('Your link from receita.fazenda.gov.br')
page_text = r.text
Read more in the Quickstart section of requests. I'll leave the 'position-separating' to you.
Hint: Use regex.