Search code examples
pythonrestsharepoint

Access Excel on SharePoint via Rest API and copy data to MS SQL


We have the following use case:
Users are not able to input all data they need into our ERP system. So what they do is maintaining Excel files with that data. To do planning activities they want to combine their Excel data with ERP data (from the database) in yet another Excel file. They want to make comments in this file and they want it to be updated every day using the most recent data from the ERP system.

The solution that I envision is to use a Python script that gathers the Excel data at night, loads it to MS SQL, transforms it over there and then loads it back to SharePoint, replacing the old file if the process is successful.

The difficult part is accessing the Excel file on SharePoint. I wanted to use the Microsoft Graph API but was instead asked to use a Python library called Office365. Now I don't get along with the documentation too well so I will describe what I need in pseudo-code below and hope for your help.

Do you know a way in Python to achieve roughly the following?

import MagicLibrary as MagL
import pandas as pd

SHAREPOINT_USERNAME = "[email protected]"
SHAREPOINT_PASSWORD = "Password"
SHAREPOINT_URL = "Sharepoint URL"
SHAREPOINT_PATH = "Path to Document"
FILE_NAME = "NeededData.xls"
SHEET_NAME = "Sheet 1"
RANGE = "A1:D20"

SharePoint_Connection = MagL.Connection(SHAREPOINT_URL, SHAREPOINT_USERNAME, SHAREPOINT_PASSWORD)

Excel_Range = SharePoint_Connection.Path(SHAREPOINT_PATH).File(FILE_NAME).Sheets(SHEET_NAME ).Range(RANGE ).As_Pandas_Usable_Array
df = pd.to_dataframe(Excel_Range)

#Use SQL Alchemy below to write dataframe to MS SQL Server. 

    #Code

Alternatively, if the API is only able to get into SharePoint, but not into Excel, I would also be fine with loading the Excel file to the server that executes the script and then accessing the file contents there, using Pandas.


Solution

  • I wasn't able to convince the API to give me the data from within an Excel file but I managed to download the Excel file and then turn it into a pandas data frame.

    In the productive script I use, I have called the source_files array "mapping". In addition to file names I also store information about the excel files in there. I use them in the pd.read_excel function to define where to take the data from.

    #For working with Dataframes
    import pandas as pd
    
    #To access the Office 365 web API for downloading Excel files. 
    from office365.sharepoint.client_context import ClientContext
    from office365.runtime.auth.user_credential import UserCredential
    
    #To load the credentials from the config file.
    import json
    
    #To use Regex when checking for names of Excel files. 
    import re
    
    #Open file with usernames, passwords and addresses
    content = open('config.json')
    config = json.load(content)
    
    #create needed static variables
    SHAREPOINT_SITE = "https://eurotech2.sharepoint.com/sites/YourSite"
    SHAREPOINT_PATH = "Shared Documents/YourPath"
    SHAREPOINT_USER = config['SHAREPOINT_USER_MAIL']
    SHAREPOINT_PASSWORD = config['SHAREPOINT_PASSWORD']
    
    source_files= ["YourFirstFileNameRegexString","YourSecondFileNameRegexString"]
    
    #Connect to SharePoint
    sp_credentials = UserCredential(SHAREPOINT_USER,SHAREPOINT_PASSWORD)
    sp_con = ClientContext(SHAREPOINT_SITE).with_credentials(sp_credentials)
    
    #Store the file objects at the chosen location in a variable
    sp_files = sp_con.web.get_folder_by_server_relative_url(SHAREPOINT_PATH).files
    sp_con.load(files).execute_query()
    
    #For every source file
    for source_file in source_files:
        #For every file in the SharePoint folder
        for sp_file in sp_files:
            #Use Regex to check whether we found our searched for source. 
            if re.search(source_file, sp_file.name):
                #We take the script's path, add the file name and store this as the new source.
                source_file = os.path.abspath(sp_file.name)
                #With the new source path we created.
                with open(source_file, "wb") as local_file:
                    #download the SharePoint file to the local system
                    file_for_download = sp_con.web.get_file_by_server_relative_url(sp_file .serverRelativeUrl)
                    file_for_download.download(local_file)
                    sp_con.execute_query()
                    
                    #If there are multiple files matching the Regex this will cause only the first to be downloaded. 
                    break
    
    #For every source file
    for source_file in source_files:
         #Check whether the original Regex string was replaced by a valid path that leads to an existing file. 
         if os.path.isfile(source_file):
             #Read the excel file and assign column names as defined. 
             df = pd.read_excel(source_file  #Excel location
                 , sheet_name="Your sheet name"
                 , header=None
                 , skiprows="Number of rows you want to skip"
                 , usecols="Excel columns you want to use
                 , names="Names you want to give to the columns
                 #For some reason, if I don't set this, the dataframe gets unreasonably large. 
                 , nrows=1000000
                 )
    
             #Do something with the dataframe before it gets overwritten by the next iteration. 
    

    From here on out I use df.to_sql to write the data frames to MS SQL. I define the table name in the "mapping" array and the SQL Alchemy engine and schema are things that I set up as static variables at the beginning of the script.