I am currently trying to upload Excel tables (as .xls) to the OpenRefine (or OntoRefine) module of Ontotexts GraphDB. Since I had problems uploading the xls, I decided to first convert the xls file to a csv file and then upload it. Unfortunately, OpenRefine does not automatically recognize every time the file as CSV. So all data in each row is stored in a single column. E.g.:
--------------------------------------------------
| Col1, Col2, Col3, Col4 |
--------------------------------------------------
| Row11, Row12, Row13, Row14 |
--------------------------------------------------
| Row21, Row22, Row23, Row24 |
--------------------------------------------------
Instead of:
--------------------------------------------------
| Col1 | Col2 | Col3 | Col4 |
--------------------------------------------------
| Row11 | Row12 | Row13 | Row14 |
--------------------------------------------------
| Row21 | Row22 | Row23 | Row24 |
--------------------------------------------------
With the Post Request
POST /command/core/create-project-from-upload
a file format in the 'format' parameter and a json with the delimiter in 'options' parameter can be added to the POST request. However, this does not work either and the official OpenRefine documentation (https://github.com/OpenRefine/OpenRefine/wiki/OpenRefine-API) does not contain any hints as to the syntax of the 'options' JSON.
My current code looks like this:
import os
import xlrd
import csv
import requests
import re
xls_file_name_ext = os.path.basename('excel_file.xls')
# create the filename with path to the new csv file (path + name stays the same)
csv_file_path = os.path.dirname(xls_file_name_ext) + '/' + os.path.splitext(xls_file_name_ext)[0] + '.csv'
# remove all comma in xls file
xls_wb = xlrd.open_workbook(xls_file_name_ext)
xls_sheet = xls_wb.sheet_by_index(0)
for col in range(xls_sheet.ncols):
for row in range(xls_sheet.nrows):
_new_cell_val = str(xls_sheet.cell(row, col).value).replace(",", " ")
xls_sheet._cell_values[row][col] = _new_cell_val
# write to csv
with open(csv_file_path, 'w', newline='', encoding='utf-8') as csv_file:
c_w = csv.writer(csv_file, delimiter=',')
for row in range(xls_sheet.nrows):
c_w.writerow(xls_sheet.row_values(row))
ontorefine_server = 'http://localhost:7200/orefine'
# filename of csv as project name in OntoRefine
onterefine_project_name = os.path.splitext(os.path.basename(csv_file_path))[0]
# the required paraneters for the post request
ontorefine_data = {"project-name": onterefine_project_name,
"format": "text/line-based/*sv",
"options": {
"separator": ","
}
}
ontorefine_file = {'project-file': open(csv_file_path, "rb")}
# execute the post request
ontorefine_response = requests.post(
ontorefine_server + '/command/core/create-project-from-upload', data=ontorefine_data, files=ontorefine_file
)
I assume that I am passing the POST request parameters incorrectly.
It all depends on your input data, of course, but the formatting looks OK. Here's what OntoRefine does "behind the curtains" if you try to import from the UI. You can see the same payload for yourself by intercepting your network traffic:
{
"format": "text/line-based/*sv",
"options": {
"project-name":"Your-project-here",
"separator":","
}
Judging from that, it looks like the project-name location is the only difference. Here is a curl command which does the same:
curl 'http://localhost:7200/orefine/command/core/importing-controller?controller=core%2Fdefault-importing-controller&jobID=1&subCommand=create-project' -H 'Content-Type: application/x-www-form-urlencoded; charset=UTF-8' --data 'format%3Dtext%2Fline-based%2F*sv%26options%3D%7B%22separator%22%3A%22%2C%22%22projectName%22%3A%22Your-project-name%22%7D'