Search code examples
pythonconfluenceconfluence-rest-api

Get table as JSON from Confluence


I'm trying to get table contents from a Confluence page in JSON format. It's all SSO so I can only use an API key and I haven't found a way to access Confluence with the requests library. Unfortunately, the output of the Confluence API is plain html.

This is what I've got so far. Can the Confluence library output tables in JSON format (rather than displaying the raw html code in a dictionary)?

from atlassian import Confluence
import os

user = "[email protected]"
api_key = os.environ['confluence_api_key']
server = "https://xxxxxx.atlassian.net"
api_url = "/rest/api/content"
page_id = "12345"

confluence = Confluence(url=server, username=user, password=api_key)
page = confluence.get_page_by_title("TEST", "page 1", expand="body.storage")
content = page["body"]["storage"]
print(content)

And the output looks like this:

{'value': '<p>Something something.</p><p /><table data-layout="default" ac:local-id="xxx"><colgroup><col style="width: 226.67px;" /><col style="width: 226.67px;" /><col style="width: 226.67px;" /></colgroup><tbody><tr><th><p><strong>name</strong></p></th><th><p><strong>type</strong></p></th><th><p><strong>comment</strong></p></th></tr><tr><td><p>text1</p></td><td><p>varchar(10)</p></td><td><p /></td></tr><tr><td><p>123</p></td><td><p>int</p></td><td><p /></td></tr></tbody></table>', 'representation': 'storage', 'embeddedContent': [], '_expandable': {'content': '/rest/api/content/12345'}}

404 error with the requests library:

request_url = "{server}{api_url}/{page_id}?expand=body.storage".format(
    server=server, api_url=api_url, page_id=page_id
)

requestResponse = requests.get(request_url, auth=(user, api_key))

print(requestResponse.status_code)

Solution

  • Why would you want to use the requests library directly? The atlassian python API is using it already, sparing you some work.

    I happen to have run into the same problem this week and I had to parse tables with BeautifulSoup. I think for a generic solution it would be best to get your tables as Dataframes:

    from atlassian import Confluence
    import os
    from bs4 import BeautifulSoup
    import pandas as pd
    
    user = "[email protected]"
    api_key = os.environ['confluence_api_key']
    server = "https://xxxxxx.atlassian.net"
    
    confluence = Confluence(url=server, username=user, password=api_key)
    page = confluence.get_page_by_title("TEST", "page 1", expand="body.storage")
    body = page["body"]["storage"]["value"]
    
    tables_raw = [[[cell.text for cell in row("th") + row("td")]
                        for row in table("tr")]
                        for table in BeautifulSoup(body, features="lxml")("table")]
    
    tables_df = [pd.DataFrame(table) for table in tables_raw]
    for table_df in tables_df:
        print(table_df)
    

    You can then convert your DataFrames to JSON using to_json depending on how you want to structure your dictionaries...

    EDIT: style information (and other tags like links) get lost in this case (we only get the text of the cell) so beware if you want to update the content of your page after modification
    Also if you wanna use table content as dictionary keys you might want to change row/column index

    EDIT2: This is an old answer but since it got upvoted recently, I would just like to add that in this case, tables_raw can be computed with the pandas built-in read_html:

    tables_df = pd.read_html(body)
    

    This will even set tables headers as df column names directly and has arguments to extract links or parse dates. However, especially if you need more than cell.text (in my initial case I wanted to import icons) in your df, the above answer is still valid.