Search code examples
matlabapiwikipediawikipedia-apimediawiki-api

Read data from wikipedia using API


For my project, I am trying to read data from Wikipedia, I am not completely sure, how I can do that.

My main concern is to read, date, location and subject of event. For a start, I have started reading above mentioned information for 91st academy awards.

I tried using Wikipedia query service, but it didn't helped much.

Then I came across API solution and ran following URL, https://en.wikipedia.org/w/api.php?action=parse&format=json&prop=sections&page=91st_Academy_Awards

But didn't found the information what I was looking for.

I am trying to read the information marked in red box in below image,

https://i.sstatic.net/W7hFL.png

Can somebody help me with this and let me know how can I read the above mentioned section.

PS:I am using Matlab for writing my algorithm


Solution

  • A possible solution is to read the webpage using webread, and process the data using the functions from the Text Analytics Toolbox:

    % Read HTML data.
    raw = webread('https://en.wikipedia.org/w/api.php?action=parse&format=json&prop=text&page=91st_Academy_Awards');
    
    % Specify sections of interest.
    SectionsOfInterest = ["Date","Site","Preshow hosts","Produced by","Directed by"];
    
    % Parse HTML data.
    myTree = htmlTree(raw.parse.text.x_);
    
    % Find table element.
    tableElements = findElement(myTree,'Table');
    tableOfInterest = tableElements(1);
    
    % Find header cell elements.
    thElements = findElement(tableOfInterest,"th");
    % Find cell elements.
    tdElements = findElement(tableOfInterest,"td");
    
    % Extract text.
    thHTML = thElements.extractHTMLText;
    tdHTML = tdElements.extractHTMLText;
    
    for section = 1:numel(SectionsOfInterest)
    
       sectionName = SectionsOfInterest(section);
       sectIndex = strcmp(sectionName,thHTML);
    
       % Remove spaces if present from section name.
       sectionName = strrep(sectionName,' ','');
    
       % Clean up data.
       sectData = regexprep(tdHTML(sectIndex),'\n+','.');
    
       % Create structure.
       s.(sectionName) = sectData;
    end
    

    Visualising the output structure:

    >> s
    s = 
    
    struct with fields:
    
            Date: "February 24, 2019"
            Site: "Dolby Theatre.Hollywood, Los Angeles, California, U.S."
    Preshowhosts: "Ashley Graham.Maria Menounos.Elaine Welteroth.Billy Porter.Ryan Seacrest. "
      Producedby: "Donna Gigliotti.Glenn Weiss"
      Directedby: "Glenn Weiss"