Search code examples
phpcsvspreadsheetfile-get-contentsfputcsv

file_get_contents from html explode, write to cell of spreadsheet


What I'm attempting to achieve is a tweezing of specific content from a URL's source via file_get_contents(), then explode() the mark-up around where that content lives, returning just the HTML formatted content, and then writing it to the single cells of a spreadsheet or CSV. Easy, I thought.

This is what I've got:

<?php

//My .html

$url = 'http://spiderlearning.com/demo/ALG_SA_U1_L1.html';

//Get content

$content = file_get_contents($url);

//Get content sections

$lesson_name = explode( '<section id="nameField" class="editable" contenteditable="false">' , $content);

$section_title1 = explode( '<a onclick="goToByScroll(\'obj0\')" href="#">' , $content);

$challenge_q = explode( '<section id="redactor_content" class="editable" contenteditable="false">' , $content);

//Write content

$write1 = explode("</section>" , $lesson_name[1]);
$write2 = explode("</a>" , $section_title1[1]);
$write3 = explode("</section>" , $challenge_q[1]);

//Into arrays

$line1 = array($write1[0],$write2[0],$write3[0]);

$list = array($line1);

//Open .csv

$file = fopen("data/data.csv", "w");

//Write as line, delimitate with ";"

foreach ($list as $line) fputcsv($file, $line, ';');

//Close

fclose($file);

?>

Which returns:

CSV

Excel

What I'm looking for is:

CSV:

Unit 1 Lesson 1; 1. Challenge Questions; <p><img src="https://s3-eu-west-1.amazonaws.com/teacher-uploads.fishtree.com/SpiderLearning/1428953716a42b06b9-1ce1-4594-badd-4ab8c9b65ac0.jpeg" alt="" rel="float: left; width: 171px; height: 113.697826086957px; margin: 0px 10px 10px 0px;" style="float: left; width: 171px; height: 113.697826086957px; margin: 0px 10px 10px 0px;"></p><p>Before you begin this lesson, let's see what you already know about the topic. Take a moment to complete the three Challenge Questions that follow.</p>

The problem seems to me to be the carriage returns in the formatted content. It's also picking up parenthesis around the returned content as well, but I'm not sure from where. Is there any way to escape these? I've put together similar functions in the past with no problem whatsoever, but this is my first file_get_contents() into CSV, and a couple of weeks in I've finally hit a wall with it.


Solution

  • First to get rid of the line breaks do this: foreach ($list as $line) fputcsv($file, preg_replace( "/\r|\n/", "", $line), ';');

    It would be better to leave those field delimiters introduced by fputcsv. The reason is that any semicolon inside one of the fields would break your CSV above The CSV you want then looks like:

    "Unit 1 Lesson 1";"1. Challenge Questions";"<p><img src=""https://s3-eu-west-1.amazonaws.com/teacher-uploads.fishtree.com/SpiderLearning/1428953716a42b06b9-1ce1-4594-badd-4ab8c9b65ac0.jpeg"" alt="""" rel=""float: left; width: 171px; height: 113.697826086957px; margin: 0px 10px 10px 0px;"" style=""float: left; width: 171px; height: 113.697826086957px; margin: 0px 10px 10px 0px;""></p><p>Before you begin this lesson, let's see what you already know about the topic. Take a moment to complete the three Challenge Questions that follow.</p>"
    

    But you cannot directly open this in excel in most cases (there is a global setting somewhere). You need to import this data and then set the following things:

    settings for import