Search code examples
javahtmlcsvscreen-scraping

Inconsistent spacing when building a CSV file


I'm scraping (screen-scraper) this raw text data from a site and I have to format it, then transfer it to a CSV file. The raw text data is in the correct format but the line breaks don't carry over.

Servicing Option: Retained/CTOS,,,Servicing Fee Rate: 0.250,,,Remittance Type: Gold
Seller Number: 143939

Mortgage product name from the Pricing Engine: 30-Year Fixed Rate Conventional,,,Valid for: 05/17/2013 at 01:56:39 PM, EDT
Interest Rate,5-DAY Contract Expiration Date : 05/22/2013,10-DAY Contract Expiration Date : 05/28/2013,15-DAY Contract Expiration Date : 06/03/2013,30-DAY Contract Expiration Date : 06/17/2013,45-DAY Contract Expiration Date : 07/01/2013,60-DAY Contract Expiration Date : 07/16/2013,75-DAY Contract Expiration Date : 07/31/2013,90-DAY Contract Expiration Date : 08/15/2013,
2.750,94.587,94.549,94.511,94.392,94.302,94.176,94.080,93.975,
2.875,95.574,95.535,95.497,95.363,95.273,95.134,95.038,94.919,
3.000,96.549,96.510,96.472,96.323,96.234,96.082,95.986,95.854,
3.125,97.489,97.450,97.412,97.250,97.160,96.997,96.901,96.757,
3.250,99.325,99.279,99.232,99.136,99.027,98.917,98.800,98.714,
3.375,100.333,100.287,100.240,100.126,100.017,99.891,99.774,99.673,
3.500,101.201,101.154,101.107,100.980,100.871,100.734,100.617,100.504,
3.625,102.016,101.970,101.923,101.785,101.676,101.529,101.413,101.290,
3.750,102.699,102.652,102.606,102.458,102.350,102.195,102.079,101.948,
3.875,103.326,103.271,103.216,103.146,103.018,102.915,102.777,102.703,
4.000,104.095,104.040,103.985,103.910,103.782,103.672,103.535,103.453,
4.125,104.834,104.779,104.724,104.641,104.513,104.399,104.262,104.176,
4.250,105.454,105.399,105.344,105.253,105.125,105.006,104.868,104.777,
4.375,104.469,104.405,104.342,104.441,104.293,104.315,104.157,104.209,
4.500,105.196,105.133,105.070,105.158,105.010,105.027,104.869,104.919,
4.625,105.892,105.828,105.765,105.847,105.699,105.712,105.554,105.599,
4.750,106.438,106.375,106.312,106.388,106.241,106.251,106.093,106.134,

Mortgage product name from the Pricing Engine: 20-Year Fixed Rate Conventional,,,Valid for: 05/17/2013 at 01:56:41 PM, EDT
Interest Rate,5-DAY Contract Expiration Date : 05/22/2013,10-DAY Contract Expiration Date : 05/28/2013,15-DAY Contract Expiration Date : 06/03/2013,30-DAY Contract Expiration Date : 06/17/2013,45-DAY Contract Expiration Date : 07/01/2013,60-DAY Contract Expiration Date : 07/16/2013,75-DAY Contract Expiration Date : 07/31/2013,90-DAY Contract Expiration Date : 08/15/2013,
2.750,95.080,95.042,95.003,94.869,94.779,94.640,94.543,94.424,
2.875,95.934,95.896,95.857,95.713,95.623,95.474,95.378,95.249,
3.000,96.777,96.739,96.700,96.546,96.456,96.299,96.202,96.065,
3.125,97.593,97.555,97.517,97.353,97.263,97.098,97.002,96.856,
3.250,100.570,100.523,100.476,100.364,100.255,100.131,100.014,99.915,
3.375,101.473,101.427,101.380,101.256,101.147,101.013,100.896,100.786,
3.500,102.276,102.229,102.183,102.050,101.941,101.799,101.682,101.564,
3.625,103.027,102.981,102.934,102.794,102.685,102.537,102.421,102.296,
3.750,103.584,103.538,103.491,103.346,103.237,103.085,102.969,102.839,
3.875,103.952,103.897,103.842,103.763,103.635,103.525,103.388,103.307,
4.000,104.664,104.609,104.554,104.474,104.346,104.232,104.095,104.009,
4.125,105.338,105.283,105.228,105.144,105.016,104.901,104.763,104.676,
4.250,105.824,105.769,105.714,105.624,105.496,105.380,105.243,105.154,
4.375,104.700,104.637,104.574,104.663,104.515,104.526,104.368,104.411,
4.500,105.361,105.298,105.234,105.315,105.167,105.178,105.020,105.063,
4.625,105.966,105.903,105.840,105.920,105.772,105.784,105.626,105.669,
4.750,106.336,106.273,106.210,106.290,106.143,106.158,106.000,106.046,

Mortgage product name from the Pricing Engine: 15-Year Fixed Rate Conventional,,,Valid for: 05/17/2013 at 02:04:38 PM, EDT
Interest Rate,5-DAY Contract Expiration Date : 05/22/2013,10-DAY Contract Expiration Date : 05/28/2013,15-DAY Contract Expiration Date : 06/03/2013,30-DAY Contract Expiration Date : 06/17/2013,45-DAY Contract Expiration Date : 07/01/2013,60-DAY Contract Expiration Date : 07/16/2013,75-DAY Contract Expiration Date : 07/31/2013,90-DAY Contract Expiration Date : 08/15/2013,
2.250,98.764,98.734,98.704,98.649,98.542,98.478,98.351,98.274,
2.375,99.509,99.479,99.449,99.388,99.280,99.212,99.085,99.001,
2.500,100.254,100.224,100.194,100.126,100.019,99.946,99.818,99.729,
2.625,100.868,100.838,100.808,100.734,100.627,100.549,100.422,100.325,
2.750,101.649,101.611,101.573,101.496,101.411,101.325,101.214,101.163,
2.875,102.380,102.342,102.304,102.220,102.135,102.044,101.933,101.873,
3.000,103.046,103.008,102.969,102.881,102.796,102.701,102.590,102.525,
3.125,103.598,103.559,103.521,103.430,103.344,103.248,103.137,103.068,
3.250,104.053,104.015,103.976,103.880,103.795,103.697,103.586,103.514,
3.375,103.999,103.952,103.906,103.803,103.802,103.688,103.672,103.715,
3.500,104.604,104.558,104.511,104.404,104.403,104.288,104.272,104.311,
3.625,105.149,105.102,105.056,104.944,104.943,104.826,104.809,104.845,
3.750,105.597,105.551,105.504,105.389,105.388,105.268,105.252,105.282,
3.875,104.646,104.591,104.536,104.413,104.497,104.363,104.395,104.496,
4.000,105.305,105.250,105.196,105.069,105.153,105.016,105.048,105.145,
4.125,105.819,105.764,105.709,105.579,105.662,105.524,105.556,105.649,
4.250,106.253,106.198,106.143,106.009,106.093,105.953,105.984,106.074,

I've written the following code to format

outputfile = "c:/" + session.getName() + ".csv"; 

out = new FileWriter(outputfile, true);

_prices = session.getv("Prices");

_prices = _prices.replace("GoldSeller", "Gold\nSeller");
_prices = _prices.replace("Seller Number: 143939", "Seller Number: 143939\n\n");
_prices = _prices.replace("EDTInterest", "EDT\nInterest");
_prices = _prices.replace("Mortgage", "\n\nMortgage");

String[] words = _prices.split(",");

for(i = 0; i < words.length; i++) {
    try {
        if(words[i].length() > 1) {
            if(words[i].substring(0,2).equals("0.") ||
                 words[i].substring(0,2).equals("1.") ||
                 words[i].substring(0,2).equals("2.") ||
                 words[i].substring(0,2).equals("3.") ||
                 words[i].substring(0,2).equals("4.") ||
                 words[i].substring(0,2).equals("5.") ||
                 words[i].substring(0,2).equals("6.") ||
                 words[i].substring(0,2).equals("7.") ||
                 words[i].substring(0,2).equals("8.") ||
                 words[i].substring(0,2).equals("9.")) {    

                _prices = _prices.replace(words[i], "\n" + words[i]);
            }
        }
    } 
    catch(Exception e){
        session.log(e.toString() + " " + words[i].length());
    }       
}

out.write(_prices);

out.close();

The problem that's occurring is the \n character is adding two extra lines in some places, and one in others.

I'm not trying to get any empty rows with the exception of where I've added \n\n.

When I don't use \n, everything is on one line.


Solution

  • I think I just figured it out.

    _prices = _prices.replace(words[i], "\n" + words[i]);
    

    this line is replacing all the words that are like words[i] adding the extra lines.