Search code examples
jsoup

Extract Typescript source codes from html page


How to extract Typescript codes from this html page? It has p and class "synStatement", "synIdentifier", "synConstant", "synType". I am learning jsoup. The output from my Java jsoup program is not complete and also not formatted properly.

<p>Currency.ts</p>

<pre class="code lang-typescript" data-lang="typescript" data-unlink><span class="synStatement">export</span> <span class="synStatement">type</span> Currency <span class="synStatement">=</span> <span class="synIdentifier">{</span>
  unit: <span class="synConstant">'EUR'</span> | <span class="synConstant">'GBP'</span> | <span class="synConstant">'JPY'</span> | <span class="synConstant">'USD'</span>
  value: <span class="synType">number</span>
<span class="synIdentifier">}</span>

<span class="synStatement">export</span> <span class="synStatement">const</span> Currency <span class="synStatement">=</span> <span class="synIdentifier">{</span>
  <span class="synStatement">from(</span>value: <span class="synType">number</span><span class="synStatement">,</span> unit: Currency<span class="synIdentifier">[</span><span class="synConstant">'unit'</span><span class="synIdentifier">]</span> <span class="synStatement">=</span> <span class="synConstant">'USD'</span><span class="synStatement">)</span>: Currency <span class="synIdentifier">{</span>
    <span class="synStatement">return</span> <span class="synIdentifier">{</span> unit<span class="synStatement">,</span> value <span class="synIdentifier">}</span>
  <span class="synIdentifier">}</span>
<span class="synIdentifier">}</span>
</pre>

Desired output:

Currency.ts

export type Currency = {
  unit: 'EUR' | 'GBP' | 'JPY' | 'USD'
  value: number
}

export const Currency = {
  from(value: number, unit: Currency['unit'] = 'USD'): Currency {
    return { unit, value }
  }
}

I tried:

import java.io.File;

public class Currency
{
    public static void main( String[] args )
    {
        try {
            File input = new File("Currency.html");
            Document doc = Jsoup.parse(input, "UTF-8", "");
            List<String> typescriptCode = new ArrayList<String>();
            String strs[] = {
                "synStatement",
                "synIdentifier",
                "synConstant",
                "synType",
            };
            for (String str : strs) {
                Elements spansWithsynStatementElements = doc.select("span." + str);
                if (spansWithsynStatementElements != null) {
                    for (Element e : spansWithsynStatementElements) {
                        String text = "";
                        text += e.ownText();
                        typescriptCode.add(text);
                    }
                }
            }
            
            int size = typescriptCode.size();
            for (int i = 0; i < size; i++) {
                System.out.println(typescriptCode.get(i));
                System.out.println("");
            }
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

Solution

  • If formatting is not an issue for you, you can simply extract and print the text:

    String script = doc.text(); 
    System.out.println(script);
    

    The output is:

    Currency.ts export type Currency = { unit: 'EUR' | 'GBP' | 'JPY' | 'USD' value: number}export const Currency = { from(value: number, unit: Currency['unit'] = 'USD'): Currency { return { unit, value } }}

    If you want to format the output, you'll have to use a pretty print library. You can look here for example.