Search code examples
javaandroidhtmlkotlintext-manipulation

How to truncate HTML string, to be used as a "preview" version of the original?


Background

We allow the user to create some text that will get converted to HTML, using a rich-text editor library (called Android-RTEditor).

The output HTML text is saved as is on the server and the device.

Because on some end cases, there is a need to show a lot of this content (multiple instances), we wish to also save a "preview" version of this content, meaning it will be much shorter in length (say 120 of normal characters, excluding the extra characters for the HTML tags, which are not counted).

What we want is a minimized version of the HTML. Some tags might optionally be removed, but we still want to see lists (numbered/bullets), no matter what we choose to do, because lists do show like text to the user (the bullet is a character, and so do the numbers with the dot).

The tag of going to next line should also be handled , as it's important to go to the next line.

The problem

As opposed to a normal string, where I can just call substring with the required number of characters, on HTML it might ruin the tags.

What I've tried

I've thought of 2 possible solutions for this:

  1. Convert to plain text (while having some tags handled), and then truncate : Parse the HTML, and replacing some tags with Unicode alternatives, while removing the others. For example, instead of a bullet-list, put the bullet character (maybe this), and same for numbered list (put numbers instead). All the other tags would be removed. Same goes for the tag of going to the next line ("
    "), which should be replaced with "\n". After that, I could safely truncate the normal text, because there are no more tags that could be ruined.

  2. Truncate nicely inside the HTML : Parse the HTML, while identifying the text within it, and truncate it there and closing all tags when reaching the truncation position. This might even be harder.

I'm not sure which is easier, but I can think of possible disadvantages for each. It is just a preview though, so I don't think it matters much.

I've searched the Internet for such solutions, to see if others have made it. I've found some links that talk about "cleaning" or "optimizing" HTML, but I don't see they can handle replacing them or truncating them. Not only that, but since it's HTML, most are not related to Android, and use PHP, C#, Angular and others as their language.

Here are some links that I've found:

The questions

  1. Are those solutions that I've written possible? If so, is there maybe a known way to implement them? Or even a Java/Kotlin/Android library? How hard would it be to make such a solution?

  2. Maybe other solution I haven't thought about?


EDIT: I've also tried using an old code I've made in the past (here), which parses XML. Maybe it will work. I also try now to investigate some third party libraries for parsing HTML, such as Jsoup. I think it can help with the truncating, while supporting "faulty" HTML inputs.


Solution

  • OK, I think I got it, using my old code for converting XML string into an object . It would still be great to see more robust solutions, but I think what I got is good enough, at least for now.

    Below code uses it (origininal XmlTag class available here) :

    XmlTagTruncationHelper.kt

    object XmlTagTruncationHelper {
        /**@param maxLines max lines to permit. If <0, means there is no restriction
         * @param maxTextCharacters max text characters to permit. If <0, means there is no restriction*/
        class Restriction(val maxTextCharacters: Int, val maxLines: Int) {
            var currentTextCharactersCount: Int = 0
            var currentLinesCount: Int = 0
        }
    
        @JvmStatic
        fun truncateXmlTag(xmlTag: XmlTag, restriction: Restriction): String {
            if (restriction.maxLines == 0 || (restriction.maxTextCharacters >= 0 && restriction.currentTextCharactersCount >= restriction.maxTextCharacters))
                return ""
            val sb = StringBuilder()
            sb.append("<").append(xmlTag.tagName)
            val numberOfAttributes = if (xmlTag.tagAttributes != null) xmlTag.tagAttributes!!.size else 0
            if (numberOfAttributes != 0)
                for ((key, value) in xmlTag.tagAttributes!!)
                    sb.append(" ").append(key).append("=\"").append(value).append("\"")
            val numberOfInnerContent = if (xmlTag.innerTagsAndContent != null) xmlTag.innerTagsAndContent!!.size else 0
            if (numberOfInnerContent == 0)
                sb.append("/>")
            else {
                sb.append(">")
                for (innerItem in xmlTag.innerTagsAndContent!!) {
                    if (restriction.maxTextCharacters >= 0 && restriction.currentTextCharactersCount >= restriction.maxTextCharacters)
                        break
                    if (innerItem is XmlTag) {
                        if (restriction.maxLines < 0)
                            sb.append(truncateXmlTag(innerItem, restriction))
                        else {
    //                    Log.d("AppLog", "xmlTag:" + innerItem.tagName + " " + innerItem.innerTagsAndContent?.size)
                            var needToBreak = false
                            when {
                                innerItem.tagName == "br" -> {
                                    ++restriction.currentLinesCount
                                    needToBreak = restriction.currentLinesCount >= restriction.maxLines
                                }
                                innerItem.tagName == "li" -> {
                                    ++restriction.currentLinesCount
                                    needToBreak = restriction.currentLinesCount >= restriction.maxLines
                                }
                            }
                            if (needToBreak)
                                break
                            sb.append(truncateXmlTag(innerItem, restriction))
                        }
                    } else if (innerItem is String) {
                        if (restriction.maxTextCharacters < 0)
                            sb.append(innerItem)
                        else
                            if (restriction.currentTextCharactersCount < restriction.maxTextCharacters) {
                                val str = innerItem
                                val extraCharactersAllowedToAdd = restriction.maxTextCharacters - restriction.currentTextCharactersCount
                                val strToAdd = str.substring(0, Math.min(str.length, extraCharactersAllowedToAdd))
                                if (strToAdd.isNotEmpty()) {
                                    sb.append(strToAdd)
                                    restriction.currentTextCharactersCount += strToAdd.length
                                }
                            }
                    }
                }
                sb.append("</").append(xmlTag.tagName).append(">")
            }
            return sb.toString()
        }
    }
    

    XmlTag.kt

    //based on https://stackoverflow.com/a/19115036/878126
    /**
     * an xml tag , includes its name, value and attributes
     * @param tagName the name of the xml tag . for example : <a>b</a> . the name of the tag is "a"
     */
    class XmlTag(val tagName: String) {
        /** a hashmap of all of the tag attributes. example: <a c="d" e="f">b</a> . attributes: {{"c"="d"},{"e"="f"}}     */
        @JvmField
        var tagAttributes: HashMap<String, String>? = null
        /**list of inner text and xml tags*/
        @JvmField
        var innerTagsAndContent: ArrayList<Any>? = null
    
        companion object {
            @JvmStatic
            fun getXmlFromString(input: String): XmlTag? {
                val factory = XmlPullParserFactory.newInstance()
                factory.isNamespaceAware = true
                val xpp = factory.newPullParser()
                xpp.setInput(StringReader(input))
                return getXmlRootTagOfXmlPullParser(xpp)
            }
    
            @JvmStatic
            fun getXmlRootTagOfXmlPullParser(xmlParser: XmlPullParser): XmlTag? {
                var currentTag: XmlTag? = null
                var rootTag: XmlTag? = null
                val tagsStack = Stack<XmlTag>()
                xmlParser.next()
                var eventType = xmlParser.eventType
                var doneParsing = false
                while (eventType != XmlPullParser.END_DOCUMENT && !doneParsing) {
                    when (eventType) {
                        XmlPullParser.START_DOCUMENT -> {
                        }
                        XmlPullParser.START_TAG -> {
                            val xmlTagName = xmlParser.name
                            currentTag = XmlTag(xmlTagName)
                            if (tagsStack.isEmpty())
                                rootTag = currentTag
                            tagsStack.push(currentTag)
                            val numberOfAttributes = xmlParser.attributeCount
                            if (numberOfAttributes > 0) {
                                val attributes = HashMap<String, String>(numberOfAttributes)
                                for (i in 0 until numberOfAttributes) {
                                    val attrName = xmlParser.getAttributeName(i)
                                    val attrValue = xmlParser.getAttributeValue(i)
                                    attributes[attrName] = attrValue
                                }
                                currentTag.tagAttributes = attributes
                            }
                        }
                        XmlPullParser.END_TAG -> {
                            currentTag = tagsStack.pop()
                            if (!tagsStack.isEmpty()) {
                                val parentTag = tagsStack.peek()
                                parentTag.addInnerXmlTag(currentTag)
                                currentTag = parentTag
                            } else
                                doneParsing = true
                        }
                        XmlPullParser.TEXT -> {
                            val innerText = xmlParser.text
                            if (currentTag != null)
                                currentTag.addInnerText(innerText)
                        }
                    }
                    eventType = xmlParser.next()
                }
                return rootTag
            }
    
            /**returns the root xml tag of the given xml resourceId , or null if not succeeded . */
            fun getXmlRootTagOfXmlFileResourceId(context: Context, xmlFileResourceId: Int): XmlTag? {
                val res = context.resources
                val xmlParser = res.getXml(xmlFileResourceId)
                return getXmlRootTagOfXmlPullParser(xmlParser)
            }
        }
    
        private fun addInnerXmlTag(tag: XmlTag) {
            if (innerTagsAndContent == null)
                innerTagsAndContent = ArrayList()
            innerTagsAndContent!!.add(tag)
        }
    
        private fun addInnerText(str: String) {
            if (innerTagsAndContent == null)
                innerTagsAndContent = ArrayList()
            innerTagsAndContent!!.add(str)
        }
    
        /**formats the xmlTag back to its string format,including its inner tags     */
        override fun toString(): String {
            val sb = StringBuilder()
            sb.append("<").append(tagName)
            val numberOfAttributes = if (tagAttributes != null) tagAttributes!!.size else 0
            if (numberOfAttributes != 0)
                for ((key, value) in tagAttributes!!)
                    sb.append(" ").append(key).append("=\"").append(value).append("\"")
            val numberOfInnerContent = if (innerTagsAndContent != null) innerTagsAndContent!!.size else 0
            if (numberOfInnerContent == 0)
                sb.append("/>")
            else {
                sb.append(">")
                for (innerItem in innerTagsAndContent!!)
                    sb.append(innerItem.toString())
                sb.append("</").append(tagName).append(">")
            }
            return sb.toString()
        }
    
    }
    

    Sample usage:

    build.grade

        compileOptions {
            sourceCompatibility JavaVersion.VERSION_1_8
            targetCompatibility JavaVersion.VERSION_1_8
        }
    
    ...
    dependencies{
    implementation 'com.1gravity:android-rteditor:1.6.7'
    ...
    }
    ...
    

    MainActivity.kt

    class MainActivity : AppCompatActivity() {
    
    
        override fun onCreate(savedInstanceState: Bundle?) {
            super.onCreate(savedInstanceState)
            setContentView(R.layout.activity_main)
    //        val inputXmlString = "<zz>Zhshs<br/>ABC</zz>"
            val inputXmlString = "Aaa<br/><b>Bbb<br/></b>Ccc<br/><ul><li>Ddd</li><li>eee</li></ul>fff<br/><ol><li>ggg</li><li>hhh</li></ol>"
    
            // XML must have a root tag
            val xmlString = if (!inputXmlString.startsWith("<"))
                "<html>$inputXmlString</html>" else inputXmlString
    
            val rtApi = RTApi(this, RTProxyImpl(this), RTMediaFactoryImpl(this, true))
            val mRTManager = RTManager(rtApi, savedInstanceState)
            mRTManager.registerEditor(beforeTruncationTextView, true)
            mRTManager.registerEditor(afterTruncationTextView, true)
            beforeTruncationTextView.setRichTextEditing(true, inputXmlString)
            val xmlTag = XmlTag.getXmlFromString(xmlString)
    
            Log.d("AppLog", "xml parsed: " + xmlTag.toString())
            val maxTextCharacters = 10
            val maxLines = 20
    
            val output = XmlTagTruncationHelper.truncateXmlTag(xmlTag!!, XmlTagTruncationHelper.Restriction(maxTextCharacters, maxLines))
            afterTruncationTextView.setRichTextEditing(true, output)
            Log.d("AppLog", "xml with truncation : maxTextCharacters: $maxTextCharacters , maxLines: $maxLines output: " + output)
        }
    }
    

    activity_main.xml

    <LinearLayout
        xmlns:android="http://schemas.android.com/apk/res/android" xmlns:app="http://schemas.android.com/apk/res-auto"
        xmlns:tools="http://schemas.android.com/tools" android:layout_width="match_parent"
        android:layout_height="match_parent" android:gravity="center" android:orientation="vertical"
        tools:context=".MainActivity">
    
        <com.onegravity.rteditor.RTEditText
            android:id="@+id/beforeTruncationTextView" android:layout_width="match_parent"
            android:layout_height="wrap_content" android:background="#11ff0000" tools:text="beforeTruncationTextView"/>
    
    
        <com.onegravity.rteditor.RTEditText
            android:id="@+id/afterTruncationTextView" android:layout_width="match_parent"
            android:layout_height="wrap_content" android:background="#1100ff00" tools:text="afterTruncationTextView"/>
    </LinearLayout>
    

    And the result:

    enter image description here