Search code examples
lucenecluster-computingliferay-6ehcachejgroups

Liferay 6.2 clustering issue with multicast


I am trying to cluster ehcache and lucene with Liferay 6.2 EE sp2 bundle on 2 servers with mutlicast enabled. WE have Apache HTTPD servers fronting tomcat servers using reverse proxy. A valid 6.2 license is deployed on both the nodes.

We user the following properties in the portal-ext.properties:

cluster.link.enabled=true
lucene.replicate.write=true
ehcache.cluster.link.replication.enabled=true

# Since we are using SSL on the frontend
web.server.protocol=https

# set this to any server that is visible to both the nodes
cluster.link.autodetect.address=dbserverip:dbport

#ports and ips we know work in our environment for multicast
multicast.group.address["cluster-link-control"]=ip
multicast.group.port["cluster-link-control"]=port1

multicast.group.address["cluster-link-udp"]=ip
multicast.group.port["cluster-link-udp"]=port2

multicast.group.address["cluster-link-mping"]=ip
multicast.group.port["cluster-link-mping"]=port3

multicast.group.address["hibernate"]=ip
multicast.group.port["hibernate"]=port4

multicast.group.address["multi-vm"]=ip
multicast.group.port["multi-vm"]=port5

We are running into issues with the ehcache and lucene clustering not working. The following tests fail :

  1. Moving a portlet on node 1, does not show up on node 2

There are no errors except for a startup error with lucene.

14:19:35,771 ERROR [CLUSTER_EXECUTOR_CALLBACK_THREAD_POOL-1][LuceneHelperImpl:1186] Unable to load index for company 10157 com.liferay.portal.kernel.exception.SystemException: java.net.ConnectException: Connection refused at com.liferay.portal.search.lucene.LuceneHelperImpl.getLoadIndexesInputStreamFromCluster(LuceneHelperImpl.java:488) at com.liferay.portal.search.lucene.LuceneHelperImpl$LoadIndexClusterResponseCallback.callback(LuceneHelperImpl.java:1176) at com.liferay.portal.cluster.ClusterExecutorImpl$ClusterResponseCallbackJob.run(ClusterExecutorImpl.java:614) at com.liferay.portal.kernel.concurrent.ThreadPoolExecutor$WorkerTask._runTask(ThreadPoolExecutor.java:682) at com.liferay.portal.kernel.concurrent.ThreadPoolExecutor$WorkerTask.run(ThreadPoolExecutor.java:593) at java.lang.Thread.run(Thread.java:745) Caused by: java.net.ConnectException: Connection refused at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339) at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200) at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) at java.net.Socket.connect(Socket.java:579) at sun.security.ssl.SSLSocketImpl.connect(SSLSocketImpl.java:625) at sun.security.ssl.BaseSSLSocketImpl.connect(BaseSSLSocketImpl.java:160) at sun.net.NetworkClient.doConnect(NetworkClient.java:180) at sun.net.www.http.HttpClient.openServer(HttpClient.java:432) at sun.net.www.http.HttpClient.openServer(HttpClient.java:527) at sun.net.www.protocol.https.HttpsClient.(HttpsClient.java:275) at sun.net.www.protocol.https.HttpsClient.New(HttpsClient.java:371)

We verified that the jgroups multicast works outside of liferay by running the following commands and using a downloaded copy of the jgroups.jar and replacing with the 5 multicast ips and ports.

Testing with JGROUPS

1) McastReceiver -

java -cp ./jgroups.jar org.jgroups.tests.McastReceiverTest -mcast_addr 224.10.10.10 -port 5555

ex. java -cp jgroups-final.jar org.jgroups.tests.McastReceiverTest -mcast_addr 224.10.10.10 -port 5555

2) McastSender -

java -cp ./jgroups.jar org.jgroups.tests.McastSenderTest -mcast_addr 224.10.10.10 -port 5555

ex. java -cp jgroups-final.jar org.jgroups.tests.McastSenderTest -mcast_addr 224.10.10.10 -port 5555

From there, typing things into the McastSender will result in the Receiver printing it out.

Thanks!


Solution

  • After a lot of troubleshooting and help from various folks in my team and at liferay support, we switched to using unicast and it worked a lot better.

    Here is what we did:

    • Extracted jgroups.jar from the tomcat home/webappts/ROOT/WEB_INF/lib, saved locally.
    • Unzipped the jgroups.jar file and extracted and save the tcp.xml from the jar's WEB_INF folder
    • As a base line test, changed the section in the tcp.xml and saved

      TCPPING timeout="3000" initial_hosts="${jgroups.tcpping.initial_hosts:servername1[7800],servername2[7800]}" port_range="1" num_initial_members="10"

    • Copy the tcp.xml to the liferay home on both the nodes

    • Change the portal-ext.properties to remove the mutlicast properties and add the following lines.

      cluster.link.channel.properties.control=${liferay.home}/tcp.xml cluster.link.channel.properties.transport.0=${liferay.home}/tcp.xml

    • Start node 1

    • start node 2

    • check logs

    • Do the cluster cache test:

    • Moving a portlet on node 1, shows up on node 2

    • Under control panel -> License manager both the nodes show up with valid licenses.

    • searching for user on node 2 after adding in node 1 in control panel -> user and organizations.

    All of the above tests worked.

    So we shutdown servers and changed the tcp.xml to use jdbc rather than the tcpping so we don't have to specify node names manually.

    Step for the jdbc config:

    1. Create the table in the liferay database manually.

      CREATE TABLE JGROUPSPING (own_addr varchar(200) not null, cluster_name varchar(200) not null, ping_data blob default null, primary key (own_addr, cluster_name))

    2. change tcp.xml and remove the tcpping section and add the following.

    Note: Please replace the leading \ with less than symbol in the following code block. There are issues with the leading less than sign in the SO editor/parser hiding whatever comes after it:

    \JDBC_PING datasource_jndi_name="java:comp/env/jdbc/LiferayPool" initialize_sql="" />

    1. Save and push the file manually to both the nodes.

    Start the servers and repeat tests above.

    It should work seamlessly.

    It was invaluable to have the debug logging on for jgroups mentioned in the following the post:

    https://bitsofinfo.wordpress.com/2014/05/21/clustering-liferay-globally-across-data-centers-gslb-with-jgroups-and-relay2/

    tomcat home/webapps/ROOT/WEB-INF/classes/META-INF/portal-log4j-ext.xml file I used to triage various issues on bootup related to clustering.
    
    <?xml version="1.0"?>
    <!DOCTYPE log4j:configuration SYSTEM "log4j.dtd">
    
    <log4j:configuration xmlns:log4j="http://jakarta.apache.org/log4j/">
    
    <category name="com.liferay.portal.cluster">
    <priority value="TRACE" />
    </category>
    
    <category name="com.liferay.portal.license">
    <priority value="TRACE" />
    </category>
    

    We also found that the Lucene cluster replication startup errors were fixed in a fix pack and are getting a patch for it.

    https://issues.liferay.com/browse/LPS-51714

    https://issues.liferay.com/browse/LPS-51428

    We added the following portal instance properties for lucene replication to work better between the 2 nodes:

    portal.instance.http.port=port that the app servers listen on ex. 8080
    portal.instance.protocol=http 
    

    Hope this helps someone.

    Update

    The lucene index load in a cluster issue was resolved by a Liferay 6.2 EE patch from support for the LPS's mentioned above.