Implementing Distributed Cache in Spring Boot using Hazelcast

Overview

A brief summary of how to implement Hazelcast, a well-known open-source distributed cache and in-memory database, as an Embedded Cache in a Spring Boot project.

Adding Library Dependencies

To use Hazelcast in the project, add the following to the build.gradle.kts in the project root.

dependencies {
    implementation("com.hazelcast:hazelcast:5.3.7")

    // To use Smile for object serialization in distributed cache, add the following
    implementation("com.fasterxml.jackson.dataformat:jackson-dataformat-smile:2.17.0")
}

Writing the @Configuration Class

Create the HazelcastConfig class as follows to define the HazelcastInstance bean necessary for using Hazelcast.

import com.fasterxml.jackson.annotation.JsonInclude
import com.fasterxml.jackson.databind.MapperFeature
import com.fasterxml.jackson.databind.ObjectMapper
import com.fasterxml.jackson.databind.SerializationFeature
import com.fasterxml.jackson.datatype.jsr310.JavaTimeModule
import com.hazelcast.config.Config
import com.hazelcast.core.Hazelcast
import com.hazelcast.core.HazelcastInstance
import org.springframework.context.annotation.Bean
import org.springframework.context.annotation.Configuration
import org.springframework.http.converter.json.Jackson2ObjectMapperBuilder

@Configuration
class HazelcastConfig {

    @Bean
    fun hazelcastInstance(): HazelcastInstance {
        return Hazelcast.newHazelcastInstance(
            Config().apply {
                this.clusterName = "foo-dev"
            }
        )
    }

    // To use Smile for object serialization in distributed cache, add the following
    @Bean("smileObjectMapper")
    fun smileObjectMapper(): ObjectMapper {

        return Jackson2ObjectMapperBuilder
            .smile()
            .serializationInclusion(JsonInclude.Include.ALWAYS)
            .failOnEmptyBeans(false)
            .failOnUnknownProperties(false)
            .featuresToEnable(SerializationFeature.WRITE_DATES_AS_TIMESTAMPS)
            .featuresToDisable(MapperFeature.USE_ANNOTATIONS)
            .modulesToInstall(JavaTimeModule())
            .build()
    }
}

Writing the @Configuration Class for Amazon ECS Environment

In the Amazon ECS environment, additional configurations are necessary as UDP multicasting for node discovery is not supported. Ensure the designated Task Role and Task Execution Role in the ECS task definition have the following IAM permissions:

ecs:ListTasks
ecs:DescribeTasks
ec2:DescribeNetworkInterfaces

Next, create the HazelcastInstance bean as follows:

    @Bean
    fun hazelcastInstance(): HazelcastInstance {
        return Hazelcast.newHazelcastInstance(
            Config().apply {
                this.clusterName = "foo-dev"
                this.networkConfig.join.multicastConfig.isEnabled = false
                this.networkConfig.join.awsConfig.apply {
                    isEnabled = true
                    isUsePublicIp = false
                    setProperty("connection-retries", "2")
                    setProperty("connection-timeout-seconds", "3")                    

                    // Enter the actual service name of the ECS instance
                    setProperty("service-name", "foo-dev")
                }
                this.networkConfig.interfaces
                    .setEnabled(true)

                    // Enter the actual VPC CIDR block of the ECS instance
                    .addInterface("10.0.*.*")
            }
        )
    }

Lastly, add firewall settings. If not specifically set, open the inbound range of TCP/5701-5801 ports as the default.

# Open the inbound range of TCP/5701-5801 ports
$ aws ec2 authorize-security-group-ingress
     --group-id {security-group-id}
     --ip-permissions '[{"IpProtocol": "tcp", "FromPort": 5701, "ToPort": 5801, "Ipv6Ranges": [{"CidrIpv6": "::/0"}], "IpRanges": [{"CidrIp": "0.0.0.0/0"}]}]'

Writing Distributed Cache Health Check API

Creating a health check API to monitor the current distributed cache status in real-time can be greatly beneficial for troubleshooting at the production level.

import com.hazelcast.core.HazelcastInstance
import org.springframework.http.ResponseEntity
import org.springframework.web.bind.annotation.*
import java.time.Instant

@RestController
class SystemCacheController(
    private val hazelcastInstance: HazelcastInstance
) {
    @GetMapping("cache-state")
    fun getCacheState(): ResponseEntity<Any> {

        return ResponseEntity.ok(
            mapOf(
                "clusterState" to hazelcastInstance.cluster.clusterState,
                "memberSize" to hazelcastInstance.cluster.members.size,
                "members" to hazelcastInstance.cluster.members.map {
                    mapOf(
                        "host" to it.address.host,
                        "port" to it.address.port,
                        "isIpv4" to it.address.isIPv4,
                        "isIpv6" to it.address.isIPv6,
                        "isLocalMember" to it.localMember(),
                        "isLiteMember" to it.isLiteMember
                    )
                }
            )
        )
    }
}

Writing the FooDTO Data Class

Write the FooDTO class to be used as an example for storing data in the distributed cache as follows.

import java.io.Serializable
import java.time.Instant

data class FooDTO(

    var id: Long = 1,
    var name: String? = null,
    var createdAt: Instant = Instant.now()

) : Serializable

Writing the FooController Controller Class

Create a controller to store and retrieve the FooDTO class in the distributed cache as follows.

import com.hazelcast.core.HazelcastInstance
import org.springframework.http.ResponseEntity
import org.springframework.web.bind.annotation.*
import java.util.concurrent.TimeUnit

@RestController
class FooController(
    private val hazelcastInstance: HazelcastInstance,
    private val smileObjectMapper: ObjectMapper
) {
    @GetMapping("/foos/{id}")
    fun getFoo(@PathVariable id: Long): ResponseEntity<Any> {

        val fooMap = hazelcastInstance.getMap<Long, ByteArray>("foo-map")

        fooMap[id]?.let { ResponseEntity.ok(smileObjectMapper.readValue(it, FooDTO::class.java)) } ?: return ResponseEntity.notFound().build()
    }

    @PostMapping("/foos")
    fun setFoo(@RequestBody foo: FooDTO): ResponseEntity<Any> {

        val fooMap = hazelcastInstance.getMap<Long, ByteArray>("foo-map")

        fooMap.set(foo.id, smileObjectMapper.writeValueAsBytes(foo), 2, TimeUnit.HOURS)

        return ResponseEntity.noContent().build()
    }
}

Running the Application

To test the written distributed cache, run the application with two instances.

# Running the first instance
$ SERVER_PORT=8080 ./gradlew bootRun

# Running the second instance
$ SERVER_PORT=8081 ./gradlew bootRun

Testing Distributed Cache Operation

Test the operation of the distributed cache by storing the cache in the first instance and querying it in the second instance.

# Storing the cache in the first instance
$ curl -i -X POST -H "Content-Type:application/json" -d '
{
  "id": 1,
  "name": "One"
}
' 'http://localhost:8080/foos'

# Querying the cache in the second instance
$ curl -i -X GET 'http://localhost:8081/foos/1'
{
    "id": 1,
    "name": "One",
    "createdAt": "2023-04-04T14:34:38.091031743Z"
}

Reliable Topic: Distributed PUB-SUB Messaging Model

Topic and Reliable Topic are distributed PUB-SUB messaging models provided by Hazelcast.
The difference between Topic and Reliable Topic lies in reliability. Topic does not check for message receipt at the time of PUB (Fire and Forget), while Reliable Topic ensures all SUB members receive the message. Additionally, in case of simultaneous PUB actions on multiple nodes, Topic does not guarantee message arrival order, but Reliable Topic strictly does. This is possible because Reliable Topic internally uses a RingBuffer structure, and this incurs additional overhead.
I frequently use Reliable Topic to broadcast Notification information in real-time to multiple end-users browsing the service without message loss and in sequence.
To use Reliable Topic, one must first implement the MessageListener interface. Below is an example of PUB-SUB using a FooDTO object.

import com.hazelcast.core.HazelcastInstance
import com.hazelcast.topic.Message
import com.hazelcast.topic.MessageListener
import org.springframework.context.annotation.Lazy
import org.springframework.stereotype.Service

@Service
class FooService(
    @Lazy private val hazelcastInstance: HazelcastInstance
) : MessageListener<FooDTO> {

    // [1] Synchronously publish the message.
    fun publishMessage(message: FooDTO) {
        hazelcastInstance.getReliableTopic<FooDTO>("foo-topic").publish(message)
    }

    // [1] Asynchronously publish the message.
    fun publishMessageAsync(message: FooDTO) {
        hazelcastInstance.getReliableTopic<FooDTO>("foo-topic").publishAsync(message)
    }

    // [2] Receive the published message.
    override fun onMessage(message: Message<FooDTO>) {
        println(message.publishingMember)
        println(message.publishTime)
        println(message.messageObject)
    }
}

For a PUB message to be transmitted, there must be SUB member nodes. It's possible to subscribe to the topic immediately after creating the HazelcastInstance bean as shown below.

@Configuration
class HazelcastConfig(
    private val fooService: FooService
) {
    @Bean
    fun hazelcastInstance(): HazelcastInstance {
        return Hazelcast.newHazelcastInstance(Config().apply { this.clusterName = "foo-dev" })
            .also { cache -> cache.getReliableTopic<FooDTO>("foo-topic").addMessageListener(fooService) }
    }
}

Graceful Shutdown

Since Hazelcast stores data in the JVM heap memory area, the Graceful Shutdown process of instances being terminated at the time of new application deployment is very important. In most deployment environments, Graceful Shutdown logic is naturally executed with the occurrence of SIGTERM signals to the instances being terminated, ensuring that the instance's partitions migrate without interruption and the member leaves the cluster correctly.
In projects based on Spring Boot and using Hazelcast in Embedded mode, as in this example, enabling the Graceful Shutdown option in Spring Boot and disabling it in Hazelcast completes the Graceful Shutdown setup. Activating both options simultaneously may cause Hazelcast's Graceful Shutdown to run twice, potentially leading to unexpected behavior.

# [1] Enable Spring Boot's Graceful Shutdown option using environment variables
SERVER_SHUTDOWN=graceful
SPRING_LIFECYCLE_TIMEOUT_PER_SHUTDOWN_PHASE=30s

# [2] Disable Hazelcast's Graceful Shutdown option using JVM parameters
# SLF4J logging type option must be enabled for proper loading of Graceful Shutdown logs
-Dhazelcast.shutdownhook.enabled=false -Dhazelcast.logging.type=slf4j -Dlogging.level.com.hazelcast=INFO

If the application is terminated with a SIGKILL signal instead of a SIGTERM, unexpected timeout interruptions can occur since Hazelcast is not given time to prepare. In such cases, it's necessary to explicitly execute Graceful Shutdown at the application level before termination to safely proceed with partition migration and leave the cluster immediately.

// Explicitly execute Graceful Shutdown for cluster members
hazelcastInstance.shutdown()

When Graceful Shutdown is executed, the following logs can be observed in the console.

[hz.zealous_keldysh.async.thread-2] INFO  c.hazelcast.core.LifecycleService - [192.168.0.2]:5701 [foo-dev] [5.2.3] [192.168.0.2]:5701 is SHUTTING_DOWN
[hz.charming_montalcini.migration] INFO  c.h.i.p.impl.MigrationManager - [192.168.0.2]:5701 [foo-dev] [5.2.3] Repartitioning cluster data. Migration tasks count: 204
[hz.charming_montalcini.migration] INFO  c.h.i.p.impl.MigrationManager - [192.168.0.2]:5701 [foo-dev] [5.2.3] All migration tasks have been completed. (repartitionTime=Fri May 19 01:56:06 UTC 2023, plannedMigrations=204, completedMigrations=204, remainingMigrations=0, totalCompletedMigrations=1425)
[hz.zealous_keldysh.async.thread-2] INFO  com.hazelcast.instance.impl.Node - [192.168.0.2]:5701 [foo-dev] [5.2.3] Shutting down multicast service...
[hz.zealous_keldysh.async.thread-2] INFO  com.hazelcast.instance.impl.Node - [192.168.0.2]:5701 [foo-dev] [5.2.3] Shutting down connection manager...
[hz.zealous_keldysh.async.thread-2] INFO  c.h.instance.impl.NodeExtension - [192.168.0.2]:5701 [foo-dev] [5.2.3] Destroying node NodeExtension.
[hz.zealous_keldysh.async.thread-2] INFO  com.hazelcast.instance.impl.Node - [192.168.0.2]:5701 [foo-dev] [5.2.3] Hazelcast Shutdown is completed in 99 ms.
[hz.zealous_keldysh.async.thread-2] INFO  c.hazelcast.core.LifecycleService - [192.168.0.2]:5701 [foo-dev] [5.2.3] [192.168.0.2]:5701 is SHUTDOWN

Rolling Member Upgrade is Only Supported in Paid Version

If you upgrade the version of Hazelcast and run the application, it will fail to execute as it cannot join the existing cluster running on an older version, as shown in the logs below. It is crucial to check this before introducing it into the production environment. This limitation exists in the free open-source version, but purchasing the enterprise version enables the use of the Rolling Member Upgrade feature.

[hz.nervous_hodgkin.generic-operation.thread-1] ERROR com.hazelcast.security - [192.168.0.2]:5701 [foo-dev] [5.3.0] Node could not join cluster. Before join check failed node is going to shutdown now!
[hz.nervous_hodgkin.generic-operation.thread-1] ERROR com.hazelcast.security - [192.168.0.2]:5701 [foo-dev] [5.3.0] Reason of failure for node join: Joining node's version 5.3.0 is not compatible with cluster version 5.2 (Rolling Member Upgrades are only supported in Hazelcast Enterprise)

After applying the real world

After applying to production, log monitoring over 8 months in a three-member cluster showed that read performance using Map as an object cache was 0ms for 53.3% and 1ms for 46.4% (the rest being 0.3%). Write performance was 1ms for 96.3% and 2ms for 2.29% (the rest being 1.41%). This performance is considered excellent for in-memory.

Reference Articles

Hazelcast Overview