Memory management and data persistence mechanism in Exlibris ecommerce platform and bookstore

September, 2020
Share It

How to persist data if there is no “visible database”

Hey developers have you ever wondered how to make monolith structure WEB application to hold data reliably without much maintenance effort when DB is far far away(in a Galaxy :) )

To get started right away at the core of this problem let’s take a look of small maybe not that small Web Desktop application I’m working past 3 years. It is ecommerce bookshelf app. During 90s all of us can remember those book and video stores where we stay in line and in front of book & cd store for hours, waiting new extension of a video game or blockbuster movie Terminator or Star Wars etc. ok maybe someone bought or rented a book (writer of this is comic-video-game-guy but also great appreciator of Russian and Green Antique classic books)

During that time in the country I lived in, Digital Piracy was State industry and we all rent or get our CDs “burnt” right in this kind of stores. In new epoch 2000s came with faster more reliable Internet so all these stores went online. There our story starts. Exlibris is a ecommerce platform for selling books and all other “digital stuff”. No need to wait in front of the store (maybe just for new Iphone :) ), no need for renting content all of this now is on one-click away. Yes VOD came and online shops. All that content from 1000 lines of bookshelves now is in some databases, in online data storage centers. There comes up our problem. We are developing application where there are too many vendors and providers of content books, office material, digital content, mp3, audio CDs, videos, DVDs various SD, HD, full HD, 4K formats, pc and console games. So for the protection of this content and also policy of the company we are working on project, have rigorous measures for content firewall privacy and asset management. So how you get all this “stuff”, that probably can’t be in your databases. Well OK put the Soap or Rest API towards DBs problem solved…well not. Maybe for the main content but for our application internals it is not that simple. So OK let’s put everything in property files. Well flat file approach is maybe good for student project with 1000 lines of code but not for the live system that changes each 2 weeks sprint. So you come up with setup of MySQL database and fix it. Yes that could be our approach but each DB requires DB administrator (fearsome person in developers circles) which knows foreign and surrogate keys to all tables, by heart and always holds the password for rollback procedure if someone blows DB from unexperienced developers (for example writing new entry every time action for save address is called) don’t think it is stupid developer nooo it is just less experienced one, we all came from the same level of experience so don’t judge them. So to make it clear you have to have small datastore on your side with at least or preferably none maintenance and also “not much required knowledge” to use (SQL is separate concept of language for accessing databases that requires additional effort for “of-the-shelf-0-experience-student-developer to appropriately put in use). So straight to the point. LET’S USE IN-MEMORY DATABASE…All right problem solved.

What is NoSQL DB?

NoSQL database –A document is roughly synonymous with a record in a relational database, and a collection is roughly a table.

Another popular type of NoSQL database is a key-value store. It is much like it sounds, storing key value pairs in a very flat manner. This can be likened to a Java Map<String, String>, though some key-value stores are multivalue and can store data more like a Map<String, List<String>>. Some popular key-value stores include Apache Cassandra, Freebase, Amazon DynamoDB, memcache, Redis, Apache River, Couchbase, and MongoDB. (Yes, some document databases double as key-value stores.) Ok maybe use Graph database….mmm we have strong relationships in app Product can be everything from office paper and toner to DVD hmmm…but let see what are those DBs.

Graph databases are NoSQL databases that focus on object relationships. In a graph database objects have attributes (properties), objects have relationships to other objects, and those relationships have attributes. Data is stored and represented as a graph, with relationships between entities existing naturally, not in the contrived manner created with foreign keys. This makes it easy, for example, to solve the degrees of separation problem, finding relationships between entities that might not have been realized at insertion time. Perhaps the most popular graph database is Neo4j.

Though Neo4j is written in Java and despite its name, it can work on any platform. It is, however, very well suited for Java applications. Well now we are even more confused. “Deegrees of separation” was the keyword where I lost my attention while writing this probably many of you switched to watching new Youtube video right now…So basically we don’t need that complexity because we are not doing search operations, greedy regex matching of data, filtering, all of this is done at our partners ERP system so don’t worry, we need to store just the internals of our application to speed up startup time and process time of server requests..Uf good so we need simple data store that act like DB but it isn’t. So close to the final solution we are (writing in style of Joda the Great from Star Wars) keep up with (Kardashian’s) me.

What to use? A NO-SQL database, or maybe Graph database or maybe combination well Cassandra? Difficult question. The right question is what we need to store. Basically we need fast datastore that is primarily used to speed up application. Hardware engineers will come up with this term…you need cache. Right, but some kind of clever cache that is below its representation still a cache but then acts as in-memory database. Finally anticipated wait resulted in solution…we injected EHCache free open source Java datastore with good performance and least maintenance effort because servers are in Zurich at some third party firm because of that, maintenance is costly.

To get to the subject of this blog. First to present our “small” application.

Application is built upon Java Spring Framework with a lot of other integrated modules. The whole overview and data flow in application is not subject of this post. So just to focus on EhCache. We have couple of nodes with the cache structures. So the first problem that arises is, how to synchronize this data.

When user opens a product on one server quickly switches back to another product is not there anymore.

(left – product found: exlibris.ch site, right – product missing: exlibris.ch site)

Likely EhCache comes up with something called Cache-Replication Mechanism. Easy-to-use feature just turn it on in xml configuration :)

RMICacheReplicationFactory does the magic…but does it?

RMI Cache Replicator stands for Remote Method Invocation it is for newbies an old but reliable java protocol that enables calling remote method on a different machine in network, to finish some task…

Because of the nature of this protocol replication of data works fine, only if there is low latency between nodes in network and no firewall boundaries….usually in real life environment that is not the case.

To see the magnitude of the problem we can take a look at the cache overview monitoring page in our application. In this table all caches are listed with their parameters as sizes in memory on disk, TTL and other times, hits statistics and so on.

The list scrolls down almost indefinitely so…we need some reliable way of telling, does this mechanism works on each node at each particular timeframe, in order to reset cache and keep the data integrity always in check. So how to do this?

To get with the smart solution we need to check how those EhCache replicators work.

Heartbeat

EhCache RMI replication

Relies on node discovery network level multicast protocol, which is adapted and called heartbeat protocol. In new environments routers block this multicast and make it difficult to pass firewalls. We need to implement some algorithm which checks this vital cache feature health status in order to keep data consistent in our app.

Cache replication mechanism

Automatic Peer Discovery

Automatic discovery uses TCP multicast to establish and maintain a multicast group. It features minimal configuration and automatic addition to and deletion of members from the group. No a priori knowledge of the servers in the cluster is required. This is recommended as the default option. Peers send heartbeats to the group once per second. If a peer has not been heard of for 5 seconds it is dropped from the group. If a new peer starts sending heartbeats it is admitted to the group.

Any cache within the configuration set up as replicated will be made available for discovery by other peers.

Multicast Blocking

The automatic peer discovery process relies on multicast. Multicast can be blocked by routers. Virtualisation technologies like Xen and VMWare may be blocking multicast. If so enable it. You may also need to turn it on in the configuration for your network interface card. An easy way to tell if your multicast is getting through is to use the Ehcache remote debugger and watch for the heartbeat packets to arrive.

Multicast Not Propagating Far Enough or Propagating Too Far

You can control how far the multicast packets propagate by setting the badly misnamed time to live. Using the multicast IP protocol, the timeToLive value indicates the scope or range in which a packet may be forwarded.

Algorithm

Problem is then clear? We need to monitor this replication process and alarm network administrating agency if blockage occurs in network during these multicasts.

For the purpose of monitoring cache replication we created dashboard restfull view inside existing cacheResetWS application which is deployed on each of the server nodes.

Dummy cache instance rmiReplicationCheckerTimestampCache which has only one element is created, for the purposes of testing.

RMIReplicationChecker dashboard is added.

It uses small algorithm that controls values inside this cache during put and read operations. It puts value inside cache, then reads from other server node’s caches. It traverses all adjacent nodes, and if all timestamps on all server nodes are the same, then status OK is returned. Also logging is added and full description of steps that execute during test. Number of test are equal to number of server nodes, and timestamp should be the same inside each test. If test passes, OK is returned, otherwise NOK and complete reason of failure is logged and displayed in this small dashboard to user.
Algorithm steps:
1. Initialization phase – Each of the nodes create special cache element called rmiReplicationCheckerTimestampCache which has only one Long value for timestamp and stores it to its cache.
2. Write/Read timestamp phase – Algorithm traverses nodes and each node stores timestamp in its cache, sleeps for one second for replicators to work, then reads timestamp from adjacent nodes.
3. Checks timestamps from others and compares with its timestamp and if timestamps differs than algorithm uses short-circuit form to break the loop and return bad timestamp meaning replication is not working replication message, and in case it is valid, then it traverses to another node and repeats process again.
4. If all timestamps match then full-duplex communication is established and cache replication is working between nodes.

5. There is HTML and machine JSON response, returned back to user.
First one is used for humans other one for automatic health checking mechanism which alerts all parts of the system for network instability issue:
JSON response:

{
“ServerName: “exl-25”,
“ReplicationStatus”: “OK”
}

Dashboards:

Basically there are 3 levels:
Critical (NOK) Warning (COMM_ERROR) and OK (OK meaning replication is correctly working)

Algorithm written in Java

Rest Method

Algorithm

Enough of maths magic, programming practices and algorithm processing. What’ve we achieved here?

Data integrity, consistency and isolation without need of complex database solutions, additional maintenance effort and robust complexity.

What if links to the database connection are down and caches are empty – no data. This is outplayed by clever system of cache resetting and cleaning. This happens only in case data is corrupt assuring that basic application stability is reached even internet is dead.

How this is done, you wonder? (Joda style saying :))
Well this is my first blog article and Word says it has 2570 words :D until now, too much. Something should be left for other blog articles…

Stay safe and tuned to good software engineering practices.

Yours faithfully,
One engineer
Stefan.

Software engineer,
Enjoying.

Author: Stefan Dželadinović, Software Engineer at enjoy.ing

Check out open postitions at enjoy.ing: https://www.enjoying.rs/open-positions/

Previous Next