Proactive Monitoring

This is a knowledge sharing article for others. Maybe you know what tools we can use as of today or you don’t. I would like to share what we did.

Motivation

As customer implementations are getting bigger and productions are running for longer time, customer is likely going to want something called “proactive monitoring”.

We tried to implement processes and setup “as of today” tools to fulfill this need.

Our main use case is that in the past we had qa / prod partitions down. In some cases it was not one time event that suddently partition went down. When analyzing root cause we figured out there were signs something si not correct based on memory load, cpu load and failing jobs in prior of this event and we could pottentialy see the problem ahead of time…

First of all to be able to do something like Proactive Monitoring customer decided to hire DCE (dedicated support engineer). That is support engineer dedicated specifically to customer. He is responsible to solve all support tickets of this customer and he is regularily using tools for proactive monitoring.

Setting goals

The bellow is list of our goals which we defined on beginning. I am not going to touch all the topics on the list, but I will touch those which might bring most benefit to customer.

[1] Error notifications from partitions

To setup error notifications in platform To check warnings and errors and threat them either by creating SE JIRA tickets or in case it looks environment connected start To analyse it To understand the issue

[2] Resources monitoring

To monitor resource usage accross all partitions memory, cpu, database load, database responses across all partitions and proactively analyzing peeks and overal performance degradation after/before releases (pfx upgrade or customer layer business release)
Monitor and compare aws pods utilization and inform when we are reaching close to 80% usage which mean two things. Raise aws pod memory setting or improve logic to be more efficient.
To monitor volume of business data inside tables DM,DS,PX,CX and give us feedback we are reaching to limits of PFX solutions or in general have those KPIs for customer. How the volume of the data and usage of application is changing in time. Do we need to setup regular data backup and so on.

[3] Review aws / partition setup

To have a close look at aws setup of clusters so all environments have same settings and watch over new partitions to ensure customer specific settings will be applied (root cause for most of the AWS connected critical tickets) in past.
To do an analysis over database indexes and give us overview what additional indexes are added to diferent databases to unified the setup accross all environments.

Discovery and Development phase

JST errors and notifications

Platform manager as of today have a feature to setup a watchdog in JST table. We can check ERROR JST jobs, or certain subset of job. More information how to setup a notification in platform manager here:
https://pricefx.atlassian.net/wiki/spaces/PM/pages/5271726715
We created a webhook into teams which aggregates all JST ERRORs every day from given partition. The result looks like this:
- then we run analysis for 14 days of every issue and mostly it was performance issue connected with customer solution layer.
- Customer has own support team and inhouse configuration engineers. After mutual agreement (with the above root cause in mind) we are using customer’s Team structure to pass above messages. Internal team then does the analysis, Create an internal Jira for bugfixes and if its infrastructure related customer creates support ticket to pfx.
- There are certain limitations of solution defined in ticket: PFIM-7245

ERROR lookback

Above creates a watchdog on top JST job. Which means async jobs. What about quotes, contracts, dashboard, custom forms and others? Those could be covered by other platform manager tool: Platform manager ALERTS - https://pricefx.atlassian.net/wiki/spaces/PM/pages/5271726933
The idea is that Alert setup scans the servelokilog, same which is used by CEs when debugging any bug and you can setup filter which will report any ERROR for example.
Motivation here is that there can be dashboards which dont have input validation and can create massive queries into DB. The dashboard will eventually fail.
The tool has limitation, that we cannot create aggregated messages same way as with JST, that should be addressed as part of PFIM-7245 as well.

Resources Monitoring

Support’s Alert tool. Support is capable of creating any alert on top of grafana data. Support can create a specific filters to given customer. The alerts then come filtered by specific customer and partition to DSE mailbox. We setup following metrics:

watch out OOM of jobs
watch long running jobs. In our case more than 3hours is considered long run.
higher load of cpu fo job
underlying operation system issues and other infractructer related issues
lack of free space

The above are infrastructure specific Alerts which has to be handled by DSE

Logic efficiency analysis

Performance issues are often caused by customer layer code, but we dont have tools to evaluate it. Right now only tools are:

Built in JST analysis:
We see time needed for execution of given evement plus queries breakdown.
StopWatch - If above is not enough we can use StopWatch util to measure certain pieces of code within given element or library.

Above tools are used for adhoc analysis. We need something which do automatic check regularly. So we created a tools to help us analyze problematic logics:

Memory Performance Troubleshooting Dashboard: https://gitlab.pricefx.eu/components/solutions/memory-performance-troubleshooting - this collects data from grafana and stores at partition level, The partiton jobs then connects memory pod usage together with JST information so we can actually see which jobs takes how much memory. This is applicable only for single pods

On top of the memory performance dashboard we used https://pricefx.atlassian.net/wiki/spaces/UNITY/pages/5202022228 accelerator to run once a week analysis. The accelerator basically gives u the possiblity to setup repeatative job on top of any dashboard or rollup. Our conditions are following:

In case memory usage of certain jobs goes above threshold or has an uptrend in comparsion to past, we create a task for customer team member. So the member can analyze root cause of high memory comsuption.

Whats in pipeline

In future we would like to tackle the topic of db queries. How to recognize heavy query and how to connect that query to specifil logic ideally the concrete element?

Conclusion

Above I tried to sum up how we looked at proactive monitoring issue from customer’s perspective and how we tried to setup environment and tools in the direction that we can claim we are doing some proactive monitoring over our partitions.

Thanks @Radek Feber,@Aleksandr Volkov for consultations.