Many video operators spend a lot of time, effort and resources on detecting, pinpointing and restoring event oriented problems, such as broken equipment or faulty cables. These processes focus on detecting incidents and then bringing the system back to its normal state of operation. What is sometimes overlooked, or at least handled with lower priority, is if the normal state is what it should be? What is the overall delivered service quality when we have no known faults? What is our normal level of service quality and availability? What could or should it be?
In order to address these questions an operator first has to understand what the normal state consists of in terms of issues. What is caused by systematic issues and what is truly random? Our experience from working with operators shows that there most often are systematic issues and that successfully addressing these have proved to provide big improvements in service quality and dramatic reductions in customer complaints. The problem, of course, is identifying the systematic but non-obvious issues among the hundreds or thousands of small events and anomalies.
FINDING AND ASSESSING THE ISSUES
The first order of business in finding problems in normal operation conditions is to detect them; for non-obvious problems this can be quite complicated. Looking at QoS data may be difficult, since there typically are quite a lot of metrics gathered at many measurement points. In this situation an aggregated QoE metric can be very helpful. An aggregated QoE metric can take into account multiple QoS parameters at a single glance, both for individual customers and for group of customers. A well designed QoE metric can thus help detect problems that are otherwise hard to find.
Even with the help of a well-designed QoE metric, the process of identifying the systematic issues caused by non-obvious problems can be complex and is by nature iterative and exploratory. There are a number of dimensions to explore in this process:
- Problems might be intermittent and/or periodic; you might need to look at monitoring data covering extended periods of time to find patterns.
- Problems might be related to combinations of equipment. For instance, a certain model or software version of set-top box combined with a certain model or version of DSLAM or edge QAM might not work well together. Finding connections like this requires access to various metadata about the service delivery chain.
- Problems might be load or usage related. Understanding load and usage – both aggregated in the distribution chain and at the individual user – is important.
- Problems might be statistical. The problem may not show up every time a certain situation occurs, but rather have an increased probability of occurring in certain situations.
- Problems might be related to combinations of all of the above.
To drive this exploratory process you need access to high resolution data covering extended periods of time, measured at various points in the service delivery chain, preferably all the way down to the customer viewing device. You also need tools enabling aggregation, correlation and visualisation in order to identify the systematics in this vast set of data.
Finally, once a problem has been identified, an aggregated QoE metric is again valuable to assess the overall impact of the problem on the service.
REQUIREMENTS ON A SOLUTION
A solution supporting this process strongly benefits from the following capabilities:
A compound QoE metric
A metric estimating the actual end-user service impact, both for individual customers and aggregated for groups of customers.
Access to high resolution historical data
Some of the issues hiding in the noise are likely to be varying over time: generated by activity/load cycles, periodic activities in the solution/equipment, etc. In order to find these issues, key metrics need to be available with high resolution over extended periods of time.
Flexible grouping and aggregation capabilities
In order to find the systematics in events spread over a large population of customers, where each individual customer may have a very low problem intensity, aggregation on groups of customers is important. The grouping needs to be flexible to allow the definition of groups along relevant dimensions such as geography/topology, device types, consumed service types, etc.
Visualisation and data exploration tools
THE DATA-DRIVEN APPROACH
Taking a data-driven approach to the systematic improvements process can lead to improved service quality, significant savings and increased customer satisfaction. A systematic process based on objective data and supported by a powerful support solution toolset enables an operator organisation to make informed decisions in reaching the optimal and most cost-efficient QoE for their operations.
Guest post by Mikael Dahlgren, CEO at Agama Technologies.
Published in Videonet Opinions on June 17, 2015.