The Enemies of Performance
Introduction: The Challenge of Data Visualization
Computer hardware has evolved substantially over the past ten years. PCs, servers, and data centers are more powerful than ever and ready to tackle the demands of big data. As technology continues to decrease in cost, sensors and IoT devices proliferate, increasing global data ingestion. Data volume is soaring, but so is the available compute power. Although high-end hardware is more available than ever before, a major challenge businesses face today is processing their data.
While modern hardware is equipped to meet the increasing demands of big data, conventional dashboards are often unoptimized. Many visualization platforms experience bottlenecks and slowdowns under high pressure. Even products that advertise live analytics are rarely designed to take full advantage of modern hardware. Today’s hardware is more than capable of undertaking modern data challenges, but many dashboard solutions are falling behind. Businesses must be able to draw timely insights from their data, so delayed processing times are no longer acceptable. Code inefficiencies and legacy architectures are two obstacles that many dashboards are confronted with.
Era of Inefficiency
Modern software development tends to rely more on hardware improvements than performance coding. Software always benefits as hardware compute speed increases, even without explicit optimizations. But, this passive approach of depending only on hardware speeds misses opportunities to utilize resources. Even when backed by the power of modern hardware, inefficient code struggles to meet performance requirements in high-stress data environments, such as live data streaming, visualization, and analytics.
In the early days of computing, hardware was limited and costly. Software engineers had no choice but to optimize their code to make the best use of the available hardware. Consequently, high performance code was at the forefront of design. As hardware has become more powerful, however, performance coding has been less of a priority.
Modern hardware has allowed for a relaxed perspective towards execution speed. Although this degree of performance coding is no longer necessary in everyday applications, performance code is still relevant for heavy-duty use cases. When data applications just rely on hardware improvements for speed gains without considering the need for optimizations, they inevitably run into performance bottlenecks.
Legacy Database Code
Many popular data tools have longstanding origins and were developed for legacy hardware. Though these tools were cutting-edge in their day, hardware capabilities have changed. Today’s hardware has opportunities for newer techniques in memory utilization, storage access, and CPU and GPU compute, which were not available in the past. The difficulty in integrating new performance technologies into existing stacks is a major challenge for these older systems. In many cases, their codebases would have to be rewritten to achieve perfect optimization.
Row64's Modern Performance
Row64 is a real-time visual business intelligence platform that is built from the ground up for maximum speed and scalability. Row64 employs modern performance principles and taps into the full potential of high-performance hardware. Performance is at the core of Row64’s design.
Row64’s founding team has a long history in game technology. Modern game engines prioritize performance and take advantage of chip advancements to achieve the high frame rates needed for AAA multiplayer games. These games are both computationally and graphically intensive, so performance is a priority. Not only do they involve complex computations, such as simulations and real-time rendering, but they also stream interactive data.
The data industry is yet to catch up with the best practices of the game industry. Row64’s expertise in game technology, graphics, and data analytics allows it to herald a new generation of dashboard, data streaming, display, and analytics tools. Row64’s custom stack is a game engine, but for data. Row64’s goal is to break speed records and raise the standard for data processing and visualization.
Identifying the Enemies of Performance
A primary strategy of high-performance coding is to avoid performance pitfalls. This involves identifying and effectively avoiding performance bottlenecks. Think of the classic game Pac-Man, where the player has to navigate a maze while avoiding the ghost enemies. Similarly, creating high-performance software involves navigating between all the parts of making a good product (hardware, software, user interface, and design), while avoiding costly processes or practices that reduce performance.
Row64 has identified four major enemies of performance in streaming and dashboard software:
Memory allocation
Cache misses
Ignoring hardware
Garbage collection
Enemy #1: Memory Allocation
Memory Allocation Overview
Programs need space in memory to run and store information. Memory allocation is the process of designating space in a system’s mass storage (SSD) or memory (RAM) for a program’s working data. When a program needs space, it must request a block of memory from the operating system. The operating system finds a suitable block and returns its address to the program, and from there, the program can use the space. When the program no longer needs the allocated memory, it releases it back to the general pool, making it available to other processes. Allocating and releasing memory is one of the major bottlenecks to performance.
The Cost of Memory Allocation
Memory allocation is slow. There is a time cost when a chunk of memory is requested, allocated, and when it is freed. This time cost varies and is influenced by: whether it occurs in RAM or the SSD, the type of hardware, and the amount of memory being requested. Current memory allocation speed estimations range from under 100 μs (microseconds) to 400 μs per megabyte. If code repeatedly requests and releases memory, it incurs the time penalty each instance. It's much faster to allocate once and reuse the memory, rather than releasing and requesting it again.
An area where memory allocation has an especially high cost is copying. Copying between memory blocks is expensive and introduces additional penalties. Every time a block of memory is copied to another block, the CPU not only needs time to allocate new space, but to transfer the bytes from the source to the target location. Performant code minimizes copies as much as possible. This strategy is referred to as the "Zero-Copy" technique. What this means is that, when you have the ability to work on an existing piece of memory, don't make a copy.
No software truly eliminates all copies, however. This is especially true for web-based applications, since transferring information across networks requires copies. While good code design can’t truly eliminate all copies, it can greatly reduce them.
Siloed Stacks
Software often revolves around a technology stack. A stack-based architecture deploys discrete layers of applications that work together to form a comprehensive platform. Generally, data must be copied and transferred between the layers for the platform to work. Stack-based architectures result in a segregation, or siloing, of data.
An example of a stack-based platform is Apache Superset, which is an open-source dashboard solution. A Superset stack may include a MySQL database on the back end, Python and SQLAlchemy for compute, Flask for network communications, and React for browser display and the UI. Each component on the stack is separately developed by different teams, which typically don’t share the same data format. Each component of the stack is effectively in its own silo. Additionally, this example Superset stack involves at least three languages: MySQL in C, Python, and React in JavaScript.
Stack-based platforms constantly copy and translate information across layers, just to complete simple tasks. In the Apache Superset example, a piece of data has a long journey before it arrives to the end user; data might need to be copied from the database to the computing service, be processed, have the results copied, and finally be pushed over the network to the front-end display.
Every time communication occurs between the layers, data must be transferred. Each stack component will need memory to be allocated, and then data will need to be copied and moved. If the components use different data formats, extra CPU resources are needed to parse and restructure the data. Language conversions between components involve a more severe cascade of memory allocations and copies.
Python Case Study
Python is a popular scripting language that is used extensively for data analytics and processing. Python is generally considered to be a user-friendly language that abstracts away the difficulties of lower-level languages, such as casting variable types and memory management. Python is built on C.
Under the hood, Python stores data in PyObject structures, which are dependency graphs of C structures. The PyObject structure includes a pointer to a second structure, called a PyTypeObject, that contains all the connections and details used to manage the object. Creating a single PyObject is expensive, as it not only requires memory allocation, but it additionally incurs a noncontiguous memory penalty, since, internally, Python works with linked lists of pointers.
The following experiment explores how a stack component, written in C, might pass a single date value to a Python application. In C, a single date value can be stored as an 8-byte structure that represents the number of nanoseconds from an epoch. We conducted a simple test to track where and how a single data is stored and processed when it is passed to Python. This was done by break-pointing Python’s internal code.
For this example, consider a date value: 1/2/2022. Converting this 8-byte date to the equivalent date in Python would require newly allocating: 16 + 408 = 424 bytes. This conversion is needed simply to hold the basic information. While a single value isn’t significant, how would performance be affected with 100 million dates in the dataset? Each of the 100 million dates would need to generate PyObjects. Not only is there the cost of simply copying the records, but there will be a substantial overhead for the additional memory allocations. Language conversions between siloed layers of a technology stack can introduce some of the most severe performance penalties.
Python plays an important role in many data analytics use cases. Within the Python community, tools like NumPy and Apache Arrow have attempted to reduce the PyObject penalty and implement performance improvements. However, the best solution is to avoid extra data copies, especially cascading copies. It is better to work in a unified system with vertically-integrated stack components. This leads to speed gains by avoiding copies, parsing, and penalties due to communication between components written in different languages.
Enemy #2: Cache Misses
The Memory Hierarchy
Computer memory hardware is organized in a hierarchy based on the speed, cost, and capacity of the technology. Near the bottom of the hierarchy is mass storage, which includes technologies like HDD and SSD. Mass storage drives are non-volatile and have a high capacity for long-term storage. Although SSDs are substantially faster than HDDs, mass storage remains comparatively slower than other forms of memory. Mass storage technologies are also the least expensive on the hierarchy.
Above mass storage is RAM, which is a computer’s main form of working memory. While mass storage drives are for long-term data preservation, RAM is used to temporarily hold a program’s active data. The underlying technology for RAM is DRAM (Dynamic Random Access Memory). DRAM is volatile, which means that it does not preserve data when it loses power. DRAM is faster, but more expensive, than SSD technology.
Near the top of the hierarchy is CPU cache. Cache is a fast form of SRAM (Static Random Access Memory) that stores information that the CPU immediately needs to perform calculations. SRAM is faster than DRAM, but is substantially more expensive. A CPU commonly has three cache layers: L1, L2, and L3. SRAM technology is extremely fast, but due to its high cost, it is deployed in small quantities.
While prices and speeds vary significantly according to the manufacturer and quality, the following table compares current storage technologies found in common, non-specialized consumer desktop systems:
The further up the hierarchy, the faster and more expensive the storage technology becomes. But, due to the increasing price, the capacity decreases.
When a program runs on a computer, it is first copied from mass storage to RAM, and the program remains in RAM throughout its execution. As the program runs, relevant bytes of data are forwarded from RAM to cache for processing. Since cache is small, data must constantly cycle between RAM and cache. The idea of the memory hierarchy is to keep relevant bytes in faster forms of memory as much as possible to maintain high speeds.
When a program needs bytes of data, the system searches from the top of the memory hierarchy down. L1 cache is the first location the CPU draws data from. If the requested data is not in L1, the CPU checks L2, and then L3, and continues down the hierarchy until the requested data is found. Searching through the hierarchy is time consuming, especially when slower forms of memory are queued.
Caching
Caching is the process of moving frequently accessed data from main memory to cache to better serve future requests. This way, when the same data is requested again, it can be immediately processed from cache, saving the time of searching through slower RAM every time it’s needed.
The best scenario is for the needed data to always be available in cache. But, since CPU cache is small, bytes need to be constantly swapped between cache and RAM. Caching uses prefetching, which is a process that attempts to retrieve data the CPU might need next, before it asks for it.
Prefetching is based on the principles of temporal and spatial locality. Temporal locality is the concept that the CPU is likely to use the same data again soon after it initially accesses it. In other words, when the CPU needs information, temporal locality predicts that the CPU might need that same information again in the near future. So, it would be advantageous to temporarily leave it in cache. Spatial locality refers to the tendency for the CPU to request data that is physically near to the location that was recently accessed in memory. With spatial locality, the prefetcher grabs bytes of data that are around the requested data, in an effort to preload cache with relevant data.
Cache Misses
When the needed data is present in cache, the CPU is able to quickly perform the needed calculations and proceed to the next step. When the data is not in cache, however, the system must locate and transfer the data to cache before the CPU can continue. A cache miss occurs when the CPU requests data that is not present in cache. In this situation, the data request propagates down the memory hierarchy until the data is found. The CPU first searches the L1 cache. If it is not there, it proceeds to L2, and then L3. If the requested data is not present at all in cache, it will continue down to RAM. When the requested data is found, the data is transferred to cache and the processing resumes.
Each step down the hierarchy is slower than the previous step. L2 cache is slower than L1 because it is larger and further from the CPU core. L3 is slower than L2 for similar reasons. RAM is significantly slower than L3. For high-performance software, cache misses are costly; each step down the memory hierarchy is nearly an order of magnitude slower than the previous step. Cache misses can potentially take hundreds of clock cycles to resolve.
Caching Benefits of Contiguous Memory
Since cache is the fastest form of memory and closest to the CPU, it is ideal for relevant data to be present in cache as often as possible. Cache misses are extremely costly, so organizing data to optimize prefetching is crucial. Prefetching is heavily influenced by the underlying organization of the data in RAM.
When a program’s data is adjacent, it is easier for the prefetcher to retrieve the requested bytes along with the next block. This is the principle of spatial locality. When data is fragmented across memory, spatial locality is no longer effective, resulting in more cache misses and slowdowns. Since relevant data needs to constantly be fed into cache, contiguous memory is crucial.
Enemy #3: Ignoring Hardware
Missing Hardware Opportunities
Hardware has a lot more to offer today than in the past, but a lot of software seldomly takes full advantage of the available resources. Some of the commonly missed performance-boosting opportunities include parallelization, using the GPU, and caching optimizations.
One reason many software applications aren't fully optimized for current hardware is because they were developed for older hardware and the implementation was simply preserved over time. For example, GPUs were exclusively used for graphics through the late 90s, but in recent years they've become useful for computing. Applications such as crypto mining and machine learning benefit from GPUs. In the past, moving data between the CPU and the GPU was slow, and gains from the GPU were lost in the transfer time. With the introduction of PCIe 5.0, however, this bottleneck has been significantly reduced.
Legacy Code Challenges
Legacy applications face a formidable challenge with modernizing their code bases. An application’s core architecture defines how much performance can be achieved. Legacy code can hinder the ability to incorporate new performance features into the existing code base. Some notable performance opportunities in recent years include innovations in: parallelization, integrated GPUs, and GPU access for browser applications. Advances in these areas can’t be integrated into many legacy applications without reworking the entire core architecture; in other words, a rewrite.
Parallelization
Most computers today have at least six cores, with each core supporting two logical processors. The cost of adding cores to a computer has decreased over time, and the speed and capacity of each core has increased exponentially. Historic data tools were written and optimized for single-core performance. Not only that, but many scripting languages don’t use parallelization at all.
Parallelization throughput has also increased enormously on GPUs. This parallelization is the most visible in video games, where complex scenes can now be rendered at hundreds of frames per second with today’s leading GPUs. Legacy applications that were optimized for single-core performance would not be able to take advantage of modern parallelization without being rewritten.
Integrated GPUs and GPU on the Browser
GPUs were not widely available on consumer hardware in the past, so historic data applications could not rely on them for performance gains. Today, integrated GPUs are available on almost all computers, tablets, and phones. In addition, integrated GPU speeds have increased enormously in the last five years, approaching the performance of discrete GPUs.
Within the last ten years, browsers started to embrace high performance technologies, including: WebAssembly, WebGL, and WebGPU. WebGL, for example, allows web applications to access a system’s GPU, but did not reach 95% adoption until around 2023. Browser applications built prior to this relied on HTML and JavaScript for display and compute.
Most leading dashboards were built for legacy technology. Row64 stands out because it was built for modern hardware and browsers, with the GPU in mind.
Enemy #4: Garbage Collection
Overview of Garbage Collection
Garbage collection is the process of reclaiming memory that is no longer needed by a program. Higher-level programming languages typically have automatic garbage collection, whereas lower-level languages, like C, C++, and Rust, allow for more granular memory management. Although automatic garbage collection is convenient and simplifies code, it often comes with the cost of reduced performance.
Performance Penalties
For languages that support automatic garbage collection, the garbage collector runs periodically without intervention. When the garbage collector initiates, it first searches for unreferenced memory. This search consumes clock cycles, and can sometimes go as far as to cause a brief hiccup or slowdown in the program during execution. For high-volume applications that require live updates, these pauses can be detrimental.
Garbage collection can also contribute to memory fragmentation. As portions of memory are allocated and deallocated over time, memory becomes scattered and nonadjacent, which could lead to performance degradation.
Combating the Enemies of Performance
Unified Memory
Unified Memory is the core strategy of Row64. Draw and compute across all components of the Row64 platform are based on the principles of Unified Memory. With Unified Memory, data is:
Vertically integrated: Making communication across the platform seamless.
Centralized: Reducing the need for copies and unnecessary memory allocation.
Contiguous: Taking advantage of cache prefetching.
Hardware-driven: Making the most of hardware when it is available.
Vertically-Integrated Stack
Row64 is a unified stack where components are vertically integrated and optimized for working together. Components like the front-end display and back-end compute service share the same memory layout. The data format is also consistent between all Row64 processes, so there is no data transformation or parsing between parts. This extends to components running on different hardware; communications between the Row64 Server and Dashboard Client likewise use the same format. All components are able to read the bytes of any data communication and retrieve the necessary details on demand without extra copies.
Row64 deploys its own RAM database, WebSocket server, and CPU/GPU parallelized compute engine, within a single Linux service. Because these run as a single process, they share RAM. That means Row64’s internal database, computing service, and network I/O can have immediate access to the same memory, eliminating the need for extra memory allocations, copies, and data transfers.
ByteStream and RamDB
Modern dashboards must be able to handle massive datasets and perform interactive investigations at fast speeds. For instance, cross-filtering allows users to drill in from a large dataset to a fine subset of details. This is valuable for tracking anomalies, discovering hidden data patterns, and transforming raw data into actionable insights. Cross-filtering while streaming data updates is especially demanding.
A primary objective for Row64 when developing the byte layout for the data formats used across the platform was to account for both scalability and interactivity. The byte layout is optimized for tasks on both the SSD and RAM.
The primary data format used by Row64 is ByteStream. ByteStream is optimized for contiguous memory, which minimizes cache misses. ByteStream is a binary dictionary, where data is stored as key-value pairs, but it is organized contiguously to yield high performance prefetching results.
Row64 also contains RamDB, a RAM database designed for high-speed data access and manipulation. RamDB uses the ByteStream layout, but includes additional specifications for storing tabular data records. RamDB’s layout is contiguous, so operations in memory reap the benefits of prefetching. Since tabular data also needs to persist when not in use, the RamDB design also considers the need for performance when reading from and writing to SSD.
RamDB’s byte layout takes advantage of modern SSDs and byte-level indexing for short and fast random access reads. This allows Row64 to extract and manipulate targeted chunks of data instantly. This is particularly useful for cross-filtering drill-ins, as records can be quickly accessed and used for compute operations.
Prior to 2010, storage technology was primarily hard disk drives (HDD). HDDs access data from a magnetic, rotating platter, so the fastest way to access data was to read it sequentially in one block. Fragmented or non-sequential data reads on HDDs incurred a heavy performance loss. The introduction of SSDs, however, meant that data could be read anywhere without the penalty. While HDDs may seem like a thing of the past, most software stacks still use legacy code that’s optimized for long reads.
Row64 set out to design a new display and compute stack for dashboards that would break records. The goal was for the dashboard to excel at cross-filtering and data drill-ins, but to also be responsive and easy to use. Row64 views cross-filtering as a frontier that drives innovation.
Hardware-Driven Compute
For the best performance during compute-heavy workloads in graphics and data evaluation, Row64 is designed to adapt to the underlying hardware. This adaptability is achieved through the Hybrid Engine. When multiple hardware resources are available, Row64’s Hybrid Engine adjusts to use the optimal resources. Row64 also works at a low level, which enables it to have a direct connection with the hardware.
Additionally, Row64 is highly threaded and parallelized on both the CPU and the GPU. Row64 can utilize all available cores on a CPU, if configured to do so. Network communications and requests, along with dashboard requests and evaluations, are also parallelized.
A New Generation of Data Visualization
Data processing and visualization platforms are presented with a quartet of challenges:
Large data scales
Need for high speeds and interactivity
Multi-tenancy
Live streaming updates
While existing data visualization platforms have scrambled to keep up with these challenges, Row64 was built to address them. Row64 was designed to work closely with hardware at a byte level, and in some places, at a bit level. Row64 bypasses many of the performance pitfalls by avoiding high-level stack components, and is able to push the speed barriers of real-time visualization.
In a fast-changing industry where the demand for high-speed data visualization and streaming updates are growing, Row64 is well positioned to elevate the standards for a new generation of dashboards and data visualization.