Why aren't hashed files converted to data sets?

Many users considering the use of hashed files in their server jobs may instinctively assume that the parallel data set stage serves as an appropriate functional equivalent. As a result, these users might understandably question why S2PX does not convert hashed file stages into data set stages.

The unique capabilities of the Hashed File stage

Despite its name the hashed file stage operates very differently from other filer-based data storage stages, such as the data set or sequential file. For Server jobs that utilize hashed files, S2PX delivers a parallel engine-compatible solution that offers the same outcomes provided by the hashed file stage. The hashed file stage offers several unique design patterns that are not possible with existing DataStage parallel stages. This limitation often requires the implementation of intricate parallel job designs and/or job decomposition to replicate the functionality of the source Server job.

The unique capabilities of the hashed file stage may not be immediately evident from its documentation. Some of its abilities stem, in part, from its ability to employ a hot cache. A hot cache is a type of temporary storage that retains frequently accessed data in memory, enabling quicker retrieval compared to accessing the original data source, which is typically located on a slower medium like disk. This caching strategy is also utilized during the process of writing data and enables the staging (and manipulation) of output data in memory before changes are eventually written to the Hashed File’s persistent data store. We’ll expand on the implications of this design below.

Usage analysis

Many users rely on anecdotal information and assumptions about the features used by their Server Jobs (particularly the way their Hashed File stages are used) and are sometimes surprised when the compelxities of their Server solution are made clear by S2PX analysis and conversion. S2PX’s approach to converting Hashed Files has been informed by an analysis of over 150K Server jobs from numerous real-world customer projects. This analysis reveals that users' job designs commonly depend on the following unique capabilities of hashed files …

A hashed file provides database-like record locking and multi-process support, meaning multiple jobs, or multiple stages within the same job, can read and modify the same hashed file concurrently.
A hashed file stage used as an output stage will use key values to enable the upsert of its data during job execution. The specification of at least one key column is manadatory, and all records are stored based on the provided key. The hashed file stage’s caching strategy therefore enables multiple output data rows with the same primary key to update their values in the cache before they are eventually written to the Hashed File’s persistent data store. This key-based update mechanism during writing effectively offers a cost-efficient de-duplication capability without the need to sort the input data, which would be necessary to achieve the same (much slower) result with the Parallel engine.
When using a hashed file as source for lookups, the hashed file’s use of a hot cache, and its ability to access its contents randomly using a key value, means that the entire file does not need to reside in memory, providing a very efficient high performance lookup capability.

As you’ll no doubt appreciate, none of these capabilities exist for the parallel engine’s file-based stages.

Some of these constraints can be be determine through static analysis, but many cannot, and it can also be difficult to reliably identify these through manual investigation. Without a holistic view of your Server solution it’s impossible to know, by looking at a single hashed file in a single job, what capabilities that hashed file relies on when being used in other jobs. While S2PX may generate jobs which some feel are not ideal, it aims to minimize as far as possible the need to perform any of this analysis while providing a job design that can easily be tweaked when we get it wrong (eg. Switching between sparse <-> normal lookup is easy an covers most cases).

References

See the following pages:

Why does S2PX replace Hashed Files with the DRS stage?
Why does S2PX generate Parallel jobs which run in Sequential mode?
We also provide a set of Asset Queries which identify some of the more complex patterns in which your Hashed File stages are deployed.