Steps invoved
We start by implementing the
We start by implementing the
SerDe
interface and setting up the internal state variables needed by other methods.public class ColumnarMapSerDe implements SerDe {
private List<String> columnNames;
private ObjectInspector objectInspector;
private Map<String, String> rowMap;
The initialize method is called when a table is created. It is responsible for verifying that the table definition is compatible with the underlying serialization and deserialization mechanism. In this case, we only support strings.
@Override
public void initialize(
Configuration conf,
Properties tableProperties)
throws SerDeException {
final List<TypeInfo> columnTypes =
TypeInfoUtils.getTypeInfosFromTypeString(
tableProperties.getProperty(LIST_COLUMN_TYPES));
// verify types
for (TypeInfo type : columnTypes) {
if (!type.getCategory().equals(PRIMITIVE) ||
The
serialize
method is called whenever a row should be written to HDFS. It is responsible for using the object inspector to extract the columnar data from the row and converting that data to a Hadoop Writable
object: @Override
public Writable serialize(
Object obj,
ObjectInspector objectInspector)
throws SerDeException {
StringBuilder builder = new StringBuilder();
StructObjectInspector structOI =
(StructObjectInspector) objectInspector;
List<? extends StructField> structFields =
structOI.getAllStructFieldRefs();
Similar to the serialize method, the deserialize method is responsible for converting a Hadoop
Writable
object back into columnar form. In this case, the input is a Text
object containing our serialized map, which we break into keys and values and return as a List
of Strings.
@Override
public Object deserialize(Writable writable)
throws SerDeException {
// reset internal state
rowMap.clear();
rowFields.clear();
Text text = (Text) writable;
String content = text.toString();
The next two methods are used by Hive to describe the types used by this
SerDe
. Object inspectors describe how Hive should interpret the tabular data and the serialized class describes how Hadoop should serialize the data to HDFS.@Override
public ObjectInspector getObjectInspector()
throws SerDeException {
return objectInspector;
}
@Override
public Class<? extends Writable> getSerializedClass() {
return Text.class;
}
Finally, implementations can optionally record and report statistics about the data they are serializing and deserializing:
@Override
public SerDeStats getSerDeStats() {
stats.setRawDataSize(deserializedByteCount);
return stats;
}
}
This class depends on the Hadoop and Hive libraries, which can be found in
$HADOOP_HOME/
and $HIVE_HOME/lib/
. Compile this class and package it into a standard JAR file. You can use your preferred Java development environment or simply use the following commands:$ javac \
-cp ${HADOOP_HOME}/'*':${HIVE_HOME}/lib/'*' \
*.java
$ mkdir -p com/leeriggins/hive/
$ mv *.class com/leeriggins/hive/
$ jar cf udf.jar com
We can now use the SerDe to define a new table. The first step is dynamically adding our new jar file to the Hive classpath:
Now, we can create our tables and copy the data from the input table to the output table:
Theory behind it
The central part of this example is our implementation of the
ColumnarMapSerDe
class, which implements the SerDe
interface. Our class describes how to transform a Hadoop record into the columns of a Hive table.
The two main types involved with serialization and deserialization are
Writable
and ObjectInspector
. Writable
is a Hadoop type used to represent keys and values in Map/Reduce jobs. An ObjectInspector
is a Hive type containing the necessary logic for converting between the various Hive representations of data and the more standard Java and Hadoop types.
There are multiple
ObjectInspector
implementations for each of the primitive and complex Hive types. These can be lazy or not and backed by Hadoop Writable
objects or standard Java classes. Generally, using the lazy versions or the versions backed by Writable
object can be more efficient; however, using these object inspectors efficiently is more complicated than using the standard Java object inspectors.
The complex Hive types all contain nested object inspectors describing the types of their contents. Primitive Hive types are all represented by subtypes of
PrimitiveObjectInspector
.
Object inspectors should never be created directly; instead, Hive provides the
ObjectInspectorFactory
and PrimitiveObjectInspectorFactory
classes that may be used to create instances.
First, we need to initialize our SerDe. Hive strongly encourages reusing objects to reduce the need for garbage collection. Therefore, we will precompute as much as possible on initialization and store this information in instance variables.
In the initialize method, Hive will pass us the Hadoop configuration and table properties so we can configure our internal state for the specific table. We can get the names and types of each of the columns from the table properties. Since we are modeling a map of strings to strings, we will throw an exception if any of the columns are not strings. Any other SerDe properties, such as the
input_regex
used by the RegexSerDe, are also available from this table properties object.
For this SerDe, we want to store the column names so we can later extract the appropriate values from each row. We also want to create the
ObjectInspector
that describes this table. We will return a single StructObjectInspector
whose fields correspond to the columns of our table. Each field should be a string, so we will use StringObjectInspectors
.
Finally, we will initialize the instance variables that we will use during serialization and deserialization. Hive asks for the
Writable
class and ObjectInspector
used by our SerDe by calling getSerializedClass
and getObjectInspector
.
When serializing a row, we are given an object containing that row's data plus the
ObjectInspector
necessary to read the data from the object. Since this object represents an entire row, the object inspector will be a StructObjectInspector
whose fields correspond to the columns of the table. The StructObjectInspector
also contains a nested ObjectInspector
for each field; in this case, each will be a StringObjectInspector
.
So, our serialize method needs to use each of these nested object inspectors to read each field, combine the data with the name of the column that we read at initialization time, and build a map string. In this case, we have hard-coded the
\001
and \002
delimiters; however, these could also have been determined at initialization time from table properties.
The deserialize method reverses the serialization process. In this case, we want to convert the
Writable
object passed to us into a row. Deserialization is somewhat simpler as we can just return a List
of Strings
. Hive will use the ObjectInspector
we return from getObjectInspector
to convert this value into whatever internal representation it may decide to use.
Because we returned the
Text
class from the getSerializedClass
method, we know that we can convert it into text, split it by our delimiters, and extract the values corresponding to our column names.
The deserialize method has one additional side effect, which is incrementing the number of bytes that we read during deserialization. These will be encapsulated in a
SerDeStats
and returned to the framework in the getSerDeStates
method. Tracking this information is optional; a SerDe
may simply always return zero for the amount of deserialized data.
Comments
Post a Comment