Subscribe to Our Bi-Weekly AI Newsletter

Eclipse DataVec - ETL & vectorization library (Pandas for the JVM)

DataVec solves one of the most important obstacles to effective machine or deep learning: getting data into a format that neural nets can understand. Nets understand vectors. Vectorization is the first problem many data scientists will have to solve to start training their algorithms on data. Datavec should be used for 99% of your data transformations, if you are not sure if this applies to you, please consult the gitter. Datavec supports most data formats you could want out of the box, but you may also implement your own custom record reader as well.

If your data is in CSV (Comma Seperated Values) format stored in flat files that must be converted to numeric and ingested, or your data is a directory structure of labelled images then DataVec is the tool to help you organize that data for use in DeepLearning4J.

Introductory Video

This video describes the conversion of image data to a vector.

Key Aspects

  • DataVec uses an input/output format system (similar in some ways to how Hadoop MapReduce uses InputFormat to determine InputSplits and RecordReaders, DataVec also provides RecordReaders to Serialize Data)
  • Designed to support all major types of input data (text, CSV, audio, image and video) with these specific input formats
  • Uses an output format system to specify an implementation-neutral type of vector format (ARFF, SVMLight, etc.)
  • Can be extended for specialized input formats (such as exotic image formats); i.e. You can write your own custom input format and let the rest of the codebase handle the transformation pipeline
  • Makes vectorization a first-class citizen
  • Built in Transformation tools to convert and normalize data
  • Please see the DataVec Javadoc here

A Few Examples

  • Convert the CSV-based UCI Iris dataset into svmLight open vector text format
  • Convert the MNIST dataset from raw binary files to the svmLight text format.
  • Convert raw text into the Metronome vector format
  • Convert raw text into TF-IDF based vectors in a text vector format {svmLight, metronome, arff}
  • Convert raw text into the word2vec in a text vector format {svmLight, metronome, arff}

Targeted Vectorization Engines

  • Any CSV to vectors with a scriptable transform language
  • MNIST to vectors
  • Text to vectors
    • TF-IDF
    • Bag of Words
    • word2vec

CSV Transformation Engine

If data is numeric and appropriately formatted then CSVRecordReader may be satisfactory. If however your data has non-numeric fields such as strings representing boolean (T/F) or strings for labels then a Schema Transformation will be required. DataVec uses Apache Spark to perform transform operations. Note you do not need to know the internals of Spark to be succesful with DataVec Transform.

Schema Transformation Video

A video tutorial of a simple DataVec transform along with code is available below.

Chris Nicholson

Chris Nicholson is the CEO of Skymind. He previously led communications and recruiting at the Sequoia-backed robo-advisor, FutureAdvisor, which was acquired by BlackRock. In a prior life, Chris spent a decade reporting on tech and finance for The New York Times, Businessweek and Bloomberg, among others.

A bi-weekly digest of AI use cases in the news.