July 17-19 | Tokyo, Japan
View More Details  & Register Here
Back To Schedule
Thursday, July 18 • 12:00 - 12:40
Schema-Less Columnar Storage Format Yosegi - Yasunori Oto, Yahoo Japan

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!

Feedback form is now closed.
Flexibility and performance is a critical point in a big data analytics platform on distributed systems.
For flexibility, it is essential to apply a schema at the time of data analyzing. On this point, a strategy called schema-on-read is widely used. This strategy arrows us to accumulate various types of data without worrying about how to analyze them. Additionally, it simplifies the data platform by omitting the schema synchronization system between data collection tier and data query one. However, this method has a disadvantage in performance because such type of data format is not optimized for data analyzing.

For performance, using a columnar format can help us to improve the utilization of the CPU and memory at the time of data reading. Apache ORC and Apache Parquet are known as this kind of format. These formats adopt the schema-on-write strategy. This strategy is optimal for querying, but it requires to specify the scheme beforehand of data storing. This feature leads to the difficulty that we should have an intention of how we analyze data at the time of designing scheme. To alleviate this point, we can use map type for input data. However, this method spoils the effect of projection push-down as these formats deserialize all the data contained in the map type.

We developed "Yosegi" as a schema-less columnar format to obtain both of the benefits. At the time of storing, this format constructs the namespace for storage schema from input data, interprets map type data as struct type data consisting of the keys contained in this map and stores data on a columnar basis. At the time of reading, it assigns the fields in the analysis-time schema to the namespace and fetches the data by its field name. Those two devises of design realize the flexibility of schema-on-read and the performance of schema-on-write. The format is available as Apache 2.0 licensed OSS on https://github.com/yahoojapan/yosegi.

In this presentation, we introduce how the format writes and reads data, performance comparison with other columnar formats and how we leverage the format in our data platform.


Yasunori Oto

Data Engineer, Yahoo Japan

Thursday July 18, 2019 12:00 - 12:40 JST
Hall B (3) (Floor 4F)