Design a storage system that encodes 24 information bits on 8 disks of 4 bits each, such that: 1. Combining the 8*4 bits into a 32 bits number (taking a nibble from each disk), a function f from 24 bits to 32 can be computed using only 5 operations, each of which is out of the set {+, -, *, /, %, &, |, ~} (addition; subtraction, multiplication; integer division, modulo; bitwise-and; bitwise-or; and bitwise-not) on variable length integers. In other words, if every operation takes a nanosecond, the function can be computed in 5 nanoseconds. 2. One can recover the original 24 bits even after any 2 of the 8 disks crash (making them unreadable and hence loosing 2 nibbles)
A. Caching and prefetching A storage system consists of tiers of different devices with varying cost and performance. Top tier is the smallest capacity, fastest, and most expensive. Fully utilizing the top tier is the key to minimize latency while keeping cost down. Kroeger et al. proposed caching frequently accessed data and prefetch data to be accessed in near future into the top tier [1]. B. Data sementic aware devices Physical devices that understand the specific applications issuing IO requests to them can organize data to minimize number of accesses and each access’ latency. Sivathaunu et al. proposed database aware storage [2]. The storage system snooped write-head log of the data base system to accurately infer the evolving access pattern. It also gathered statics such as access time of queries, correlation between table/indexes, and number of queries on tables over a duration of time to devise caching scheme. Arpacii-Dusseau et al. applied similar idea to construct a file system aware disk, which utilized similar statistics as proposed by Sivathaunu et al. and directory/inode structure specific to file system to infer applications’ view of data blocks [3]. More transparency between upper level application and physical devices is the most direct method to construct an intelligent storage system, but it incurs significant costs: complication in design and extra hardware overhead. C. Device characteristics aware applications This approaches tackles the opacity issue between application and device from the opposite direction as described in section B. Upper level applications are developed with assumptions on specific hardware’s behaviors. Schindler et al. proposed embedding knowledge of disk’s physical geometry information in applications’ algorithm to align related accesses within a disk track to minimize access latency [4]. This works well until the hardware design obsoletes. D. Machine learning approach We can apply machine learning to allow storage system to infer data semantic without the transparency between higher level application and physical devices as described in section B. Wildani et al. applied k-means on IO accesses history to identify access upper level applications’ working sets [5]. Our proposal also uses a clustering algorithm to learn applications’ access pattern, but incorporates an additional inferred feature. We also explored multiple runs of clustering algorithm on the same data set to learn time domain and spatial domain separately