Java enabled network file storage system

2012-01-12 @ admin

It has become a common view that storing files on the Internet has much potential for the computing community. Compared with local storage, network storage has many merits such as unlimited capacity, easier to manage distributed files, more reliable and more secure, etc. In my opinion, the best online file storage system must have two characteristics:

  1. it should have massive storage capacity to support a multitude of users;
  2. it should provide high performance for all users.

No such file storage systems exist yet. For example, a FTP server’s storage capacity is generally 100 Gbytes. If one user is allocated 1 Gbyte space, a FTP server can support only 100 users. This is too limited. As to performance, the service’s performance is acceptable for users near the FTP server. However, for users far away from the server, the performance is definitely poor due to limited network bandwidth and high latency.

Taking another example, Xdrive provides a file storage service on the Internet. It uses a cluster file system to store files which can support massive storage, but the performance problem still exists. Since the service is centralized, the service’s performance depends significantly on the dynamic Internet performance. So, the average service performance for general users is inevitably poor.

A CSFS (cryptographic storage film system) has two design objectives:

  1. having massive file storage capacity;
  2. providing acceptable performance for all users.

They introduce a star-like architecture that can consolidate many distributed file servers to form an integral system. Massive storage capacity results from the aggregated storage capacity of all file servers. High-performance results from proximity-based serving policies, e.g. creating files on the closest file server.

In a CSFS, file directories are stored in a name server, whereas file data are stored in file servers. File names are location independent, so that the migration of files between different file servers is transparent to users. Thay have designed a user-oriented multi-namespace directory scheme to make file organization more effective. In such a directory scheme, each user has a private directory tree that is transparent to other users.

They implemented CSFS 1.0 in pure Java. To examine its performance, developers modified the standard distributed file system benchmark AB to simulate multiple simultaneous users. The modified benchmark is named GAB, and they also implemented it in Java. The test results of CSFS 1.0 show:

  1. the performance of general file operations such as directory scanning and files reading is comparable with NFS;
  2. the performance of files upload and download is similar to FTP;
  3. the system architecture has good scalability—people can apply the system in a small-to-medium sized network to provide a network file storage service;
  4. the system can support many concurrent users—more than 450 users can be supported simultaneously.

Early distributed file systems, e.g. NFS and Sprite , aimed at transparently accessing remote files. They did not consider how to organize multiple file servers to support massive file storage. If users needed multiple file servers, they had to configure multiple computers manually. File servers were independent from each other in such a situation.

AFS and Coda organized distributed file servers to form a unified system and to enable cooperation between file servers. From the angle of supporting network file storage, AFS and Coda have made more progress than early distributed file systems. However, there are still many limitations.

For example, in AFS, the file name is only location independent in a cell but not inter-cells. So files can only transparently migrate within a cell but not inter-cells. This characteristic limits system performance optimization. In addition, AFS and Coda both adopt a single namespace scheme. All users share a single directory tree which is too large for one user to manage their own files.

Internet file systems such as UFO , Jade , ALEX and WebFS support applications transparently accessing heterogeneous file systems via the Internet. As early distributed file systems, they do not deal with the issue of how to organize multiple file servers to support massive file storage.

Prospero is a global file system which is based on the Virtual System Model. In fact, CSFS 1.0 adopted the spirit of the naming scheme of the Virtual System Model. However, Prospero is more like a directory system rather than a file system. The basic function of Prospero is to resolve file names and direct users’ requests to the corresponding file servers. The CSFS is different from Prospero. In addition to name resolution, the CSFS also manages file storage, file server load balance and system performance optimization, etc. The CSFS is a more complete file system than Prospero.

Napster is a mp3 file-sharing service for users around the Internet. It establishes a centralized name server to store and resolve file names. Its architecture is similar to CSFS 1.0, but it does not manage file storage, whereas CSFS 1.0 does.

There are some Web services 7,1 which consist of multiple Web servers and work in a similar way to CSFS 1.0. In these systems, a modified DNS server obtains relevant information (e.g. load) of all Web servers in the system, and forwards users’ requests to the appropriate Web server based on the information. The DNS server plays a similar role as the name server in a CSFS.

iSCSI is a protocol for network data storage. It implements the SCSI protocol on TCP/IP. Via iSCSI, applications can store and access data in the Internet transparently, just like accessing local SCSI peripherals. A CSFS and iSCSI deal with network storage on different levels. A CSFS works on the level of files, whereas iSCSI works on the level of data blocks.

CSFS 1.0 proposed a star-like architecture to organize file servers scattered in the Internet to form a unified network file storage system. In CSFS 1.0, file names are stored on the name server, whereas file data are stored on the file servers. The main tasks of the name server are file name resolution and file server administration. The main tasks of the file server are file storage and file transfer. There are two merits of the architecture:

  1. file names become location independent;
  2. it benefits mobile computing. Since the name server only deals with trivial sized directory information, the architecture possesses good scalability.

Using the CSFS, we designed a user-oriented multi-namespace to manage file names, which makes file management more effective than single namespace. We designed various file server selection policies for system load balance and performance improvement. We implemented CSFS 1.0 in Java and tested the system with a modified benchmark, GAB. The test results show that: it delivers acceptable performance for general file operations; the performance of file upload and download is as good as FTP; the load on the name server due to the management of the file servers is light; and more than 450 users can be supported simultaneously. We think that CSFS 1.0 is a suitable network file storage system for small-to-medium-sized networks, such as a campus network or a county network.

Purchase "The Stripes Book" now

Comments