Efficient serialization of HDF5 data

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Efficient serialization of HDF5 data

Michaël Melchiore

Dear HDF experts,

I build an application which operates on NetCDF data using Big Data technologies.

My design aims at avoiding unnecessarily writing data to disk. Instead, I want to operate as much as possible in memory. The challenge is data (de)serialization for distributed communications between computing nodes.

Since NetCDF4 and HDF5 already provide a portable data format, a simple and efficient design would simply access and then exchange the raw binary data over the network.

Currently, I fail to access this buffer without creating files. I am investigating the use of the Apache Common VFS Ram file system to trick NetCDF into working in memory.

But, a suggestion on the NetCDF Java mailing list (see ticket MQO-415619) was to build an alternative to the core driver. I feel this is the more desirable course of actions as it is about improving the existing solutions instead of working around their limitations.

Do you think this approach is feasible ? Any starting pointers would be appreciated !

Kind regards,

Michaël

_______________________________________________
Hdf-forum is for HDF software users discussion.
[hidden email]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5
Reply | Threaded
Open this post in threaded view
|

Re: Efficient serialization of HDF5 data

Andrey Paramonov
Hello Michaël!

04.12.2017 21:23, Michaël Melchiore пишет:

> I build an application which operates on NetCDF data using Big Data
> technologies.
>
> My design aims at avoiding unnecessarily writing data to disk. Instead,
> I want to operate as much as possible in memory. The challenge is data
> (de)serialization for distributed communications between computing nodes.
>
> Since NetCDF4 and HDF5 already provide a portable data format, a simple
> and efficient design would simply access and then exchange the raw
> binary data over the network.
>
> Currently, I fail to access this buffer without creating files. I am
> investigating the use of the Apache Common VFS Ram file system to trick
> NetCDF into working in memory.
>
> But, a suggestion on the NetCDF Java mailing list (see ticket
> MQO-415619) was to build an alternative to the core driver. I feel this
> is the more desirable course of actions as it is about improving the
> existing solutions instead of working around their limitations.
>
> Do you think this approach is feasible ? Any starting pointers would be
> appreciated !

I am probably not a distinguished expert in HDF5, but I take courage to
suggest you to check
https://www.hdfgroup.org/downloads/spark-connector/
It would be superb if you could share your experience and whether Spark
connector helped you to implement in-memory processing.

Best wishes,
Andrey Paramonov

--
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.


_______________________________________________
Hdf-forum is for HDF software users discussion.
[hidden email]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5
Reply | Threaded
Open this post in threaded view
|

Re: Efficient serialization of HDF5 data

Michaël Melchiore
Dear Andrey,

While Apache Spark does aim at working in memory when possible, my need is not related to Spark. There are many alternatives to Spark which can be used to perform in memory processing (Apache Storm, Apache Flink, Google Dataflow...)
I have registered for more information regarding the Spark Connector but I am not sure it is what I am looking for.

Kind regards,

Michaël

2017-12-05 15:11 GMT+01:00 Андрей Парамонов <[hidden email]>:
Hello Michaël!

04.12.2017 21:23, Michaël Melchiore пишет:
I build an application which operates on NetCDF data using Big Data technologies.

My design aims at avoiding unnecessarily writing data to disk. Instead, I want to operate as much as possible in memory. The challenge is data (de)serialization for distributed communications between computing nodes.

Since NetCDF4 and HDF5 already provide a portable data format, a simple and efficient design would simply access and then exchange the raw binary data over the network.

Currently, I fail to access this buffer without creating files. I am investigating the use of the Apache Common VFS Ram file system to trick NetCDF into working in memory.

But, a suggestion on the NetCDF Java mailing list (see ticket MQO-415619) was to build an alternative to the core driver. I feel this is the more desirable course of actions as it is about improving the existing solutions instead of working around their limitations.

Do you think this approach is feasible ? Any starting pointers would be appreciated !

I am probably not a distinguished expert in HDF5, but I take courage to suggest you to check
https://www.hdfgroup.org/downloads/spark-connector/
It would be superb if you could share your experience and whether Spark connector helped you to implement in-memory processing.

Best wishes,
Andrey Paramonov

--
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.


_______________________________________________
Hdf-forum is for HDF software users discussion.
[hidden email]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5


_______________________________________________
Hdf-forum is for HDF software users discussion.
[hidden email]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5
Reply | Threaded
Open this post in threaded view
|

Re: Efficient serialization of HDF5 data

Martijn Jasperse
Dear Michaël,
Have you tried using the core driver with a file image? Seems to me that this is what you want to do, see H5Pset_file_image. This enables you to "open" the file data in memory and then retrieve it again after you've finished operations, using H5Fget_file_image.

We have previously used this for networked HDF5-based data transfer; admittedly with small data instead of big data, but the disk access overhead was unacceptable in that case too.

Cheers,
Martijn

On 6 December 2017 at 03:43, Michaël Melchiore <[hidden email]> wrote:
Dear Andrey,

While Apache Spark does aim at working in memory when possible, my need is not related to Spark. There are many alternatives to Spark which can be used to perform in memory processing (Apache Storm, Apache Flink, Google Dataflow...)
I have registered for more information regarding the Spark Connector but I am not sure it is what I am looking for.

Kind regards,

Michaël

2017-12-05 15:11 GMT+01:00 Андрей Парамонов <[hidden email]>:
Hello Michaël!

04.12.2017 21:23, Michaël Melchiore пишет:
I build an application which operates on NetCDF data using Big Data technologies.

My design aims at avoiding unnecessarily writing data to disk. Instead, I want to operate as much as possible in memory. The challenge is data (de)serialization for distributed communications between computing nodes.

Since NetCDF4 and HDF5 already provide a portable data format, a simple and efficient design would simply access and then exchange the raw binary data over the network.

Currently, I fail to access this buffer without creating files. I am investigating the use of the Apache Common VFS Ram file system to trick NetCDF into working in memory.

But, a suggestion on the NetCDF Java mailing list (see ticket MQO-415619) was to build an alternative to the core driver. I feel this is the more desirable course of actions as it is about improving the existing solutions instead of working around their limitations.

Do you think this approach is feasible ? Any starting pointers would be appreciated !

I am probably not a distinguished expert in HDF5, but I take courage to suggest you to check
https://www.hdfgroup.org/downloads/spark-connector/
It would be superb if you could share your experience and whether Spark connector helped you to implement in-memory processing.

Best wishes,
Andrey Paramonov

--
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.


_______________________________________________
Hdf-forum is for HDF software users discussion.
[hidden email]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5


_______________________________________________
Hdf-forum is for HDF software users discussion.
[hidden email]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5


_______________________________________________
Hdf-forum is for HDF software users discussion.
[hidden email]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5
Reply | Threaded
Open this post in threaded view
|

Re: Efficient serialization of HDF5 data

Michaël Melchiore
Dear Martjin,

Yes, this is very promising. Thank you for bringing this to my attention.

Michaël

2017-12-05 21:34 GMT+01:00 Martijn Jasperse <[hidden email]>:
Dear Michaël,
Have you tried using the core driver with a file image? Seems to me that this is what you want to do, see H5Pset_file_image. This enables you to "open" the file data in memory and then retrieve it again after you've finished operations, using H5Fget_file_image.

We have previously used this for networked HDF5-based data transfer; admittedly with small data instead of big data, but the disk access overhead was unacceptable in that case too.

Cheers,
Martijn

On 6 December 2017 at 03:43, Michaël Melchiore <[hidden email]> wrote:
Dear Andrey,

While Apache Spark does aim at working in memory when possible, my need is not related to Spark. There are many alternatives to Spark which can be used to perform in memory processing (Apache Storm, Apache Flink, Google Dataflow...)
I have registered for more information regarding the Spark Connector but I am not sure it is what I am looking for.

Kind regards,

Michaël

2017-12-05 15:11 GMT+01:00 Андрей Парамонов <[hidden email]>:
Hello Michaël!

04.12.2017 21:23, Michaël Melchiore пишет:
I build an application which operates on NetCDF data using Big Data technologies.

My design aims at avoiding unnecessarily writing data to disk. Instead, I want to operate as much as possible in memory. The challenge is data (de)serialization for distributed communications between computing nodes.

Since NetCDF4 and HDF5 already provide a portable data format, a simple and efficient design would simply access and then exchange the raw binary data over the network.

Currently, I fail to access this buffer without creating files. I am investigating the use of the Apache Common VFS Ram file system to trick NetCDF into working in memory.

But, a suggestion on the NetCDF Java mailing list (see ticket MQO-415619) was to build an alternative to the core driver. I feel this is the more desirable course of actions as it is about improving the existing solutions instead of working around their limitations.

Do you think this approach is feasible ? Any starting pointers would be appreciated !

I am probably not a distinguished expert in HDF5, but I take courage to suggest you to check
https://www.hdfgroup.org/downloads/spark-connector/
It would be superb if you could share your experience and whether Spark connector helped you to implement in-memory processing.

Best wishes,
Andrey Paramonov

--
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.


_______________________________________________
Hdf-forum is for HDF software users discussion.
[hidden email]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5


_______________________________________________
Hdf-forum is for HDF software users discussion.
[hidden email]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5


_______________________________________________
Hdf-forum is for HDF software users discussion.
[hidden email]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5


_______________________________________________
Hdf-forum is for HDF software users discussion.
[hidden email]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5
Reply | Threaded
Open this post in threaded view
|

Re: Efficient serialization of HDF5 data

Ed Hartnett
In reply to this post by Michaël Melchiore
Have you looked at "diskless" files in netCDF? They are created in memory.

Also have a look at netCDF's support for DAP. Perhaps what you want is to read a diskless file through DAP. I'm not sure if that is possible...

Ed Hartnett

On Mon, Dec 4, 2017 at 11:23 AM, Michaël Melchiore <[hidden email]> wrote:

Dear HDF experts,

I build an application which operates on NetCDF data using Big Data technologies.

My design aims at avoiding unnecessarily writing data to disk. Instead, I want to operate as much as possible in memory. The challenge is data (de)serialization for distributed communications between computing nodes.

Since NetCDF4 and HDF5 already provide a portable data format, a simple and efficient design would simply access and then exchange the raw binary data over the network.

Currently, I fail to access this buffer without creating files. I am investigating the use of the Apache Common VFS Ram file system to trick NetCDF into working in memory.

But, a suggestion on the NetCDF Java mailing list (see ticket MQO-415619) was to build an alternative to the core driver. I feel this is the more desirable course of actions as it is about improving the existing solutions instead of working around their limitations.

Do you think this approach is feasible ? Any starting pointers would be appreciated !

Kind regards,

Michaël

_______________________________________________
Hdf-forum is for HDF software users discussion.
[hidden email]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5


_______________________________________________
Hdf-forum is for HDF software users discussion.
[hidden email]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5