Quantcast

Azure, DataLake, Spark, Hadoop suggestions....

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Azure, DataLake, Spark, Hadoop suggestions....

Rowe, Jim

Hello HDF Gurus,

We are doing some machine learning work against HDF5 data (several hundred files, 5-50GB each).

 

We are looking for others who may have blazed or been blazing this trail.  We are in Azure using Microsoft DataLake storage and working through trying to read data into RDDs for use in Spark.

 

We have been working with h5py, but running into issues where we cannot access files that MS exposes using the “adl://” URI—our assumption is that however that is implemented, it does not translate to a filesystem the underlying HDF5 libraries can read (?).   Our best option so far is to copy the files locally, which introduces an extra step and delay in the process.

 

If anyone has suggestions or insights on how to architect a cloud solution as roughly described, we would love to talk to you.  We are also potentially looking for some paid consulting help in this area if anyone is interested.

 

 

Warm regards,

--Jim


_______________________________________________
Hdf-forum is for HDF software users discussion.
[hidden email]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Azure, DataLake, Spark, Hadoop suggestions....

Gerd Heber

Jim, do you need barebones RDDs or some of the more structured types (Spark DataFrame, Dataset)?

How about loading the data via HDF5/JDBC?

 

G.

 

From: Hdf-forum [mailto:[hidden email]] On Behalf Of Rowe, Jim
Sent: Monday, January 30, 2017 9:23 AM
To: HDF Users Discussion List <[hidden email]>
Cc: Smith, Jacob <[hidden email]>
Subject: [Hdf-forum] Azure, DataLake, Spark, Hadoop suggestions....

 

Hello HDF Gurus,

We are doing some machine learning work against HDF5 data (several hundred files, 5-50GB each).

 

We are looking for others who may have blazed or been blazing this trail.  We are in Azure using Microsoft DataLake storage and working through trying to read data into RDDs for use in Spark.

 

We have been working with h5py, but running into issues where we cannot access files that MS exposes using the “adl://” URI—our assumption is that however that is implemented, it does not translate to a filesystem the underlying HDF5 libraries can read (?).   Our best option so far is to copy the files locally, which introduces an extra step and delay in the process.

 

If anyone has suggestions or insights on how to architect a cloud solution as roughly described, we would love to talk to you.  We are also potentially looking for some paid consulting help in this area if anyone is interested.

 

 

Warm regards,

--Jim


_______________________________________________
Hdf-forum is for HDF software users discussion.
[hidden email]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Azure, DataLake, Spark, Hadoop suggestions....

Gerd Heber
Jake, are you using 100%, 80%, 60%, ... of the data that you'd be copying?
If you were using just a fraction (< 20%), copying all those files sounds like a waste.

[OK, I'm peddling HDF5/JDBC server here...]

With HDF5/JDBC server you could:

1. Limit (SELECT) the amount of data to be brought in (over the network)
2. With something like Sqoop, you could save the data in any BigData format you like.

G.

________________________________________
From: Smith, Jacob <[hidden email]>
Sent: Monday, January 30, 2017 11:19:55 AM
To: Gerd Heber; HDF Users Discussion List
Subject: RE: Azure, DataLake, Spark, Hadoop suggestions....

Gerd,

Thanks for the response!  My name is Jake Smith and I’ll be working with this cloud solution.  Currently, our HDF5 files are in DataLake, we use a Python Jupyter notebook around Azure’s HDInsight with a Spark cluster.  We want to load our data from HDF5 into a H2O frame to build additional models.  We are using Sparkling Water (the integration of H2O and Spark).  Since h5py (python module) doesn’t seem to facilitate remote querying of HDF5 files (which I’m not sure if that’s a characteristic of HDF5 itself rather than this python client), we are wondering if it is a good idea to download these files to the Spark cluster before transforming them to RDDs.

From: Gerd Heber [mailto:[hidden email]]
Sent: Monday, January 30, 2017 8:59 AM
To: HDF Users Discussion List <[hidden email]>
Cc: Smith, Jacob <[hidden email]>
Subject: RE: Azure, DataLake, Spark, Hadoop suggestions....

Jim, do you need barebones RDDs or some of the more structured types (Spark DataFrame, Dataset)?
How about loading the data via HDF5/JDBC?

G.

From: Hdf-forum [mailto:[hidden email]] On Behalf Of Rowe, Jim
Sent: Monday, January 30, 2017 9:23 AM
To: HDF Users Discussion List <[hidden email]<mailto:[hidden email]>>
Cc: Smith, Jacob <[hidden email]<mailto:[hidden email]>>
Subject: [Hdf-forum] Azure, DataLake, Spark, Hadoop suggestions....

Hello HDF Gurus,
We are doing some machine learning work against HDF5 data (several hundred files, 5-50GB each).

We are looking for others who may have blazed or been blazing this trail.  We are in Azure using Microsoft DataLake storage and working through trying to read data into RDDs for use in Spark.

We have been working with h5py, but running into issues where we cannot access files that MS exposes using the “adl://” URI—our assumption is that however that is implemented, it does not translate to a filesystem the underlying HDF5 libraries can read (?).   Our best option so far is to copy the files locally, which introduces an extra step and delay in the process.

If anyone has suggestions or insights on how to architect a cloud solution as roughly described, we would love to talk to you.  We are also potentially looking for some paid consulting help in this area if anyone is interested.


Warm regards,
--Jim

_______________________________________________
Hdf-forum is for HDF software users discussion.
[hidden email]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Azure, DataLake, Spark, Hadoop suggestions....

Miller, Mark C.
In reply to this post by Rowe, Jim

Am woefully ignorant of most  of this technology so I just suggest maybe one of these might help...

 

https://github.com/LLNL/spark-hdf5

 

http://www.nersc.gov/users/data-analytics/data-management/i-o-libraries/hdf5-2/h5spark/

 

Mark

 

 

"Hdf-forum on behalf of Rowe, Jim" wrote:

 

Hello HDF Gurus,

We are doing some machine learning work against HDF5 data (several hundred files, 5-50GB each).

 

We are looking for others who may have blazed or been blazing this trail.  We are in Azure using Microsoft DataLake storage and working through trying to read data into RDDs for use in Spark.

 

We have been working with h5py, but running into issues where we cannot access files that MS exposes using the “adl://” URI—our assumption is that however that is implemented, it does not translate to a filesystem the underlying HDF5 libraries can read (?).   Our best option so far is to copy the files locally, which introduces an extra step and delay in the process.

 

If anyone has suggestions or insights on how to architect a cloud solution as roughly described, we would love to talk to you.  We are also potentially looking for some paid consulting help in this area if anyone is interested.

 

 

Warm regards,

--Jim


_______________________________________________
Hdf-forum is for HDF software users discussion.
[hidden email]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5
Loading...