Quantcast

CSV data into HDF5 data structure and files

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

CSV data into HDF5 data structure and files

nitin chandra
Hi All,

I have been reading a lot and for a long time HDF documentation and
samples (when ever possible), but I guess, till the time I don't wet
my toes, I will not gain eve the basic experience :).

I am on LinuxMint 64 bit OS, with Python and HDF libraries already
installed. I need help / support in coding using python and c++, both,
separately in cli mode.

Background : This data is of a road alignment centre line (File1,
irregular interval), the next data is of Grade file (irregular
interval, File2).

Objective : File1 - File2 = TempFile1

TempFile1 :
containing all Km's in meters {intermediate and regular interval} and
their respective Heights,

Plotting graphically TempFile, by over laying on data of File 1 & 2.

Q) Will my work directory be "/" directory ?

Q) and where or how, do I create File1, File2 and TempFile ?

Q) What parameters do I need to set, to insert/write data into each
file ? and then read from it ?

Q) How do I edit data in the file directly ? Add remove columns in
each file, respectively ?

I do understand, calculation will be done at the code (python or c++) level.

I have sample data, please do let me know when to post it.

Thank you

Nitin Chandra

_______________________________________________
Hdf-forum is for HDF software users discussion.
[hidden email]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: CSV data into HDF5 data structure and files

Francesc Altet-2

Hi Nitin,


I think before getting into details, you need to look into how to efficiently read and write data from CSV files into HDF5 in Python.  For this, pandas is a great library to use.  My advice is to have a look at the excellent documentation in pandas website:


http://pandas.pydata.org/pandas-docs/stable/io.html


In particular, you want to use the `pandas.read_csv()` which one of the fastest ways to read CSV files that I am aware of.  Also, for storing the data in HDF5, `pandas.HDFStore()` comes handy because it can generate HDF5 files out of pandas Dataframes.  In addition, in order to avoid loading all the data in a Dataframe in memory, you want to use the `chunksize` keyword that will allow to read the CSV files in chunks before storing.


I have prepared an example for you (attached) so that you can have a look at how to use all of this (it is simpler than it may seem).  Here it is the output on my machine:


$ python csv_demo.py
CSV creation time: 1.491 (67.092 Krow/s)
CSV reading time: 0.134 (748.360 Krow/s)
HDF5 store time: 0.322 (310.228 Krow/s)
HDF5 read time: 0.006 (15622.990 Krow/s)


so, once the data is stored in HDF5, the read times will be much faster than using CSV (as expected).


HTH,


Francesc


_______________________________________________
Hdf-forum is for HDF software users discussion.
[hidden email]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5

csv_demo.py (2K) Download Attachment
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: CSV data into HDF5 data structure and files

nitin chandra
Thank you Francesc,

Please give me 2-3 days try your example ... do some reading and
testes based as per the link mentioned.

I shall repost soon.

Thank you

Nitin

On 30 January 2017 at 17:14, Francesc Altet <[hidden email]> wrote:

> Hi Nitin,
>
>
> I think before getting into details, you need to look into how to
> efficiently read and write data from CSV files into HDF5 in Python.  For
> this, pandas is a great library to use.  My advice is to have a look at the
> excellent documentation in pandas website:
>
>
> http://pandas.pydata.org/pandas-docs/stable/io.html
>
>
> In particular, you want to use the `pandas.read_csv()` which one of the
> fastest ways to read CSV files that I am aware of.  Also, for storing the
> data in HDF5, `pandas.HDFStore()` comes handy because it can generate HDF5
> files out of pandas Dataframes.  In addition, in order to avoid loading all
> the data in a Dataframe in memory, you want to use the `chunksize` keyword
> that will allow to read the CSV files in chunks before storing.
>
>
> I have prepared an example for you (attached) so that you can have a look at
> how to use all of this (it is simpler than it may seem).  Here it is the
> output on my machine:
>
>
> $ python csv_demo.py
> CSV creation time: 1.491 (67.092 Krow/s)
> CSV reading time: 0.134 (748.360 Krow/s)
> HDF5 store time: 0.322 (310.228 Krow/s)
> HDF5 read time: 0.006 (15622.990 Krow/s)
>
>
> so, once the data is stored in HDF5, the read times will be much faster than
> using CSV (as expected).
>
>
> HTH,
>
>
> Francesc
>
>
> _______________________________________________
> Hdf-forum is for HDF software users discussion.
> [hidden email]
> http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
> Twitter: https://twitter.com/hdf5

_______________________________________________
Hdf-forum is for HDF software users discussion.
[hidden email]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: CSV data into HDF5 data structure and files

nitin chandra
Hi Francesc,

I tried your example as it is, could not get time to modify and try
some thing new.

ran the

$ python csv_demo.py

it did create a CSV file with 10 columns, populating the columns with random no.

The demo.h5 was created, and I used HDFView 2.9 to see the contents of
the demo.h5 file.

created were a directory table,

 and data table - table.

 In the data table - table, there are 2 columns

index   |   value_block_0

empty   | no value
no data | but 10 commas

So that I can relate to your guidance with respect to the issue,
please find attached 2 sample files.
Also, note the first row in CSVs attached, this was created to
initialise the start point of data sequence. Will it be a good
practice to have them in h5 tables also ? Last column has string
values, need them.

ALIGN data goes into file1 and GRADE data into File2, so I am looking
for a write function to write into respective tables and then read
function to read from them.

After the data is in H5 file, can I insert/add/append a new row in
between other rows or at end of file ? Which editor to use or method
to do it in ?

Thank you,

Nitin

On 30 January 2017 at 23:01, nitin chandra <[hidden email]> wrote:

> Thank you Francesc,
>
> Please give me 2-3 days try your example ... do some reading and
> testes based as per the link mentioned.
>
> I shall repost soon.
>
> Thank you
>
> Nitin
>
> On 30 January 2017 at 17:14, Francesc Altet <[hidden email]> wrote:
>> Hi Nitin,
>>
>>
>> I think before getting into details, you need to look into how to
>> efficiently read and write data from CSV files into HDF5 in Python.  For
>> this, pandas is a great library to use.  My advice is to have a look at the
>> excellent documentation in pandas website:
>>
>>
>> http://pandas.pydata.org/pandas-docs/stable/io.html
>>
>>
>> In particular, you want to use the `pandas.read_csv()` which one of the
>> fastest ways to read CSV files that I am aware of.  Also, for storing the
>> data in HDF5, `pandas.HDFStore()` comes handy because it can generate HDF5
>> files out of pandas Dataframes.  In addition, in order to avoid loading all
>> the data in a Dataframe in memory, you want to use the `chunksize` keyword
>> that will allow to read the CSV files in chunks before storing.
>>
>>
>> I have prepared an example for you (attached) so that you can have a look at
>> how to use all of this (it is simpler than it may seem).  Here it is the
>> output on my machine:
>>
>>
>> $ python csv_demo.py
>> CSV creation time: 1.491 (67.092 Krow/s)
>> CSV reading time: 0.134 (748.360 Krow/s)
>> HDF5 store time: 0.322 (310.228 Krow/s)
>> HDF5 read time: 0.006 (15622.990 Krow/s)
>>
>>
>> so, once the data is stored in HDF5, the read times will be much faster than
>> using CSV (as expected).
>>
>>
>> HTH,
>>
>>
>> Francesc
>>
>>
>> _______________________________________________
>> Hdf-forum is for HDF software users discussion.
>> [hidden email]
>> http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
>> Twitter: https://twitter.com/hdf5

_______________________________________________
Hdf-forum is for HDF software users discussion.
[hidden email]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5

ALIGN_NewfmtH5.csv (4K) Download Attachment
GRAD_newfmtH5.csv (196 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: CSV data into HDF5 data structure and files

nitin chandra
Hi All,

Any solution would be helpful.

Thank you,

Nitin

On 2 February 2017 at 00:34, nitin chandra <[hidden email]> wrote:

> Hi Francesc,
>
> I tried your example as it is, could not get time to modify and try
> some thing new.
>
> ran the
>
> $ python csv_demo.py
>
> it did create a CSV file with 10 columns, populating the columns with random no.
>
> The demo.h5 was created, and I used HDFView 2.9 to see the contents of
> the demo.h5 file.
>
> created were a directory table,
>
>  and data table - table.
>
>  In the data table - table, there are 2 columns
>
> index   |   value_block_0
>
> empty   | no value
> no data | but 10 commas
>
> So that I can relate to your guidance with respect to the issue,
> please find attached 2 sample files.
> Also, note the first row in CSVs attached, this was created to
> initialise the start point of data sequence. Will it be a good
> practice to have them in h5 tables also ? Last column has string
> values, need them.
>
> ALIGN data goes into file1 and GRADE data into File2, so I am looking
> for a write function to write into respective tables and then read
> function to read from them.
>
> After the data is in H5 file, can I insert/add/append a new row in
> between other rows or at end of file ? Which editor to use or method
> to do it in ?
>
> Thank you,
>
> Nitin
>
> On 30 January 2017 at 23:01, nitin chandra <[hidden email]> wrote:
>> Thank you Francesc,
>>
>> Please give me 2-3 days try your example ... do some reading and
>> testes based as per the link mentioned.
>>
>> I shall repost soon.
>>
>> Thank you
>>
>> Nitin
>>
>> On 30 January 2017 at 17:14, Francesc Altet <[hidden email]> wrote:
>>> Hi Nitin,
>>>
>>>
>>> I think before getting into details, you need to look into how to
>>> efficiently read and write data from CSV files into HDF5 in Python.  For
>>> this, pandas is a great library to use.  My advice is to have a look at the
>>> excellent documentation in pandas website:
>>>
>>>
>>> http://pandas.pydata.org/pandas-docs/stable/io.html
>>>
>>>
>>> In particular, you want to use the `pandas.read_csv()` which one of the
>>> fastest ways to read CSV files that I am aware of.  Also, for storing the
>>> data in HDF5, `pandas.HDFStore()` comes handy because it can generate HDF5
>>> files out of pandas Dataframes.  In addition, in order to avoid loading all
>>> the data in a Dataframe in memory, you want to use the `chunksize` keyword
>>> that will allow to read the CSV files in chunks before storing.
>>>
>>>
>>> I have prepared an example for you (attached) so that you can have a look at
>>> how to use all of this (it is simpler than it may seem).  Here it is the
>>> output on my machine:
>>>
>>>
>>> $ python csv_demo.py
>>> CSV creation time: 1.491 (67.092 Krow/s)
>>> CSV reading time: 0.134 (748.360 Krow/s)
>>> HDF5 store time: 0.322 (310.228 Krow/s)
>>> HDF5 read time: 0.006 (15622.990 Krow/s)
>>>
>>>
>>> so, once the data is stored in HDF5, the read times will be much faster than
>>> using CSV (as expected).
>>>
>>>
>>> HTH,
>>>
>>>
>>> Francesc
>>>
>>>
>>> _______________________________________________
>>> Hdf-forum is for HDF software users discussion.
>>> [hidden email]
>>> http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
>>> Twitter: https://twitter.com/hdf5

_______________________________________________
Hdf-forum is for HDF software users discussion.
[hidden email]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: CSV data into HDF5 data structure and files

Francesc Altet-2
In reply to this post by nitin chandra

Hi Nitin,

Yes, HDF5 files generated in pandas can be appended with more rows easily using the HDFStore.append() method (as shown in the documentation and in my examples).

Regarding visualizations, pandas uses its own format on top of HDF5 to store dataframes, so this is why using a standard HDF5 viewer (like HDFView) is not showing the table (i.e. compound type) that you might expect.  For this, it is better to use pandas itself to read the HDF5 dataset (or parts of it) and then visualize the resulting dataframe with one of many existing tools that interacts well with pandas:

http://pandas.pydata.org/pandas-docs/stable/ecosystem.html#visualization

Take your time to decide which tool works best for your case.  Meanwhile, you can have a glance at the kind of plots that can produce plotly with HDF5 files produced by pandas:

In general, and if you want to proceed with the pandas path, you may want to ask in the pandas mailing list, where far more people will be ready for helping you.


Francesc Alted


From: Hdf-forum <[hidden email]> on behalf of nitin chandra <[hidden email]>
Sent: Wednesday, February 1, 2017 8:04:58 PM
To: HDF Users Discussion List
Subject: Re: [Hdf-forum] CSV data into HDF5 data structure and files
 
Hi Francesc,

I tried your example as it is, could not get time to modify and try
some thing new.

ran the

$ python csv_demo.py

it did create a CSV file with 10 columns, populating the columns with random no.

The demo.h5 was created, and I used HDFView 2.9 to see the contents of
the demo.h5 file.

created were a directory table,

 and data table - table.

 In the data table - table, there are 2 columns

index   |   value_block_0

empty   | no value
no data | but 10 commas

So that I can relate to your guidance with respect to the issue,
please find attached 2 sample files.
Also, note the first row in CSVs attached, this was created to
initialise the start point of data sequence. Will it be a good
practice to have them in h5 tables also ? Last column has string
values, need them.

ALIGN data goes into file1 and GRADE data into File2, so I am looking
for a write function to write into respective tables and then read
function to read from them.

After the data is in H5 file, can I insert/add/append a new row in
between other rows or at end of file ? Which editor to use or method
to do it in ?

Thank you,

Nitin

On 30 January 2017 at 23:01, nitin chandra <[hidden email]> wrote:
> Thank you Francesc,
>
> Please give me 2-3 days try your example ... do some reading and
> testes based as per the link mentioned.
>
> I shall repost soon.
>
> Thank you
>
> Nitin
>
> On 30 January 2017 at 17:14, Francesc Altet <[hidden email]> wrote:
>> Hi Nitin,
>>
>>
>> I think before getting into details, you need to look into how to
>> efficiently read and write data from CSV files into HDF5 in Python.  For
>> this, pandas is a great library to use.  My advice is to have a look at the
>> excellent documentation in pandas website:
>>
>>
>> http://pandas.pydata.org/pandas-docs/stable/io.html
>>
>>
>> In particular, you want to use the `pandas.read_csv()` which one of the
>> fastest ways to read CSV files that I am aware of.  Also, for storing the
>> data in HDF5, `pandas.HDFStore()` comes handy because it can generate HDF5
>> files out of pandas Dataframes.  In addition, in order to avoid loading all
>> the data in a Dataframe in memory, you want to use the `chunksize` keyword
>> that will allow to read the CSV files in chunks before storing.
>>
>>
>> I have prepared an example for you (attached) so that you can have a look at
>> how to use all of this (it is simpler than it may seem).  Here it is the
>> output on my machine:
>>
>>
>> $ python csv_demo.py
>> CSV creation time: 1.491 (67.092 Krow/s)
>> CSV reading time: 0.134 (748.360 Krow/s)
>> HDF5 store time: 0.322 (310.228 Krow/s)
>> HDF5 read time: 0.006 (15622.990 Krow/s)
>>
>>
>> so, once the data is stored in HDF5, the read times will be much faster than
>> using CSV (as expected).
>>
>>
>> HTH,
>>
>>
>> Francesc
>>
>>
>> _______________________________________________
>> Hdf-forum is for HDF software users discussion.
>> [hidden email]
>> http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
>> Twitter: https://twitter.com/hdf5

_______________________________________________
Hdf-forum is for HDF software users discussion.
[hidden email]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5
Loading...