database - Querying large amount of data processed by Hive -
Say that I have about 10-20 GB of data in HDFS, which is in the form of a hive table, reducing many maps And after joining on two different datasets it has been received. I need to make the user eligible for this query. What options do I have?
- Use to transfer data from HDFS to RGS such as Post GRSL. But I want to spend so much time on data transfers. I have tested HDFS-> RDS only using Sqoop in the same SWS area, and takes 4-8 minutes to 800 MB of data. So you can imagine that ~ 60GB data will be very ineffective. This will be my last resort.
- Ask questions directly to my webserver according to user request. I have never used such a hive so I I am doubting about this. It struck me because I just got to know that you can ask a bee table remotely after some port forwarding on the EMR cluster. But being new to the large (ISH) data, I am not sure about the risks associated with this, is it common to do this?
- Some other solutions - how do people generally do this kind of thing? Sounds like a very normal work.
Just for completeness, my data looks like this:
ID time cat1 cat2 cat3 metrics [200] A 123 1234212133 12 ABC 24 4 , 55,231,34,556,123 .... (~ 200). . .
(time is the era)
And my question looks like this:
choose cat1, corr (metrics [2] , Matrix [3]), core (metrics [2], matrix [4]), core (metrics [2], matrix [5]), core (matrix [2], matrix [6])
/ P>
I need correlation function, which is why I chose postgresql on MySQL.
You have the correlation function:
corr (col1, Col2)
Pearson gives coefficient of correlation of a pair of numerical columns in the group.
Comments
Post a Comment