database - Querying large amount of data processed by Hive -


Say that I have about 10-20 GB of data in HDFS, which is in the form of a hive table, reducing many maps And after joining on two different datasets it has been received. I need to make the user eligible for this query. What options do I have?

  1. Use to transfer data from HDFS to RGS such as Post GRSL. But I want to spend so much time on data transfers. I have tested HDFS-> RDS only using Sqoop in the same SWS area, and takes 4-8 minutes to 800 MB of data. So you can imagine that ~ 60GB data will be very ineffective. This will be my last resort.
  2. Ask questions directly to my webserver according to user request. I have never used such a hive so I I am doubting about this. It struck me because I just got to know that you can ask a bee table remotely after some port forwarding on the EMR cluster. But being new to the large (ISH) data, I am not sure about the risks associated with this, is it common to do this?
  3. Some other solutions - how do people generally do this kind of thing? Sounds like a very normal work.

Just for completeness, my data looks like this:

  ID time cat1 cat2 cat3 metrics [200] A 123 1234212133 12 ABC 24 4 , 55,231,34,556,123 .... (~ 200). . .  

(time is the era)

And my question looks like this:

choose cat1, corr (metrics [2] , Matrix [3]), core (metrics [2], matrix [4]), core (metrics [2], matrix [5]), core (matrix [2], matrix [6]) / P>

I need correlation function, which is why I chose postgresql on MySQL.

You have the correlation function:

corr (col1, Col2)

Pearson gives coefficient of correlation of a pair of numerical columns in the group.


Comments

Popular posts from this blog

java - Can't add JTree to JPanel of a JInternalFrame -

javascript - data.match(var) not working it seems -

javascript - How can I pause a jQuery .each() loop, while waiting for user input? -