java - Spark data processing with grouping -
I need a group group of CSV lines through a specific column and do some processing on each group.
JavaRDD & lt; String & gt; Rows = sc .textfile ("somefile.csv"); JavaPairRDD & lt; String, string & gt; PairRDD = lines.mapToPair (new SomeParser ()); & Lt; String & gt; Keys = pairRDD.keys (). Distinct (). collection (); (String key: keys) {list & lt; String & gt; Rows = jodi rdd Lookup (key); NoOfVisits = rows.size (); Country = COMMA.split (rows.get (0)) [6]; AccessDuration = getAccessDuration (rows, time format); Click the & lt; String, integer & gt; Calculation = gets (rows); Whitepapers = counts.get ("white paper"); Tutorial = counts.get ("Tutorial"); Workshops = Calculations ("Workshop"); Accidental = Calculations ("Concise"); Product page = Calculations ("Product page"); } Private Static Long Dade Pursure (String Destestering) throws ParseException {Simple Data Format Format = New SimpleDateform ("MMM DD Yyyy HH: MMA"); Date date = format PRS (datestring); Return date .getTime (); } Is called dateParser for each row, then it is calculated to get the minimum and maximum usage duration for the group. There are other string matches
pairRDD.lookup is extremely slow .. Is this a better way to do with spark.
I think you can use that column as a key and a GroupByKey
. There is no mention on the operation on those lines. If it is an action that combines those lines in some way, then you can also reduce lesser
.
Something like this:
import org.apache.spark sparkcontact._ // the built-in pair functional pairing = lines.map (parser _) well grouped = pairs.groupByKey / / Here is grouped: (key, iterator [string])
* edit After looking at this process, I think every line Will be more able to contribute, which contributes to it, and then to reduce all of them to the totality
Use total.
def consolidated Becky [u: class tag] (zeroview: u) (U, U) = & gt; U): RDD [(K, U)]
The first task is a partition aggregator and it will run efficiently through local partitions, local copy Partition Segmented Partial
val line = sc.textFile ("somefile.csv") Something like this:) // Parsed gives a key and a decomposed record of the values of the track: (Key, record ("country", timestamp, "whitepaper", ...)) val records = lines.map (parse (_)) val totals = Records.aggregateByKey ((0, set [string]. Fair, long (Record, Count, Desert, Minetime, Maximum Time, Countermap)) => (Count + 1, countrySet + record.country, mathematics). .min (minTime, record.timestamp), math.max (maxTime, record.timestamp), ...) (cumm1, cumm2) = & gt; ??? //
Comments
Post a Comment