regex - matching and counting strings (k-mer of DNA) in R -
I have a list of strings (DNA sequence including AN, T, C, G) I can find all the matches and In the table whose columns are all possible combinations of those DNA alphabets (4 ^ k; - Kashmir mer - "K" is the length of each match and should be specified by the user) and represent the number of rows Match the sequence in a list
Let's say 5 members in my list include:
DNAlst & LT; -List ("CAAACTGATTTT", "GATGAAAGTAAAATACCG", "ATTATGC", "TGGA", "CGCGCATCAA")
me k = 2
(2-mer) , So the 4 ^ 2 = 16
combinations are available including AA, AT, AC, AG, TA, TT, ...
Then my table will have 5 rows
and 16 columns
. I want to calculate the number of matches between my K-Mers and the list members.
My desired result: df:
AT AC AG TA TT TC lstMemb AA ... 1 2 1 1 0 0 3 0 2 ... 3 4 5
Can you help me implement it in R?
If you are looking for speed, then there is a clear solution package for calculation pattern stri_count_fixed
function and now, code and benchmark check
DNAlst & LT; -List ("CAAACTGATTT", "GATGAAAGTAAAATACCG", "ATTATGC", "TGGA", "CGCGCATCAA") DNA & lt; -! Results (L (C, "A", "C", "G", "T"), Every = 4), C (Stri_paste ("A", "C", "G", "T")) "A", "L") [, 1] DNA result AA AC AG AT CA CC Coast Guard GTGC GG GT TA TC TG TT2 1 0 1 1 -; (Sapply (DNAlst, stri_count_fixed, pattern = DNA, overlap = TRUE)) colnames (results) <1 1 0 0 0 0 1 3 [2,] 5 1 1 2 0 1 1 0 2 0 0 1 2 0 1 0 [3,] 0 0 0 0 0 0 0 0 0 1 0 0 1 0 1 1 [4,] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 [5,] 0 0 1 2 0 2 0 0 0 0 0 0 0 0 fstri & lt; - Function (X) {T (sapply (x, stri_count_fixed, dna, T))} fbio & lt; - Function (x) {T (s, x, function (x) {x1 & lt; - DNAString (x); oligonucleotide frequency (x1,2)})) all (fstri (DNAlst) == fbio (DNAlst)) # Pills are the same [1] TRUE longDNA & lt; - Sample (DNAlst, 100, T) fstri (longDNA), fbio (longDNA) unit: Microsoft Expression Minimum LBA Median Average UK Maximum Naval FSTI (Long DNA) 68 9.378 738.184 825.3014 766.862 793.134 6027.039 100 FBOO (Long DNA ) 118371.825 125552.401 129543.6585 127245.489 129165.711 359335.294 100 127245.489 / 766.862 ## 165.9301
CA 165x times faster :)
Comments
Post a Comment