regex - matching and counting strings (k-mer of DNA) in R -


I have a list of strings (DNA sequence including AN, T, C, G) I can find all the matches and In the table whose columns are all possible combinations of those DNA alphabets (4 ^ k; - Kashmir mer - "K" is the length of each match and should be specified by the user) and represent the number of rows Match the sequence in a list

Let's say 5 members in my list include:

  DNAlst & LT; -List ("CAAACTGATTTT", "GATGAAAGTAAAATACCG", "ATTATGC", "TGGA", "CGCGCATCAA")  

me k = 2 (2-mer) , So the 4 ^ 2 = 16 combinations are available including AA, AT, AC, AG, TA, TT, ...

Then my table will have 5 rows and 16 columns . I want to calculate the number of matches between my K-Mers and the list members.

My desired result: df:

  AT AC AG TA TT TC lstMemb AA ... 1 2 1 1 0 0 3 0 2 ... 3 4 5  

Can you help me implement it in R?

If you are looking for speed, then there is a clear solution package for calculation pattern stri_count_fixed function and now, code and benchmark check

  DNAlst & LT; -List ("CAAACTGATTT", "GATGAAAGTAAAATACCG", "ATTATGC", "TGGA", "CGCGCATCAA") DNA & lt; -! Results (L (C, "A", "C", "G", "T"), Every = 4), C (Stri_paste ("A", "C", "G", "T")) "A", "L") [, 1] DNA result AA AC AG AT CA CC Coast Guard GTGC GG GT TA TC TG TT2 1 0 1 1 -; (Sapply (DNAlst, stri_count_fixed, pattern = DNA, overlap = TRUE)) colnames (results) <1 1 0 0 0 0 1 3 [2,] 5 1 1 2 0 1 1 0 2 0 0 1 2 0 1 0 [3,] 0 0 0 0 0 0 0 0 0 1 0 0 1 0 1 1 [4,] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 [5,] 0 0 1 2 0 2 0 0 0 0 0 0 0 0 fstri & lt; - Function (X) {T (sapply (x, stri_count_fixed, dna, T))} fbio & lt; - Function (x) {T (s, x, function (x) {x1 & lt; - DNAString (x); oligonucleotide frequency (x1,2)})) all (fstri (DNAlst) == fbio (DNAlst)) # Pills are the same [1] TRUE longDNA & lt; - Sample (DNAlst, 100, T) fstri (longDNA), fbio (longDNA) unit: Microsoft Expression Minimum LBA Median Average UK Maximum Naval FSTI (Long DNA) 68 9.378 738.184 825.3014 766.862 793.134 6027.039 100 FBOO (Long DNA ) 118371.825 125552.401 129543.6585 127245.489 129165.711 359335.294 100 127245.489 / 766.862 ## 165.9301  

CA 165x times faster :)

< / Html>

Comments

Popular posts from this blog

mysql - How to enter php data into a html multiple select box -

java - Can't add JTree to JPanel of a JInternalFrame -

c++ - Cassandra datastax cpp driver - avoiding unnecessary copies -