I would like to have a function that would seach the vector
for a specific patern "1 after 4" (that is "1" "4"). It should list all the found sequences and the print
out the ration for every each one, their length, where does it start and end.
It should search a part of the vector equal than N>=8 for each pair number(1,4) in the following vector with these condiotions in mind:
1) a specific ratio like this of:
BigRatio= Number of (1,4)*N/(Number of (1)*Number of (4))
has to be more or equal than 0.2 %
2) and the the ratio of (1,4)
in the vector (average of
SmallRadtio= (Number of 1 + Number of 4)/(length of sequence) for 0.3%
If the conditions are met, it should then print the sequence the rations for every match.
This is the vector:
vector=c(1,1,1,1,1,1,1,4,4,4,4,2,3,1,1,1,1,1,1,1,4,4,4,4,2,3,1,4,1,4,1,4,1,4,1,4,
1,4,1,4,4,2,3,1,1,1,1,4,1,1,1,4,4,4,4,2,3,1,1,4,1,4,1,4,1,1,1,4,4,4,4,2,3,3,1,1,
4,1,4,1,4,1,1,1,4,4,4,4,4,4,4,4,2,3,1,1,1,1,1,1,1,4,4,1,1,4,2,1,1,1,1,1,1,4,3,
2,4,2,1,5,6,2,3,1,2,4,1,2,3,1,1,1,1,1,1,1,2,3,4,5,1,2,3,4,1,1,1,1,1,1,2,3,4,1,1,
1,2,3,1,2,3,1,2,3,4,3,1,2,1,4,1,4,1,4,1,4,1,4,1,4,1,4,1,4,1,4,1,4,1,4,1,4,1,
4,1,4,4,2,3,1,1,1,1,4,1,1,1,3,1,1,1,1,4,1,1,1,3,1,1,1,1,4,1,1,1,4,1,1,1,3,1,1,
1,1,4,2,3,1,1,4,1,4,1,4)
vector2=as.character(vector)
I converted it to character becasuse I thought I would be more easier that way. I may be wrong.
My code/Progress so far
I was having two ideas about this:
1)The function could search 8 or more( I can choose in the function) numbers at once and then check the rations. And then give the informations about it if its a good piece of 8 numbers.
2)The other idea would be that there would be a scoring system giving 5 points for a pair of 1,4 and -1 for every other number. Then it should somehow give an estimate where these parts are and should find these segmentes. The problem with the first idea is that maybe there would be maybe a segment which has 40 %, and the next segment has 20 % and together they maybe have more than that. So I was trying to figure out how to escape this trap of negative positivies. Maybe the search system should check every number or pair number than a whole segment. This is more complicated, but again more precise.
With the code I am stuck how to make the function. I know the arguments should be
vector
and the desired length of the sequence I would like to search for ( if I go for
the first ide).I think I have to use a for loop
to count every number ( or two numbers) sothat I could check if the are equal to (1,4) and then "remeber" it calculate
the length of that part. And of course search for every part in it for 1 ili 4 to
calculate the rations for them.
I thought of using this kind of loop:
for (i in 1:length(vector)) {
idx <- agrep(vector[i],x)
matches[i] <- length(vector)
But I think it is wrong and not really right.
I am still new to programming and R.
Additional question:
How would the function look if it was used for a data frame? Would it change the search to specifi rows?Is posibble to convert a vector into a data frame?
EDIT:
Another example and clarification:
sample2=c("aaaaabababababababababababababababcabcbababc bcbabcbcdddcbcbcdcbcbcbdcb
bcbcbcbdbdbcbcbcbccbbcbbcbcbcbcbcbcbabababababababccbbcbbcbcbcbcbcbcbdbdbcbcbcbccb
bcbcbcbdbdbcbcbcbccbbcbbcbcbcbcbcbbababababababababababababacbcbacbcbcdcbcbcbdcbbcdaddabcbac
cabcbabcbabcbcbbabbabababababababababababa")
nchar(sample2)
So this it what it should do:
1) idea
Search every 50 part of the string, that means this part first:
"aaaaabababababababababababababababcabcbababcbcbabc"
and then this part ( the next sequence of 50 elements of that string)
"bcbabcbcdddcbcbcdcbcbcbdcbbcbcbcbdbdbcbcbcbccbbcbb"
And to this for every other 50 elements of the string.
As you can see the second 50 elements have "ba" in it that match the condition. So that will not be shown because it does not meet the condition.
The next idea was to calculate what the optimal segment for >0.5 in this string would be. That means there would be a problem if in the first part of 50 elements there would be 0.4 of "ba" in it, and in the next 50 0.1 of "ba" in it right at the beginning of that part : Imaginary first 50 have at the end a lot of ba, but not enough:
"aaaaabababababdcdcdcdacacbababababababababababababab"
The next 50 have a lot of the beginning:
"bababababababcbcdcbcbcbdcbbcbcbcbdbdbcbcbcbccbbcbbcd"
So how to make this more optimal? Should there we scoring system for "ba" as explained above to find the optimal lenghts of a segment for satisfing the conditions?
I'm rather annoyed that after producing useful code still no upvote and the problem still seems ambiguous. The new example has linefeeds it it but it's not clear whater we are supposed to read these in as separate lines, since:
> nchar(readLines(textConnection(sample2)))
[1] 71 92 102 52
It's not that hard to split a long character value into smaller parts:
samp3 <- paste(rep("a", 300), collapse="")
mapply( substr, seq(1,nchar(samp3),by=50), seq(1,nchar(samp3),by=50)+49, MoreArgs=list(x=samp3))
[1] "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"
[2] "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"
[3] "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"
[4] "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"
[5] "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"
[6] "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"
If you want to progress in your academic pursuits you need to work on expressing a concrete example in a manner other can execute.
------------First attempt:
Here is some vectorized code that should produce the tools needed to do this. Finding the correct vectorized functions lets you move beyond the for-loop-mentality fostered by SAS and BASIC. Loops can be useful when needed but generally R programmers try to avoid them unless really needed. I'm not sure what the exact desired outcome is, but at least this should move the conversation forward:
# convert to single character item
collapsV <- paste0(vector,collapse="")
pos14 <- gregexpr("14", collapsV) # regex pattern matching
# look for runs of 2 differences , i.e. "14"'s next to each other
diff14_2 <- rle( diff(gregexpr("14", collapsV)[[1]]) )
#Run Length Encoding ...# value is a two element list that looks like
# lengths: int [1:22] 1 1 6 1 1 1 2 1 1 2 ...
# values : int [1:22] 13 7 2 8 4 8 2 4 9 2 ...
which( diff14_2$values==2 & diff14_2$lengths>4)
[1] 3 16
So the third gregexpr "hit" will be the position in "vector" of the first 14141414 run that is at least 4 pairs long. Check it:
> pos14[[1]][3]
[1] 27
> vector[27:40]
[1] 1 4 1 4 1 4 1 4 1 4 1 4 1 4
> vector[25:40]
[1] 2 3 1 4 1 4 1 4 1 4 1 4 1 4 1 4
And 16 is the second position in the gregexpr value that refers back to the position in "vector":
> pos14[[1]][16]
[1] 76
> vector[76:(76+8)]
[1] 1 4 1 4 1 4 1 1 1
You should print out all the intermediate values to see what is happening.