|
Post by Roger Cabo on Dec 15, 2022 1:34:19 GMT 1
Hi everyone, Does anyone have an idea to do a spell check to a textbox anyhow? I found a German dictionary text file and all words are sorted. The text file contains about 1 million words. Text Dictionary File German 30MB. Does anyone have an idea how to find words from a text are not correct? Perhaps with a binary search? Thanks!
|
|
webu
Full Member
Posts: 149
|
Post by webu on Dec 15, 2022 1:45:12 GMT 1
Binary needs 20 tries to find a word for sure. This should be the best way.
|
|
|
Post by (X) on Dec 15, 2022 2:52:59 GMT 1
This is the source code to find a string in a file which may be adapted to your needs.
|
|
webu
Full Member
Posts: 149
|
Post by webu on Dec 15, 2022 17:20:03 GMT 1
However, FindFiles.g32 would be extremely fat for a very small problem. The area to be searched is already sorted alphabetically, so this is extremely fast with a very simple binary search.
Do you already have a solution for a binary search, Rogercabo or do you still need one?
---german---
FindFiles.g32 wäre aber extrem fett für ein sehr kleines Problem. Der zu durchsuchende Bereich ist bereits alphabetisch sortiert, von daher geht das mit einer sehr einfachen binären Suche extrem schnell.
Hast du schon eine Lösung für eine Binärsuche, Rogercabo oder brauchst du noch eine?
|
|
|
Post by Roger Cabo on Dec 15, 2022 20:40:34 GMT 1
However, FindFiles.g32 would be extremely fat for a very small problem. The area to be searched is already sorted alphabetically, so this is extremely fast with a very simple binary search. Do you already have a solution for a binary search, Rogercabo or do you still need one? ---german--- FindFiles.g32 wäre aber extrem fett für ein sehr kleines Problem. Der zu durchsuchende Bereich ist bereits alphabetisch sortiert, von daher geht das mit einer sehr einfachen binären Suche extrem schnell. Hast du schon eine Lösung für eine Binärsuche, Rogercabo oder brauchst du noch eine?
From what I see files search works simply with instr and rematch. When loading the 30MB file into a string it takes 10ms to find a word at the end of the string. = "zzgl". A bit too long.
So what issue I see for binary search, I need an index. Because the words are in a string stream containing #13#10. How you will find exactly the half of the next search?
I thought about to use a Hash table, but this would be about 250MB after creating an array with about 1million words. Then I came to the conclusion to build and array of 100 elements that contains the index of A,AR, B,BR, C,CR.. etc?
Currently I have no binary search for a sorted text stream.. any idea how to solve the index problem for binary search?
|
|
webu
Full Member
Posts: 149
|
Post by webu on Dec 15, 2022 21:28:48 GMT 1
Ok, I briefly documented a code of mine from the last millennium and transformed it a bit into English.
Ok, ich habe einen Code von mir aus dem letzten Jahrtausend kurz dokumentiert und etwas in Englisch transformiert.
|
|
webu
Full Member
Posts: 149
|
Post by webu on Dec 15, 2022 21:34:54 GMT 1
'Bin_search, 12/2022 WeBu, Germany
Dim j%, n%, a$(1910000) OpenW 1, 16, 16 AutoRedraw = 1 SetFont 16 Mode StrSpace 0 Const FName$ = "wordlist-german.txt"
Print "open "; FName$`"..."` Open FName$ for Input As # 1 Recall # 1, a$(), -1, n Close # 1 Print "-> "; n & " items"
Print "Sort ... "; // Sorting necessary because of the German special characters, the sorting order is thereby wrong for the search // Sortieren notwendig wegen der deutschen Sonderzeichen, die Sortierreihenfolge ist dadurch für die Suche falsch QSort a$(), n Print "ready."#10
Print "Start finding all "; n; " items ..." Dim t# = -Timer For j = 0 To n - 1 bin_search(a$(j)) Next t += Timer Print "... ready after "; Round(t, 2); " secs" Procedure bin_search(such$) ' Searches in the range of 0 - (n-1) and returns the correct position in the VAR 'fnd'. If the element of 'fnd' is then not the one searched for, then it does not exist and can be inserted at exactly this position 'fnd' with the Insert-command. ' Sucht im Bereich von 0 - (n-1) und gibt in der VAR 'fnd' die richtige Stelle zurück. Ist das Element von 'fnd' dann nicht das gesuchte, dann ist es nicht vorhanden und kann an genau dieser Stelle 'fnd' mit Insert-Befehl eingefügt werden. Dim i% Dim unten% = -1 Dim fnd% = n + 1 ' Repeat i = (unten + fnd) / 2 If a$(i) < such$ unten = i Else fnd = i EndIf Until fnd = unten + 1 Check: If j <> fnd Color 255 : Print j; "<>"; fnd, such$, a$(fnd) : Color 0 Return
|
|
webu
Full Member
Posts: 149
|
Post by webu on Dec 15, 2022 21:43:19 GMT 1
' bin_search' finds in a loop all the 1.908.815 items in the load array in 1 - 3 seconds. All 1.908.815 tiems! If you have a text with 1000 words, it will check it in a millisecond.
This routine i buildet in 1989 and is from GFA 3.07 Atari. In g32 transformed today
|
|
|
Post by Roger Cabo on Dec 15, 2022 21:57:59 GMT 1
thank you very much!
You wrote: >>> Sorting necessary because of the German special characters, the sorting order is thereby wrong for the search
Does that mean the sorting order is now incorrect?
|
|
|
Post by Roger Cabo on Dec 16, 2022 0:16:17 GMT 1
It seems there is a bug on the Dict word table? This word exist 2 times Betriebsmasse <--- Betriebsmaße Betriebsmasse <--- After Sorting with gb32.
Betriebsmaße Betriebsmasse <--- Betriebsmasse <--- Because all words must be found in your demo. I think I should eliminate doubles as well from the dict after sorting. Seems there are any more errors inside, or Gb32 does have a problem. The word üppigstes exist only 1 time but it shows an error:
That' strange the word exist, but it would not be found "üppigste"... Perhaps
i = (unten + fnd) / 2 the result does not point correct under some circumstances because / 2 cause in integer rounding error anyhow! :-)
|
|
|
Post by Roger Cabo on Dec 16, 2022 1:27:44 GMT 1
Seems gb32 qSort has issues... after sorting there are 1188 empty elements in the array.
|
|
|
Post by dragonjim on Dec 16, 2022 12:07:52 GMT 1
If there are empty elements after QSort, then there are probably empty or unreadable elements spotted around the array itself.
A quick way to find them is as follows:
Dim n%, ct% = 7, w$(1 To ct%), w%(1 To ct%) // This just populates the example array Restore Dataline For n% = 1 To ct% Read w$(n%) w%(n%) = n% // Stores the element in the original array Next n% Dataline: Data "Hello", "Goodbye","","Monday","Tuesday","","Wednesday" // Sort while preserving the original element number QSort w$(), ct%, w%() // Run through the sorted array, identifying the original elements that were either empty or not converted properly For n% = 1 To ct% If w$(n%) = "" Then Debug "Element" & w%(n%) & " in the original array is empty" Next n% Debug.Show
You could run through the original array and delete the empty elements but the above will also catch any issues with QSort not handling strings properly, if that is the case.
|
|
|
Post by (X) on Dec 16, 2022 16:56:00 GMT 1
Divide and Conquer... One could also sort the word file by string length. Fewer elements = Faster search
|
|
|
Post by (X) on Dec 16, 2022 19:12:57 GMT 1
This is a quick survey of the string lengths...
|
|
webu
Full Member
Posts: 149
|
Post by webu on Dec 16, 2022 21:31:39 GMT 1
1. Yes, the word "Betriebsmasse" is duplicated, but is still found. Duplicate words are not a problem with binary search. 2. "QSort a$(), n" is necessary. n is the number of elements to be sorted only. If only "Qsort a$()" (without n) is used, the empty strings at the end of the array are sorted upwards and are at the beginning. This can never be needed, therefore the parameter n 3. In the file "wordlist-german.txt" i tried, there are no empty lines, also not after sorting. Just try inside the loop the command "If !a$(j) Stop" and the program will not stop. 4. Be carefull. The file "wordlist-german.txt" ist UTF-8-encoded. If you try to search for your own words, these search-words must also be UTF-8-encoded.The best way is to convert the file to your own desired encoding, sort it once and then store it back with "Store a$() ...". Then you don't need to sort after loading. 5. "Fewer elements = Faster search". Well, whether a search needs 20 comparisons or only 17 should not matter much. It was Rogercabo about a check of a textbox, so very few words. This still does not take a millisecond without prior indexing.
6. Which division you use doesn't matter, all versions will work find all words, but if you use the integer division it will of course be about 33% faster again (compiled).
Repeat i = (unten + fnd) \ 2 // Integer-div works fine If a$(i) < such$ unten = i Else fnd = i EndIf Until fnd = unten + 1
|
|
|
Post by dragonjim on Dec 16, 2022 23:09:07 GMT 1
Another way to make it faster is to just test the word currently being edited (plus possibly the one before and afterwards in the case of a space being inserted).
|
|
|
Post by Roger Cabo on Dec 17, 2022 18:21:01 GMT 1
'Bin_search, 12/2022 WeBu, Germany
Dim j%, n%, a$(1910000) OpenW 1, 16, 16 AutoRedraw = 1 SetFont 16 Mode StrSpace 0 Const FName$ = "wordlist-german.txt"
Print "open "; FName$`"..."` Open FName$ for Input As # 1 Recall # 1, a$(), -1, n Close # 1 Print "-> "; n & " items"
Print "Sort ... "; // Sorting necessary because of the German special characters, the sorting order is thereby wrong for the search // Sortieren notwendig wegen der deutschen Sonderzeichen, die Sortierreihenfolge ist dadurch für die Suche falsch QSort a$(), n Print "ready."#10
Print "Start finding all "; n; " items ..." Dim t# = -Timer For j = 0 To n - 1 bin_search(a$(j)) Next t += Timer Print "... ready after "; Round(t, 2); " secs" Procedure bin_search(such$) ' Searches in the range of 0 - (n-1) and returns the correct position in the VAR 'fnd'. If the element of 'fnd' is then not the one searched for, then it does not exist and can be inserted at exactly this position 'fnd' with the Insert-command. ' Sucht im Bereich von 0 - (n-1) und gibt in der VAR 'fnd' die richtige Stelle zurück. Ist das Element von 'fnd' dann nicht das gesuchte, dann ist es nicht vorhanden und kann an genau dieser Stelle 'fnd' mit Insert-Befehl eingefügt werden. Dim i% Dim unten% = -1 Dim fnd% = n + 1 ' Repeat i = (unten + fnd) / 2 If a$(i) < such$ unten = i Else fnd = i EndIf Until fnd = unten + 1 Check: If j <> fnd Color 255 : Print j; "<>"; fnd, such$, a$(fnd) : Color 0 Return
I have some small troubles to return a value if a word does not exist.
If I simply call :
bin_search("fsjdhksfhsfk")
I like to return the index if the word was found. Return fnd And -1 if the word was not found.
Would that work like this?
Function bin_search(such$) as int Dim i% Dim unten% = -1 Dim fnd% = Ubound(a$() + 1 ' Repeat i = (unten + fnd) / 2 If a$(i) < such$ unten = i Else fnd = i EndIf Until fnd = unten + 1 If a$(fnd) != such$ Return -1 Else Return fnd EndIf
EndFunction
|
|
|
Post by (X) on Dec 17, 2022 18:26:07 GMT 1
I am not sure what to make of this comment in the help file: Does this mean the total number of characters in the array?
or
Does it mean the max string length for 1 element of the array?
A further help file comment may clear this up...
So, I guess there's no need to worry unless we are trying to input strings of length greater than 9999 bytes.
Apparently there is a "work-around" inspired by an example by Roger Cabo...
|
|
|
Post by Roger Cabo on Dec 17, 2022 19:42:20 GMT 1
I am not sure what to make of this comment in the help file: Does this mean the total number of characters in the array?
or
Does it mean the max string length for 1 element of the array?
A further help file comment may clear this up...
So, I guess there's no need to worry unless we are trying to input strings of length greater than 9999 bytes.
Apparently there is a "work-around" inspired by an example by Roger Cabo...
Haha yes! Yesterday it was confusing me as well. But I think a single string should be not longer than 9999 bytes when using store or recall. The function Store/Recall works with about these 1.900.000 lines from the dic well. But anyway: I found an issue with binary search logic. The search cause an error if the elements are not even. I think all binary search function have this problem. I tried several in c# as well.. Example: Return -1 // false Dim a$(10) a$(0) = "0" a$(1) = "1" a$(2) = "2" a$(3) = "3" a$(4) = "4" a$(5) = "5" a$(6) = "6" a$(7) = "7" a$(8) = "8" a$(9) = "9" a$(10) = "10"
MsgBox bin_search("10") & " fail!"
ReDim a$(9)
MsgBox bin_search("9") & " found"
Function bin_search(such$) Dim i% Dim unten% = -1 Dim fnd% = UBound(a$()) ' Repeat i = (unten + fnd) / 2 If a$(i) < such$ unten = i Else fnd = i EndIf Until fnd = unten + 1 If a$(fnd) != such$ Return -1 Else Return fnd EndIf Return
I always take care that the array is even. For sure... the possibility that exactly the last entry was searched is nearly non existent. But that gives me bad feeling not to take care of.
|
|
|
Post by Roger Cabo on Dec 17, 2022 19:53:11 GMT 1
To make a value even, is there any easier logic than this? Dim v% = 99 MsgBox (((v% + 1) / 2) * 2) v% = (((v% + 1) / 2) * 2) - 1 // -1 to get even array elements in gb32 Redim a$(v%)
|
|
|
Post by (X) on Dec 17, 2022 19:54:52 GMT 1
There are the Odd() and Even() functions...
Dim a = 1
If Odd(a) a++
|
|
|
Post by Roger Cabo on Dec 17, 2022 20:06:36 GMT 1
There are the Odd() and Even() functions...
Dim a = 1
If Odd(a) a++
Yes true!!! I'm totally focused to shader programming math at this moment. In shader code you can use if (x == y) but that breaks the multi threading technologies and cause shaders to run slow. Here is an example: doesn't use any if statements: www.shadertoy.com/view/Ms2fWW
|
|
|
Post by (X) on Dec 17, 2022 20:12:58 GMT 1
If the number is an integer you could just check the least significant bit for "0" or "1"... Local n% = 1 Trace Btst(n%, 0) ' Should return -1 which is True, which would mean the number is odd.
' So...
If Btst(n%, 0) n%++ ' Would make it even.
|
|
webu
Full Member
Posts: 149
|
Post by webu on Dec 17, 2022 20:33:38 GMT 1
Dim a$(10) a$(0) = "0" a$(1) = "1" a$(2) = "2" a$(3) = "3" a$(4) = "4" a$(5) = "5" a$(6) = "6" a$(7) = "7" a$(8) = "8" a$(9) = "9" a$(10) = "10"
MsgBox bin_search("10") & " fail!"
ReDim a$(9)
If you forget to sort your example-array, my program will not work. In a numerically sorted array, the 10 would be in the 10th position. But in a sorted string-array, "10" is not in the 10th position! The front part of your sorted string-array would look like this: "0" "1" "10" "2" "3" ... This is also the reason why it doesn't work up to your "10", but it does up to "9" and this has nothing to do with odd or even.
|
|
webu
Full Member
Posts: 149
|
Post by webu on Dec 17, 2022 21:10:07 GMT 1
I would give that back this other way. If it is found then return -1, otherwise fnd is global and delivers the index. If a$(fnd) = such$ Return -1
And outside the function:
If @bin_search(word$) >= 0 // not found If MsgBox("'" + word$ + "' not found! Insert?", MB_YESNO, "") = IDYES Insert a$(fnd) = word$ : n++ EndIf Else // word found, all is ok. EndIf
// fnd% must be global in bin_search()
|
|
|
Post by Roger Cabo on Dec 17, 2022 22:23:27 GMT 1
I would give that back this other way. If it is found then return -1, otherwise fnd is global and delivers the index. If a$(fnd) = such$ Return -1
And outside the function:
If @bin_search(word$) >= 0 // not found If MsgBox("'" + word$ + "' not found! Insert?", MB_YESNO, "") = IDYES Insert a$(fnd) = word$ : n++ EndIf Else // word found, all is ok. EndIf
// fnd% must be global in bin_search()
Yes my fault.. must be asccii sorted! this is not numerical. Thank you very much for your help!
|
|
webu
Full Member
Posts: 149
|
Post by webu on Dec 17, 2022 23:03:08 GMT 1
Just look at this second version. It is a example for you to test and insert
Please dont use Ubound(), this is dangerous. Be carfull and use a var like n% for the highest count, this is the only way. a$(n%) is still outside the area you have. a$(n%-1) is the last valid element of the database.
If you also want to insert elements, you have to preallocate the last element a%(n%) with some Chr$(255). Only then the search routine will also return a fnd% of n% if a new value would have to be inserted/appended to the end of the array. (e.g. "zzzz")
|
|
|
Post by Roger Cabo on Dec 18, 2022 0:12:08 GMT 1
Thank you I will have a look into! I use the German.txt as a fixed ANSI QSorted file. All the elements staying the same for ever. The first entry contain the count of elements. This value declares the string dictionary element count. Then I use a secondary file for additional words added by the user. So it's possible to delete wrong words are already added as well. My 3rd dictionary is a German medical words database with about 4000 entries I made by my self from this website. www.pschyrembel.de/_/webpages---krankheiten-systematisch%23D08K0/doc/In the end, I search through all the 3 Dics. ------ There is one thing I though about.. but did not came to an good result. If a word was *not* found then try to make some suggestions. I think this must be done with pattern recognition. I think that's very difficultly.
|
|
webu
Full Member
Posts: 149
|
Post by webu on Dec 18, 2022 0:38:03 GMT 1
If you want to find with case sensitivity, this is easy. If you want reg exp, this will need time, much time, because the known engines for that are slow.
To find a string in a big memory-block, use the fast Boyer-Moore-Alg. For this Boyer-Moore i wrote a machine-language-code for the 68000 Processor in 1992 (30 years ago on a native Atari ST for GFA 3.07)
|
|
|
Post by Roger Cabo on Dec 18, 2022 17:00:59 GMT 1
If you want to find with case sensitivity, this is easy. If you want reg exp, this will need time, much time, because the known engines for that are slow.
To find a string in a big memory-block, use the fast Boyer-Moore-Alg. For this Boyer-Moore i wrote a machine-language-code for the 68000 Processor in 1992 (30 years ago on a native Atari ST for GFA 3.07)
Cool! But currently not my possibilities to write a fully functional Boyer-Moore-Alg :-)
Another small hint:
the bin_search(such$) require a check function if a$() has only one element for any reason.
Global a$(0) bin_search(such$)
Function bin_search(such$, n) Dim i% Dim unten% = -1 Dim fnd% = n If fnd% = 0 // otherwise it hangs in a loop. Return -1 EndIf
....
Further I found a quick solution to delete doubles in a Array and redim the Array as well in one loop.
Since I worked in C# a lot.. I use definitely the Unbound() function to determine how many elements are in any array and walk through. Now in gb32 I use so many dynamic arrays and never had a problem while redim any datatype in real time.
Dim a$(10),i% a$(0) = "-1" a$(1) = "0" a$(2) = "0" a$(3) = "1" a$(4) = "1" a$(5) = "1" a$(6) = "2" a$(7) = "2" a$(8) = "3" a$(9) = "4" a$(10) = "4" QSort a$()
i% = UBound(a$()) While i > 0 If a$(i%) = a$(i% - 1) Delete a$(i%) ReDim a$(UBound(a$()) - 1) EndIf i%-- Wend
|
|