String Theory - finding values in a largish string

PurpleEdge2214 · February 15, 2025, 8:04am

I have a string, possibly up to 50K in size, that contains a HTML web page, with a form that contains lots of values. There might be multiple sets of data, e.g. like a list of products with specifications for each product.

I plan to use String Theory to help find the values in an optimal way and I’m just wondering what the best way to do it would be?

The string is formatted and contains CRLF so I could save it to a file and read the file with the BASIC driver and search each row for the information I need. I plan to use ST.Between to find and extract the values I need if I use this technique. I’d need to keep track of the beginning and end of each product segment in the file.

I’m asking if this is the best technique or if String Theory has other methods that might be more efficient?

e.g. Just load the string into a ST object and then use ST.FindBetween to loop through the string and find read the values - I’m not that experienced with ST so not even sure how I’d do this?? How do I keep track of where I am in the string, for example? It also sounds like I’d need more code to do it this way?

Any tips welcome!

Edit: Ah, OK I’ve just seen that the “end” value of FindBetween is updated when the searched value is found, so I can see now how I could loop through the string. Worth a try!

seanh · February 15, 2025, 8:14am

StringTheory will defiantly do the job for you, and it’s been optimized for speed quite a bit.
But have a look at xFiles (also Capesoft) which is sort of another layer on top of StringTheory for manipulating XML. that may be a better fit.

vitesse · February 16, 2025, 12:07am

yes ST can easily do this and it will be a good learning exercise.

Perhaps you mean the ASCII driver - but either way using ST you don’t use any of the file drivers. You read the string into memory and process it there. 50k is small so not a problem of running out of memory that you might have if say the string was say a gigabyte.

Yell out if you have any problems and make sure to show your code so we can help if you need it. A sample of the data would also be useful. Also if you look at the ST code for FindBetween it says:

!-----------------------------------------------------------------------------------
! Finds the string between the passed left and right delimiter, and returns it. The passed pStart and pEnd
! are set to the start and end position of the returned value in the string. If pStart or pEnd is passed as less
! than or equal to zero then they are set to the start and end of the stored string repectively. Otherwise
! they are used as the bounds for the search. This allows FindBetween to be called multiple times to search
! for multiple occurances using the same delimiter:
!
! limit  = 0 ! set to zero for end of string
! pStart = 0 ! set to zero or one for start of string
! loop
!     pEnd = limit
!     betweenVal = st.FindBetween('[[', ']]', pStart, pEnd)
!     if pStart = 0
!         break
!     else
!         ! do something with the returned betweenVal
!     end
!     pStart =  pEnd + size(pRight) + 1 ! Reset pStart for next iteration. If not doing pExclusive then pStart = pEnd + 1
! end

so that may be helpful - and I see the same example code is in the docs so no need to actually consult the ST code…

https://www.capesoft.com/docs/StringTheory3/StringTheory.htm#FindBetween

also if you just want the start/end position you can call findBetweenPosition instead, which doesn’t return the value.

cheers

Geoff R

vitesse · February 27, 2025, 1:00am

see this later discussion about st.findBetweenPosition()

RchdR · February 27, 2025, 7:20am

I was using RegEx’s to process the TXA which is similar/variable in size, has a structure with additional “foreign” or “alien” components like embed code with various languages in.

Basically started with the first line, looking for start component and then looked for the end component where it existed or the next start component and backed up a line, took that block of lines and then repeated the process by looking for sections.

Thing with the TXA and other files, is the order of starting sections ithin a section can be fixed where they exist so there will be no end component and different types of line wrapping can be implemented, ie CRLF, &| and |, but StringTheory or RegEx’s will work, just break everything down into smaller chunks and you’ll find it fairly easy, so a webpage should be easy enough.

Thinking of writing your own search engine spider?