Refine a working regex exclude periods or dots except if a decimal

sxschech

Registered User.
Local time
Yesterday, 18:21
Joined
Mar 2, 2010
Messages
802
Pulling data out of many historical documents. Put together this regex pattern and has been working fine. Unfortunately, the latest batch has a slightly different format. I modified the pattern, however, would like to know if it could be refined to maintain the original output format of previous documents. Otherwise, will I need to create a separate loop or function to remove/replace unwanted characters? Can Regex convert to uppercase or do I need to wrap the result in a function?

Originally was MP space number or CP space number. (C|M)?P\s\d+\.?\d*
MP 29.3

Now is M dot P dot space number and variations such as: (C|M)\.?P\.?\s\d+\.?\d*
M.p 29.3
m.P. 29.3
mp. 29.3


Sample Input. Actual text is read in from document so function has a variable instead of being hard coded.
Origianl Pattern:
? regexallmatches("This is a sample of MP 8 CP 25.2 and extracting CP 4.5 in order to capture this too MP 88.92 afterwards ending with MP 9 during Fri.","(C|M)?P\s\d+\.?\d*", "/")


New Pattern:
? regexallmatches("This is a sample of mp 8 C.P. 25.2 and extracting Cp. 4.5 in order to capture this too c.p. 88.92 afterwards ending with m.p 9 during Fri.","(C|M)\.?P\.?\s\d+\.?\d*", "/")

Current Output:
mp 8/C.P. 25.2/Cp. 4.5/c.p. 88.92/m.p 9

Desired output:
MP 8/CP 25.2/CP 4.5/CP 88.92/MP 9

Code:
Function RegexAllMatches(txt As String, Optional pttrn As String, Optional sepchar As String = ",") As String
'Match all instances of the pattern
'https://stackoverflow.com/questions/44979363/excel-vba-regex-function-that-returns-multiple-matches-into-a-single-cell
'adjusted to similar syntax of existing regex functions
'20240314
    Dim regex As Object
    Dim myMatch As Object
    Dim s As String
    Dim arrayMatches()
    Dim i As Long
    
    Set regex = CreateObject("vbscript.regexp")
    On Error GoTo notFound
    With regex
        .IgnoreCase = True
        .MultiLine = True
        .Global = True
        If Len(pttrn) = 0 Then
            .Pattern = "(Code)[\s][\d]{2,4}"
        Else
            .Pattern = pttrn
        End If
        If .Test(txt) Then
            For Each myMatch In .Execute(txt)
                ReDim Preserve arrayMatches(i)
                arrayMatches(i) = myMatch.Value
                i = i + 1
                's = s & " " & myMatch
            Next
        End If
    End With
    RegexAllMatches = Join(arrayMatches, sepchar) '" ")
    If Len(RegexAllMatches) = 0 Then
        RegexAllMatches = "Not Found"
    End If
    Exit Function
notFound:
    RegexAllMatches = "Not Found"
End Function
 
Code:
Sub test_ram()
    Debug.Print RegexAllMatches( _
        "This is a sample of mp 8 C.P. 25.2 and extracting Cp. 4.5 in order to capture this too c.p. 88.92 afterwards ending with m.p 9 during Fri.", _
        "([CM]\.?P\.?)( \d+\.?\d*)", "/")
End Sub

Function RegexAllMatches(txt As String, pttrn As String, Optional sepchar As String = ",") As String
'Match all instances of the pattern
'https://stackoverflow.com/questions/44979363/excel-vba-regex-function-that-returns-multiple-matches-into-a-single-cell
'adjusted to similar syntax of existing regex functions
'20240314
' revised AWF / ebs17 2024-03-22
    Static oRegex As Object
    Dim oMC As Object     ' MatchCollection
    Dim arrayMatches() As String
    Dim lCount As Long
    Dim i As Long
  
    If oRegex Is Nothing Then Set oRegex = CreateObject("vbscript.regexp")
    On Error GoTo notFound
    With oRegex
        .IgnoreCase = True
        .MultiLine = True
        .Global = True
        .Pattern = pttrn

        Set oMC = .Execute(txt)
        lCount = oMC.Count
        If lCount > 0 Then
            ReDim Preserve arrayMatches(lCount - 1)
            For i = 0 To lCount - 1
                arrayMatches(i) = UCase(Replace(oMC(i).SubMatches(0), ".", "")) & oMC(i).SubMatches(1)
            Next
        End If
    End With
    RegexAllMatches = Join(arrayMatches, sepchar)
    If Len(RegexAllMatches) = 0 Then
        RegexAllMatches = "Not Found"
    End If
    Exit Function
notFound:
    RegexAllMatches = "Not Found"
End Function
 
Thank you Eberhard. Appreciate it. Seems to work with my test document.

I tried with another file and looks like will have to play around with the pattern a bit more as am discovering additional variations and even cases without the keyword MP or CP. I imagine for those, will have to do manual entry since without context of the search term to extract from can't determine if a number is related to what should be output.
 
another test:
Code:
Public Function fnRemoveDot(ByVal sText As String) As String
    Dim i As Integer, Char As String, ln As Integer
    Dim prevChar As String, sNew As String
    fnRemoveDot = sText
    ln = Len(sText)
    For i = 1 To ln
        Char = Mid$(sText, i, 1)
        If Char = "." Then
            If IsNumeric(prevChar) Then
                sNew = sNew & UCase(Char)
            End If
        Else
            sNew = sNew + UCase(Char)
        End If
        prevChar = Char
    Next
    fnRemoveDot = sNew
End Function
 
Last edited:
as am discovering additional variations and even cases without the keyword MP or CP
Of course, you first have to be familiar with such variants before you can take them into account in the pattern used. So you would first have to analyze your texts and collect the variants in order to be able to test everything.
without the keyword MP or CP
That would mean recording numbers down to individual digits (=> 8).
You then have to be sure that the numbers recorded belong to your context and were not used for other purposes. If you only want to take numbers (or other expressions) into account on a case-by-case basis, you need a descriptive environment in addition to the number can apply a rule.
 
I played some more. It is certainly not a complete solution, but it shows in which directions one can expect problems that would then have to be solved separately.
Code:
Sub test_ram()
    Debug.Print RegexAllMatches( _
        "Do5 or 6. This is a sample of mp 8 C.P. 25.2 and extracting Cp. 4.5 in order to capture this too c.p. 88.92" & _
        " afterwards ending with m.p 9 during Fri." & _
        " Now we want to deal with 27 numbers without an identifier, such as 2.45 or 77.2 or MP4", _
        "([CM]\.?P\.?)?( )?(\d+\.?\d*)", "/")
End Sub

Function RegexAllMatches(txt As String, pttrn As String, Optional sepchar As String = ",") As String
'Match all instances of the pattern
'https://stackoverflow.com/questions/44979363/excel-vba-regex-function-that-returns-multiple-matches-into-a-single-cell
'adjusted to similar syntax of existing regex functions
'20240314
' revised AWF /ebs17 2024-03-22
    Static oRegex As Object
    Dim oMC As Object
    Dim arrayMatches() As String
    Dim lCount As Long
    Dim i As Long
    Dim sKeyword As String
    Dim sResult As String
   
    If oRegex Is Nothing Then Set oRegex = CreateObject("vbscript.regexp")
    On Error GoTo notFound
    With oRegex
        .IgnoreCase = True
        .MultiLine = True
        .Global = True
        .Pattern = pttrn
 
        Set oMC = .Execute(txt)
        lCount = oMC.Count
        Debug.Print "Count Matches: ", lCount
        If lCount > 0 Then
            ReDim Preserve arrayMatches(lCount - 1)
            For i = 0 To lCount - 1
                'Debug.Print oMC(i), Len(oMC(i).SubMatches(0)), Len(oMC(i).SubMatches(1)), Len(oMC(i).SubMatches(2)), oMC(i).SubMatches.Count
                'Debug.Print oMC(i).FirstIndex, oMC(i).Length
                If Len(oMC(i).SubMatches(0)) = 0 Then
                    sKeyword = ""
                Else
                    sKeyword = UCase(Replace(oMC(i).SubMatches(0), ".", ""))
                End If
                If Len(sKeyword) > 0 Then
                    sResult = sKeyword & " " & oMC(i).SubMatches(2)
                Else
                    sResult = oMC(i).SubMatches(2)
                End If
                arrayMatches(i) = sResult
            Next
        End If
    End With
    RegexAllMatches = Join(arrayMatches, sepchar)
    If Len(RegexAllMatches) = 0 Then
        RegexAllMatches = "Not Found"
    End If
    Exit Function
notFound:
    RegexAllMatches = "Not Found"
End Function
 
Last edited:
another flavor:
Code:
Public Function fnRemoveDot(ByVal sText As String, ParamArray NeedCaps() As Variant) As String
'
' note:
'
'       Needcaps is a "list" of word that need to be Capitalized, example:
'
'       fnRemoveDot("This is a sample of M.P. 8 C.P. 25.2", "MP","CP")
'
    Dim i As Integer, Char As String, ln As Integer, j As Integer
    Dim prevChar As String, sNew As String, var As Variant
    fnRemoveDot = sText
    ln = Len(sText)
    For i = 1 To ln
        Char = Mid$(sText, i, 1)
        If Char = "." Then
            If IsNumeric(prevChar) Then
                sNew = sNew & Char
            End If
        Else
            sNew = sNew + Char
        End If
        prevChar = Char
    Next
    var = Split(sNew)
    For i = 0 To UBound(var)
        Char = Trim$(var(i))
        For j = 0 To UBound(NeedCaps)
            If (" " & Char & " ") = (" " & NeedCaps(j) & " ") Then
                Char = UCase(NeedCaps(j))
            End If
        Next
        var(i) = Char
    Next
    sNew = Join(var)
    fnRemoveDot = sNew
End Function
 
Thanks all. As the files are being downloaded by month and year I was randomly selecting and opening files to view the data and for a while, things seemed to fit a particular pattern, then it changed. Now am scanning random files after each download. Have made a few adjustments to the regex as new discoveries are made. After I run the code, still need to go in and clean up some of the data, however, at least it is better than before.

Using this one which I renamed slightly based on post #3 from @ebs17
stKeyWords = RegexAllMatchesSubmatches(RemoveEXTRASpaces(RemoveNonASCII(.tables(gblTbl).cell(gblRow, gblCol + 1).range.Text)), "([CMP]\.?[PS]\.?|\sAT)(\s?\d+\.?\d*)", "/")

Since that code didn't seem to work for the previous patterns, that is to say the ones that didn't have periods, I used an if statement to run the original code that I had when the result of above was Not Found.
If stKeyWords = "Not Found" Then stKeyWords = RegexAllMatches(RemoveEXTRASpaces(RemoveNonASCII(.tables(gblTbl).cell(gblRow, gblCol + 1).range.Text)), "([CMP]\.?[PS]\.?|\sAT)(\s?\d+\.?\d*)|C[PT]\s\w+", "/")

--Code from Post #7 seemed to return Not Found when I put in my code, so that is why continue to use from Post #3

@arnelgp I tried your code and seemed that at least for what I was after, results weren't what I was expecting. I will revisit for use in a future project.

@tvanstiphout, thanks for the heads up. I'll take advantage of using regex in vbscript for as long as it's available, even though I have a long way to go to learn and be more effective using it. I found a few links about it from NoLongerSet and Devhut, seems no definitive replacement as of yet?

 
As the discoveries and requirements for text passages to be found become more and more diverse, it becomes increasingly difficult to fit all requirements into a single search pattern.
In addition to the option you use to simply start different searches, there is also the option of alternating patterns.
Symbolic:
Code:
"(pattern1)|(pattern2)"
 
tvanstiphout said:

I did see a quote from Microsoft some time ago, but unfortunately haven't been able to run it to earth that RegEx would be integrated into VBA in a forthcoming release.

Just so long as they don't do it in the half-baked ridiculously verbose way the did in VB.Net!

Of course a sensible person tucks away a copy of both the \SysWOW64 and the \System32 versions of vbscript.dll just in case.
 
Last edited:

Users who are viewing this thread

Back
Top Bottom