UtterAccess.com
X   Site Message
(Message will auto close in 2 seconds)

Welcome Guest ( Log In | Register )

 
Reply to this topicStart new topic
> Vba To Parse Html Dom    
 
   
cat2phat
post Mar 24 2012, 07:42 PM
Post #1

New Member
Posts: 10



I am looking to parse a known website into a text file.

Example of extremly basic website:
CODE
<html> <head>  <title>  </title> </head> <body>  <div>   <div>    <img />   </div>   <div>    <p></p>    <p></p>    <p></p>   </div>  </div> </body></html>


I want to parse the website into a text file so it looks like:
CODE
<html>
<head>
  <title>
  </title>
</head>
<body>
  <div>
   <div>
    <img />
   </div>
   <div>
    <p></p>
    <p></p>
    <p></p>
   </div>
  </div>
</body>
</html>

I think the basics are:

CODE
Sub ParseWebsite()
set outfile
set IE

for each child in IE.Document.All
  if child.length = 0 then
   print node information
  else
   ParseWebsiteChildren(child, " ", outfile)
  end if
next
End Sub

Function ParseWebsiteChildren(obj, spacing, outfile)
if child.length = 0 then
  print node information
else
  for each child in obj
   ParseWebsiteChildren(child, " " & spacing, outfile)
  next
end if
End Function


Any documentation, scriptlets, and such is much appreciated.

This post has been edited by cat2phat: Mar 24 2012, 07:43 PM
Go to the top of the page
 
+
doctor9
post Mar 26 2012, 10:28 AM
Post #2

UtterAccess VIP
Posts: 9,304
From: Wisconsin



cat2phat,

This looks like a basic "stack" exercise. I have no idea how to do this using Access Macros, but I think I can do it in VBA. You'll need to customize it a bit - right now, so it uses your file path/filename structure. The first filename is your existing HTML file. The second filename is the name you want to give to the file you're creating with the formatted HTML.

CODE
Private Sub ParseHTML()

    Dim TextLine
    Dim strStack(25) As String, intStackPosition As Integer, strOut As String

    Open "H:\Parser\Unformatted.txt" For Input As #1 ' Open file.
    Open "H:\Parser\Reformatted.txt" For Output As #2  ' Open file for output.

'   Loop until we reach the end of the input file.
    Do While Not EOF(1)
'       Read an entire line into a string variable from the input file.
        Line Input #1, TextLine

'       Initialize
        intStackPosition = 0

MoreText:
'       If there's any text between HTML tags, output it now
        If InStr(1, TextLine, "<") > 1 Then
            Print #2, Space(intStackPosition * 2) & Left(TextLine, InStr(1, TextLine, "<") - 1)
            TextLine = Mid(TextLine, InStr(1, TextLine, "<"))
        End If
            
'       If the tag starts with a slash, it's an ending tag.
        If Left(TextLine, 2) = "</" Then

'           Output the ending tag with the appropriate indentation, based on the
'           stack position we're currently at.
            Print #2, Space(intStackPosition * 2) & Left(TextLine, InStr(1, TextLine, ">"))

'           Remove the ending tag from the line of text we've read from the file.
            TextLine = Mid(TextLine, InStr(1, TextLine, ">") + 1)

'           Remove the top tag from the stack, making the stack one tag shorter.
            strStack(intStackPosition) = ""
            intStackPosition = intStackPosition - 1
        Else

'           Starting Tag.  Add this tag to the top of the stack.
            intStackPosition = intStackPosition + 1
            strStack(intStackPosition) = Left(TextLine, InStr(1, TextLine, ">"))

'           Output the starting tag with the appropriate indentation, based on the
'           stack position we're currently at.
            Print #2, Space(intStackPosition * 2) & Left(TextLine, InStr(1, TextLine, ">"))

'           Remove the starting tag from the line of text we've read from the file.
            TextLine = Mid(TextLine, InStr(1, TextLine, ">") + 1)
        End If

'       If there's more text in this line we read from the file, loop back and repeat.
        If Len(Trim(TextLine)) > 0 Then GoTo MoreText
    Loop

'   Cleanup - close both files.
    Close #1
    Close #2
    
End Sub


Hope this helps,

Dennis
Go to the top of the page
 
+
cat2phat
post Mar 26 2012, 03:04 PM
Post #3

New Member
Posts: 10



Here is what I came up with over the weekend. It is crude and I am more than willing to take advice on how to improve.

CODE
Option Compare Database
Option Explicit

Sub test_buildDOMTree()
    Dim oFS
    Dim outFile
    Dim IE As MSHTML.HTMLDocument
    Dim oEle As Object
    Dim oFileToSave
    
    oFileToSave = "c:\Test\WebSiteExtract.txt"

    'use the function GetIDByLocation to find the open window that has a particular location
    'Example - Set IE = GetIDByLocation("http:\\www.google.com")
    'use the function GetIDByTitle to find the open window that has a particular title
    'Example - Set IE = GetIDByTitle("Google")
    Set IE = GetIEByTitle("This is a Title")

    Set oFS = CreateObject("Scripting.FileSystemObject")
    Set outFile = oFS.CreateTextFile(oFileToSave, True)

    If IE Is Nothing Then
        MsgBox "Not Found"
        Exit Sub
    End If
    
    buildDOMTree IE, outFile
    
    'Alternatively if the site has IFRAMES and the content of the IFRAME is needed
    'we can use the contendWindow.Document method to retreive the HTML from that IFRAME
    'Dim Ifrm
    'For Each Ifrm In IE.Document.getElementsByTagName("IFRAME")
    '    If Ifrm.ID = "detail" Then
    '        'buildDOMTree Ifrm.contentWindow.Document, outFile
    '        Exit For
    '    End If
    'Next

    outFile.Close
    Set outFile = Nothing
    Set oFS = Nothing
End Sub

Function GetIEByLocation(strLocation As String) As MSHTML.HTMLDocument
    Set GetIEByLocation = GetIE(strLocation, 1)
End Function

Function GetIEByTitle(strTitle As String) As MSHTML.HTMLDocument
    Set GetIEByTitle = GetIE(strTitle, 2)
End Function

Function GetIE(sAddress As String, intBy As Integer) As MSHTML.HTMLDocument
    Dim objShell As Object
    Dim objShellWindows As Object
    Dim o As Object
    Dim retVal As Object
    Dim sURL As String
    
    Set retVal = Nothing
    Set objShell = CreateObject("Shell.Application")
    Set objShellWindows = objShell.Windows
    
    'see if IE is already open
    For Each o In objShellWindows
        sURL = ""
        On Error Resume Next
        Select Case intBy
            Case 1:
                sURL = o.Document.Location
            Case 2:
                sURL = o.Document.title
        End Select
        
        On Error GoTo 0
        If sURL <> "" Then
              'Debug.Print sURL
              If sURL Like "*" & sAddress & "*" Then
              Set retVal = o.Document
              Exit For
            End If
        End If
    Next o
    Set GetIE = retVal
End Function

Function buildDOMTree(ByRef oParent As MSHTML.HTMLElementCollection, ByRef outFile, Optional strSpacing As String = "")
    Dim i As Integer
    Dim oChild As MSHTML.HTMLElementCollection

    On Error GoTo getTagContent
        For Each oChild In oParent.childNodes
            'nodeType
                '1 = Element - HTML Tags (example - <HTML><HEAD><TITLE>)
                '3 = Text - Texted within HTML TAGS (example - <P>This is texed within HTML TAGS</P>)
            Debug.Print oChild.nodeType
            If oChild.nodeType = 1 Then
                Debug.Print oChild.nodeName
                outFile.write strSpacing & "<" & oChild.nodeName
                buildDOMTree_GetAttributes oChild, outFile
                outFile.Writeline ">"
                buildDOMTree oChild, outFile, strSpacing & " "
                outFile.Writeline strSpacing & "</" & oChild.nodeName & ">"
            Else
                Debug.Print oChild.nodeValue
                outFile.Writeline strSpacing & oChild.nodeValue
            End If
nextElement:
        Next
    On Error GoTo 0
Exit Function

'An error occurs when there is content between tags
'example - <p>I WILL CAUSE AN ERROR CAUSE I AM NOT AN ELEMENT</p>
getTagContent:
    outFile.Writeline strSpacing & oParent.childNodes(0).nodeValue
    Resume nextElement
End Function

Function buildDOMTree_GetAttributes(ByRef oParent, ByRef outFile)
    Dim oAttr
    
    'Go through all known attributes
    For Each oAttr In oParent.Attributes
        'Was it defined within the HTML document
        If oAttr.specified Then
            outFile.write " " & oAttr.nodeName & "=""" & oAttr.nodeValue & """"
        End If
    Next
End Function


This post has been edited by cat2phat: Mar 26 2012, 03:05 PM
Go to the top of the page
 
+

Thank you for your support! Reply to this topicStart new topic

Jump To Forum:
 



RSS Go to Top  ·  Lo-Fi Version Time is now: 25th May 2013 - 01:20 AM

Tag cloud: