My Assistant
![]() ![]() |
|
|
Mar 24 2012, 07:42 PM
Post
#1
|
|
|
New Member Posts: 10 |
I am looking to parse a known website into a text file.
Example of extremly basic website: CODE <html> <head> <title> </title> </head> <body> <div> <div> <img /> </div> <div> <p></p> <p></p> <p></p> </div> </div> </body></html> I want to parse the website into a text file so it looks like: CODE <html> <head> <title> </title> </head> <body> <div> <div> <img /> </div> <div> <p></p> <p></p> <p></p> </div> </div> </body> </html> I think the basics are: CODE Sub ParseWebsite() set outfile set IE for each child in IE.Document.All if child.length = 0 then print node information else ParseWebsiteChildren(child, " ", outfile) end if next End Sub Function ParseWebsiteChildren(obj, spacing, outfile) if child.length = 0 then print node information else for each child in obj ParseWebsiteChildren(child, " " & spacing, outfile) next end if End Function Any documentation, scriptlets, and such is much appreciated. This post has been edited by cat2phat: Mar 24 2012, 07:43 PM |
|
|
|
Mar 26 2012, 10:28 AM
Post
#2
|
|
|
UtterAccess VIP Posts: 9,304 From: Wisconsin |
cat2phat,
This looks like a basic "stack" exercise. I have no idea how to do this using Access Macros, but I think I can do it in VBA. You'll need to customize it a bit - right now, so it uses your file path/filename structure. The first filename is your existing HTML file. The second filename is the name you want to give to the file you're creating with the formatted HTML. CODE Private Sub ParseHTML() Dim TextLine Dim strStack(25) As String, intStackPosition As Integer, strOut As String Open "H:\Parser\Unformatted.txt" For Input As #1 ' Open file. Open "H:\Parser\Reformatted.txt" For Output As #2 ' Open file for output. ' Loop until we reach the end of the input file. Do While Not EOF(1) ' Read an entire line into a string variable from the input file. Line Input #1, TextLine ' Initialize intStackPosition = 0 MoreText: ' If there's any text between HTML tags, output it now If InStr(1, TextLine, "<") > 1 Then Print #2, Space(intStackPosition * 2) & Left(TextLine, InStr(1, TextLine, "<") - 1) TextLine = Mid(TextLine, InStr(1, TextLine, "<")) End If ' If the tag starts with a slash, it's an ending tag. If Left(TextLine, 2) = "</" Then ' Output the ending tag with the appropriate indentation, based on the ' stack position we're currently at. Print #2, Space(intStackPosition * 2) & Left(TextLine, InStr(1, TextLine, ">")) ' Remove the ending tag from the line of text we've read from the file. TextLine = Mid(TextLine, InStr(1, TextLine, ">") + 1) ' Remove the top tag from the stack, making the stack one tag shorter. strStack(intStackPosition) = "" intStackPosition = intStackPosition - 1 Else ' Starting Tag. Add this tag to the top of the stack. intStackPosition = intStackPosition + 1 strStack(intStackPosition) = Left(TextLine, InStr(1, TextLine, ">")) ' Output the starting tag with the appropriate indentation, based on the ' stack position we're currently at. Print #2, Space(intStackPosition * 2) & Left(TextLine, InStr(1, TextLine, ">")) ' Remove the starting tag from the line of text we've read from the file. TextLine = Mid(TextLine, InStr(1, TextLine, ">") + 1) End If ' If there's more text in this line we read from the file, loop back and repeat. If Len(Trim(TextLine)) > 0 Then GoTo MoreText Loop ' Cleanup - close both files. Close #1 Close #2 End Sub Hope this helps, Dennis |
|
|
|
Mar 26 2012, 03:04 PM
Post
#3
|
|
|
New Member Posts: 10 |
Here is what I came up with over the weekend. It is crude and I am more than willing to take advice on how to improve.
CODE Option Compare Database
Option Explicit Sub test_buildDOMTree() Dim oFS Dim outFile Dim IE As MSHTML.HTMLDocument Dim oEle As Object Dim oFileToSave oFileToSave = "c:\Test\WebSiteExtract.txt" 'use the function GetIDByLocation to find the open window that has a particular location 'Example - Set IE = GetIDByLocation("http:\\www.google.com") 'use the function GetIDByTitle to find the open window that has a particular title 'Example - Set IE = GetIDByTitle("Google") Set IE = GetIEByTitle("This is a Title") Set oFS = CreateObject("Scripting.FileSystemObject") Set outFile = oFS.CreateTextFile(oFileToSave, True) If IE Is Nothing Then MsgBox "Not Found" Exit Sub End If buildDOMTree IE, outFile 'Alternatively if the site has IFRAMES and the content of the IFRAME is needed 'we can use the contendWindow.Document method to retreive the HTML from that IFRAME 'Dim Ifrm 'For Each Ifrm In IE.Document.getElementsByTagName("IFRAME") ' If Ifrm.ID = "detail" Then ' 'buildDOMTree Ifrm.contentWindow.Document, outFile ' Exit For ' End If 'Next outFile.Close Set outFile = Nothing Set oFS = Nothing End Sub Function GetIEByLocation(strLocation As String) As MSHTML.HTMLDocument Set GetIEByLocation = GetIE(strLocation, 1) End Function Function GetIEByTitle(strTitle As String) As MSHTML.HTMLDocument Set GetIEByTitle = GetIE(strTitle, 2) End Function Function GetIE(sAddress As String, intBy As Integer) As MSHTML.HTMLDocument Dim objShell As Object Dim objShellWindows As Object Dim o As Object Dim retVal As Object Dim sURL As String Set retVal = Nothing Set objShell = CreateObject("Shell.Application") Set objShellWindows = objShell.Windows 'see if IE is already open For Each o In objShellWindows sURL = "" On Error Resume Next Select Case intBy Case 1: sURL = o.Document.Location Case 2: sURL = o.Document.title End Select On Error GoTo 0 If sURL <> "" Then 'Debug.Print sURL If sURL Like "*" & sAddress & "*" Then Set retVal = o.Document Exit For End If End If Next o Set GetIE = retVal End Function Function buildDOMTree(ByRef oParent As MSHTML.HTMLElementCollection, ByRef outFile, Optional strSpacing As String = "") Dim i As Integer Dim oChild As MSHTML.HTMLElementCollection On Error GoTo getTagContent For Each oChild In oParent.childNodes 'nodeType '1 = Element - HTML Tags (example - <HTML><HEAD><TITLE>) '3 = Text - Texted within HTML TAGS (example - <P>This is texed within HTML TAGS</P>) Debug.Print oChild.nodeType If oChild.nodeType = 1 Then Debug.Print oChild.nodeName outFile.write strSpacing & "<" & oChild.nodeName buildDOMTree_GetAttributes oChild, outFile outFile.Writeline ">" buildDOMTree oChild, outFile, strSpacing & " " outFile.Writeline strSpacing & "</" & oChild.nodeName & ">" Else Debug.Print oChild.nodeValue outFile.Writeline strSpacing & oChild.nodeValue End If nextElement: Next On Error GoTo 0 Exit Function 'An error occurs when there is content between tags 'example - <p>I WILL CAUSE AN ERROR CAUSE I AM NOT AN ELEMENT</p> getTagContent: outFile.Writeline strSpacing & oParent.childNodes(0).nodeValue Resume nextElement End Function Function buildDOMTree_GetAttributes(ByRef oParent, ByRef outFile) Dim oAttr 'Go through all known attributes For Each oAttr In oParent.Attributes 'Was it defined within the HTML document If oAttr.specified Then outFile.write " " & oAttr.nodeName & "=""" & oAttr.nodeValue & """" End If Next End Function This post has been edited by cat2phat: Mar 26 2012, 03:05 PM |
|
|
|
![]() ![]() |
|
Go to Top · Lo-Fi Version | Time is now: 25th May 2013 - 01:20 AM |