My Assistant
![]() ![]() |
|
|
Oct 6 2010, 01:52 PM
Post
#1
|
|
|
New Member Posts: 11 |
I am looking for a way to remove outliers from my data. This process was previously done in an excel sheet by defining the minimum allowable value as 2 standard deviations below the average and the maximum allowable value as 2 standard deviations above the average. The values in the list had conditional formating that caused a value to turn red if it fell outside of the acceptable range. This value was deleted so that it would not be used in the standard deviation calculation and therefore change the acceptable range. The process would continue until no values were red. Any ideas how to do this automatically in with a list of values in a table (or query) in access?
This post has been edited by Kosby: Oct 6 2010, 02:02 PM |
|
|
|
Oct 7 2010, 07:23 AM
Post
#2
|
|
|
UtterAccess VIP Posts: 2,336 From: West Michigan |
Can you give some example data? A list of values, indicate those that *should* be outliers, and what you want the output to look like?
|
|
|
|
Oct 7 2010, 12:05 PM
Post
#3
|
|
|
New Member Posts: 11 |
An example of the initial data is below. I want to remove any values that do not fall within two standard deviations of the average of the group.
1.3064 1.5828 1.4097 0.7527 1.2732 1.3531 1.4139 1.4020 1.3374 st dev = 0.2286 average = 1.3146 min = 0.8574 max = 1.7717 So in the first iteration, 0.7527 is removed because it falls outside of the range, but that causes the standard deviation and average of the group to change. Below is the next iteration. 1.3064 1.5828 1.4097 1.2732 1.3531 1.4139 1.402 1.3374 St dev = 0.0947 average = 1.3848 min = 1.1953 max = 1.5743 In the second iteration 1.5828 is above the max so it is removed. That leaves us with the data below which contains no outliers. 1.3064 1.4097 1.2732 1.3531 1.4139 1.402 1.3374 St dev = 0.0548 average = 1.3565 min = 1.2469 max = 1.4661 |
|
|
|
Oct 7 2010, 12:31 PM
Post
#4
|
|
|
Rent-an-Admin Posts: 8,933 From: Banana Republic |
Since we need the process to be iterative, it's possible you need to use VBA with some Dynamic SQL.
An idea off the top of head would be something like this: CODE Set ars = db.OpenRecordset("SELECT StdDev([Col]) AS ColDev, Avg([Col]) AS ColAvg FROM aTable;") CurrDev = rs.Fields("ColDev") CurrAvg = rs.Fields("ColAvg") Set irs = db.OpenRecordset("SELECT COUNT(*) FROM aTable WHERE [Col] > " & CurrDev - CurrAvg & " AND [Col] < " & CurrDev + CurrAvg ";") Do Until irs.EOF Set ars = db.OpenRecordset("SELECT StdDev([Col]) AS ColDev, Avg([Col]) AS ColAvg FROM aTable WHERE [Col] BETWEEN " & CurrDev - CurrAvg & " AND " & CurrDev + CurrAvg & ";") CurrDev = rs.Fields("ColDev") CurrAvg = rs.Fields("ColAvg") Set irs = db.OpenRecordset("SELECT COUNT(*) FROM aTable WHERE [Col] > " & CurrDev - CurrAvg & " AND [Col] < " & CurrDev + CurrAvg ";") Loop Debug.Print "Final StdDev: " & CurrDev Debug.Print "Final Avg: " & CurrAvg (untested aircode) |
|
|
|
Oct 7 2010, 01:44 PM
Post
#5
|
|
|
UtterAccess VIP Posts: 2,336 From: West Michigan |
Well, I haven't been able to code it, and I can't get Banana's code to come up with the right answers.
But I can do it with a series of queries. Suppose Table1 is your original table. Query1: SELECT Field1 FROM Table1 WHERE Field1>(Select Avg([Field1])-StDev([field1])*2 AS [Min] from Table1) And Field1<(Select Avg([Field1])+StDev([field1])*2 AS [max] from Table1) will produce your first round results Query2: SELECT Field1 FROM Query1 WHERE Field1>(Select Avg([Field1])-StDev([field1])*2 AS [Min] from Query1) And Field1<(Select Avg([Field1])+StDev([field1])*2 AS [max] from Query1); produces your final result. If there were more outliers, you could create another query based on Query2 and so forth. You only have to run the last query in the sequence because all the others get run first. This is less than optimum, but perhaps you can build from there. |
|
|
|
Oct 7 2010, 03:36 PM
Post
#6
|
|
|
UtterAccess VIP Posts: 5,139 From: Dunbar, WV |
You could put a flag in the table and flag the ones that are removed. Recalculate the standard deviation with unfllagged records and do the elimination again with unflagged records. Repeat till done.
Robert |
|
|
|
Jan 20 2012, 12:52 PM
Post
#7
|
|
|
UtterAccess Member Posts: 27 |
I found this old thread while I was trying to do something very similar.
I used the outline from the comments below (thanks!) to come up with the following code that seems to work OK. I have made no effort to make it a clean universal function, so I apologize for the overuse of table and field names (and some data conditions) that are unique to the data I'm using. It might have been better to write this as a function that has a data set passed to it as a parameter, and it returns a dataset with outliers removed; but I did want to keep a record of what points were deleted. I used Chauvenet's criterion to remove outliers (rather than just a fixed number of std deviations away from the mean). I added a 2 x multiplier in there to take into account that I'm looking at the furthest point from the mean each time, and that this could be to the left or the right of the mean - I think I've interpreted Chauvenet's criterion correctly, but I wouldn't guarantee it. And, thanks to the author of the NormsDist function. CODE Function removeOutliers() As Boolean
Dim db As Database Set db = CurrentDb() Dim qtemp, q2temp, q3temp As QueryDef Dim rtemp, r2temp, r3temp As Recordset Dim sql As String Dim p As Double 'empty the previous costs table sql = "DELETE * FROM tblCOGSoutlier" db.Execute (sql) 'put data into the table where qty>0, costs >0, inv date is after Jun 2011 (prior data suspect) sql = "INSERT INTO tblCOGSoutlier ( PartNum, InvDate, CostEa, fromQty, FlaggedOutlier )" sql = sql & " SELECT tblCOGSMkTbl.[Part Num], CDate([Inv Date]) AS InvDate, [total cost extended]/[qty shipped] AS CostEa," sql = sql & " tblCOGSMkTbl.[Qty Shipped], False AS Expr1 FROM tblCOGSMkTbl" sql = sql & " WHERE (((CDate([Inv Date]))>=#Jul/1/2011#) AND (([total cost extended]/[qty shipped])>0) AND ((tblCOGSMkTbl.[Qty Shipped])>0));" db.Execute (sql) 'inflate for multiples? 'not sure about this.... will make for clustered and non normal distribution, but high qtys are more important than low qtys.... 'after some thought, strategy is remove outliers before considering quantities per data point 'for each part number sql = "SELECT tblCOGSoutlier.PartNum FROM tblCOGSoutlier GROUP BY tblCOGSoutlier.PartNum;" Set qtemp = db.CreateQueryDef("", sql) Set rtemp = qtemp.OpenRecordset(DB_OPEN_SNAPSHOT) While Not rtemp.EOF 'get the mean and stddev for the costs not flagged as outliers sql = "SELECT tblCOGSoutlier.PartNum, StDev(tblCOGSoutlier.[CostEa]) AS StDevOfCostEa, " sql = sql & "Avg(tblCOGSoutlier.CostEa) AS AvgOfCostEa, Count(CostEa) AS CountOfCost FROM tblCOGSoutlier " sql = sql & "where (((tblCOGSoutlier.PartNum) Like """ & rtemp!PartNum & """) And " sql = sql & "((tblCOGSoutlier.FlaggedOutlier) = No)) GROUP BY tblCOGSoutlier.PartNum;" Set q2temp = db.CreateQueryDef("", sql) Set r2temp = q2temp.OpenRecordset(DB_OPEN_SNAPSHOT) 'get the cost record for this item where the point is furthest from the mean 'select top 1 ordered by abs(cost-mean) sql = "SELECT TOP 1 tblCOGSoutlier.LineID, tblCOGSoutlier.FlaggedOutlier, tblCOGSoutlier.PartNum, " sql = sql & " tblCOGSoutlier.CostEa, Abs([costEa]-" & r2temp!AvgOfCostEa & ") AS Away" sql = sql & " FROM tblCOGSoutlier where (((tblCOGSoutlier.PartNum) Like """ & rtemp!PartNum & """) And " sql = sql & "((tblCOGSoutlier.FlaggedOutlier) = No)) ORDER BY Abs([costEa]-" & r2temp!AvgOfCostEa & ") DESC;" Set q3temp = db.CreateQueryDef("", sql) Set r3temp = q3temp.OpenRecordset(dbOpenDynaset, dbInconsistent, dbOptimistic) 'while the point is too far away.... 'MsgBox (r2temp!PartNum & Chr$(13) & "Away: " & r3temp!away & Chr$(13) & "Ave: " & r2temp!AvgOfCostEa & Chr$(13) & "SDev: " & r2temp!StDevOfCostEa) If Not (IsNull(r2temp!StDevOfCostEa)) Then If (r2temp!StDevOfCostEa > 0) Then 'using Chauvenet's criterion..... p = 2 * (1 - SNorm2(r3temp!away / r2temp!StDevOfCostEa)) * r2temp!CountOfCOst 'MsgBox (r2temp!PartNum & Chr$(13) & "Away: " & r3temp!away & Chr$(13) & "Ave: " & r2temp!AvgOfCostEa & Chr$(13) & "SDev: " & r2temp!StDevOfCostEa & Chr$(13) & "Count: " & r2temp!CountOfCOst & Chr$(13) & "P: " & p) While p < 0.5 'flag the point r3temp.MoveFirst r3temp.Edit r3temp!FlaggedOutlier = True r3temp.Update 'get the records that drop this point sql = "SELECT TOP 1 tblCOGSoutlier.LineID, tblCOGSoutlier.FlaggedOutlier, tblCOGSoutlier.PartNum, " sql = sql & " tblCOGSoutlier.CostEa, Abs([costEa]-" & r2temp!AvgOfCostEa & ") AS Away" sql = sql & " FROM tblCOGSoutlier where (((tblCOGSoutlier.PartNum) Like """ & rtemp!PartNum & """) And " sql = sql & "((tblCOGSoutlier.FlaggedOutlier) = No)) ORDER BY Abs([costEa]-" & r2temp!AvgOfCostEa & ") DESC;" Set q3temp = db.CreateQueryDef("", sql) Set r3temp = q3temp.OpenRecordset(dbOpenDynaset, dbInconsistent, dbOptimistic) 'recalculate p p = 2 * (1 - SNorm2(r3temp!away / r2temp!StDevOfCostEa)) * r2temp!CountOfCOst Wend End If End If 'to get here, only non outliers are left.... 'get the next part number rtemp.MoveNext Wend r3temp.Close q3temp.Close r2temp.Close q2temp.Close rtemp.Close qtemp.Close removeOutliers = True End Function '*********************************************************************** '* Cumulative Standard Normal Distribution * '* (this function provides similar result as NORMSDIST( ) on Excel) * '* Source: http://www.geocities.com/WallStreet/9245/vba6.htm * '*********************************************************************** Public Function SNorm2(z As Double) As Double Const c1 = 2.506628 Const c2 = 0.3193815 Const c3 = -0.3565638 Const c4 = 1.7814779 Const c5 = -1.821256 Const c6 = 1.3302744 Dim w As Double, x As Double, y As Double If z > 0 Or z = 0 Then w = 1 Else w = -1 End If y = 1 / (1 + 0.231649 * w * z) x = c6 x = y * x + c5 x = y * x + c4 x = y * x + c3 x = y * x + c2 SNorm2 = 0.5 + w * (0.5 - (Exp(-z * z / 2) / c1) * y * x) End Function |
|
|
|
![]() ![]() |
|
Go to Top · Lo-Fi Version | Time is now: 20th June 2013 - 06:40 AM |