Full Version: Removing outliers

I am looking for a way to remove outliers from my data. This process was previously done in an excel sheet by defining the minimum allowable value as 2 standard deviations below the average and the maximum allowable value as 2 standard deviations above the average. The values in the list had conditional formating that caused a value to turn red if it fell outside of the acceptable range. This value was deleted so that it would not be used in the standard deviation calculation and therefore change the acceptable range. The process would continue until no values were red. Any ideas how to do this automatically in with a list of values in a table (or query) in access?

Can you give some example data? A list of values, indicate those that *should* be outliers, and what you want the output to look like?

An example of the initial data is below. I want to remove any values that do not fall within two standard deviations of the average of the group.

.3064

1.5828

1.4097

0.7527

1.2732

1.3531

1.4139

1.4020

1.3374

st dev = 0.2286

average = 1.3146

min = 0.8574

max = 1.7717

So in the first iteration, 0.7527 is removed because it falls outside of the range, but that causes the standard deviation and average of the group to change. Below is the next iteration.

1.3064

1.5828

1.4097

1.2732

1.3531

1.4139

1.402

1.3374

St dev = 0.0947

average = 1.3848

min = 1.1953

max = 1.5743

In the second iteration 1.5828 is above the max so it is removed. That leaves us with the data below which contains no outliers.

1.3064

1.4097

1.2732

1.3531

1.4139

1.402

1.3374

St dev = 0.0548

average = 1.3565

min = 1.2469

max = 1.4661

.3064

1.5828

1.4097

0.7527

1.2732

1.3531

1.4139

1.4020

1.3374

st dev = 0.2286

average = 1.3146

min = 0.8574

max = 1.7717

So in the first iteration, 0.7527 is removed because it falls outside of the range, but that causes the standard deviation and average of the group to change. Below is the next iteration.

1.3064

1.5828

1.4097

1.2732

1.3531

1.4139

1.402

1.3374

St dev = 0.0947

average = 1.3848

min = 1.1953

max = 1.5743

In the second iteration 1.5828 is above the max so it is removed. That leaves us with the data below which contains no outliers.

1.3064

1.4097

1.2732

1.3531

1.4139

1.402

1.3374

St dev = 0.0548

average = 1.3565

min = 1.2469

max = 1.4661

Since we need the process to be iterative, it's possible you need to use VBA with some Dynamic SQL.

On idea off the top of head would be something like this:

(untested aircode)

On idea off the top of head would be something like this:

CODE

Set ars = db.OpenRecordset("SELECT StdDev([Col]) AS ColDev, Avg([Col]) AS ColAvg FROM aTable;")

CurrDev = rs.Fields("ColDev")

CurrAvg = rs.Fields("ColAvg")

Set irs = db.OpenRecordset("SELECT COUNT(*) FROM aTable WHERE [Col] > " & CurrDev - CurrAvg & " AND [Col] < " & CurrDev + CurrAvg ";")

Do Until irs.EOF

Set ars = db.OpenRecordset("SELECT StdDev([Col]) AS ColDev, Avg([Col]) AS ColAvg FROM aTable WHERE [Col] BETWEEN " & CurrDev - CurrAvg & " AND " & CurrDev + CurrAvg & ";")

CurrDev = rs.Fields("ColDev")

CurrAvg = rs.Fields("ColAvg")

Set irs = db.OpenRecordset("SELECT COUNT(*) FROM aTable WHERE [Col] > " & CurrDev - CurrAvg & " AND [Col] < " & CurrDev + CurrAvg ";")

Loop

Debug.Print "Final StdDev: " & CurrDev

Debug.Print "Final Avg: " & CurrAvg

CurrDev = rs.Fields("ColDev")

CurrAvg = rs.Fields("ColAvg")

Set irs = db.OpenRecordset("SELECT COUNT(*) FROM aTable WHERE [Col] > " & CurrDev - CurrAvg & " AND [Col] < " & CurrDev + CurrAvg ";")

Do Until irs.EOF

Set ars = db.OpenRecordset("SELECT StdDev([Col]) AS ColDev, Avg([Col]) AS ColAvg FROM aTable WHERE [Col] BETWEEN " & CurrDev - CurrAvg & " AND " & CurrDev + CurrAvg & ";")

CurrDev = rs.Fields("ColDev")

CurrAvg = rs.Fields("ColAvg")

Set irs = db.OpenRecordset("SELECT COUNT(*) FROM aTable WHERE [Col] > " & CurrDev - CurrAvg & " AND [Col] < " & CurrDev + CurrAvg ";")

Loop

Debug.Print "Final StdDev: " & CurrDev

Debug.Print "Final Avg: " & CurrAvg

(untested aircode)

Well, I haven't been able to code it, and I can't get Banana's code to come up with the right answers.

But I can do it with a series of queries. Suppose Table1 is your original table.

Query1:

SELECT Field1

FROM Table1

WHERE Field1>(Select Avg([Field1])-StDev([field1])*2 AS [Min] from Table1) And Field1<(Select Avg([Field1])+StDev([field1])*2 AS [max] from Table1)

will produce your first round results

Query2:

SELECT Field1

FROM Query1

WHERE Field1>(Select Avg([Field1])-StDev([field1])*2 AS [Min] from Query1) And Field1<(Select Avg([Field1])+StDev([field1])*2 AS [max] from Query1);

produces your final result. If there were more outliers, you could create another query based on Query2 and so forth. You only have to run the last query in the sequence because all the others get run first.

This is less than optimum, but perhaps you can build from there.

But I can do it with a series of queries. Suppose Table1 is your original table.

Query1:

SELECT Field1

FROM Table1

WHERE Field1>(Select Avg([Field1])-StDev([field1])*2 AS [Min] from Table1) And Field1<(Select Avg([Field1])+StDev([field1])*2 AS [max] from Table1)

will produce your first round results

Query2:

SELECT Field1

FROM Query1

WHERE Field1>(Select Avg([Field1])-StDev([field1])*2 AS [Min] from Query1) And Field1<(Select Avg([Field1])+StDev([field1])*2 AS [max] from Query1);

produces your final result. If there were more outliers, you could create another query based on Query2 and so forth. You only have to run the last query in the sequence because all the others get run first.

This is less than optimum, but perhaps you can build from there.

You could put a flag in the table and flag the ones that are removed. Recalculate the standard deviation with unfllagged records and do the elimination again with unflagged records. Repeat till done.

obert

obert

I found this old thread while I was trying to do something very similar.

Oused the outline from the comments below (thanks!) to come up with the following code that seems to work OK.

I have made no effort to make it a clean universal function, so I apologize for the overuse of table and field names (and some data conditions) that are unique to the data I'm using.

It might have been better to write this as a function that has a data set passed to it as a parameter, and it returns a dataset with outliers removed; but I did want to keep a record of what points were deleted.

I used Chauvenet's criterion to remove outliers (rather than just a fixed number of std deviations away from the mean). I added a 2 x multiplier in there to take into account that I'm looking at the furthest point from the mean each time, and that this could be to the left or the right of the mean - I think I've interpreted Chauvenet's criterion correctly, but I wouldn't guarantee it.

And, thanks to the author of the NormsDist function.

Oused the outline from the comments below (thanks!) to come up with the following code that seems to work OK.

I have made no effort to make it a clean universal function, so I apologize for the overuse of table and field names (and some data conditions) that are unique to the data I'm using.

It might have been better to write this as a function that has a data set passed to it as a parameter, and it returns a dataset with outliers removed; but I did want to keep a record of what points were deleted.

I used Chauvenet's criterion to remove outliers (rather than just a fixed number of std deviations away from the mean). I added a 2 x multiplier in there to take into account that I'm looking at the furthest point from the mean each time, and that this could be to the left or the right of the mean - I think I've interpreted Chauvenet's criterion correctly, but I wouldn't guarantee it.

And, thanks to the author of the NormsDist function.

CODE

Function removeOutliers() As Boolean

Dim db As Database

Set db = CurrentDb()

Dim qtemp, q2temp, q3temp As QueryDef

Dim rtemp, r2temp, r3temp As Recordset

Dim sql As String

Dim p As Double

'empty the previous costs table

sql = "DELETE * FROM tblCOGSoutlier"

db.Execute (sql)

'put data into the table where qty>0, costs >0, inv date is after Jun 2011 (prior data suspect)

sql = "INSERT INTO tblCOGSoutlier ( PartNum, InvDate, CostEa, fromQty, FlaggedOutlier )"

sql = sql & " SELECT tblCOGSMkTbl.[Part Num], CDate([Inv Date]) AS InvDate, [total cost extended]/[qty shipped] AS CostEa,"

sql = sql & " tblCOGSMkTbl.[Qty Shipped], False AS Expr1 FROM tblCOGSMkTbl"

sql = sql & " WHERE (((CDate([Inv Date]))>=#Jul/1/2011#) AND (([total cost extended]/[qty shipped])>0) AND ((tblCOGSMkTbl.[Qty Shipped])>0));"

db.Execute (sql)

'inflate for multiples?

'not sure about this.... will make for clustered and non normal distribution, but high qtys are more important than low qtys....

'after some thought, strategy is remove outliers before considering quantities per data point

'for each part number

sql = "SELECT tblCOGSoutlier.PartNum FROM tblCOGSoutlier GROUP BY tblCOGSoutlier.PartNum;"

Set qtemp = db.CreateQueryDef("", sql)

Set rtemp = qtemp.OpenRecordset(DB_OPEN_SNAPSHOT)

While Not rtemp.EOF

'get the mean and stddev for the costs not flagged as outliers

sql = "SELECT tblCOGSoutlier.PartNum, StDev(tblCOGSoutlier.[CostEa]) AS StDevOfCostEa, "

sql = sql & "Avg(tblCOGSoutlier.CostEa) AS AvgOfCostEa, Count(CostEa) AS CountOfCost FROM tblCOGSoutlier "

sql = sql & "where (((tblCOGSoutlier.PartNum) Like """ & rtemp!PartNum & """) And "

sql = sql & "((tblCOGSoutlier.FlaggedOutlier) = No)) GROUP BY tblCOGSoutlier.PartNum;"

Set q2temp = db.CreateQueryDef("", sql)

Set r2temp = q2temp.OpenRecordset(DB_OPEN_SNAPSHOT)

'get the cost record for this item where the point is furthest from the mean

'select top 1 ordered by abs(cost-mean)

sql = "SELECT TOP 1 tblCOGSoutlier.LineID, tblCOGSoutlier.FlaggedOutlier, tblCOGSoutlier.PartNum, "

sql = sql & " tblCOGSoutlier.CostEa, Abs([costEa]-" & r2temp!AvgOfCostEa & ") AS Away"

sql = sql & " FROM tblCOGSoutlier where (((tblCOGSoutlier.PartNum) Like """ & rtemp!PartNum & """) And "

sql = sql & "((tblCOGSoutlier.FlaggedOutlier) = No)) ORDER BY Abs([costEa]-" & r2temp!AvgOfCostEa & ") DESC;"

Set q3temp = db.CreateQueryDef("", sql)

Set r3temp = q3temp.OpenRecordset(dbOpenDynaset, dbInconsistent, dbOptimistic)

'while the point is too far away....

'MsgBox (r2temp!PartNum & Chr$(13) & "Away: " & r3temp!away & Chr$(13) & "Ave: " & r2temp!AvgOfCostEa & Chr$(13) & "SDev: " & r2temp!StDevOfCostEa)

If Not (IsNull(r2temp!StDevOfCostEa)) Then

If (r2temp!StDevOfCostEa > 0) Then

'using Chauvenet's criterion.....

p = 2 * (1 - SNorm2(r3temp!away / r2temp!StDevOfCostEa)) * r2temp!CountOfCOst

'MsgBox (r2temp!PartNum & Chr$(13) & "Away: " & r3temp!away & Chr$(13) & "Ave: " & r2temp!AvgOfCostEa & Chr$(13) & "SDev: " & r2temp!StDevOfCostEa & Chr$(13) & "Count: " & r2temp!CountOfCOst & Chr$(13) & "P: " & p)

While p < 0.5

'flag the point

r3temp.MoveFirst

r3temp.Edit

r3temp!FlaggedOutlier = True

r3temp.Update

'get the records that drop this point

sql = "SELECT TOP 1 tblCOGSoutlier.LineID, tblCOGSoutlier.FlaggedOutlier, tblCOGSoutlier.PartNum, "

sql = sql & " tblCOGSoutlier.CostEa, Abs([costEa]-" & r2temp!AvgOfCostEa & ") AS Away"

sql = sql & " FROM tblCOGSoutlier where (((tblCOGSoutlier.PartNum) Like """ & rtemp!PartNum & """) And "

sql = sql & "((tblCOGSoutlier.FlaggedOutlier) = No)) ORDER BY Abs([costEa]-" & r2temp!AvgOfCostEa & ") DESC;"

Set q3temp = db.CreateQueryDef("", sql)

Set r3temp = q3temp.OpenRecordset(dbOpenDynaset, dbInconsistent, dbOptimistic)

'recalculate p

p = 2 * (1 - SNorm2(r3temp!away / r2temp!StDevOfCostEa)) * r2temp!CountOfCOst

Wend

End If

End If

'to get here, only non outliers are left....

'get the next part number

rtemp.MoveNext

Wend

r3temp.Close

q3temp.Close

r2temp.Close

q2temp.Close

rtemp.Close

qtemp.Close

removeOutliers = True

End Function

'***********************************************************************

'* Cumulative Standard Normal Distribution *

'* (this function provides similar result as NORMSDIST( ) on Excel) *

'* Source: http://www.geocities.com/WallStreet/9245/vba6.htm *

'***********************************************************************

Public Function SNorm2(z As Double) As Double

Const c1 = 2.506628

Const c2 = 0.3193815

Const c3 = -0.3565638

Const c4 = 1.7814779

Const c5 = -1.821256

Const c6 = 1.3302744

Dim w As Double, x As Double, y As Double

If z > 0 Or z = 0 Then

w = 1

Else

w = -1

End If

y = 1 / (1 + 0.231649 * w * z)

x = c6

x = y * x + c5

x = y * x + c4

x = y * x + c3

x = y * x + c2

SNorm2 = 0.5 + w * (0.5 - (Exp(-z * z / 2) / c1) * y * x)

End Function

Dim db As Database

Set db = CurrentDb()

Dim qtemp, q2temp, q3temp As QueryDef

Dim rtemp, r2temp, r3temp As Recordset

Dim sql As String

Dim p As Double

'empty the previous costs table

sql = "DELETE * FROM tblCOGSoutlier"

db.Execute (sql)

'put data into the table where qty>0, costs >0, inv date is after Jun 2011 (prior data suspect)

sql = "INSERT INTO tblCOGSoutlier ( PartNum, InvDate, CostEa, fromQty, FlaggedOutlier )"

sql = sql & " SELECT tblCOGSMkTbl.[Part Num], CDate([Inv Date]) AS InvDate, [total cost extended]/[qty shipped] AS CostEa,"

sql = sql & " tblCOGSMkTbl.[Qty Shipped], False AS Expr1 FROM tblCOGSMkTbl"

sql = sql & " WHERE (((CDate([Inv Date]))>=#Jul/1/2011#) AND (([total cost extended]/[qty shipped])>0) AND ((tblCOGSMkTbl.[Qty Shipped])>0));"

db.Execute (sql)

'inflate for multiples?

'not sure about this.... will make for clustered and non normal distribution, but high qtys are more important than low qtys....

'after some thought, strategy is remove outliers before considering quantities per data point

'for each part number

sql = "SELECT tblCOGSoutlier.PartNum FROM tblCOGSoutlier GROUP BY tblCOGSoutlier.PartNum;"

Set qtemp = db.CreateQueryDef("", sql)

Set rtemp = qtemp.OpenRecordset(DB_OPEN_SNAPSHOT)

While Not rtemp.EOF

'get the mean and stddev for the costs not flagged as outliers

sql = "SELECT tblCOGSoutlier.PartNum, StDev(tblCOGSoutlier.[CostEa]) AS StDevOfCostEa, "

sql = sql & "Avg(tblCOGSoutlier.CostEa) AS AvgOfCostEa, Count(CostEa) AS CountOfCost FROM tblCOGSoutlier "

sql = sql & "where (((tblCOGSoutlier.PartNum) Like """ & rtemp!PartNum & """) And "

sql = sql & "((tblCOGSoutlier.FlaggedOutlier) = No)) GROUP BY tblCOGSoutlier.PartNum;"

Set q2temp = db.CreateQueryDef("", sql)

Set r2temp = q2temp.OpenRecordset(DB_OPEN_SNAPSHOT)

'get the cost record for this item where the point is furthest from the mean

'select top 1 ordered by abs(cost-mean)

sql = "SELECT TOP 1 tblCOGSoutlier.LineID, tblCOGSoutlier.FlaggedOutlier, tblCOGSoutlier.PartNum, "

sql = sql & " tblCOGSoutlier.CostEa, Abs([costEa]-" & r2temp!AvgOfCostEa & ") AS Away"

sql = sql & " FROM tblCOGSoutlier where (((tblCOGSoutlier.PartNum) Like """ & rtemp!PartNum & """) And "

sql = sql & "((tblCOGSoutlier.FlaggedOutlier) = No)) ORDER BY Abs([costEa]-" & r2temp!AvgOfCostEa & ") DESC;"

Set q3temp = db.CreateQueryDef("", sql)

Set r3temp = q3temp.OpenRecordset(dbOpenDynaset, dbInconsistent, dbOptimistic)

'while the point is too far away....

'MsgBox (r2temp!PartNum & Chr$(13) & "Away: " & r3temp!away & Chr$(13) & "Ave: " & r2temp!AvgOfCostEa & Chr$(13) & "SDev: " & r2temp!StDevOfCostEa)

If Not (IsNull(r2temp!StDevOfCostEa)) Then

If (r2temp!StDevOfCostEa > 0) Then

'using Chauvenet's criterion.....

p = 2 * (1 - SNorm2(r3temp!away / r2temp!StDevOfCostEa)) * r2temp!CountOfCOst

'MsgBox (r2temp!PartNum & Chr$(13) & "Away: " & r3temp!away & Chr$(13) & "Ave: " & r2temp!AvgOfCostEa & Chr$(13) & "SDev: " & r2temp!StDevOfCostEa & Chr$(13) & "Count: " & r2temp!CountOfCOst & Chr$(13) & "P: " & p)

While p < 0.5

'flag the point

r3temp.MoveFirst

r3temp.Edit

r3temp!FlaggedOutlier = True

r3temp.Update

'get the records that drop this point

sql = "SELECT TOP 1 tblCOGSoutlier.LineID, tblCOGSoutlier.FlaggedOutlier, tblCOGSoutlier.PartNum, "

sql = sql & " tblCOGSoutlier.CostEa, Abs([costEa]-" & r2temp!AvgOfCostEa & ") AS Away"

sql = sql & " FROM tblCOGSoutlier where (((tblCOGSoutlier.PartNum) Like """ & rtemp!PartNum & """) And "

sql = sql & "((tblCOGSoutlier.FlaggedOutlier) = No)) ORDER BY Abs([costEa]-" & r2temp!AvgOfCostEa & ") DESC;"

Set q3temp = db.CreateQueryDef("", sql)

Set r3temp = q3temp.OpenRecordset(dbOpenDynaset, dbInconsistent, dbOptimistic)

'recalculate p

p = 2 * (1 - SNorm2(r3temp!away / r2temp!StDevOfCostEa)) * r2temp!CountOfCOst

Wend

End If

End If

'to get here, only non outliers are left....

'get the next part number

rtemp.MoveNext

Wend

r3temp.Close

q3temp.Close

r2temp.Close

q2temp.Close

rtemp.Close

qtemp.Close

removeOutliers = True

End Function

'***********************************************************************

'* Cumulative Standard Normal Distribution *

'* (this function provides similar result as NORMSDIST( ) on Excel) *

'* Source: http://www.geocities.com/WallStreet/9245/vba6.htm *

'***********************************************************************

Public Function SNorm2(z As Double) As Double

Const c1 = 2.506628

Const c2 = 0.3193815

Const c3 = -0.3565638

Const c4 = 1.7814779

Const c5 = -1.821256

Const c6 = 1.3302744

Dim w As Double, x As Double, y As Double

If z > 0 Or z = 0 Then

w = 1

Else

w = -1

End If

y = 1 / (1 + 0.231649 * w * z)

x = c6

x = y * x + c5

x = y * x + c4

x = y * x + c3

x = y * x + c2

SNorm2 = 0.5 + w * (0.5 - (Exp(-z * z / 2) / c1) * y * x)

End Function

This is a "lo-fi" version of UA. To view the full version with more information, formatting and images, please click here.