Search code examples
pythonfiltercurve-fittingoutliersdata-fitting

May I ask there are any algorithms(in python) could filter "deep valley" data points on a sloping straight line?


I have a group of datasets, each of them containing 251 points, which will be fitted as a sloping straight line. However there are around 30 outliers forming a lot "deep valleys" as shown below in every dataset.enter image description here

My task is to remove these deep valleys for future data processing and my initial idea was like this below:

lastData = limit 
def limiting(nowData, limit):
    global lastData
    if (abs(nowData-lastData) > limit):
        return lastData
    else:
        lastData = nowData
        return nowData

and my code is shown as below:

limit = 250
index = np.random.randint(0, 250)
last_data = honing_data_matrix[index, 0]
data_filtered = np.zeros((251, 251))
for i in range(0, len(data[index])):
    current_data = data[index, i]
    if abs(current_data - last_data) <= limit:
        data_filtered[index, i] = current_data
        last_data = current_data
    else:
        data_filtered[index, i] = last_data
        last_data = data_filtered[index, i]
data_filtered[index, 0] = data[index, 0]

It looked ok in several dataset but on most of the datasets the results were bad as shown below, the blue line is the filtered dataset: enter image description here This one up here looks good enter image description here But this one not

The filtered data is as below:

[5455. 5467. 5463. 5468. 5477. 5484. 5480. 5488. 5497. 5501. 5414. 5446.
 5501. 5505. 5509. 5530. 5534. 5538. 5541. 5550. 5548. 5553. 5574. 5569.
 5558. 5578. 5567. 5568. 5575. 5580. 5587. 5592. 5594. 5605. 5611. 5614.
 5612. 5617. 5580. 5441. 5378. 5520. 5642. 5657. 5657. 5673. 5688. 5644.
 5637. 5678. 5694. 5696. 5686. 5690. 5712. 5730. 5700. 5706. 5725. 5719.
 5714. 5712. 5712. 5712. 5712. 5712. 5712. 5533. 5700. 5685. 5676. 5725.
 5756. 5772. 5776. 5714. 5640. 5698. 5752. 5563. 5476. 5563. 5645. 5712.
 5783. 5831. 5835. 5861. 5791. 5650. 5631. 5724. 5806. 5854. 5875. 5889.
 5896. 5904. 5900. 5908. 5905. 5907. 5910. 5916. 5915. 5930. 5934. 5935.
 5938. 5949. 5945. 5917. 5768. 5783. 5840. 5712. 5547. 5499. 5572. 5775.
 5769. 5670. 5793. 5969. 6039. 6025. 6000. 6016. 6026. 6013. 5978. 6005.
 6036. 6044. 6047. 6061. 6072. 6080. 6080. 6090. 6097. 6101. 5971. 5828.
 5751. 5751. 5751. 5751. 5525. 5525. 5525. 5525. 5525. 5525. 5525. 5525.
 5525. 5525. 5525. 5525. 5525. 5525. 5525. 5654. 5520. 5755. 5755. 5755.
 5755. 5564. 5326. 5326. 5326. 5326. 5326. 5326. 5326. 5326. 5326. 5326.
 5326. 5326. 5326. 5326. 5326. 5326. 5326. 5326. 5326. 5326. 5326. 5326.
 5326. 5326. 5326. 5326. 5326. 5326. 5326. 5326. 5326. 5326. 5326. 5326.
 5326. 5326. 5326. 5326. 5326. 5326. 5326. 5326. 5326. 5326. 5326. 5326.
 5326. 5326. 5326. 5326. 5326. 5326. 5326. 5326. 5326. 5326. 5326. 5326.
 5326. 5326. 5326. 5326. 5326. 5326. 5326. 5326. 5326. 5326. 5326. 5326.
 5326. 5326. 5326. 5326. 5326. 5326. 5326. 5326. 5326. 5326. 5326.]

The original data is as below:

[5455. 5467. 5463. 5468. 5477. 5484. 5480. 5488. 5497. 5501. 5414. 5446.
 5501. 5505. 5509. 5530. 5534. 5538. 5541. 5550. 5548. 5553. 5574. 5569.
 5558. 5578. 5567. 5568. 5575. 5580. 5587. 5592. 5594. 5605. 5611. 5614.
 5612. 5617. 5580. 5441. 5378. 5520. 5642. 5657. 5657. 5673. 5688. 5644.
 5637. 5678. 5694. 5696. 5686. 5690. 5712. 5730. 5700. 5706. 5725. 5719.
 5714. 5712. 5202. 4653. 4553. 4836. 5205. 5533. 5700. 5685. 5676. 5725.
 5756. 5772. 5776. 5714. 5640. 5698. 5752. 5563. 5476. 5563. 5645. 5712.
 5783. 5831. 5835. 5861. 5791. 5650. 5631. 5724. 5806. 5854. 5875. 5889.
 5896. 5904. 5900. 5908. 5905. 5907. 5910. 5916. 5915. 5930. 5934. 5935.
 5938. 5949. 5945. 5917. 5768. 5783. 5840. 5712. 5547. 5499. 5572. 5775.
 5769. 5670. 5793. 5969. 6039. 6025. 6000. 6016. 6026. 6013. 5978. 6005.
 6036. 6044. 6047. 6061. 6072. 6080. 6080. 6090. 6097. 6101. 5971. 5828.
 5751. 5433. 4973. 4978. 5525. 5976. 6079. 6111. 6139. 6154. 6154. 6161.
 6182. 6161. 6164. 6194. 6174. 6163. 6058. 5654. 5520. 5755. 6049. 6185.
 6028. 5564. 5326. 5670. 6048. 6197. 6204. 6140. 5937. 5807. 5869. 6095.
 6225. 6162. 5791. 5610. 5831. 6119. 6198. 5980. 5801. 5842. 5999. 6177.
 6273. 6320. 6335. 6329. 6336. 6358. 6363. 6355. 6357. 6373. 6350. 6099.
 6045. 6236. 6371. 6385. 6352. 6353. 6366. 6392. 6394. 6403. 6405. 6416.
 6415. 6425. 6428. 6426. 6374. 6313. 6239. 6059. 6077. 6197. 6293. 6365.
 6437. 6448. 6469. 6486. 6470. 6473. 6451. 6476. 6509. 6514. 6517. 6535.
 6545. 6525. 6364. 6295. 6388. 6510. 6556. 6568. 6570. 6459. 6343.]

Should I not filter the data one by one? Is there any other better filter for these kinds of sloping straight line data?


Solution

  • What an outlier is can be very dependent on the dataset. Here is a similar question that was answered that might help: Is there a numpy builtin to reject outliers from a list

    In your case it comes done to keeping track of a running average, and checking if values are further out then you would prefer.

    Last but not least scipy has great tools for al sorts of these problems. They have a section on outliers here: https://scikit-learn.org/stable/modules/outlier_detection.html