Transcriptions
Note: this content has been automatically generated.
00:00:00
data is often directed and dirty they take is wrong information according
00:00:04
to recent studies data find this and after eighty percent of
00:00:08
the time preparing directed data before it can be useful data analysis
00:00:12
and therefore they take a longer time to generate insights
00:00:16
the main reason for these is that the data cleaning process is a costly therapist process because that does scientists
00:00:22
need to perform operations started filling with the values fixing
00:00:26
around yourself and applying any sort of transformations
00:00:30
at the same time existing tools that try to automated data cleaning procedure
00:00:35
either focus on a specific data cleaning operation already facing
00:00:40
therefore from a user's perspective one has to use a
00:00:43
different potential inefficient tools for each category affairs
00:00:48
so how can we support arbitrary pickup cleaning operations which are subjective to
00:00:52
the use of manipulative data and the faster the same time
00:00:56
yes there is that we need cleaning language which is also coupled with that within that with an optimised algebra
00:01:04
so the call i know it's clean them and clinton support multiple types of
00:01:08
data cleaning operations and can be easily expanded to support more brains
00:01:13
so support operations that as a duplicate elimination violations off integrate because states that as
00:01:19
well lessons of functional dependence for example ten validation using dictionaries and so on
00:01:25
and clean them out all the sub races into a common autograph in order
00:01:30
to be able to you to optimise i mean the unified way
00:01:34
so that's not a great based on them one night calculus one night
00:01:37
out right construct that's them from category theory which are used
00:01:41
to rebut it to replace that are great given collection operators for
00:01:45
example mean maxed and some are classic examples of one night
00:01:49
that's for using them one like out because we can represent the
00:01:52
complex building blocks that data cleaning operation civil status clustering
00:01:57
and then one night carter's expressions are translated into and out the back plan where
00:02:02
we can perform up to make decisions by exploiting work setting opportunities for example
00:02:07
and finally uh that that's a great band gets translated into an optimist
00:02:12
physical pain plan which can be executed in the scale out fashion
00:02:16
we have implemented and evaluate it had these are three level
00:02:20
optimisation process and we have observed that compared to
00:02:23
existing data cleaning techniques clean them can support more data