Lec 6 | MIT 6.046J / 18.410J Introduction to Algorithms (SMA 5503), Fall 2005

Edit subtitles

0:07 - 0:07

Today we're going to not talk
about sorting.
0:11 - 0:11

This is an exciting new
development.
0:14 - 0:14

We're going to talk about
another problem,
0:18 - 0:18

a related problem,
but a different problem.
0:35 - 0:35

We're going to talk about
another problem that we would
0:38 - 0:38

like to solve in linear time.
Last class we talked about we
0:42 - 0:42

could do sorting in linear time.
To do that we needed some
0:45 - 0:45

additional assumptions.
Today we're going to look at a
0:48 - 0:48

problem that really only needs
linear time, even though at
0:51 - 0:51

first glance it might look like
it requires sorting.
0:54 - 0:54

So this is going to be an
easier problem.
0:57 - 0:57

The problem is I give you a
bunch of numbers.
1:00 - 1:00

Let's call them elements.
And they are in some array,
1:07 - 1:07

let's say.
And they're in no particular
1:12 - 1:12

order, so unsorted.
I want to find the kth smallest
1:18 - 1:18

element.
1:26 - 1:26

This is called the element of
rank k.
1:37 - 1:37

In other words,
I have this list of numbers
1:40 - 1:40

which is unsorted.
And, if I were to sort it,
1:43 - 1:43

I would like to know what the
kth element is.
1:46 - 1:46

But I'm not allowed to sort it.
One solution to this problem,
1:50 - 1:50

this is the naīve algorithm,
is you just sort and then
1:54 - 1:54

return the kth element.
This is another possible
1:57 - 1:57

definition of the problem.
And we would like to do better
2:03 - 2:03

than that.
So you could sort,
2:06 - 2:06

what's called the array A,
and then return A[k].
2:11 - 2:11

That is one thing we could do.
And if we use heap sort or
2:16 - 2:16

mergesort, this will take n lg n
time.
2:20 - 2:20

We would like to do better than
n lg n.
2:24 - 2:24

Ideally linear time.
The problem is pretty natural,
2:29 - 2:29

straightforward.
It has various applications.
2:34 - 2:34

Depending on how you choose k,
k could be any number between 1
2:40 - 2:40

and n.
For example,
2:41 - 2:41

if we choose k=1 that element
has a name.
2:45 - 2:45

Any suggestions of what the
name is?
2:48 - 2:48

The minimum.
That's easy.
2:50 - 2:50

Any suggestions on how we could
find the minimum element in an
2:55 - 2:55

array in linear time?
Right.
2:59 - 2:59

Just scan through the array.
Keep track of what the smallest
3:04 - 3:04

number is that you've seen.
The same thing with the
3:09 - 3:09

maximum, k=n.
These are rather trivial.
3:12 - 3:12

But a more interesting version
of the order statistic problem
3:17 - 3:17

is to find the median.
This is either k equals n plus
3:22 - 3:22

1 over 2 floor or ceiling.
I will call both of those
3:26 - 3:26

elements medians.
3:34 - 3:34

Finding the median of an
unsorted array in linear time is
3:37 - 3:37

quite tricky.
And that sort of is the main
3:39 - 3:39

goal of this lecture,
is to be able to find the
3:42 - 3:42

medians.
For free we're going to be able
3:44 - 3:44

to find the arbitrary kth
smallest element,
3:46 - 3:46

but typically we're most
interested in finding the
3:49 - 3:49

median.
And on Friday in recitation
3:51 - 3:51

you'll see why that is so
useful.
3:52 - 3:52

There are all sorts of
situations where you can use
3:55 - 3:55

median for really effective
divide-and-conquer without
3:58 - 3:58

having to sort.
You can solve a lot of problems
4:03 - 4:03

in linear time as a result.
And we're going to cover today
4:07 - 4:07

two algorithms for finding order
statistics.
4:11 - 4:11

Both of them are linear time.
The first one is randomized,
4:15 - 4:15

so it's only linear expected
time.
4:18 - 4:18

And the second one is
worst-case linear time,
4:22 - 4:22

and it will build on the
randomized version.
4:25 - 4:25

Let's start with a randomize
divide-and-conquer algorithm.
4:46 - 4:46

This algorithm is called
rand-select.
5:02 - 5:02

And the parameters are a little
bit more than what we're used
5:06 - 5:06

to.
The order statistics problem
5:09 - 5:09

you're given an array A.
And here I've changed notation
5:13 - 5:13

and I'm looking for the ith
smallest element,
5:16 - 5:16

so i is the index I'm looking
for.
5:18 - 5:18

And I'm also going to change
the problem a little bit.
5:22 - 5:22

And instead of trying to find
it in the whole array,
5:26 - 5:26

I'm going to look in a
particular interval of the
5:29 - 5:29

array, A from p up to q.
We're going to need that for a
5:34 - 5:34

recursion.
This better be a recursive
5:36 - 5:36

algorithm because we're using
divide-and-conquer.
5:39 - 5:39

Here is the algorithm.
5:51 - 5:51

With a base case.
It's pretty simple.
5:55 - 5:55

Then we're going to use part of
the quicksort algorithm,
6:00 - 6:00

randomized quicksort.
6:09 - 6:09

We didn't actually define this
subroutine two lectures ago,
6:13 - 6:13

but you should know what it
does, especially if you've read
6:18 - 6:18

the textbook.
This says in the array A[p...q]
6:21 - 6:21

pick a random element,
so pick a random index between
6:25 - 6:25

p and q, swap it with the first
element, then call partition.
6:30 - 6:30

And partition uses that first
element to split the rest of the
6:35 - 6:35

array into less than or equal to
that random partition and
6:39 - 6:39

greater than or equal to that
partition.
6:43 - 6:43

This is just picking a random
partition element between p and
6:47 - 6:47

q, cutting the array in half,
although the two sizes may not
6:52 - 6:52

be equal.
And it returns the index of
6:55 - 6:55

that partition element,
some number between p and q.
7:00 - 7:00

And we're going to define k to
be this particular value,
7:08 - 7:08

r minus p plus 1.
And the reason for that is that
7:15 - 7:15

k is then the rank of the
partition element.
7:22 - 7:22

This is in A[p...q].
Let me draw a picture here.
7:30 - 7:30

We have our array A.
It starts at p and ends at q.
7:34 - 7:34

There is other stuff,
but for this recursive all we
7:39 - 7:39

care about is p up to q.
We pick a random partition
7:43 - 7:43

element, say this one,
and we partition things so that
7:47 - 7:47

everything in here,
let's call this r,
7:51 - 7:51

is less than or equal to A[r]
and everything up here is
7:55 - 7:55

greater than or equal to A[r].
And A[r] is our partition
8:00 - 8:00

element.
After this call,
8:03 - 8:03

that's what the array looks
like.
8:06 - 8:06

And we get r.
We get the index of where
8:10 - 8:10

partition element is stored.
The number of elements that are
8:15 - 8:15

less than or equal to A[r] and
including r is r minus p plus 1.
8:20 - 8:20

There will be r minus p
elements here,
8:24 - 8:24

and we're adding 1 to get this
element.
8:28 - 8:28

And, if you start counting at
1, if this is rank 1,
8:33 - 8:33

rank 2, this element will have
rank k.
8:36 - 8:36

That's just from the
construction in the partition.
8:40 - 8:40

And now we get to recurse.
And there are three cases --
8:53 - 8:53

-- depending on how i relates
to k.
8:55 - 8:55

Remember i is the rank that
we're looking for,
8:58 - 8:58

k is the rank that we happen to
get out of this random
9:01 - 9:01

partition.
We don't have much control over
9:04 - 9:04

k, but if we're lucky i=k.
That's the element we want.
9:13 - 9:13

Then we just return the
partition element.
9:15 - 9:15

More likely is that the element
we're looking for is either to
9:18 - 9:18

the left or to the right.
And if it's to the left we're
9:21 - 9:21

going to recurse in the
left-hand portion of the array.
9:24 - 9:24

And if it's to the right we're
going to recurse in the
9:26 - 9:26

right-hand portion.
So, pretty straightforward at
9:29 - 9:29

this point.
9:45 - 9:45

I just have to get all the
indices right.
10:08 - 10:08

Either we're going to recurse
on the part between p and r
10:11 - 10:11

minus 1, that's this case.
The rank we're looking for is
10:15 - 10:15

to the left of the rank of
element A[r].
10:17 - 10:17

Or, we're going to recurse on
the right part between r plus 1
10:21 - 10:21

and q.
Where we recurse on the left
10:23 - 10:23

part the rank we're looking for
remains the same,
10:26 - 10:26

but when we recurse on the
right part the rank we're
10:29 - 10:29

looking for gets offset.
Because we sort of got rid of
10:34 - 10:34

the k elements over here.
I should have written this
10:39 - 10:39

length is k.
We've sort of swept away k
10:42 - 10:42

ranks of elements.
And now within this array we're
10:47 - 10:47

looking for the i minus kth
smallest element.
10:51 - 10:51

That's the recursion.
We only recurse once.
10:55 - 10:55

And random partition is not a
recursion.
11:00 - 11:00

That just takes linear time.
And the total amount of work
11:05 - 11:05

we're doing here should be
linear time plus one recursion.
11:09 - 11:09

And we'd next like to see what
the total running time is in
11:14 - 11:14

expectation, but let's first do
a little example --
11:26 - 11:26

-- to make this algorithm
perfectly clear.
11:29 - 11:29

Let's suppose we're looking for
the seventh smallest element in
11:34 - 11:34

this array.
11:50 - 11:50

And let's suppose,
just for example,
11:53 - 11:53

that the pivot we're using is
just the first element.
11:58 - 11:58

So, nothing fancy.
I would have to flip a few
12:02 - 12:02

coins in order to generate a
random one, so let's just pick
12:07 - 12:07

this one.
If I partition at the element
12:10 - 12:10

6, this is actually an example
we did two weeks ago,
12:14 - 12:14

and I won't go through it
again, but we get the same
12:18 - 12:18

array, as we did two weeks ago,
namely 2, 5,
12:21 - 12:21

3, 6, 8, 13,
10 and 11.
12:23 - 12:23

If you run through the
partitioning algorithm,
12:27 - 12:27

that happens to be the order
that it throws the elements
12:31 - 12:31

into.
And this is our position r.
12:35 - 12:35

This is p here.
It's just 1.
12:37 - 12:37

And q is just the end.
And I am looking for the
12:41 - 12:41

seventh smallest element.
And it happens when I run this
12:45 - 12:45

partition that 6 falls into the
fourth place.
12:48 - 12:48

And we know that means,
because all the elements here
12:52 - 12:52

are less than 6 and all the
elements here are greater than
12:57 - 12:57

6, if this array were sorted,
6 would be right here in
13:01 - 13:01

position four.
So, r here is 4.
13:05 - 13:05

Yeah?
The 12 turned into an 11?
13:09 - 13:09

This was an 11,
believe it or not.
13:14 - 13:14

Let me be simple.
Sorry.
13:17 - 13:17

Sometimes my ones look like
twos.
13:21 - 13:21

Not a good feature.
That's an easy way to cover.
13:27 - 13:27

[LAUGHTER]
Don't try that on exams.
13:32 - 13:32

Oh, that one was just a two.
No.
13:34 - 13:34

Even though we're not sorting
the array, we're only spending
13:38 - 13:38

linear work here to partition by
6.
13:40 - 13:40

We know that if we had sorted
the array 6 would fall here.
13:44 - 13:44

We don't know about these other
elements.
13:46 - 13:46

They're not in sorted order,
but from the properties of
13:50 - 13:50

partition we know 6 went the
right spot.
13:53 - 13:53

We now know rank of 6 is 4.
We happened to be looking for 7
13:56 - 13:56

and we happened to get this
number 4.
14:00 - 14:00

We want something over here.
It turns out we're looking for
14:04 - 14:04

10, I guess.
No, 11.
14:05 - 14:05

There should be eight elements
in this array,
14:08 - 14:08

so it's the next to max.
Max here is 13,
14:11 - 14:11

I'm cheating here.
The answer we're looking for is
14:14 - 14:14

11.
We know that what we're looking
14:16 - 14:16

for is in the right-hand part
because the rank we're looking
14:20 - 14:20

for is 7, which is bigger than
4.
14:23 - 14:23

Now, what rank are we looking
for in here?
14:25 - 14:25

Well, we've gotten rid of four
elements over here.
14:30 - 14:30

It happened here that k is also
4 because p is 1 in this
14:35 - 14:35

example.
The rank of 6 was 4.
14:38 - 14:38

We throw away those four
elements.
14:41 - 14:41

Now we're looking for rank 7
minus 4 which is 3.
14:46 - 14:46

And, indeed,
the rank 3 element here is
14:50 - 14:50

still 11.
So, you recursively find that.
14:54 - 14:54

That's your answer.
Now that algorithm should be
14:58 - 14:58

pretty clear.
The tricky part is to analyze
15:03 - 15:03

it.
And the analysis here is quite
15:06 - 15:06

a bit like randomized quicksort,
although not quite as hairy,
15:10 - 15:10

so it will go faster.
But it will be also sort of a
15:14 - 15:14

nice review of the randomized
quicksort analysis which was a
15:18 - 15:18

bit tricky and always good to
see a couple of times.
15:22 - 15:22

We're going to follow the same
kind of outline as before to
15:26 - 15:26

look at the expected running
time of this algorithm.
15:31 - 15:31

And to start out we're going
to, as before,
15:35 - 15:35

look at some intuition just to
feel good about ourselves.
15:40 - 15:40

Also feel bad as you'll see.
Let's think about two sort of
15:45 - 15:45

extreme cases,
a good case and the worst case.
15:49 - 15:49

And I should mention that in
all of the analyses today we
15:54 - 15:54

assume the elements are
distinct.
16:04 - 16:04

It gets really messy if the
elements are not distinct.
16:08 - 16:08

And you may even have to change
the algorithms a little bit
16:12 - 16:12

because if all the elements are
equal, if you pick a random
16:17 - 16:17

element, the partition does not
do so well.
16:20 - 16:20

But let's assume they're all
distinct, which is the really
16:24 - 16:24

interesting case.
A pretty luck case --
16:28 - 16:28

I mean the best cases we
partition right in the middle.
16:33 - 16:33

The number of elements to the
left of our partition is equal
16:38 - 16:38

to the number of elements to the
right of our partition.
16:42 - 16:42

But almost as good would be
some kind of 1/10 to 9/10 split.
16:47 - 16:47

Any constant fraction,
we should feel that.
16:51 - 16:51

Any constant fraction is as
good as 1/2.
16:54 - 16:54

Then the recurrence we get is,
let's say at most,
16:58 - 16:58

this bad.
So, it depends.
17:01 - 17:01

If we have let's say 1/10 on
the left and 9/10 on the right
17:05 - 17:05

every time we do a partition.
It depends where our answer is.
17:09 - 17:09

It could be if i is really
small it's in the 1/10 part.
17:13 - 17:13

If i is really big it's going
to be in the 9/10 part,
17:16 - 17:16

or most of the time it's going
to be in the 9/10 part.
17:20 - 17:20

We're doing worst-case analysis
within the lucky case,
17:23 - 17:23

so we're happy to have upper
bounds.
17:26 - 17:26

I will say t(n) is at most t of
T(9/10n)+Theta(n).
17:30 - 17:30

Clearly it's worse if we're in
the bigger part.
17:35 - 17:35

What is the solution to this
recurrence?
17:38 - 17:38

Oh, solving recurrence was so
long ago.
17:42 - 17:42

What method should we use for
solving this recurrence?
17:48 - 17:48

The master method.
What case are we in?
17:51 - 17:51

Three.
Good.
17:52 - 17:52

You still remember.
This is Case 3.
17:56 - 17:56

We're looking at nlog_b(a).
b here is 10/9,
18:01 - 18:01

although it doesn't really
matter because a is 1.
18:06 - 18:06

log base anything of 1 is 0.
So, this is n^0 which is 1.
18:11 - 18:11

And n is polynomially larger
than 1.
18:15 - 18:15

This is going to be O(n),
which is good.
18:19 - 18:19

That is what we want,
linear time.
18:22 - 18:22

If we're in the lucky case,
great.
18:25 - 18:25

Unfortunately this is only
intuition.
18:30 - 18:30

And we're not always going to
get the lucky case.
18:33 - 18:33

We could do the same kind of
analysis as we did with
18:36 - 18:36

randomized quicksort.
If you alternate between lucky
18:38 - 18:38

and unlucky, things will still
be good, but let's just talk
18:42 - 18:42

about the unlucky case to show
how bad things can get.
18:45 - 18:45

And this really would be a
worst-case analysis.
18:53 - 18:53

The unlucky case we get a split
of 0:n-1.
19:00 - 19:00

Because we're removing the
partition element either way.
19:05 - 19:05

And there could be nothing less
than the partition element.
19:10 - 19:10

We have 0 on the left-hand side
and we have n-1 on the
19:15 - 19:15

right-hand side.
Now we get a recurrence like
19:18 - 19:18

T(n)=T(n-1) plus linear cost.
And what's the solution to that
19:24 - 19:24

recurrence?
n^2.
19:25 - 19:25

Yes.
This one you should just know.
19:28 - 19:28

It's n^2 because it's an
arithmetic series.
19:38 - 19:38

And that's pretty bad.
This is much,
19:40 - 19:40

much worse than sorting and
then picking the ith element.
19:43 - 19:43

In the worst-case this
algorithm really sucks,
19:46 - 19:46

but most of the time it's going
to do really well.
19:49 - 19:49

And, unless you're really,
really unlucky and every coin
19:52 - 19:52

you flip gives the wrong answer,
you won't get this case and you
19:56 - 19:56

will get something more like the
lucky case.
19:59 - 19:59

At least that's what we'd like
to prove.
20:02 - 20:02

And we will prove that the
expected running time here is
20:05 - 20:05

linear.
So, it's very rare to get
20:07 - 20:07

anything quadratic.
But later on we will see how to
20:10 - 20:10

make the worst-case linear as
well.
20:12 - 20:12

This would really,
really solve the problem.
20:30 - 20:30

Let's get into the analysis.
20:43 - 20:43

Now, you've seen an analysis
much like this before.
20:47 - 20:47

What do you suggest we do in
order to analyze this expected
20:52 - 20:52

time?
It's a divide-and-conquer
20:54 - 20:54

algorithm, so we kind of like to
write down the recurrence on
20:59 - 20:59

something resembling the running
time.
21:09 - 21:09

I don't need the answer,
but what's the first step that
21:13 - 21:13

we might do to analyze the
expected running time of this
21:17 - 21:17

algorithm?
Sorry?
21:18 - 21:18

Look at different cases,
yeah.
21:20 - 21:20

Exactly.
We have all these possible ways
21:23 - 21:23

that random partition could
split.
21:25 - 21:25

It could split 0 to the n-1.
It could split in half.
21:30 - 21:30

There are n choices where it
could split.
21:33 - 21:33

How can we break into those
cases?
21:36 - 21:36

Indicator random variables.
Cool.
21:39 - 21:39

Exactly.
That's what we want to do.
21:41 - 21:41

Indicator random variable
suggests that what we're dealing
21:46 - 21:46

with is not exactly just a
function T(n) but it's a random
21:51 - 21:51

variable.
This is one subtlety.
21:53 - 21:53

T(n) depends on the random
choices, so it's really a random
21:58 - 21:58

variable.
22:05 - 22:05

And then we're going to use
indicator random variables to
22:08 - 22:08

get a recurrence on T(n).
22:25 - 22:25

So, T(n) is the running time of
rand-select on an input of size
22:33 - 22:33

n.
22:40 - 22:40

And I am also going to write
down explicitly an assumption
22:46 - 22:46

about the random numbers.
22:55 - 22:55

That they should be chosen
independently from each other.
23:00 - 23:00

Every time I call random
partition, it's generating a
23:03 - 23:03

completely independent random
number from all the other times
23:07 - 23:07

I call random partition.
That is important,
23:10 - 23:10

of course, for this analysis to
work.
23:13 - 23:13

We will see why some point down
the line.
23:15 - 23:15

And now, to sort of write down
an equation for T(n) we're going
23:20 - 23:20

to define indicator random
variables, as you suggested.
23:36 - 23:36

And we will call it X_k.
And this is for all k=0...n-1.
23:50 - 23:50

Indicator random variables
either 1 or 0.
23:54 - 23:54

And it's going to be 1 if the
partition comes out k on the
24:01 - 24:01

left-hand side.
So say the partition generates
24:07 - 24:07

a k:n-k-1 split and it is 0
otherwise.
24:11 - 24:11

We have n of these indicator
random variables between
24:17 - 24:17

0...n-1.
And in each case,
24:20 - 24:20

no matter how the random choice
comes out, exactly one of them
24:28 - 24:28

will be 1.
All the others will be 0.
24:32 - 24:32

Now we can divide out the
running time of this algorithm
24:37 - 24:37

based on which case we're in.
24:49 - 24:49

That will sort of unify this
intuition that we did and get
24:57 - 24:57

all the cases.
And then we can look at the
25:03 - 25:03

expectation.
T(n), if we just split out by
25:09 - 25:09

cases, we have an upper bound
like this.
25:28 - 25:28

If we have 0 to n-1 split,
the worst is we have n-1.
25:33 - 25:33

Then we have to recurse in a
problem of size n-1.
25:38 - 25:38

In fact, it would be pretty
hard to recurse in a problem of
25:44 - 25:44

size 0.
If we have a 1 to n-2 split
25:47 - 25:47

then we take the max of the two
sides.
25:51 - 25:51

That's certainly going to give
us an upper bound and so on.
26:03 - 26:03

And at the bottom you get an
n-1 to 0 split.
26:14 - 26:14

This is now sort of
conditioning on various events,
26:17 - 26:17

but we have indicator random
variables to tell us when these
26:20 - 26:20

events happen.
We can just multiply each of
26:22 - 26:22

these values by the indicator
random variable and it will come
26:25 - 26:25

out 0 if that's not the case and
will come out 1 and give us this
26:28 - 26:28

value if that happens to be the
split.
26:31 - 26:31

So, if we add up all of those
we'll get the same thing.
26:38 - 26:38

This is equal to the sum over
all k of the indicator random
26:46 - 26:46

variable times the cost in that
case, which is t of max k,
26:53 - 26:53

and the other side,
which is n-k-1,
26:57 - 26:57

plus theta n.
This is our recurrence,
27:02 - 27:02

in some sense,
for the random variable
27:05 - 27:05

representing running time.
Now, the value will depend on
27:09 - 27:09

which case we come into.
We know the probability of each
27:14 - 27:14

of these events happening is the
same because we're choosing the
27:19 - 27:19

partition element uniformly at
random, but we cannot really
27:24 - 27:24

simplify much beyond this until
we take expectations.
27:29 - 27:29

We know this random variable
could be as big as n^2.
27:33 - 27:33

Hopefully it's usually linear.
We will take expectations of
27:37 - 27:37

both sides and get what we want.
27:54 - 27:54

Let's look at the expectation
of this random variable,
27:58 - 27:58

which is just the expectation,
I will copy over,
28:02 - 28:02

summation we have here so I can
work on this board.
28:30 - 28:30

I want to compute the
expectation of this summation.
28:34 - 28:34

What property of expectation
should I use?
28:37 - 28:37

Linearity, good.
We can bring the summation
28:40 - 28:40

outside.
29:08 - 29:08

Now I have a sum of
expectation.
29:10 - 29:10

Let's look at each expectation
individually.
29:13 - 29:13

It's a product of two random
variables, if you will.
29:16 - 29:16

This is an indicator random
variable and this is some more
29:19 - 29:19

complicated function,
some more complicated random
29:22 - 29:22

variable representing some
running time,
29:25 - 29:25

which depends on what random
choices are made in that
29:28 - 29:28

recursive call.
Now what should I do?
29:32 - 29:32

I have the expectation of the
product of two random variables.
29:38 - 29:38

Independence,
exactly.
29:40 - 29:40

If I know that these two random
variables are independent then I
29:46 - 29:46

know that the expectation of the
product is the product of the
29:51 - 29:51

expectations.
Now we have to check are they
29:55 - 29:55

independent?
I hope so because otherwise
29:59 - 29:59

there isn't much else I can do.
Why are they independent?
30:05 - 30:05

Sorry?
Because we stated that they
30:08 - 30:08

are, right.
Because of this assumption.
30:11 - 30:11

We assume that all the random
numbers are chosen
30:14 - 30:14

independently.
We need to sort of interpolate
30:18 - 30:18

that here.
These X_k's,
30:19 - 30:19

all the X_k's,
X_0 up to X_n-1,
30:22 - 30:22

so all the ones appearing in
this summation are dependent
30:26 - 30:26

upon a single random choice of
this particular call to random
30:31 - 30:31

partition.
All of these are correlated,
30:37 - 30:37

because if one of them is 1,
all the others are forced to be
30:44 - 30:44

0.
So, there is a lot of
30:47 - 30:47

correlation among the X_k's.
But with respect to everything
30:55 - 30:55

that is in here,
and the only random part is
31:00 - 31:00

this T(max(kn-k-1)).
That is the reason that this
31:07 - 31:07

random variable is independent
from these.
31:13 - 31:13

The same thing as quicksort,
but I know some people got
31:19 - 31:19

confused about it a couple
lectures ago so I am
31:25 - 31:25

reiterating.
We get the product of
31:29 - 31:29

expectations,
E[X_k] E[T(max(kn-k-1))].
31:35 - 31:35

I mean the order n comes
outside, but let's leave it
31:40 - 31:40

inside for now.
There is no expectation to
31:45 - 31:45

compute there for order n.
Order n is order n.
31:49 - 31:49

What is the expectation of X_k?
1/n, because they're all chosen
31:56 - 31:56

with equal probability.
There is n of them,
32:00 - 32:00

so the expectation is 1/n.
The value is either 1 or 0.
32:05 - 32:05

We start to be able to split
this up.
32:07 - 32:07

We have 1/n times this expected
value of some recursive T call,
32:12 - 32:12

and then we have plus 1 over n
times order n,
32:16 - 32:16

also known as a constant,
but everything is summed up n
32:20 - 32:20

times so let's expand this.
32:35 - 32:35

I have the sum k=0 to n-1.
I guess the 1/n can come
32:42 - 32:42

outside.
And we have expectation of
32:47 - 32:47

[T(max(kn-k-1))].
Lots of nifty braces there.
32:54 - 32:54

And then plus we have,
on the other hand,
33:00 - 33:00

the sum k=0 to n-1.
Let me just write that out
33:06 - 33:06

again.
We have a 1/n in front and we
33:09 - 33:09

have a Theta(n) inside.
This summation is n^2.
33:12 - 33:12

And then we're dividing by n,
so this whole thing is,
33:17 - 33:17

again, order n.
Nothing fancy happened there.
33:20 - 33:20

This is really just saying the
expectation of order n is order
33:25 - 33:25

n.
Average value of order n is
33:27 - 33:27

order n.
What is interesting is this
33:32 - 33:32

part.
Now, what could we do with this
33:35 - 33:35

summation?
Here we start to differ from
33:39 - 33:39

randomized quicksort because we
have this max.
33:43 - 33:43

Randomized quicksort we had the
sum of T(k) plus T(n-k-1)
33:48 - 33:48

because we were making both
recursive calls.
33:53 - 33:53

Here we're only making the
biggest one.
33:56 - 33:56

That max is really a pain for
evaluating this recurrence.
34:03 - 34:03

How could I get rid of the max?
That's one way to think of it.
34:12 - 34:12

Yeah?
34:18 - 34:18

Exactly.
I could only sum up to halfway
34:21 - 34:21

and then double.
In other words,
34:23 - 34:23

terms are getting repeated
twice here.
34:26 - 34:26

When k=0 or when k=n-1,
I get the same T(n-1).
34:30 - 34:30

When k=1 or n-2,
I get the same thing,
34:34 - 34:34

2 and n-3.
What I will actually do is sum
34:38 - 34:38

from halfway up.
That's a little bit cleaner.
34:43 - 34:43

And let me get the indices
right.
34:46 - 34:46

Floor of n/2 up to n-1 will be
safe.
34:49 - 34:49

And then I just have E[T(k)],
except I forgot to multiply by
34:56 - 34:56

2, so I'm going to change this 1
to a 2.
35:01 - 35:01

And order n is preserved.
This is just because each term
35:05 - 35:05

is appearing twice.
I can factor it out.
35:08 - 35:08

And if n is odd,
I'm actually double-counting
35:11 - 35:11

somewhat, but it's certain at
most that.
35:13 - 35:13

So, that's a safe upper bound.
And upper bounds are all we
35:17 - 35:17

care about because we're hoping
to get linear.
35:20 - 35:20

And the running time of this
algorithm is definitely at least
35:25 - 35:25

linear, so we just need an upper
bounded linear.
35:29 - 35:29

So, this is a recurrence.
E[T(n)] is at most 2/n times
35:33 - 35:33

the sum of half the numbers
between 0 and n of
35:36 - 35:36

E[T(k)]+Theta(n).
It's a bit of hairy recurrence.
35:40 - 35:40

We want to solve it,
though.
35:42 - 35:42

And it's actually a little bit
easier than the randomized
35:46 - 35:46

quicksort recurrence.
We're going to solve it.
35:49 - 35:49

What method should we use?
Sorry?
35:51 - 35:51

Master method?
Master would be nice,
35:54 - 35:54

except that each of the
recursive calls is with a
35:57 - 35:57

different value of k.
The master method only works
36:02 - 36:02

when all the calls are with the
same value, same size.
36:06 - 36:06

Alas, it would be nice if we
could use the master method.
36:09 - 36:09

What else do we have?
Substitution.
36:12 - 36:12

When it's hard,
when in doubt,
36:14 - 36:14

use substitution.
I mean the good thing here is
36:17 - 36:17

we know what we want.
From the intuition at least,
36:20 - 36:20

which is now erased,
we really feel that this should
36:24 - 36:24

be linear time.
So, we know what we want to
36:26 - 36:26

prove.
And indeed we can prove it just
36:32 - 36:32

directly with substitution.
36:42 - 36:42

I want to claim there is some
constant c greater than zero
36:46 - 36:46

such that E[T(n)],
according to this recurrence,
36:50 - 36:50

is at most c times n.
Let's prove that over here.
37:00 - 37:00

As we guessed,
the proof is by substitution.
37:13 - 37:13

What that means is we're going
to assume, by induction,
37:18 - 37:18

that this inequality is true
for all smaller m.
37:23 - 37:23

I will just say 4 less than n.
And we need to prove it for n.
37:29 - 37:29

We get E[T(n)].
Now we are just going to expand
37:34 - 37:34

using the recurrence that we
have.
37:36 - 37:36

It's at most this.
I will copy that over.
37:54 - 37:54

And then each of these
recursive calls is with some
37:58 - 37:58

value k that is strictly smaller
than n.
38:01 - 38:01

Sorry, I copied it wrong,
floor of n over 2,
38:04 - 38:04

not zero.
And so I can apply the
38:07 - 38:07

induction hypothesis to each of
these.
38:11 - 38:11

This is at most c times k by
the induction hypothesis.
38:17 - 38:17

And so I get this inequality.
38:37 - 38:37

This c can come outside the
summation because it's just a
38:41 - 38:41

constant.
And I will be slightly tedious
38:43 - 38:43

in writing this down again,
because what I care about is
38:47 - 38:47

the summation here that is left
over.
38:56 - 38:56

This is a good old-fashioned
summation.
39:01 - 39:01

And if you remember back to
your summation tricks or
39:05 - 39:05

whatever, you should be able to
evaluate this.
39:08 - 39:08

If we started at zero and went
up to n minus 1,
39:11 - 39:11

that's just an arithmetic
series, but here we have the
39:15 - 39:15

tail end of an arithmetic
series.
39:17 - 39:17

And you should know,
at least up to theta,
39:20 - 39:20

what this is,
right?
39:21 - 39:21

n^2, yeah.
It's definitely T(n^2).
39:24 - 39:24

But we need here a slightly
better upper bond,
39:27 - 39:27

as we will see the constants
really matter.
39:31 - 39:31

What we're going to use is that
this summation is at most 3/8
39:35 - 39:35

times n^2.
And that will be critical,
39:38 - 39:38

the fact that 3/8 is smaller
than 1/2, I believe.
39:42 - 39:42

So it's going to get rid of
this 2.
39:44 - 39:44

I am not going to prove this.
This is an exercise.
39:48 - 39:48

When you know that it is true,
it's easy because you can just
39:52 - 39:52

prove it by induction.
Figuring out that number is a
39:56 - 39:56

little bit more work,
but not too much more.
40:00 - 40:00

So you should prove that by
induction.
40:04 - 40:04

Now let me simplify.
This is a bit messy,
40:09 - 40:09

but what I want is c times n.
Let's write it as our desired
40:16 - 40:16

value minus the residual.
And here we have some crazy
40:22 - 40:22

fractions.
This is 2 times 3 which is 6
40:27 - 40:27

over 8 which is 3/4,
right?
40:31 - 40:31

Here we have 1,
so we have to subtract up 1/4
40:35 - 40:35

to get 3/4.
And this should be,
40:37 - 40:37

I guess, 1/4 times c times n.
And then we have this theta n
40:42 - 40:42

with double negation becomes a
plus theta n.
40:46 - 40:46

That should be clear.
I am just rewriting that.
40:50 - 40:50

So we have what we want over
here.
40:53 - 40:53

And then we hope that this is
nonnegative because what we want
40:58 - 40:58

is that this less than or equal
to c times n.
41:03 - 41:03

That will be true,
provided this thing is
41:06 - 41:06

nonnegative.
And it looks pretty good
41:09 - 41:09

because we're free to choose c
however large we want.
41:13 - 41:13

Whatever constant is imbedded
in this beta notation is one
41:18 - 41:18

fixed constant,
whatever makes this recurrence
41:21 - 41:21

true.
We just set c to be bigger than
41:24 - 41:24

4 times that constant and then
this will be nonnegative.
41:28 - 41:28

So this is true for c
sufficiently large to dwarf that
41:33 - 41:33

theta constant.
It's also the base case.
41:37 - 41:37

I just have to make the cursory
mention that we choose c large
41:41 - 41:41

enough so that this claim is
true, even in the base case
41:46 - 41:46

where n is at most some
constant.
41:48 - 41:48

Here it's like 1 or so because
then we're not making a
41:52 - 41:52

recursive call.
What we get --
41:55 - 41:55

This algorithm,
randomize select,
41:59 - 41:59

has expected running time order
n, Theta(n).
42:12 - 42:12

The annoying this is that in
the worst-case,
42:16 - 42:16

if you're really,
really unlucky it's n^2.
42:19 - 42:19

Any questions before we move on
from this point?
42:24 - 42:24

This finished off the proof of
this fact that we have Theta(n)
42:29 - 42:29

expected time.
We already saw the n^2
42:33 - 42:33

worst-case.
All perfectly clear?
42:35 - 42:35

Good.
You should go over these
42:37 - 42:37

proofs.
They're intrinsically related
42:40 - 42:40

between randomized quicksort and
randomized select.
42:44 - 42:44

Know them in your heart.
This is a great algorithm that
42:48 - 42:48

works really well in practice
because most of the time you're
42:52 - 42:52

going to split,
say, in the middle,
42:55 - 42:55

somewhere between a 1/4 and 3/4
and everything is good.
43:00 - 43:00

It's extremely unlikely that
you get the n^2 worst-case.
43:03 - 43:03

It would have to happen with
like 1 over n^n probability or
43:07 - 43:07

something really,
really small.
43:09 - 43:09

But I am a theoretician at
least.
43:10 - 43:10

And it would be really nice if
you could get Theta(n) in the
43:14 - 43:14

worst-case.
That would be the cleanest
43:16 - 43:16

result that you could hope for
because that's optimal.
43:19 - 43:19

You cannot do better than
Theta(n).
43:21 - 43:21

You've got to look at the
elements.
43:23 - 43:23

So, you might ask,
can we get rid of this
43:26 - 43:26

worst-case behavior and somehow
avoid randomization and
43:29 - 43:29

guarantee Theta(n) worst-case
running time?
43:33 - 43:33

And you can but it's a rather
nontrivial algorithm.
43:39 - 43:39

And this is going to be one of
the most sophisticated that
43:46 - 43:46

we've seen so far.
It won't continue to be the
43:51 - 43:51

most sophisticated algorithm we
will see, but here it is.
43:58 - 43:58

Worst-case linear time order
statistics.
44:09 - 44:09

And this is an algorithm by
several, all very famous people,
44:23 - 44:23

Blum, Floyd,
Pratt, Rivest and Tarjan.
44:32 - 44:32

I think I've only met the B and
the R and the T.
44:35 - 44:35

Oh, no, I've met Pratt as well.
I'm getting close to all the
44:40 - 44:40

authors.
This is a somewhat old result,
44:42 - 44:42

but at the time it was a major
breakthrough and still is an
44:47 - 44:47

amazing algorithm.
Ron Rivest is a professor here.
44:50 - 44:50

You should know him from the R
in RSA.
44:53 - 44:53

When I took my PhD
comprehensives some time ago,
44:56 - 44:56

on the cover sheet was a joke
question.
45:00 - 45:00

It asked of the authors of the
worst-case linear time order
45:05 - 45:05

statistics algorithm,
which of them is the most rich?
45:09 - 45:09

Sadly it was not a graded part
of the comprehensive exam,
45:13 - 45:13

but it was an amusing question.
I won't answer it here because
45:18 - 45:18

we're on tape,
[LAUGHTER] but think about it.
45:22 - 45:22

I may not be obvious.
Several of them are rich.
45:25 - 45:25

It's just the question of who
is the most rich.
45:30 - 45:30

Anyway, before they were rich
they came up with this
45:34 - 45:34

algorithm.
They've come up with many
45:36 - 45:36

algorithms since,
even after getting rich,
45:39 - 45:39

believe it or not.
What we want is a good pivot,
45:42 - 45:42

guaranteed good pivot.
Random pivot is going to be
45:45 - 45:45

really good.
And so the simplest algorithm
45:48 - 45:48

is just pick a random pivot.
It's going to be good with high
45:52 - 45:52

probability.
We want to force a good pivot
45:55 - 45:55

deterministically.
And the new idea here is we're
45:58 - 45:58

going to generate it
recursively.
46:02 - 46:02

What else could we do but
recurse?
46:04 - 46:04

Well, you should know from your
recurrences that if we did two
46:09 - 46:09

recursive calls on problems of
half the size and we have a
46:13 - 46:13

linear extra work that's the
mergesort recurrence,
46:16 - 46:16

T(n)=2[T(n/2)+Theta(n)].
You should recite in your
46:20 - 46:20

sleep.
That's n lg n.
46:21 - 46:21

So we cannot recurse on two
problems of half the size.
46:25 - 46:25

We've got to do better.
Somehow these recursions have
46:30 - 46:30

to add up to strictly less than
n.
46:33 - 46:33

That's the magic of this
algorithm.
46:35 - 46:35

So this will just be called
select instead of rand-select.
46:40 - 46:40

And it really depends on an
array, but I will focus on the
46:44 - 46:44

i-th element that we want to
select and the size of the array
46:49 - 46:49

that we want to select in.
And I am going to write this
46:53 - 46:53

algorithm slightly less formally
than randomize-select because
46:58 - 46:58

it's a bit higher level of an
algorithm.
47:22 - 47:22

And let me draw over here the
picture of the algorithm.
47:31 - 47:31

The first step is sort of the
weirdest and it's one of the key
47:36 - 47:36

ideas.
You take your elements,
47:39 - 47:39

and they are in no particular
order, so instead of drawing
47:44 - 47:44

them on a line,
I am going to draw them in a 5
47:48 - 47:48

by n over 5 grid.
Why not?
47:50 - 47:50

This, unfortunately,
take a little while to draw,
47:54 - 47:54

but it will take you equally
long so I will take my time.
48:00 - 48:00

It doesn't really matter what
the width is,
48:03 - 48:03

but it should be width n over 5
so make sure you draw your
48:06 - 48:06

figure accordingly.
Width n over 5,
48:08 - 48:08

but the height should be
exactly 5.
48:10 - 48:10

I think I got it right.
I can count that high.
48:13 - 48:13

Here is 5.
And this should be,
48:15 - 48:15

well, you know,
our number may not be divisible
48:18 - 48:18

by 5, so maybe it ends off in
sort of an odd way.
48:21 - 48:21

But what I would like is that
these chunks should be floor of
48:25 - 48:25

n over 5.
And then we will have,
48:27 - 48:27

at most, four elements left
over.
48:30 - 48:30

So I am going to ignore those.
They don't really matter.
48:34 - 48:34

It's just an additive constant.
Here is my array.
48:37 - 48:37

I just happened to write it in
this funny way.
48:40 - 48:40

And I will call these vertical
things groups.
48:43 - 48:43

I would circle them,
and I did that in my notes,
48:46 - 48:46

but things get really messy if
you start circling.
48:49 - 48:49

This diagram is going to get
really full, just to warn you.
48:53 - 48:53

By the end it will be almost
unintelligible,
48:56 - 48:56

but there it is.
If you are really feeling
49:00 - 49:00

bored, you can draw this a few
times.
49:03 - 49:03

And you should draw how it
grows.
49:06 - 49:06

So there are the groups,
vertical groups of five.
49:10 - 49:10

Next step.
49:18 - 49:18

The second step is to recurse.
This is where things are a bit
49:25 - 49:25

unusual, well,
even more unusual.
49:28 - 49:28

Oops, sorry.
I really should have had a line
49:33 - 49:33

between one and two so I am
going to have to move this down
49:37 - 49:37

and insert it here.
I also, in step one,
49:40 - 49:40

want to find the median of each
group.
49:53 - 49:53

What I would like to do is just
imagine this figure,
49:56 - 49:56

each of the five elements in
each group gets reorganized so
50:00 - 50:00

that the middle one is the
median.
50:02 - 50:02

So I am going to call these the
medians of each group.
50:06 - 50:06

I have five elements so the
median is right in the middle.
50:10 - 50:10

There are two elements less
than the median,
50:13 - 50:13

two elements greater than the
median.
50:16 - 50:16

Again, we're assuming all
elements are distinct.
50:19 - 50:19

So there they are.
I compute them.
50:22 - 50:22

How long does that take me?
N over five groups,
50:25 - 50:25

each with five elements,
compute the median of each one?
50:30 - 50:30

Sorry?
Yeah, 2 times n over 5.
50:32 - 50:32

It's theta n,
that's all I need to know.
50:35 - 50:35

I mean, you're counting
comparisons, which is good.
50:38 - 50:38

It's definitely Theta(n).
The point is within each group,
50:42 - 50:42

I only have to do a constant
number of comparisons because
50:46 - 50:46

it's a constant number of
elements.
50:48 - 50:48

It doesn't matter.
You could use randomize select
50:52 - 50:52

for all I care.
No matter what you do,
50:54 - 50:54

it can only take a constant
number of comparisons.
50:59 - 50:59

As long as you don't make a
comparison more than once.
51:03 - 51:03

So this is easy.
You could sort the five numbers
51:07 - 51:07

and then look at the third one,
it doesn't matter because there
51:12 - 51:12

are only five of them.
That's one nifty idea.
51:16 - 51:16

Already we have some elements
that are sort of vaguely in the
51:21 - 51:21

middle but just of the group.
And we've only done linear
51:26 - 51:26

work.
So doing well so far.
51:29 - 51:29

Now we get to the second step,
which I started to write
51:34 - 51:34

before, where we recurse.
51:58 - 51:58

So the next idea is,
well, we have these floor over
52:02 - 52:02

n over 5 medians.
I am going to compute the
52:04 - 52:04

median of those medians.
I am imagining that I
52:07 - 52:07

rearranged these.
And, unfortunately,
52:09 - 52:09

it's an even number,
there are six of them,
52:12 - 52:12

but I will rearrange so that
this guy, I have drawn in a
52:15 - 52:15

second box, is the median of
these elements so that these two
52:19 - 52:19

elements are strictly less than
this guy, these three elements
52:22 - 52:22

are strictly greater than this
guy.
52:24 - 52:24

Now, that doesn't directly tell
me anything, it would seem,
52:28 - 52:28

about any of the elements out
here.
52:31 - 52:31

We will come back to that.
In fact, it does tell us about
52:35 - 52:35

some of the elements.
But right now this element is
52:39 - 52:39

just the median of these guys.
Each of these guys is a median
52:43 - 52:43

of five elements.
That's all we know.
52:45 - 52:45

If we do that recursively,
this is going to take T of n
52:49 - 52:49

over 5 time.
So far so good.
52:51 - 52:51

We can afford a recursion on a
problem of size n over 5 and
52:55 - 52:55

linear work.
We know that works out to
52:58 - 52:58

linear time.
But there is more.
53:01 - 53:01

We're obviously not done yet.
53:10 - 53:10

The next step is x is our
partition element.
53:13 - 53:13

We partition there.
The rest of the algorithm is
53:16 - 53:16

just like randomized partition,
so we're going to define k to
53:19 - 53:19

be the rank of x.
And this can be done,
53:22 - 53:22

I mean it's n minus r plus 1 or
whatever, but I'm not going to
53:26 - 53:26

write out how to do that because
we're at a higher level here.
53:30 - 53:30

But it can be done.
And then we have the three-way
53:35 - 53:35

branching.
So if i happens to equal k
53:38 - 53:38

we're happy.
The pivot element is the
53:41 - 53:41

element we're looking for,
but more likely i is either
53:46 - 53:46

less than k or it is bigger than
k.
53:49 - 53:49

And then we make the
appropriate recursive call,
53:54 - 53:54

so here we recursively select
the i-th smallest element --
54:08 - 54:08

-- in the lower part of the
array.
54:11 - 54:11

Left of the partition element.
Otherwise, we recursively
54:16 - 54:16

select the i minus k-th smallest
element in the upper part of the
54:23 - 54:23

array.
I am writing this at a high
54:26 - 54:26

level because we've already seen
it.
54:30 - 54:30

All of this is the same as the
last couple steps of randomized
54:36 - 54:36

select.
54:45 - 54:45

That is the algorithm.
The real question is why does
54:48 - 54:48

it work?
Why is this linear time?
54:50 - 54:50

The first question is what's
the recurrence?
54:53 - 54:53

We cannot quite write it down
yet because we don't know how
54:57 - 54:57

big these recursive subproblems
could be.
55:00 - 55:00

We're going to either recurse
in the lower part or the upper
55:04 - 55:04

part, that's just like before.
If we're unlucky and we have a
55:08 - 55:08

split of like zero to n minus
one, this is going to be a
55:12 - 55:12

quadratic time algorithm.
The claim is that this
55:15 - 55:15

partition element is guaranteed
to be pretty good and good
55:19 - 55:19

enough.
The running time of this thing
55:21 - 55:21

will be T of something times n,
and we don't know what the
55:25 - 55:25

something is yet.
How big could it be?
55:27 - 55:27

Well, I could ask you.
But we're sort of indirect here
55:32 - 55:32

so I will tell you.
We have already a recursive
55:35 - 55:35

call of T of n over 5.
It better be that whatever
55:38 - 55:38

constant, so it's going to be
something times n,
55:41 - 55:41

it better be that that constant
is strictly less than 4/5.
55:45 - 55:45

If it's equal to 4/5 then
you're not splitting up the
55:48 - 55:48

problem enough to get an n lg n
running time.
55:51 - 55:51

If it's strictly less than 4/5
then you're reducing the problem
55:55 - 55:55

by at least a constant factor.
In the sense if you add up all
56:00 - 56:00

the recursive subproblems,
n over 5 and something times n,
56:04 - 56:04

you get something that is a
constant strictly less than one
56:07 - 56:07

times n.
That forces the work to be
56:10 - 56:10

geometric.
If it's geometric you're going
56:12 - 56:12

to get linear time.
So this is intuition but it's
56:15 - 56:15

the right intuition.
Whenever you're aiming for
56:18 - 56:18

linear time keep that in mind.
If you're doing a
56:21 - 56:21

divide-and-conquer,
you've got to get the total
56:24 - 56:24

subproblem size to be some
constant less than one times n.
56:28 - 56:28

That will work.
OK, so we've got to work out
56:33 - 56:33

this constant here.
And we're going to use this
56:37 - 56:37

figure, which so far looks
surprisingly uncluttered.
56:42 - 56:42

Now we will make it cluttered.
What I would like to do is draw
56:48 - 56:48

an arrow between two vertices,
two points, elements,
56:54 - 56:54

whatever you want to call them.
Let's call them a and b.
57:00 - 57:00

And I want to orient the arrow
so it points to a larger value,
57:04 - 57:04

so this means that a is less
than b.
57:07 - 57:07

This is notation just for the
diagram.
57:10 - 57:10

And so this element,
I am going to write down what I
57:13 - 57:13

know.
This element is the median of
57:16 - 57:16

these five elements.
I will suppose that it is drawn
57:19 - 57:19

so that these elements are
larger than the median,
57:23 - 57:23

these elements are smaller than
the median.
57:26 - 57:26

Therefore, I have arrows like
this.
57:28 - 57:28

Here is where I wish I had some
colored chalk.
57:33 - 57:33

This is just stating this guy
is in the middle of those five
57:37 - 57:37

elements.
I know that in every single
57:39 - 57:39

column.
57:55 - 57:55

Here is where the diagram
starts to get messy.
57:58 - 57:58

I am not done yet.
Now, we also know that this
58:02 - 58:02

element is the median of the
medians.
58:04 - 58:04

Of all the squared elements,
this guy is the middle.
58:07 - 58:07

And I will draw it so that
these are the ones smaller than
58:10 - 58:10

the median, these are the ones
larger than the median.
58:13 - 58:13

I mean the algorithm cannot do
this.
58:15 - 58:15

It doesn't necessarily know how
all this works.
58:18 - 58:18

I guess it could,
but this is just for analysis
58:21 - 58:21

purposes.
We know this guy is bigger than
58:23 - 58:23

that one and bigger than that
one.
58:25 - 58:25

We don't directly know about
the other elements.
58:29 - 58:29

We just know that that one is
bigger than both of those and
58:33 - 58:33

this guy is smaller than these.
Now, that is as messy as the
58:37 - 58:37

figure will get.
Now, the nice thing about less
58:41 - 58:41

than is that it's a transitive
relation.
58:43 - 58:43

If I have a directed path in
this graph, I know that this
58:47 - 58:47

element is strictly less than
that element because this is
58:51 - 58:51

less than that one and this is
less than that one.
58:55 - 58:55

Even though directly I only
know within a column and within
58:59 - 58:59

this middle row,
I actually know that this
59:02 - 59:02

element --
This is x, by the way.
59:06 - 59:06

This element is larger than all
of these elements because it's
59:11 - 59:11

larger than this one and this
one and each of these is larger
59:15 - 59:15

than all of those by these
arrows.
59:18 - 59:18

I also know that all of these
elements in this rectangle here,
59:23 - 59:23

and you don't have to do this
but I will make the background
59:28 - 59:28

even more cluttered.
All of these elements in this
59:33 - 59:33

rectangle are greater than or
equal to this one and all of the
59:38 - 59:38

elements in this rectangle are
less than or equal to x.
59:43 - 59:43

Now, how many are there?
Well, this is roughly halfway
59:47 - 59:47

along the set of groups and this
is 3/5 of these columns.
59:52 - 59:52

So what we get is that there
are at least --
59:57 - 59:57

We have n over 5 groups and we
have half of the groups that
60:04 - 60:04

we're looking at here roughly,
so let's call that floor of n
60:10 - 60:10

over 2, and then within each
group we have three elements.
60:17 - 60:17

So we have at least 3 times
floor of floor of n over 5 over
60:23 - 60:23

2 n floor elements that are less
than or equal to x.
60:30 - 60:30

And we have the same that are
greater than or equal to x.
60:36 - 60:36

Let me simplify this a little
bit more.
60:40 - 60:40

I can also give you some more
justification,
60:45 - 60:45

and we drew the picture,
but just for why this is true.
60:51 - 60:51

We have at least n over 5 over
2 group medians that are less
60:58 - 60:58

than or equal to x.
This is the argument we use.
61:03 - 61:03

We have half of the group
medians are less than or equal
61:06 - 61:06

to x because x is the median of
the group median,
61:09 - 61:09

so that is no big surprise.
This is almost an equality but
61:12 - 61:12

we're making floors so it's
greater than or equal to.
61:15 - 61:15

And then, for each group
median, we know that there are
61:18 - 61:18

three elements there that are
less than or equal to that group
61:22 - 61:22

median.
So, by transitivity,
61:23 - 61:23

they're also less than or equal
to x.
61:25 - 61:25

We get this number times three.
This is actually just floor of
61:31 - 61:31

n over 10.
I was being unnecessarily
61:34 - 61:34

complicated there,
but that is where it came from.
61:38 - 61:38

What we know is that this thing
is now at least 3 times n over
61:44 - 61:44

10, which is roughly 3/10 of
elements are in one side.
61:48 - 61:48

In fact, at least 3/10 of the
elements are in each side.
61:53 - 61:53

Therefore, each side has at
most 7/10 elements roughly.
61:59 - 61:59

So the number here will be
7/10.
62:01 - 62:01

And, if I'm lucky,
7/10 plus 1/5 is strictly less
62:05 - 62:05

than one.
I believe it is,
62:06 - 62:06

but I have trouble working with
tenths.
62:09 - 62:09

I can only handle powers of
two.
62:11 - 62:11

What we're going to use is a
minor simplification,
62:15 - 62:15

which just barely still works,
is a little bit easier to think
62:19 - 62:19

about.
It's mainly to get rid of this
62:22 - 62:22

floor because the floor is
annoying.
62:24 - 62:24

And we don't really have a
sloppiness lemma that applies
62:28 - 62:28

here.
It turns out if n is
62:31 - 62:31

sufficiently large,
3 times floor of n over 10 is
62:35 - 62:35

greater than or equal to 1/4.
Quarters I can handle.
62:39 - 62:39

The claim is that each group
has size at least 1/4,
62:42 - 62:42

therefore each group has size
at most 3/4 because there's a
62:47 - 62:47

quarter on the side.
This will be 3/4.
62:49 - 62:49

And I can definitely tell that
1/5 is less than 1/4.
62:53 - 62:53

This is going to add up to
something strictly less than one
62:57 - 62:57

and then it will work.
How is my time?
63:01 - 63:01

Good.
At this point,
63:03 - 63:03

the rest of the analysis is
easy.
63:06 - 63:06

How the heck you would come up
with this algorithm,
63:10 - 63:10

you realize that this is
clearly a really good choice for
63:15 - 63:15

finding a partition element,
just barely good enough that
63:20 - 63:20

both recursions add up to linear
time.
63:23 - 63:23

Well, that's why it took so
many famous people.
63:28 - 63:28

Especially in quizzes,
but I think in general this
63:31 - 63:31

class, you won't have to come up
with an algorithm this clever
63:34 - 63:34

because you can just use this
algorithm to find the median.
63:38 - 63:38

And the median is a really good
partition element.
63:40 - 63:40

Now that you know this
algorithm, now that we're beyond
63:43 - 63:43

1973, you don't need to know how
to do this.
63:46 - 63:46

I mean you should know how this
algorithm works,
63:48 - 63:48

but you don't need to do this
in another algorithm because you
63:52 - 63:52

can just say run this algorithm,
you will get the median in
63:55 - 63:55

linear time, and then you can
partition to the left and the
63:59 - 63:59

right.
And then the left and the right
64:02 - 64:02

will have exactly equal size.
Great.
64:05 - 64:05

This is a really powerful
subroutine.
64:07 - 64:07

You could use this all over the
place, and you will on Friday.
64:12 - 64:12

Have I analyzed the running
time pretty much?
64:15 - 64:15

The first step is linear.
The second step is T of n over
64:19 - 64:19

5.
The third step,
64:20 - 64:20

I didn't write it,
is linear.
64:22 - 64:22

And then the last step is just
a recursive call.
64:25 - 64:25

And now we know that this is
3/4.
64:34 - 64:34

I get this recurrence.
T of n is, I'll say at most,
64:40 - 64:40

T of n over 5 plus T of 3/4n.
You could have also used 7/10.
64:47 - 64:47

It would give the same answer,
but you would also need a floor
64:54 - 64:54

so we won't do that.
I claim that this is linear.
65:01 - 65:01

How should I prove it?
Substitution.
65:12 - 65:12

Claim that T of n is at most
again c times n,
65:16 - 65:16

that will be enough.
Proof is by substitution.
65:20 - 65:20

Again, we assume this is true
for smaller n.
65:24 - 65:24

And want to prove it for n.
We have T of n is at most this
65:29 - 65:29

thing.
T of n over 5.
65:31 - 65:31

And by induction,
because n of 5 is smaller than
65:36 - 65:36

n, we know that this is at most
c.
65:40 - 65:40

Let me write it as c over 5
times n.
65:44 - 65:44

Sure, why not.
Then we have here 3/4cn.
65:48 - 65:48

And then we have a linear term.
Now, unfortunately,
65:53 - 65:53

I have to deal with things that
are not powers of two.
66:00 - 66:00

I will cheat and look at my
notes.
66:02 - 66:02

This is also known as 19/20
times c times n plus theta n.
66:07 - 66:07

And the point is just that this
is strictly less than one.
66:11 - 66:11

Because it's strictly less than
one, I can write this as one
66:15 - 66:15

times c of n minus some
constant, here it happens to be
66:19 - 66:19

1/20, as long as I have
something left over here,
66:23 - 66:23

1/20 times c times n.
Then I have this annoying theta
66:27 - 66:27

n term which I want to get rid
of because I want this to be
66:31 - 66:31

nonnegative.
But it is nonnegative,
66:35 - 66:35

as long as I set c to be
really, really large,
66:38 - 66:38

at least 20 times whatever
constant is here.
66:42 - 66:42

So this is at most c times n
for c sufficiently large.
66:46 - 66:46

And, oh, by the way,
if n is less than or equal to
66:50 - 66:50

50, which we used up here,
then T of n is a constant,
66:54 - 66:54

it doesn't really matter what
you do, and T of n is at most c
66:59 - 66:59

times n for c sufficiently
large.
67:03 - 67:03

That proves this claim.
Of course, the constant here is
67:06 - 67:06

pretty damn big.
It depends exactly what the
67:08 - 67:08

constants and the running times
are, which depends on your
67:12 - 67:12

machine, but practically this
algorithm is not so hot because
67:15 - 67:15

the constants are pretty big.
Even though this element is
67:18 - 67:18

guaranteed to be somewhere
vaguely in the middle,
67:21 - 67:21

and even though these
recursions add up to strictly
67:24 - 67:24

less than n and it's geometric,
it's geometric because the
67:27 - 67:27

problem is reducing by at least
a factor of 19/20 each time.
67:31 - 67:31

So it actually takes a while
for the problem to get really
67:35 - 67:35

small.
Practically you probably don't
67:37 - 67:37

want to use this algorithm
unless you cannot somehow flip
67:41 - 67:41

coins.
The randomized algorithm works
67:43 - 67:43

really, really fast.
Theoretically this is your
67:46 - 67:46

dream, the best you could hope
for because it's linear time and
67:50 - 67:50

you need linear time as
guaranteed linear time.
67:53 - 67:53

I will mention,
before we end,
67:55 - 67:55

an exercise.
68:03 - 68:03

Why did we use groups of five?
Why not groups of three?
68:06 - 68:06

As you might guess,
the answer is because it
68:09 - 68:09

doesn't work with groups of
three.
68:11 - 68:11

But it's quite constructive to
find out why.
68:14 - 68:14

If you work through this math
with groups of three instead of
68:18 - 68:18

groups of five,
you will find that you don't
68:20 - 68:20

quite get the problem reduction
that you need.
68:23 - 68:23

Five is the smallest number for
which this works.
68:27 - 68:27

It would work with seven,
but theoretically not any
68:30 - 68:30

better than a constant factor.
Any questions?
68:33 - 68:33

All right.
Then recitation Friday.
68:35 - 68:35

Homework lab Sunday.
Problem set due Monday.
68:38 - 68:38

Quiz one in two weeks.

Title:: Lec 6 | MIT 6.046J / 18.410J Introduction to Algorithms (SMA 5503), Fall 2005
Description:: Lecture 06: Order Statistics, Median

View the complete course at: http://ocw.mit.edu/6-046JF05

License: Creative Commons BY-NC-SA

More information at http://ocw.mit.edu/terms

More courses at http://ocw.mit.edu

more » « less
Duration:: 01:08:49

	Amara Bot edited English subtitles for Lec 6 \| MIT 6.046J / 18.410J Introduction to Algorithms (SMA 5503), Fall 2005
	Amara Bot edited English subtitles for Lec 6 \| MIT 6.046J / 18.410J Introduction to Algorithms (SMA 5503), Fall 2005
	Amara Bot edited English subtitles for Lec 6 \| MIT 6.046J / 18.410J Introduction to Algorithms (SMA 5503), Fall 2005

English subtitles

Revisions

Revision 1

Amara Bot

Lec 6 | MIT 6.046J / 18.410J Introduction to Algorithms (SMA 5503), Fall 2005

Revisions

Our website uses cookies

Operating cookies (Required)